Peasgood et al. (unpublished)
We have a copy
Peasgood et al. (unpublished)
We have a copy
Unit-change comparability
I'm not sure this is stated correctly. It seems to overlap cardinality.
📚 Further Reading: Unjournal Evaluations The Unjournal has commissioned independent evaluations of papers relevant to this debate: → StrongMinds & Friendship Bench Evaluation — Critical assessment of HLI's meta-analysis and cost-effectiveness claims → Long-Run Effects of Psychotherapy on Depression — Cuijpers et al. meta-analysis on therapy durability → Cash Transfers vs Psychotherapy: Comparative Impact — McGuire et al. direct comparison in Liberia → Mental Health Therapy as a Core Strategy (Ghana) — Barker et al. on scaling community-based therapy
Put this somewhere else - I don't think it belongs within the focal case folding box. It should have its own folding box in the reading section and references
mortality-focused interventions
When comparing among interventions, some of which that affect mortality.
Recent work has attempted to estimate the neutral point empirically. The estimates vary widely (0.6 to 6.0 on a 0-10 scale) depending on the elicitation method and sample. HLI is advising the UK Green Book guidelines and has compiled the following estimates:[*]Table compiled by Samuel Dupret (HLI), shared March 2026. Note that different questions elicit very different values—asking about "life no longer worth living" yields lower estimates than asking about "minimally acceptable" levels. Source Value Method Sample Samuelsson et al. (2023) - HLI pilot 1.26 Asked when life is no longer worth living on 0-10 LS scale N=79, UK Samuelsson et al. (2023) - HLI pilot 5.30 Asked where balance between satisfied/dissatisfied on 0-10 LS N=128, UK Peasgood et al. (unpublished) 2.00 Time trade-offs (QALY method rather than wellbeing scale) N=75, UK IDinsight Beneficiary Survey (2019) 0.56 "At what point on the ladder is it worse than dying?" N=70, Ghana & Kenya Moss (Rethink Priorities, unpublished) 2.49 Asked level preferring alive to dead (converted 0-100 → 0-10) N=35, likely UK Moss (Rethink Priorities, unpublished) 6.05 Asked minimally acceptable level to live an extra year N=101, likely UK Jamison et al. (forthcoming) 2.39 Policy comparison: saving people from dying (0-100 → 0-10) N=1800 (Brazil, China, UK) Jamison et al. (forthcoming) 2.54 Policy comparison: saving people from non-existence (0-100 → 0-10)
Try to source links/citations to these!
Practical guidance for funders now Given the uncertainties above, what should funders actually do? This section offers a decision-oriented framework, not a single prescription.
I didn't want the AI to give this 'practical guidance' -- that's meant to come out of the session!!
paper develops methods using calibration questions[5]Survey items with objectively correct answers—the same for all respondents. Benjamin et al. use visual calibration (e.g., "How dark is this circle?") to reveal individual scale-use tendencies without text-interpretation issues. and vignette exercises[6]Unlike traditional anchoring vignettes (rating hypothetical people), Benjamin et al.'s vignettes ask respondents to imagine situations in their own life and rate dimensions of well-being for themselves. Part of an "ideal approach" that also requires multi-dimensional wellbeing questions and stated-preference surveys. to detect
Remove the bolding here ... italics is better. Avoid bolding within paragraphs in general.
We're organizing the discussion around four key questions:
Restate this to more directly address the question in the heading on "what we want to achieve".
We want to: - Help researchers understand practitioners' highest-value questions and considerations and trade-offs. - Help practitioners understand the most relevant and useful up to date research and its implications - Enable communication and collaboration, by getting on the same page, agreeing on terminology, identifying points of consensus and high-value cruxes, etc. - State and measure our beliefs about key issues and questions openly, with precision and calibrated uncertainty, driving high "value of information" Bayesian updating - Drive better decisions over measuring the impact of interventions in LMICs and using existing measures, leading to better funding decisions
(This is a bit long -- just adjust the basic first sentence a tiny bit, and then footnote this more detailed theory of change. ) #implement
The neutral point is the life satisfaction level representing neither positive nor negative welfare—essentially the boundary between "life worth living" and "suffering." Estimates range from 2-5 on the 0-10 scale. Peasgood et al. (2018) tentatively estimate ~2.
Add: "This is particularly important for comparing interventions that have impacts on mortality (and perhaps fertility). We should discuss this in this workshop to an extent, but we might de-emphasize it to avoid overstretching the scope, depending on interest and timing.
evaluation summary
Link it here https://unjournal.pubpub.org/pub/evalsumstrongminds/ -- however, I don't see anything in that summary that provides details suggesting this order of magnitude thing. Find a better reference.
QALYs (quality-adjusted life years)
Link one authoritative external resource presenting these sin detail
instruments like EQ-5D
dead link
Other measures include QALYs (quality-adjusted life years), income-equivalent measures, and multi-dimensional poverty indices. QALYs are similar to DALYs but measure health gained rather than lost.
This is being adjusted. NB we focus more on DALY than QALY because it's used a lot more in the LMIC intervention context, largely due to its ease of collection
Confirmed Participants
make this a folding box. Also add hyperlinks to people's web pages if you can find them
—and what would change their minds?
remove 'and what would change their minds' -- this doesn't fit. #implement
Unlike WELLBYs, DALYs are based on expert-derived disability weights rather than self-reported wellbeing—weights are constructed through surveys of health professionals rating hypothetical health states.
Are you sure that it's through surveys of health professionals? I thought the surveys were of people in the general population. And this explanation doesn't mention how an individual's DALY is constructed based on asking them about their health states or something. What's the data used?
Zoom chat for quick reactions;
No, I only want the Zoom chat to be used by the session organizers and mainly just to guide people on the structure of the workshop and where we're going next
Segment structure is set; timing may adjust slightly. Updated March 11, 2026
12 Mar 2026 -- Not entirely set -- we may add some small things. But close to set, and trying to harden the timings so we can send out a schedule soon that people can trust
calibrated
Give the definition of 'calibration' here as a footnote/tooltip. Roughly, things that when you say something will happen X% of the time it in fact occurs X% of the time, not much more nor less.
If you are asked to give 80% CIs, the true values should fall in those intervals close to 80% of the time. If it happens less than 8/10 times, you're being overconfident, and stating too narrow intervals. If it happens more than 8/10 times, you're being underconfident, and stating overly wide intervals
Consider the value obtained when using the best feasible measure for cross-intervention comparison in contexts like the focal context. What share of this value is obtained, in expectation, from using the simple linear WELLBY measure (as defined above) for all interventions?
Above the 'operationalized version' Add a discussion box here for people to answer the more general question.
Consider the value obtained
add a sub-sub-header "Operationalized version" here
Essential
'essential' is too strong. Maybe 'Most important for discussion'. And note there's no way to do a thorough read of all of these in 2 hours. Just leave that 'time allotment' out'
1. WELLBY Reliability and Value
make an anchorable link here and for the other headers.
See the canonical formulations on Coda.
Make this a footnote. I don't think most people need to see that fairly confusing page. #implement
Vignette exercises: respondents rate hypothetical people's life satisfaction based on descriptions, revealing how individuals anchor the scale and enabling cross-person calibration.
Do they actually do this in the paper? doublecheck
Calibration questions ask respondents to rate well-defined scenarios (e.g., "How satisfied would you be if you won $1,000?"). By observing how people rate the same reference points, researchers can estimate individual differences in scale use.
Is this a reasonable examlpe? Do they ask questions like that in the exercises mentioend in the paper?
Cost-effectiveness estimates vary by an order of magnitude depending on how WELLBYs are valued relative to DALYs.
What's the source for this OOM claim?? Find and link it with a verbatim quote . #implement
Also it's not in our 'evaluation summary as far as I know'
Open Philanthropy
It's now "Coefficient Giving" -- correct this on every page. And hyperlink "https://coefficientgiving.org/research/cost-effectiveness/" here. #implement
Each scale point represents an equal welfare increment. If violated, summing is invalid and interventions targeting different baselines become incomparable.
David Reinstein --- personally, this is the one I find least plauslible and most important.
nterpersonal Comparability LSA = 7 ≈ LSB = 7 implies UA ≈ UB When two people report the same score, they experience similar welfare. Scale-use heterogeneity violates this assumption.
I don't think this one is necessary if we can (instead) assume that differences are equivalent. For example, if we assume that person A is actually experiencing higher welfare at all levels of reported score, but the differences between the scores are comparable, then compared to interventions for measured differences in well-being, that shouldn't matter.
I think it could also still be reliable if the distribution between the two populations is the same, even though we don't have specific inter-person comparability between any two compared individuals.
equires four implicit assumptions
Give a linked source and citation for this.
1 WELLBY = 1-point increase on a 0-10 life satisfaction scale × 1 person × 1 year W = Σi Σt LSit
Those are not clearly defined here, nor the indexing
We'll produce a practitioner-focused summary document, belief elicitation results with confidence intervals, and structured notes.
Change this to "we hope to" and "We will share outputs". -- I can't guarantee right now that we'll get enough input or have bandwidth to produce this. #implement
Participants can opt out of recording for specific segments if needed
Add "and we will ask for final approval before posting anything". #implement
(Note: QALYs may be more directly comparable than DALYs for this purpose.)
Leave out the QALYs parentheses bit here. Add "(or QALYs)" after "~1 SD in DALYs". #implement
scale?
Add "is a move from 1-3 for one person as good as a move from 1-2 for 2 people"? At the end of this paragraph... "even if these don't hold, does the linear WELLBY aggregation yield 'nearly as much value' for decisionmaking as other potential measures"? #adjust #implement
Where is the "neutral point" on the scale?
Remind me why the neutral point is important.
When comparing a mental health intervention (measured in WELLBYs) to a physical health intervention (measured in DALYs)
Either of these, especially the physical health intervention, could be measured either way. This overstates it a bit. Perhaps, just to give this as an example, suppose there is a case... #adjust #implement
but more work is needed.
"more work is neeeded" That's very much vague -- we nIt would be nice to have at least one specific point suggesting that the difference in scale means potentially matters and merits more study
Each has strengths and limitations—and how they relate to each other, and whether either reliably captures what matters for human welfare, directly affects which interventions get prioritized.
I'm allergic to platitudes. IIRC you should have some notes somewhere providing at least one case where this matters .
adversarial manipulation.
I don't think we discussed adversarial manipulation or have any results on it, so I'm a little worried that whatever generated this discussion is doing a sort of generic pandering and putting in what it generally expects to see in papers like this.
Our results support AI as structured screening and decision support rather than full automation,
This seems like a sort of milquetoast generic caveat. In what sense is this what our AI results support? This seems a bit pandering.
xhibiting consistent failure modes: compressed rating scales, uneven criterion coverage, and variable identification of expert-flagged concerns.
I'm guessing this is a bit premature/too much rounding up a few observations to general conclusions, but let me look at the results a bit more carefully.
often approach the ceiling implied by human inter-rater variability on several criteria,
This is interesting and strong. It comes across maybe a little bit overstated, so we just need to be careful about how we're framing this result.
high-quality but noisy reference signal
I think this is right, but the term "reference signal" sounds technical in an information theoretic sense, and we want to make sure we're not misapplying it.
narrative critiques
Yes, we focus on the critiques here, but the on journal evaluations do more than just critique. They discuss, they offer suggestions, implications, et cetera.
overing economics and social-science working papers
"covering ... working papers" Is mostly accurate but not quite right. We don't cover all working papers, and we have a specific focus on research relevant to global priorities. We can also evaluate post-journal publication, but I'm not sure how to best summarize this in a simple way in the abstract.
The idea of "open evaluation platform" also could be a bit confusing here because it's not mainly about crowd sourcing. Yes, the "paid expert review packages" cover this, but I don't quite think this is worded in the best possible way.
Peer review is strained, and AI tools generating referee-like feedback are already adopted by researchers and commercial services—yet field evidence on how reliably frontier LLMs can evaluate research remains scarce.
This is a decent first sentence, although it bears the marks of AI-generated text. But also I'm not sure if it's really in line with our newest spin on this.
“high” reasoning effort
Not relevant to Pro -- cut this
OpenAI Responses API
"Responses" is the newer one (as of 4 Nov 2025)
returned file id keyed by path, size, and modification time.
what does this mean? "Keyed by" ?
This implies it is kept on the server and won't need a later upload.
d the best performance from top reasoning models
Best relative to what? Better than the 'non-top reasoning models'? @valik
Zhang and Abernethy (2025) propose deploying LLMs as quality checkers to surface critical problems instead of
Is this the only empirical work? I thought there were others underway. Worth our digging into. Fwiw I can do an elicit.org query.
but still recommend human oversight.
why? based on some evidence of LLM limitations or risks?
emphasize
I'd say 'they argue' instead of 'emphasize'; the latter seems like a statement of absolute truth that we agree with.
The population of papers
Should we adjust "the population of papers" to "the reference is" ? to be more explicit?