1,537 Matching Annotations
  1. Mar 2026
    1. Annotate & Comment: Double-click any text to add a Hypothes.is annotation. No account needed to read; quick signup for a free account to post.

      We'd especially like pre-session feedback on

      • Are these ~accurate?
      • Are they useful? At the right level
      • What is redundant?
      • Which issues should we skip (as less important to intervention choices for LMIC, mostly-resolved, or intractable?)
      • What is missing?
      • Is there a better overall structure and framing for these?
      • Where does it go into too much detail? Where is it too opinionated in cases where we should leave things open?
      • Are we failing to attribute any important sources for language, arguments, or claims? *
    2. Predictive validity: SWB predicts consequential outcomes systematically

      This was mentioned above, but does it do so in a scale-sensitive way?

      As I suggested, it's not enough to have it be 'somewhat predictive'

    3. Transformation Sensitivity Demo

      This needs more context and explanation. I've forgotten what g of x is here, and what's the actual calculation? Also, this doesn't seem to be illustrating the point that it means to. As I move the slider, population B always seems to be higher, but also it seems like we're getting away from the discussion of the relative impact of different interventions. We don't want to just simply compare populations. If this does pertain to interventions, explain better.

      Exokain a bit more (as a footnote) what the 'transformation' means here and why/when it's used

    4. Magnitude-sensitive cost-effectiveness: Even if signs are stable, cost-effectiveness ratios rely on magnitudes

      Do they? Magnitudes of what? Explain. Give a 1-2 sentence exampls as a footnote

    5. Incremental WELLBY Estimate

      This is simple and perhaps obvious, but good for illustrating the simple WELLBY linear WELLBY concept, but that's already been explained above. I'm not sure what should maybe be put at the top. I'm not sure if it's useful down here. OK put this at the top, in a folding box -- it just helps to make sure we're all in on the same page about the definition of the WELLBY here.

      Perhaps it would also be helpful to include some sort of adjusted WELLBY calculator interface that's a more sophisticated concept people might not appreciate, particularly embodying the approach in Benjamin and others.

    6. What "non-identified" means A parameter is "identified" when data + assumptions pin down a unique value. Ordinal responses only tell us which interval a latent value falls into. Many different latent distributions and transformations can generate the same observed category counts, so rankings of means can change across equally admissible representations.

      This explanation is not clear. It could be improved, it's a bit too literal. Why do ordinal responses only tell us in which interval a latent value falls into?

      This might also be worth folding

    7. Monotonic transformations can reverse conclusions

      An example here would be very helpful. ... Perhaps even an interactive display.

      Monotonic transformations of what?

    8. Bond and Lang (2019) argue that with ordinal response data, comparing "average happiness" between groups is generally not identified without strong assumptions—monotonic transformations can reverse results.[11]

      This should be fleshed out in more detail and rigor, along with some responses to it, and probably belongs earlier on in the discussion.

      ....

      What do you mean, comparing "average happiness between groups is not identified"? What is the thing that is not identified?

    9. Time structure and discounting Later (t>1)Follow-up (t=1)Baseline (t=0)Later (t>1)Follow-up (t=1)Baseline (t=0)#mermaid-1772847441513{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#000000;}#mermaid-1772847441513 .error-icon{fill:#552222;}#mermaid-1772847441513 .error-text{fill:#552222;stroke:#552222;}#mermaid-1772847441513 .edge-thickness-normal{stroke-width:2px;}#mermaid-1772847441513 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-1772847441513 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-1772847441513 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-1772847441513 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-1772847441513 .marker{fill:#666;stroke:#666;}#mermaid-1772847441513 .marker.cross{stroke:#666;}#mermaid-1772847441513 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-1772847441513 .actor{stroke:hsl(0, 0%, 83%);fill:#eee;}#mermaid-1772847441513 text.actor>tspan{fill:#333;stroke:none;}#mermaid-1772847441513 .actor-line{stroke:#666;}#mermaid-1772847441513 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-1772847441513 .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-1772847441513 #arrowhead path{fill:#333;stroke:#333;}#mermaid-1772847441513 .sequenceNumber{fill:white;}#mermaid-1772847441513 #sequencenumber{fill:#333;}#mermaid-1772847441513 #crosshead path{fill:#333;stroke:#333;}#mermaid-1772847441513 .messageText{fill:#333;stroke:none;}#mermaid-1772847441513 .labelBox{stroke:hsl(0, 0%, 83%);fill:#eee;}#mermaid-1772847441513 .labelText,#mermaid-1772847441513 .labelText>tspan{fill:#333;stroke:none;}#mermaid-1772847441513 .loopText,#mermaid-1772847441513 .loopText>tspan{fill:#333;stroke:none;}#mermaid-1772847441513 .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(0, 0%, 83%);fill:hsl(0, 0%, 83%);}#mermaid-1772847441513 .note{stroke:#999;fill:#666;}#mermaid-1772847441513 .noteText,#mermaid-1772847441513 .noteText>tspan{fill:#fff;stroke:none;}#mermaid-1772847441513 .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-1772847441513 .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-1772847441513 .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-1772847441513 .actorPopupMenu{position:absolute;}#mermaid-1772847441513 .actorPopupMenuPanel{position:absolute;fill:#eee;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-1772847441513 .actor-man line{stroke:hsl(0, 0%, 83%);fill:#eee;}#mermaid-1772847441513 .actor-man circle,#mermaid-1772847441513 line{stroke:hsl(0, 0%, 83%);fill:#eee;stroke-width:2px;}#mermaid-1772847441513 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}Persistence, decay, response shift?

      This diagram is not fully explained. I don't see how it relates to the rest of the content either.

    10. #mermaid-1772847441491{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#000000;}#mermaid-1772847441491 .error-icon{fill:#552222;}#mermaid-1772847441491 .error-text{fill:#552222;stroke:#552222;}#mermaid-1772847441491 .edge-thickness-normal{stroke-width:2px;}#mermaid-1772847441491 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-1772847441491 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-1772847441491 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-1772847441491 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-1772847441491 .marker{fill:#666;stroke:#666;}#mermaid-1772847441491 .marker.cross{stroke:#666;}#mermaid-1772847441491 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-1772847441491 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#000000;}#mermaid-1772847441491 .cluster-label text{fill:#333;}#mermaid-1772847441491 .cluster-label span,#mermaid-1772847441491 p{color:#333;}#mermaid-1772847441491 .label text,#mermaid-1772847441491 span,#mermaid-1772847441491 p{fill:#000000;color:#000000;}#mermaid-1772847441491 .node rect,#mermaid-1772847441491 .node circle,#mermaid-1772847441491 .node ellipse,#mermaid-1772847441491 .node polygon,#mermaid-1772847441491 .node path{fill:#eee;stroke:#999;stroke-width:1px;}#mermaid-1772847441491 .flowchart-label text{text-anchor:middle;}#mermaid-1772847441491 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-1772847441491 .node .label{text-align:center;}#mermaid-1772847441491 .node.clickable{cursor:pointer;}#mermaid-1772847441491 .arrowheadPath{fill:#333333;}#mermaid-1772847441491 .edgePath .path{stroke:#666;stroke-width:2.0px;}#mermaid-1772847441491 .flowchart-link{stroke:#666;fill:none;}#mermaid-1772847441491 .edgeLabel{background-color:white;text-align:center;}#mermaid-1772847441491 .edgeLabel rect{opacity:0.5;background-color:white;fill:white;}#mermaid-1772847441491 .labelBkg{background-color:rgba(255, 255, 255, 0.5);}#mermaid-1772847441491 .cluster rect{fill:hsl(0, 0%, 98.9215686275%);stroke:#707070;stroke-width:1px;}#mermaid-1772847441491 .cluster text{fill:#333;}#mermaid-1772847441491 .cluster span,#mermaid-1772847441491 p{color:#333;}#mermaid-1772847441491 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(-160, 0%, 93.3333333333%);border:1px solid #707070;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-1772847441491 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#000000;}#mermaid-1772847441491 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}InterventionStudy designMeasured outcomesLS / DALY / depressionTranslation layermapping, calibrationCommon currencyWELLBY / DALY / $Decision

      This flow chart is too small and it's underexplained. I don't understand what each of these is meant to mean and how they fit together.

    11. Cheap calibration methods: Can vignettes, anchoring questions, or other calibration approaches work in low-resource settings without excessive respondent burden?

      That seems fairly tractable for us to at least share our knowledge about in this conference. Cool.

    12. true mapping

      That's the second question combo which we'll be setting up an explainer on. Once we do, we should link that and also link that PQ here

      But 'true mapping' Needs a bit more definition. Maybe put it in square quotes to note that (or link the tentative formulation in the PQ space)

    13. Scale-use heterogeneity mapping: How do shifters vs. stretchers vary across LMIC populations, and can we predict which matters more in a given context?

      Measuring this seems fairly high value to me if it can be done at a low cost.

    14. These questions represent high-value areas for future research that could meaningfully improve the reliability of WELLBY-based comparisons:

      I wouldn't state this so directly and clearly, and give attributions to people making the claims that these represent high value. We want this to be one of the outputs of the workshop, but I'm not sure that all of these are in fact high value. Some of them might be very much intractable.

    15. Within-person designs where each person serves as their own control

      But this can bring its own problematic effects if people feel prompted or motivated to report an improvement to please the experimenters, etc.

    16. Treat WELLBY estimates as one input among several, not the final answer

      That's the sort of milquetoast thing I want to avoid. People will always say, "Do compare multiple things, don't treat something as the gospel truth, etc." It's not a statement with a lot of meaning.

    17. 8. Practical Recommendations

      I don't like the core practical recommendations for having a section here. The recommendations are meant to come out of the workshop. We shouldn't be pre-establishing them. It's OK if you want to compare the recommendations coming out of the existing reports & literature, though.

    18. DALYs and QALYs: Standardized But Narrower

      How are these measured in the relevant settings and how does it differ from WELLBY? These are based on external measurements?

    19. Years of Life Lost (YLL) + Years Lived with Disability (YLD

      this seems like it must be incorrect/imprecise. Is a year with a disability actually measured here as being as bad as a year of life lost? This needs a better definition ... how is it measured

    20. It does not automatically imply that within-study randomized treatment effects are meaningless It implies you should be explicit about what assumptions let you treat reported changes as welfare units

      this seems a bit babytalk/obvious

    21. OECD (2024) concludes data remain meaningful for policy despite critiques

      Give a link... and what is the basis for this? Meaningful is somewhat of a vague term. It doesn't get at the hard questions about what measures we should use for comparing specific interventions.

    22. Survey response times can help solve identification (Liu & Netzer, AER 2023)

      This is highly counter-intuitive to me. How do survey response times help?

    23. A strong response to skepticism: even if the numbers seem arbitrary, do they behave like a measurement? Kaiser and Oswald show that single numeric feelings responses have strong predictive power—relationships to later "get-me-out-of-here" actions (changing neighborhoods, jobs, partners) tend to be replicable and close to linear in large longitudinal datasets.[10]

      This kind of seems like a weak response unless I'm missing something. Even if they are not arbitrary, even if they have informational value, it doesn't tell me that they provide reliable information in comparing the benefit/cost across multiple interventions which all improve people's lives.

    24. They do not solve cross-study comparability—but demonstrate that in at least one setting, SWB is responsive.

      But this doesn't seem to have been the challenge as posed. I'm not sure this is the most relevant thing to lead with, or maybe it needs to be motivated better

    25. Measurement error attenuates estimated effects (bias toward zero)—small real effects may be undervalued

      How does that affect the relative comparison of interventions?

    26. What breaks: Duration weighting is wrong. Why it might fail: Adaptation effects—people return to baseline. Mitigation: Long-term follow-up data.

      Again, this is too shorthand. I need an explanation, if necessary, in footnotes or a folding box, of what all this means.

    27. ΔLS has ≈ same welfare meaning across people

      'meaning' should be clarified, perhaps with reference to the gold standards I suggest you add above. Should we state this in terms of an individual's willing to make "time trade-offs" (e.g., would be willing to go from 7-->6 for one year in exchange for going from 3-->4 another year), or probability trade-off (would take a coin flip over the above ), or person trade-off (a third party willing to move one person from 7 to 6 it meant moving someone else from three to four) ... [or vice versa in all cases]

    28. ΔU(3→4) = ΔU(7→8)

      Obviously this notation is extremely crude! I wonder if important nuance is lost here

      E.g.,. is this 'within person' or 'across people'?

    29. Validity

      "Validity" is vague, needs a better definition. And perhaps something more informative in terms of the metric offering value would help. Naturally, no metric would be perfect, and even if a model's assumption are violated in practice, the assumption might be close enough to holding that the difference doesn't matter much.

      We need a better definition of the 'gold standard here'. What would an 'accurate comparison' tell us? What is the appropriate measure of 'degree of inaccuracy'?

    30. Test

      how to test this? Define 'log transformation' more clearly here and what are the assumptions necessary for it to accurately reflect tradeoffs?

    31. Ceiling/floor effects: Even with identical reporting functions, bounded scales can cause mechanical differences in responsiveness at high or low baselines.

      But this does not seem consistent. You are saying "when heterogeneity is most dangerous", but this doesn't look like heterogeneity.

    32. Comparing across studies/countries: Different instruments, translations, norms, and populations. If the distribution of stretch factors bi differs, "1 point-year" is not the same welfare unit across the evidence base.

      Can you justify this a bit more, both in equations and in an intuitive explanation of what the problem is?

    33. interpersonal noncomparability is less of a threat for estimating an average treatment effect

      "less of a threat" is vague, needs clarification. And why? Give a citation and/or a proof and further explanation (perhaps in a footnote)

    34. studies, countries, or populations with different distributions of "stretch factors.

      Adapt this discussion to focus more on comparing different interventions (see the canonical example but also link real-world relevant comparisons and studies) ... where these interventions may take place in nearly-identical, similar or distinct contexts, affect similar or different outcomes (wealth health, etc.)

    35. Δui = bi × ΔLSi.

      this needs more explanation. What does 'fail' mean here? What's being compared, and how do the estimates compare with the ground truth?

    36. UA ≈ UB

      Maybe add a footnote explaining what sort of "utility" we are considering here, noting this is a bit of an oversimplification of welfare considerations.

    37. A common overstatement is that

      Who stated this? How is it 'common'? Maybe just change this to "Equal scores mean equal welfare" is stronger than most applications need.

    38. This second form requires a defined zero point (e.g., death = 0)

      Might benefit from some further explanation. How could Level-based be used for comparing interventions -- that's not clear here. How many people are we summing over? How do 'dead people' enter into that? Some explanations can go in footnotes.

    39. Σi Σt δt (LSit(k) − LSit(0))

      Is this really How it's depicted in the literature? It's a bt confusing at first, because it looks liek one has to know two things for incremental WELLBYs and only one thing for the Level based measure. Furthermore, the incremental one seems to requre knowledge of a counterfactual. However, one mght be able to have an estimate of a difference without knowing the levels. Isn't there a better notation/explanation for this?

    40. ΔWELLBY(k) = Σi Σt δt (LSit(k) − LSit(0))

      I'm missing the definition of the indices i and t, as well as the definition of the variable LS -- #adjust #implement

    41. Benjamin et al. 2023, UK Green Book Wellbeing Guidance, Bond & Lang 2019, Haushofer & Shapiro 2016/2018, Kaiser & Oswald 2022)

      Are these Really all the sources? I thought we had more.

    42. AI-Generated Content (March 2025): This page was created through iterative prompting of Claude Code (Opus 4.5) and GPT-5.2 Pro, feeding in workshop discussion content and focal papers for our Pivotal Questions initiative (Benjamin et al. 2023, UK Green Book Wellbeing Guidance, Bond & Lang 2019, Haushofer & Shapiro 2016/2018, Kaiser & Oswald 2022). While grounded in these sources, this content requires further human verification. Specific claims, citations, and numerical details should be checked against the original literature before relying on them.

      Make this a folding box #implement

    1. Most studies measure outcomes at baseline and one or two follow-ups;

      Give a footnote with some examples here. What do the studies involving LMIC interventions do?

    2. The measurement-to-decision pipeline #mermaid-1772846605552{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#000000;}#mermaid-1772846605552 .error-icon{fill:#552222;}#mermaid-1772846605552 .error-text{fill:#552222;stroke:#552222;}#mermaid-1772846605552 .edge-thickness-normal{stroke-width:2px;}#mermaid-1772846605552 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-1772846605552 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-1772846605552 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-1772846605552 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-1772846605552 .marker{fill:#666;stroke:#666;}#mermaid-1772846605552 .marker.cross{stroke:#666;}#mermaid-1772846605552 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-1772846605552 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#000000;}#mermaid-1772846605552 .cluster-label text{fill:#333;}#mermaid-1772846605552 .cluster-label span,#mermaid-1772846605552 p{color:#333;}#mermaid-1772846605552 .label text,#mermaid-1772846605552 span,#mermaid-1772846605552 p{fill:#000000;color:#000000;}#mermaid-1772846605552 .node rect,#mermaid-1772846605552 .node circle,#mermaid-1772846605552 .node ellipse,#mermaid-1772846605552 .node polygon,#mermaid-1772846605552 .node path{fill:#eee;stroke:#999;stroke-width:1px;}#mermaid-1772846605552 .flowchart-label text{text-anchor:middle;}#mermaid-1772846605552 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-1772846605552 .node .label{text-align:center;}#mermaid-1772846605552 .node.clickable{cursor:pointer;}#mermaid-1772846605552 .arrowheadPath{fill:#333333;}#mermaid-1772846605552 .edgePath .path{stroke:#666;stroke-width:2.0px;}#mermaid-1772846605552 .flowchart-link{stroke:#666;fill:none;}#mermaid-1772846605552 .edgeLabel{background-color:white;text-align:center;}#mermaid-1772846605552 .edgeLabel rect{opacity:0.5;background-color:white;fill:white;}#mermaid-1772846605552 .labelBkg{background-color:rgba(255, 255, 255, 0.5);}#mermaid-1772846605552 .cluster rect{fill:hsl(0, 0%, 98.9215686275%);stroke:#707070;stroke-width:1px;}#mermaid-1772846605552 .cluster text{fill:#333;}#mermaid-1772846605552 .cluster span,#mermaid-1772846605552 p{color:#333;}#mermaid-1772846605552 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(-160, 0%, 93.3333333333%);border:1px solid #707070;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-1772846605552 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#000000;}#mermaid-1772846605552 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}

      the diagram is too small, and was never explained!

    3. Some influential critiques argue that different monotone transformations can reverse conclusions about "average happiness"

      'influential' -- that's subjective. ///Link to an example

    4. Is "incremental WELLBY" standard terminology? Some literatures talk about WELLBYs as point-years of life satisfaction (UK guidance) and many evaluation contexts are inherently incremental. But "incremental WELLBY" itself is not uniformly a standard term. In this page, we use it as a descriptive label for counterfactual impact calculation, not as established jargon.

      too inside-info for a whole box. -- make this a footnote at most

    5. WELLBY (unit of account): UK Green Book guidance defines a WELLBY as a one-point change in life satisfaction on a 0-10 scale, per person per year.[3]HM Treasury (2021/2024). Wellbeing Guidance for Appraisal: Supplementary Green Book Guidance.

      Missing the standard framing of the LS question here

    6. The measurement-to-decision pipeline #mermaid-1772845759179{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#000000;}#mermaid-1772845759179 .error-icon{fill:#552222;}#mermaid-1772845759179 .error-text{fill:#552222;stroke:#552222;}#mermaid-1772845759179 .edge-thickness-normal{stroke-width:2px;}#mermaid-1772845759179 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-1772845759179 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-1772845759179 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-1772845759179 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-1772845759179 .marker{fill:#666;stroke:#666;}#mermaid-1772845759179 .marker.cross{stroke:#666;}#mermaid-1772845759179 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-1772845759179 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#000000;}#mermaid-1772845759179 .cluster-label text{fill:#333;}#mermaid-1772845759179 .cluster-label span,#mermaid-1772845759179 p{color:#333;}#mermaid-1772845759179 .label text,#mermaid-1772845759179 span,#mermaid-1772845759179 p{fill:#000000;color:#000000;}#mermaid-1772845759179 .node rect,#mermaid-1772845759179 .node circle,#mermaid-1772845759179 .node ellipse,#mermaid-1772845759179 .node polygon,#mermaid-1772845759179 .node path{fill:#eee;stroke:#999;stroke-width:1px;}#mermaid-1772845759179 .flowchart-label text{text-anchor:middle;}#mermaid-1772845759179 .node .katex path{fill:#000;stroke:#000;stroke-width:1px;}#mermaid-1772845759179 .node .label{text-align:center;}#mermaid-1772845759179 .node.clickable{cursor:pointer;}#mermaid-1772845759179 .arrowheadPath{fill:#333333;}#mermaid-1772845759179 .edgePath .path{stroke:#666;stroke-width:2.0px;}#mermaid-1772845759179 .flowchart-link{stroke:#666;fill:none;}#mermaid-1772845759179 .edgeLabel{background-color:white;text-align:center;}#mermaid-1772845759179 .edgeLabel rect{opacity:0.5;background-color:white;fill:white;}#mermaid-1772845759179 .labelBkg{background-color:rgba(255, 255, 255, 0.5);}#mermaid-1772845759179 .cluster rect{fill:hsl(0, 0%, 98.9215686275%);stroke:#707070;stroke-width:1px;}#mermaid-1772845759179 .cluster text{fill:#333;}#mermaid-1772845759179 .cluster span,#mermaid-1772845759179 p{color:#333;}#mermaid-1772845759179 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(-160, 0%, 93.3333333333%);border:1px solid #707070;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-1772845759179 .flowchartTitleText{text-anchor:middle;font-size:18px;fill:#000000;}#mermaid-1772845759179 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}InterventionStudy designMeasured outcomesLS / DALY / depression scaleTranslation layermapping, calibration, assumptionsCommon currencyWELLBY / DALY / $Decision / deliberation

      this is too small and also underexplained

    7. Plant, M. (2025). "A Happy Possibility: Rational Behavior and the Cardinality Thesis." Working paper.

      wait -- hallucination -- you renamed the title here!!

    8. f you compare to mortality-preventing interventions

      Adjust this to "if you compare interventions that affect mortality (or, in some accounting, birth rates)"

    1. 📊 View Aggregated Results See beliefs elicitation summaries and Metaculus question forecasts

      I don't think I want to show this here because I don't want people to anchor in stating their beliefs. #todo #adjust #implement

    2. discussion. SEGMENT 1 ~11:00 AM ET 25 min Stakeholder Problem Statement & Pivotal Questions Stakeholders present their WELLBY/DALY challenges (~10 min each), then we introduce key PQs for belief elicitation (~5 min) Speakers: Peter Hickman (Coefficient Giving), Matt Lerner (Founders Pledge) Upcoming

      Here or somewhere early in the workshop, we should have the opportunity for participants to provide feedback about whether the Pivotal questions are clear and useful, and which ones are more important to their work and to the ~welfare of humanity.

    3. SEGMENT 6 15 min Beliefs Elicitation Guided form to state priors on operationalized pivotal questions Self-guided form + Metaculus questions Upcoming

      This will be very loose. Reinstein will introduce the context and interfaces, and stick around to answer questions and fix bugs etc.

      Change this to "explains" not "introduces", as it was already introduced briefly

    4. SEGMENT 7 ~2:20 PM ET 30 min Practitioner Panel & Open Discussion Practical implications for funders and researchers Panelists: Matt Lerner (FP), Peter Hickman (CG)

      So this will be public -- another 10 minutes presentation from each. Then we will open it up for discussion questions -- David Reinstein will raise some if others don't. This will be followed by a private invitation-only discussion amongst a few heavily involved participants (to be mentioned but not linked here)

    5. Scale-use heterogeneity findings, calibration methods, and implications for WELLBY use Speakers: Dan Benjamin (UCLA/NBER), Miles Kimball (CU Boulder)

      Extend this with a presentation on the application of this method in Israel. -- or maybe after the evaluator responses/discussion?

    6. Evaluator Responses & Discussion Evaluation findings, author dialogue, Unjournal process Presenters: Caspar Kaiser, David Reinstein, Valentin Klotzbücher

      This will get into more technical research issues

    1. html`<div style="background: #f8f9fa; padding: 1rem 1.25rem; border-left: 4px solid #3498db; margin-bottom: 1.5rem; font-size: 0.95em; line-height: 1.6;"> <strong>What these numbers represent:</strong> Simulated <strong>production cost per kilogram of cultured chicken</strong> (wet weight, unprocessed) in <strong>${targetYear}</strong>, based on ${stats.n.toLocaleString()} Monte Carlo simulations. This is the cost to produce meat in a bioreactor — not retail price, which would include processing, distribution, and margins. <br><br> <strong>Why it matters:</strong> If production costs reach <strong>~$10/kg</strong> (comparable to conventional chicken), cultured meat could compete at scale. If costs remain <strong>>$50/kg</strong>, the technology may remain niche. These thresholds inform whether animal welfare interventions should prioritize supporting this industry. </div>` RuntimeError: targetYear is not definedOJS Runtime Error (line 804, column 163) targetYear is not defined

      How Can we fix this 'runtime error'.I think it was working before. The "target year" should be the "projection year" in the sidebar model parameters. The default year was 2036. #implement

    1. Add questions and comments directly to the collaborative notes above, or submit them via the beliefs elicitation form.

      Remove 'or submit them via...' Note that the "beliefs elicitation form" is doing something else. #adjust #implement (adjust this on all pages).

    2. decision problems that funders face when comparing interventions measured in different units (WELLBYs vs DALYs)

      It's not just about 'comparing interventions measured in different units' #adjust #implement

    3. How funders currently navigate WELLBY vs DALY in cost-effectiveness analysis

      Not just 'WELLBY vs DALY' ... --> How funders consider wellbeing and metrics based on self-reports in considering and comparing interventions.

    1. If the effectiveness of some programs have already been measured in terms of WELLBYs, while others are measured in terms of DALYs, what method or what "mapping structure or approach" should we use to compare and convert between them?

      It might be too many questions on conversion here if the workshop's not focusing on conversion. We might want to move some of these more detailed questions to a second outlinked page.

    2. These are some of the key operationalized questions from our Wellbeing Pivotal Questions project. We want to elicit expert and stakeholder beliefs—before, during, and after reviewing the evidence and key arguments—to see how views evolve and where consensus exists. (All questions are optional.)

      These have Metaculus versions. We probably want to link them here, but we also don't want to overwhelm people.

    3. 📋 Full question specifications: For more detail, context, and the complete set of operationalized questions, see the canonical Wellbeing PQ formulations on Coda →

      We link to these, but they might be a bit overwhelming for session participants - perhaps put a disclaimer here.

    4. About You

      We had a box here to indicate whether this is an original submission or an updated edition. First submission, second submission, etc. Or perhaps this is redundant as we will see it based on the time that comes in?

    5. One way to think about this: Imagine an ideal research team with unlimited resources, time, and data—perhaps even a kind of omniscience where they could perfectly understand the welfare and psychological states of everyone affected. What probability would you assign that this idealized team would ultimately conclude the statement is true?

      I'd like to link a "calibrate your judgment" tool here over a very quick exercise. I'll stop. Ideally, this is something we could even embed. I don't want people to have to sign up for things; friction is the enemy.

    1. DALY_01 What is the best numerical conversion factor between WELLBYs and DALYs? If a charity prevents 1 DALY (Disability-Adjusted Life Year), approximately how many WELLBYs does this represent? Current estimates range from 2-15. View on Metaculus

      Have this link to the specific linked metaculus forecast, not just the general community page.

      But we should also embed the more detailed belief elicitation that is already here.

    1. Brief context on the Unjournal evaluation process

      Also mention how this relates to this Pivotal Questions initiative and how we're looking for Pivotal Questions evaluators. #implement

    1. —a “think first, score second” protocol designed to ground numeric ratings in specific textual evidence

      Is there backing in the literature for this approach? Is there any formal way of defining this approach?

    2. We pass the PDF directly to the model’s native multimodal input rather than extracting text, preserving tables, figures, equations, and layout cues that ad-hoc scraping could mangle. A single API call per paper avoids hand-offs and summary loss from m

      I think we've discussed this before. There are trade-offs here. Close up, we could be assessing some less meaningful components of paper processing rather than the actual reasoning frontiers, and the differences in these by model, by type of paper, by field, etc. Are you sure this is what we're doing and we might consider doing in a different way?

    3. Sample and human reference data.

      I suggest some subsection headers here. There are different aspects to what you might call methods, including: - the context of the content included - the LLM pipeline and procedure - The comparisons of human and LLM identification, including for identification of issues - the statistical/info theoretic analysis It might be helpful to divide this up.

    4. development economics, health policy, environmental economics,

      This leaves out some important field/priority areas - dig deeper. We include the economics of innovation and global catastrophic risk as well.

    1. models’ training data may include fragments of these papers or related discussions.

      This should be discussed in more detail, perhaps with a particular section addressing this both conceptually and with some empirical checks. We should perhaps have robustness checks (maybe we already have some of these ) with models ... that are explicitly using cut-off dates below the start of our evaluation sample, or removing papers/evaluations that occur after this date. This should be linked and referred to here.

    2. Qualitative coverage varies widely across papers: on some, the LLM captures nearly all consensus human concerns; on others, it misses key critiques or raises issues absent from the expert consensus

      A link would be helpful here / an example.

    3. respectably

      Let's try to avoid terms like "respectively" without definitions. This is sort of something that leads to sloppy thinking. Can we give a quantification in words in some meaningful way? Or perhaps we should depart from the norm of giving broad but imprecise explanations of everything (also with a lot of repetition) _in papers like this.

    4. , approaching the ceiling set by inter-rater variability among humans themselves.

      It's noted elsewhere this isn't really a ceiling. It doesn't act as a ceiling. In practice, it's not mathematically guaranteed to be the case, and I don't see a conceptual reasoning why it should be the case.

    1. (likely Founders Pledge)

      update #implement -- CG and Founders Pledge are both likely to speak for about ten minutes, followed by a discussion of how we're mapping this into "pivotal questions". /// Try to keep this aligned with the "live sessions" page.

    2. Confirmed: Monday, March 16, 2026 · 11am–5pm ET / 4pm–10pm UK · Fully online · ~3.5 hours of live sessions (join only the segments you're interested in) + asynchronous

      Make this date/time more prominent here and on all pages. Note that you need to sign up to be given the Zoom link.

    3. Your primary role in this conversation (optional)

      Make a box below this with an optional free response, asking people what their background is and why they're interested.

      Add a caveat/"Note on access" in a folding box. Note that these sessions themselves, as proposed, are "by invitation only". We will share the Zoom link only with a limited set of people to keep things from becoming overwhelming. Please don't be offended if we don't follow up with you; we have limited bandwidth and may have overlooked you. But we aim to bring anyone interested into the conversation in some format, perhaps a future more open event. #implement

    1. Focal Question (DALY_01) If the impact of one program is measured in WELLBYs and another program impact is measured in DALYs, and we have a reported effect size and standard deviation for each, what is the best numerical conversion or mapping between them? Note: This may be treated as a secondary topic depending on time constraints.

      This should link or embed the space where people can state their beliefs

    1. 💬 Questions & Comments Submit questions and comments through the form below. Note: Submissions are publicly visible to all participants.

      There is no form here. How can we enable it? Ideally with an 'upvote' feature to prioritize these?

    1. ogether the paper's authors, the evaluators who assessed it,

      adjust -- not just 'the paper' ; authors of several papers in this area as well as Unournal evaluators

  2. Feb 2026
    1. (explaining the slightly different ρ values).

      The difference between the value in the diagram and in the table. I didn't understand what difference was being referred to at first... So this should say, "Explaining the slightly different ... values between the table and the figure."

    2. evaluator pairs is no tighter than the LLM-human scatter in panel

      The vague statement. I mean it's not obviously tighter but you can't eyeball it and say that it's no tighter. Okay it's less tight if we use the Spearman measure. That should be made a bit clearer. I didn't see that you gave us the Spearmans.

      But I still think this gets back to the question of whether it's fair to compare the human-human individual evaluator correlations to the correlation between the LLM and the average of humans. Given both signal and noise, I imagine that the average of 2 measures tends to be more reliably predicted than by a measure with signal and noise than one individual measure predicts a second measure.

    3. Compare panels (b) and (c) directly to see whether LLM-human scatter is tighter than human-human scatter.

      Probably put at least one correlation metric in each plot because it's really hard to eyeball this

    4. ) ratings for each paper, revealing inter-rater variability—CIs often span 20–40~points.

      The statement is confusing and not fully explained. What CIs are we talking about here? Note that we ask each rater to explicitly provide 90% credible intervals for each rating. Is that what this is referring to? But that's a different thing than inter-rater variability

    5. In most cases both LLMs fall within the range of human opinions, though several papers show substantial divergence.

      This might have been my own language question? In any case we should have some numbers to back this up - it's not clear to me that this statement is in fact justified using the diagram. I seem to see a lot of cases where the LLM ratings fall outside the humans or at least more than a few

    6. Per-paper overview and model comparison. Figure 2.1 presents three complementary views of overall (0–100 percentile) ratings. Panel (a) displays individual human evaluator ratings alongside GPT-5 Pro (orange diamo

      I guess this is ordered from highest human rating to lowest average human rating? Check this and explain it in the diagram or the discussion

    7. Per-paper overview and model comparison. Figure 2.1 presents three

      Diagrams are too small. I can barely see them. At least in this version of it. If these are meant to be printed out or whatever, there's no way people will be able to see it. In an online hosting you could have them zoom in of course. But for a sort of printable version you'd need to make these a lot bigger. And no one can read the names either

    8. We evaluate 6 frontier LLMs against human expert reviews from The Unjournal.

      This seems repetitive of what we said in the first section... to the extent it needs to repeat, please take on board the hypothesis comments there

    9. Results CodeShow All CodeHide All CodeView Source

      Putting results before methods might be the norm in computer science or something but in economics I think we usually see the methods and discussion come first (although people often mention the results in the introduction, )

    10. criterion-level ceiling.

      I don't know why they use the word "ceiling." It's not really a ceiling here. Maybe it's a point of comparison, but there's nothing that statistically or mathematically bounds the others to be below this. And in fact sometimes the models do better at matching humans at least in the stats that I've seen by this measure, than humans do.

    11. If two human evaluators agree at Spearman ρ = 0.55, an LLM achieving ρ = 0.57 against the human mean is performing within human inter-rater range.

      Not sure I completely understand the claim here and what is meant by "performing within human interrater range."

    12. severity, topic familiarity, interpretation of the scale)

      Perhaps also mention that we're asking them to provide percentiles relative to papers in this area that they read in the last two years, and different evaluators may have read different selections of research. There should be a link here to the actual guidelines that we gave the humans (https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/policies-projects-evaluation-workflow/evaluation/guidelines-for-evaluators)

    13. 33 of these were also evaluated by Claude Opus 4.6

      We want to make sure this is either dynamically coded or that the LLM is called "updated" as we should increase this

    1. six frontier LLMs

      These are not all frontier, I would say. Or am I wrong here? Does the term "frontier" include faster but less deeply thinking models?

    2. Funding for The Unjournal has been provided by the Survival and Flourishing Fund, the Long Term Future Fund, and EA Funds.

      do we need to mention Unjournal funding here?

    3. The Unjournal setting is particularly well suited for this comparison. It commissions paid expert evaluations using a structured rubric covering seven percentile criteria with 90% credible intervals plus journal-tier predictions, and publishes the resulting packages openly

      A bit more context on The Unjournal would probably be helpful here, mentioning our prioritization, etc.

      Claude added this comment (on an earlier version?) Claude: Selection bias: Unjournal selects papers from NBER/top working paper series. This is not a random sample of research. LLM performance on pre-screened quality papers may differ from performance on the full distribution (including poor papers). Explicitly note: "Our sample is pre-selected for quality; results may not generalize to evaluating lower-quality submissions."

    4. Our headline finding is that the best-performing model (GPT-5 Pro) matches or exceeds pairwise human inter-rater rank agreement on overall quality,

      I don't want to be seen as cherry-picking here. When we report this we should also report the other important statistics like Krippendorfer and at least mention which metrics the LLM performs worse on in terms of matching this

    5. while the journal-tier predictions provide an external reference point2

      By the language here, the predictions are not an external reference point. The publication outcomes and perhaps citation outcomes are an external reference point even though as we say, this is not a precise measure of the "quality" of the paper.

    6. reducing classic gatekeeping motives and increasing reviewer effort.

      Not sure what they mean by "reducing classic gatekeeping motives." We argue that it does lead to a high level of reviewer effort for a few reasons, but this is not fully justified here. The case we make is that we manage it carefully, that the reviews (we call them "evaluations") will be made public. So people may want to set a better standard, and some people leave their names (aka sign their reviews) so there's a reputation motive. We offer compensation as well as prizes for the strongest work. So there's an incentive the direct financial incentive although our compensation is fairly modest.

    7. structured measurement schemas (Asirvatham, Mokski, and Shleifer 2026), iterative quality-checking workflows (Zhang and Abernethy 2025), or the kind of prompt-robustness engineering motivated by specification-search concerns (Asher et al. 2026)—should improve further.

      Of course we want to look at these carefully before we praise them. I'm not super familiar with what each of these things are. And I'm not sure that I would state it so strongly that it will necessarily improve on this. There may be countervailing constraints and limitations. ... Taking a look at the AMS abstract, I don't quite see that this is the same sort of thing we're trying to do.

    8. with no iteration, retrieval augmentation, chain-of-thought scaffolding, or multi-step agentic loop.

      Rephrase as "We do not do any iteration..." As a separate sentence otherwise, it's a little confusing what we're saying we are doing versus not doing.

    9. strong case that frontier LLMs can serve as additional expert raters in structured evaluation pipelines,

      I think saying "strong case" is probably too strong

    10. Our headline finding is that the best-performing model (GPT-5 Pro) matches or exceeds pairwise human inter-rater rank agreement on overall quality,

      Just need some clarification. It's meeting or exceeding this if we compare it to the average of the human ratings. Is that a fair comparison? Double-check or compare it to how it would compare with the individual human raters.

    11. against expert evaluations for 60 economics and social-science working papers

      I don't think this number should be 60 - I thought we only have 57, at least 57 that are publicly released.

    12. multi-dimensional

      It's not clear what the advantage of our evaluations being "multi-dimensional" here is. At least this paragraph doesn't make it clear. The paper should make it clear that we also ask for overall judgments in comparison to familiar journal tiers. I would say the advantage of the multi-dimensional is it gives us a sense of the aspects of the research that the LLM tools tend to agree or disagree with the humans on... Something like an understanding of tastes and prioritization.

    13. These developments make the evidentiary gap salient: funders, editors, and policymakers need to know when AI evaluation outputs are trustworthy enough to use, and when they are unstable, biased, or manipulable. Recent work highlights all three concerns. First, reproducibility can be “jagged”: repeated runs of the same models on the same corpus over time can be highly consistent for some tasks and models, but much less so for others (Thomas, Romasanta, and Pujol Priego 2026); robustness may require separating scientific judgment from computational execution (Xu and Yang 2026); and even without overt adversarial intent, subtle reframings of the same task can induce systematic shifts in outputs—a form of LLM “specification search”—raising concerns about frame-sensitive biases when models serve as measurement instruments (Asher et al. 2026). Second, adversarial manipulation is not hypothetical: invisible-text “prompt injection” can substantially inflate LLM-assigned review scores and acceptance recommendations in simulated peer review (Choi et al. 2026), and prompt-injection vulnerabilities are also documented in other high-stakes advice settings (Lee et al. 2025). Third, even when outputs look fluent and plausible, it remains unclear whether AI models approximate expert judgment: AI-generated reviews tend to cover more surface-level sections while being less thematically diverse and less focused on interpretation, originality, and applicability than human reviews (Rajakumar et al. 2026); LLMs used as manuscript quality checkers identify only a small fraction of confirmed critical errors even with the strongest reasoning models (Zhang and Abernethy 2025); and LLM scoring exhibits systematic range restriction and halo effects that can distort agreement metrics (Wang et al. 2025).

      This seems too long. This isn't really coming from us, so we might mention some of these things, but I tend to make this a lot shorter. Perhaps some things can be put in footnotes. Obviously we need to check these carefully to see if we agree with them.

      I think I mentioned before I'm not sure our work really speaks to the prompt injection issue. The set of work we're putting the LLMs and humans to evaluate would seem to be rather unlikely to have such prompt injection, so we can't really test that (unless we modified the work being fed in, but I don't think that's in our wheelhouse right now. )

    14. Meanwhile, publishers are formalizing policies that treat manuscripts and reviews as confidential and prohibit reviewers from uploading them into general-purpose generative AI tools

      I'm just checking some of these references to see if there's hallucination going on - this one seems to check out ... from the cited policy " Reviewers should not upload a submitted manuscript or any part of it into a generative AI tool as this may violate the authors’ confidentiality and proprietary rights and, where the paper contains personally identifiable information, may breach data privacy right".

    1. irr::kripp.alpha(M, method = "interval")$value }, error = function(e) NA_real_)

      This is the interval version of K's alpha which penalizes the square of the distances. I guess this means that it particularly penalizes cases where the raters are very far apart and a larger number of small differences won't matter as much. I'm not sure if this is appropriate to ask something to think about. Perhaps we also want to provide the ordinal version for comparison or something else. I believe we've thought about this but I can't remember what we came up with. We'll have to re-consult the notes.

    2. 0.07

      Wow this is nearly zero agreement among humans but I wonder if something is coming up here because of the way we changed the category/introducing new criteria? I think the claims might've been something we introduced later in the process, at the same point that I coalesced the two things related to global relevance. (I could check this. We have documentation.)

    3. Table A.3: Krippendorff’s αHH

      I suggest we should also include the agreement measures for the journal tears? It should be comparable to the others at least if the measure is fairly unit-less

    1. At 20 kTA reference scale

      Still needs more explanation. I don't know why you're using this reference scale. I don't know why we're talking about pharma grade, etc.

    2. Scalable GF technology 50% Switches to “cheap” GF prices Pivotal uncertaint

      As mentioned elsewhere, this needs a lot more explanation or discussion. What is the major factor switching us between cheap and expensive growth factors here? How much does this affect the outcomes? What are the different price distributions for the cheap versus expensive ones? I'm not actually seeing growth factors or any of these p's in any of the equations you give. At least not in a way that allows me to unpack each element. ... Okay, now it's partially explained above, but I still don't see what the different price distributions are and where they come from for the cheap versus expensive

    3. yments:

      You should provide an explanation in a folding box for the CRF formula in intuitive terms. I suppose it depends on the interest rate r and n, the number of years, but then explain how that works and why it equals this complicated formula.

    4. The slider complements this by letting users explore “what if progress is partial?” scenarios.

      That seems to be underexplained and seems to contradict what you just said.

    5. If any one of these succeeds at commercial scale, the “cheap” price regime applies

      That makes sense, but then what determines how you model the price in the 'cheap' price regime?

    6. Consider correlated scenarios via the maturity slider

      We probably want to unpack this more. One could imagine some forms of technical development going together and others less so.

    7. Weighted average cost of capital

      That seems rather high - what are references for this? Why should it be so expensive? Here, is this comparable to some benchmarks?

      And again, I want to be able to look up each of these elements within an equation somewhere - I don't see where that equation is. Make the links clearer.

    8. Breakthrough technologies that could trigger the “cheap” scenario: - Autocrine cell lines (cells produce own FGF2) - Plant molecular farming ($1-10/g target) - Precision fermentation at scale - Polyphenol substitution (reduces GF requirements by 80%)

      Okay, you got to my question here that I asked above, although it still seems underexplained. Wouldn't each of these things have independent effects on the cost of growth factors? So why is it just a zero-one switch?

    9. 30-200 g/L Final biomass at harvest Cycle time 0.5-5 days Time per production batch Media turnover 1-10 ratio 1=batch, >1=perfusion

      Interesting, but it should be more clear how this maps into the ultimate cost equation. Everything should be linked back in some way to a total cost formula. I'd like to be able to open and close and unpack the different elements.

    10. Why correlate? In “good worlds” for cultured chicken: - Technologies are more likely adopted (higher P) - Custom reactors are more common (lower CAPEX) - Financing is cheaper (lower WACC) This prevents unrealistic scenarios where technology succeeds but financing remains prohibitively expensive.

      this is really not well explained. I don't see how the discussion relates to the equations here

    11. The model uses a latent maturity factor (0–1) to correlate technology adoption, reactor costs, and financing: Padopted=bound(Pbase+k⋅(m−0.5),0,1) What does “bound” mean? bound(x, 0, 1) ensures the result stays between 0 and 1. Also c

      what are k and m here? define and explain

    12. The GF progress slider interpolates between current and target prices: PGF=Pcurrent×(0.01)progress At 0% progress: current prices ($5,000–500,000/g) At 100% progress: target prices ($1–100/g for cheap scenario)

      The equation doesn't seem to be correct/displayed correctly here. Explain more but also I don't understand what "0.01^(progress)" means

    13. Example calculation: - Cell density: 50 g/L → need 1000/50 = 20 L per kg - Media turnover: 3× (perfusion system) → 20 × 3 = 60 L/kg - Media price: $0.50/L (hydrolysates) → 60 × 0.50 = $30/kg

      is the 'per liter' meaningful though? Doesn't the density depends strongly on the contents used?

    1. Given the available collected data [...], how should [funders] measure the impact on wellbeing? [...] What measures of well-being should charities, NGOs, and RCTs collect for impact analysis?

      Let's split up the answer boxes within this question to ask separately about the best use of currently collected data for these cases, and also ask what data should be collected in the future.

    2. How reliable is the WELLBY measure [...] relative to other available measures in the 'wellbeing space'? How much insight is lost by using WELLBY and when will it steer us wrong?

      signpost more that we are talking about the very simple use of the WELLBY measure

    3. More detailed questions on WELLBY reliability

      Should be 'on WELLBY reliability and wellbeing measures' ... but also the folding box is still not ideal here -- better for this to link out to another page/subpage (open in new window)

    4. "Meaningful change" = at least one intervention currently in the top 5 moves out of the top 5, OR the #1 ranked intervention changes. This assumes future RCTs incorporate these methods and Founders Pledge updates their CEA accordingly.

      This one is nice -- is it the same in the PQ table?