1,845 Matching Annotations
  1. Last 7 days
    1. Evaluator anonymity in current Unjournal public data. A bit over half of deduplicated evaluator-paper rating pairs use anonymous or generic public identifiers: 65/113, or 57.5%. This matters because later gatekeepers see the public report, but often not a named evaluator.

      interesting but you haven't really connected this to the discussion or model. Why does evaluator anonymity matter to the author's decision?

    2. That the premium persists once public evaluation is common.

      also, as with these signaling games, there's usually multl equilibria, including a sort of 'babbling' one iirc

    3. Author Benefits, With Limits

      this section is not written well. As far as I know only the things linked to Propositions 1 and 2 even engage the model ... you can mention other possible costs and benefits not considered by the model, which are largely empirical questions

    4. The Proposition 2 signal, made interpretable by benchmarked ratings, journal-tier equivalents, and uncertainty intervals.

      link and tooltip "Proposition 2" -- it hasn't been introduced yet

    5. not formal publication by The Unjournal.

      remove 'not formal publication' -- already noted, but the 'formal publication' is not a benefit per se

    6. visible

      The diagram below is underexplained. Also note we geve evaluators the opportunity to revise if the author points out clear omissions or errors

    7. Pr(favorable | H) = Pr(adverse | L)

      The probability of why should the probability of each of these be the same? That's weird and seems very limiting to me. The model should be made more general.

    8. author response

      encouraging authors to respond, and evaluators to update their evaluations if authors find clear mistakes or oversight.

      Tooltip: We're also working to build and coach our evaluator pool, and hope to provide paid calibration workshops in future.

    9. Yes. Peer review is noisy: the 2014 NeurIPS experiment had committee disagreement on 43 of 166 duplicated papers

      link please, and give details in tooltips

    10. where The Unjournal's structure changes the mechanism from cases where private author risk remains.

      this is a bit too stark. I don't know if we can be sure that UJ fully changes these situations, nor whether in the 'risky' cases the risk is substantial. State it more tentatively or diretionally

    11. trongest private case is a paper already ready for serious expert scrutiny whose true quality exceeds its default credibility.

      Should we also mention the 'submitting your work for public evaluation can also be a strong signal of your confidence in the credibility of your work?'

    12. Logic map: where public evaluation changes the author's payoff

      LaTeX is not really rendered in the diagram below for "Bar C", and also it's not clear what these variables are that need better labeling.

    13. Read it as a working decision aid, not a final institutional position.

      "I don't like 'read it as a working decision aid' so much" -- it's a bit of this, but also a bit of a 'here's how to think about this' discussion document. You can say 'this does not represent The Unjournal's official position or advice' and link our Author FAQ https://globalimpact.gitbook.io/the-unjournal-project-and-communication-space/faq-interaction/for-researchers-authors for the latter. But also avoid the 'not this but that' typical AI language. These are 2 separate things

    14. seminar-defensible

      why 'seminar-defensible' -- this could confuse things, people will say "why not present it at seminars"? Also, you mean for such a paper; this could be interpreted as public (UJ) evaluations will make it seminar defensible

  2. uj-exeter-talk.netlify.app uj-exeter-talk.netlify.app
    1. A public commitment — and a signal. “I’m willing to have this evaluated openly.” Feedback now, a public signal now — journal path still open.

      Can we have a 'separating equilibrium' or other image here. Let's focus this slide on the "willing to recieve and respond to public criticism signals research strength' part ... and then the 'immediate underdog benefit thing' is the next slide (which you can tease)

    2. runs

      1st and second box ... thought bubbles or callouts (0-100 percentile relative to pool); Both suggestor and assessor writes a motivating explanation/discussion.

      "Whole team votes" -- 5 point approval scale (strong/weak/neutral)

    3. What does open (Unjournal) evaluation provide? Now: faster, useful feedback + a credible public signal, and useful inputs to practitioners and funders. Soon: it starts to carry career value. Eventually: it can replace much or all of what we ask the journal stamp to do. Which of these would actually help your work?

      the text on this 'all green slide' is a bit hard to read. Make it more readable and clear.

    4. For research leaders & managers encouraging engagement signals a commitment to rigour, transparency & innovation — and opens the research-impact channel (our funder & practitioner network, incl. Pivotal Questions). Two audiences. Individuals/committees: a strong public evaluation should count as evidence of quality in its own right — hiring, promotion, REF narratives, grants, esteem. Research managers / those setting direction for a group or department: encouraging engagement is a visible demonstration of research rigour, transparency and openness to innovation — and brings more, faster, more transparent feedback and signals than standard peer review. Exeter needn’t lead, but it could position itself as open to this innovation. The research-impact channel is real: strong connections to funders and nonprofit practitioners, including via Pivotal Questions. If useful, happy to discuss a light next step. span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      space this slide out more

    5. Full “model” (v. preliminary, ~Fable-generated with human feedback): unjournal-reluctance-note.netlify.app Reframed per your screening/sorting logic: the value is highest when you’re strong-but-under-credited or just below a bar — exactly where an extra credible signal can move you. And if committing to open evaluation becomes a positive signal in a sorting equilibrium, you want to be an early mover. The case for waiting is narrow: work that already clears the bar AND a genuinely sensitive moment — then the extra signal adds little. I’d be less worried about about “harmful criticism”: our evaluations are constructive and you get a public response; public scrutiny isn’t a bogeyman. For timing/embargo, people talk to us. span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      This bit is cut off at the bottom of the page. Need to use the vertical space a little more conservatively here. Have smaller fonts or more use of the horizontal space.

    6. journals never teach

      Obviously, journals never teach this. Also, our research education system doesn't tend to teach this. But it's also not clear that this alone is something that has a lot of career value. It's the methodology, theory, context, etc. that has the career value, as it allows you to do better work.

    7. Build a reviewing reputation early — citable evaluations on a CV Only where someone here sees the fit — not a demanded programme. A reading group could shadow-evaluate a published package and compare their judgements to the expert evaluations — good methods training. We can support student involvement and there are paid RA-style roles. span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      I don't know that I would emphasize this quite so much. First of all, this only (mainly) holds if you choose to sign your reviews, which not everyone wants to do, particularly early in their career . But also, it does show some value and helps you demonstrate your understanding, but I don't think that the profession rewards refereeing and reviewing quite so much.

      A nice thing is you will get feedback on your evaluation from us and potentially from the authors, which will help you improve and learn.

    8. Training in structured evaluation — a skill journals never teach

      I think this is missing some of the key benefits here. Our public evaluations help you understand what issues other economists care about, as well as, to some extent practitioners, funders and people interested in impact. It's a methodological discussion that will help your own work, as well as help you understand the ways to engage in the peer review process .

      This also helps make you part of a conversation involving funders, grantmakers, and people that might be able to help your career and help you have more impact.

    9. timing

      I can see something like "sensitive career moments + Work that likely passes the bar" Being a situation where you wouldn't want to have this sort of public criticism or these additional signals. But it depends. If you're at a sensitive career moment but you think you're coming up just below the bar, or you are being systematically undervalued, then it might be helpful to have these additional signals. And if making a commitment to public evaluation itself becomes a positive signal in a screening, sorting equilibrium, you will definitely want to do it in such a situation.

    10. criticism likely about taste / importance / fit

      I don't see this as clearly argued - The Unjournal isn't going to give these sorts of critiques in a way that I think will be harmful, and I don't want people to keep thinking that public criticism is somehow a deeply harmful thing.

    11. Exeter strength Capabilities it brings Behavioural & experimental (Hauser, Fonseca, Balafoutas) decision-making, elicitation, policy design LEEP / environmental (Bateman, Groom, Day) valuation, natural capital, evidence-based policy Health & wellbeing (Jamison, Medina-Lara) cost-effectiveness, wellbeing, decision modelling Development / applied micro (Jamison, Banerjee) interventions, external validity, welfare Econometrics & methods (Clarke) evaluation design, calibration, meta-science

      This seems a bit small, not using the whole page. ?

    12. Forecasts via Metaculus; partners incl. Institute for Replication, Center for Open Science

      The Metaculus thing is a bit separate now. Go back to the Pivotal questions knowledge base and update on this.

    13. Examples: cultured-meat costs · plant-based substitution · WELLBY ↔︎ DALY conversion. This isn’t replacing academic agendas with consulting — it’s taking questions funders and practitioners already face and asking what research and expert judgement imply for actual choices. We elicit high-value questions, curate and evaluate the evidence, add structured expert forecasts (our Metaculus community), and synthesise. Timing: the wellbeing and cultured-meat workshops have happened; the plant-based substitution workshop is still in planning. Partner/related orgs include the Institute for Replication, the Center for Open Science, and Metaculus. (Logos can be added if we want.) span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      Hyperlink the workshops here.

    14. partners incl. Institute for Replication, Center for Open Science

      Those aren't the partners on this. The partners on this include Founders Pledge and Animal Charity Evaluators. People from Coefficient Giving and many other organizations have participated in our workshops on these pivotal questions as well.

    15. Each package = 2–3 expert evaluators, backed by a broader community:

      I think the earlier version of this page was better. It's a little bit confusing because you're talking about evaluators, but you're showing the management team and advisory board here.

    16. Quantified, benchmarked:

      This benchmarking is necessary because we don't quote, accept, or reject, and we don't have a journal tier. So this is the path to Unjournal evaluations being something that has career value as well as value for research users.

    17. ~330 screened → the ~57 we’ve published

      This doesn't quite make sense. First of all, we don't publish the research; we publish evaluations. Second of all, where are you getting the 330 figure? Maybe leave this off.

    18. is it already famous?

      That's not quite getting at the right thing because we actually do favor research that is more well known, as it's likely to be more influential. What we don't favor is research that is just simply seen as deeply intellectually interesting or clever

    19. “A very positive review of our work”

      'positive' is more about their work and not about the evaluation ... look for better feedback, including from authors

    20. 180+ evaluators stand behind every evaluation:

      the "180 evaluators" don't stand behind every evaluation ... this doesn't make sense. We have usually 2 evaluators per evaluation. Also mention the field specialist counts too

    21. What we’ve evaluated — 57 packages by area

      incorporate image/text of one of the award-winning evaluations here (or as a bonus vertical slide)

    22. “But why expose my paper?”

      ilustration is OK but it's a bit too 'obvious' -- it should note that a signal could go in either direction ... perhaps should be nested in a graph considering both internal and external signals ... and illstrate cases where the expected value is positive

    1. The strongest private case is a paper already ready for serious expert scrutiny whose true quality exceeds its default credibility

      what about the case that 'willingness to make all work available for evaluation could be a strong signal of your confidence and credibility'?

  3. uj-exeter-talk.netlify.app uj-exeter-talk.netlify.app
    1. good

      More like, "How does AI evaluation compare to humans?"

      And I'd frame this more as an open question, one we're exploring, but at the moment the general attitude seems to be that there needs to be a human in the loop, at the very least, making the final judgment calls, prioritization, and communication

    2. Questions for you

      I'd add things like: - How could this invigorate teaching and research training? - How could it help with building agendas, attracting funding, and demonstrating value for exercises like the REF?

    3. What would make an evaluation count as evidence of quality?

      This is perhaps the most important question here - maybe put this one first. ... What would make it reliable, meaningful, and valued?

    4. Where would faster public evaluation be most useful in economic

      Not quote in economics. That's asking too much. Leave the last bit out, but presumably they'll understand that we're asking about what would be useful to them.

    5. LLM vs. human ratings: modest correlation — not aligned enough to substitute

      We don't have such strong evidence on this to say it's not a line enough to substitute, to be honest. We only have one trial that we attempted. This slide overstates things, and it would be better to have it link and show some of our output, just so people know what was done in our trial.

    6. Forecasts via Metaculus; partners incl. Institute for Replication, Center for Open Science

      Those are not the relevant partners. The relevant partners are: - Founders Pledge - Animal Charity Evaluators We've had participants in workshops from Coefficient Giving and many other organizations..

      Metaculus is not really at the center of this at the moment. ... It's more on our own pages and platforms. https://uj-wellbeing-workshop.netlify.app/beliefs the Metaculus thing is sort of an extension.

    7. ~9 management · 15+ advisory board · 40+ field specialists · 180+ evaluators (over half economists, over half doctorates). The point isn’t celebrity endorsement — it’s that we have enough disciplinary coverage, advisory oversight, evaluator depth, and process experience to be taken seriously. Usually 2–3 evaluators per package (not “everyone behind every evaluation”). Field specialists across eight areas help prioritise and recruit. Two of those field specialists are here at Exeter (next section). Full team at unjournal.org/team. span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      Make this bit larger and more visible. Maybe include some of the visuals on the composition of the evaluator pool.

    8. markets

      I don't think labor markets get at it here. We're talking about the impact of transformative technological change on labor markets, not labor markets on their own.

    9. It’s a coordination problem

      Add "funding and grantmaker incentives will help".

      And maybe replace this with "solving the coordination problem".

    10. Replace Fear of Standing Out with Fear of Missing Out

      Also a bullet about how we're making ourselves prominent in the ecosystem so that the evaluations and ratings will be seen before the paper is reviewed by conventional journals.

    11. e.g. Bonn’s tenure criterion: “at least one article in a top-5 general-interest journal.” The signal is the system. The honest diagnosis is a collective-action failure, not preference. Outside demand (funders who value research) can fund a better signal while academia decides how much to trust it — which is why an early, low-cost engagement from a place like Exeter matters. The Bonn example shows how hard-coded journal prestige has become. Detail slide. span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      Why is this quote here? It doesn't seem to relate to the slide.

    12. Research evaluation for choices that can’t wait. The honest origin story (keep on-slide light, say this aloud): some early funders and partners come from the global-priorities / EA-adjacent world. They’re not mainly asking “is this top-5 material?” — they’re asking “how should this change our beliefs, and what should we do differently?” They need quantified beliefs with uncertainty and explicit reasoning. That’s a different demand signal from the journal system. Important framing: this is NOT “academics don’t want to change.” Many academics dislike the current system — but individual researchers and departments can’t safely move first. It’s a coordination failure mistaken for a preference. Outside demand matters because it can pay for a better signal while academia decides how much to trust it. And the demand may grow: AI wealth may expand impact-focused philanthropy — Anthropic has confidentially filed a draft S-1 for a proposed IPO (not money in hand, but a plausible tailwind). span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      Skip this slogan. It's not just about speed and timing.

    13. “Published — so stop bothering me about it”

      Add some more vertical slides with the other costs of the existing system/benefits of separating evaluation from "publication", And making the evaluation public

    14. 1 · The problem

      These green slides are not so visually compelling. The text is small, the numbering is not particularly helpful, and there's no image or anything that makes it seem interesting.

    15. Careful quantified evaluation can begin to compete with — and eventually replace — the journal stamp.

      That's nice, but I also want to emphasize the value that we're providing in the medium term.

    16. Which pieces, if any, would actually help your work? Distinguish horizons: near term, this provides useful feedback, decision evidence, and an additional public quality signal. Medium term, if the evaluations prove calibrated and useful, they can begin to carry career value. Long term, they can replace some of what we currently ask journal prestige to do — not a claim that committees should ignore journals tomorrow. Final spoken close: “I’m not asking Exeter to adopt a system today — I’m asking which pieces of this, if any, would be useful enough for researchers here to try, use, challenge, or build on.” (Aside: the deck is open to Hypothes.is comments.) span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      What was meant by "pieces" here?

    17. How the triage runs

      I don't think this is the right diagram. I think we want the other one illustrating just what the process has been ... People on the team suggest it, give it a rating for prioritization/potential for impact, the whole team votes on it, we finalize it, and liaise with the authors, etc. This is just a diagram about a particular way that we do or do not consider certain things in doing this prioritization.

    18. There’s already a real connection to build on — if useful. Not cold outreach, and not an institutional ask. There are already people and examples connected to Exeter; if Julian or Ben are here, acknowledge them. I’m interested in whether any of these connections are useful to people in this room. span.MJX_Assistive_MathML { position:absolute!important; clip: rect(1px, 1px, 1px, 1px); padding: 1px 0 0 0!important; border: 0!important; height: 1px!important; width: 1px!important; overflow: hidden!important; display:block!important; }

      Drop this. This is obviously already a point being made.

  4. Jun 2026
    1. About Summary Pivotal Questions Live Sessions Resources ▾ Readings Linear WELLBY Analysis DALY-WELLBY Conversion Metaculus Question

      colore on black text here is very hard to read

    1. How many WELLBYs equal 1 DALY?

      check and annotate -- what does e.g., 5 wellbys per DALY mean in context, and how does it compare with what people currently do?

    2. problem under consideration. So I'd resist doing a simple exchange rate."

      This seems like a valid objection, but I think we still phrase the question such that you would give a meaningful answer, or you could give a meaningful answer in this case in terms of the value generated if you were forced to use a single conversion.

    3. PQ1B — Recommended measure for funders

      The discussion might be more valuable, or I would say is likely to be more valuable than the response, particularly for this question.

    4. Composite well-being measure

      let's do better to differentiate this from calibrated well-being. It's not fully clear to someone glancing at this briefly.

    5. How many WELLBYs equal 1 DALY?

      Make sure you can access the literal question from this interface to know exactly what the respondents are answering. If these things get very long, you can use tooltips.

    1. Interactive uncertainty model

      I don't htink this is a stochastic model? Perhaps an extension of this should give these (correlated?) distributions. ... Squiggle-type modeling

      There should also be discplay of the actual equation behind the model, and a folding box or linked page explaining it in more detail

    2. Organizations should distinguish runway decisions from upside options. If a project is valuable only under a fast-funding scenario, that dependence should be explicit rather than hidden inside local rumor.Funders and field builders should prioritize grantmaker capacity, plural donor relationships, legal vehicles, and evaluation infrastructure. These are the bottlenecks that convert paper wealth into usable grants.

      this advice seems on the overly generic side?

    1. or policy relevance.

      remove 'or policy relevance' perhaps -- The Unjournal prioritizes research with global impact potential (although that's not what we mainly rate the research on)

    2. or the likely criticism is about taste, importance, novelty, or fit rather than checkable claims

      not sure I understand the logic behind the latter part

    3. r[qA - (1-q)L] + (1-r)[(1-q)A - qL] - k ≥ 0

      notation needs improvement, and it should be explained more -- how derived, how to interpret it? Tooltips and expanding sections could help

    4. Requesting a noisy public test is not the same as disclosing an already-known verifiable fact.

      I suspect another paper has dealt with this question ... 'when noisy signals help the seller' or some such

    5. e reader should not update from p0 after observing the evaluation res

      this needs clarification, I don't quite see why this is the case. Isn't it possible that the author's signal is positive so they submit, but the evaluator reading the paper gets a negative signal?

    6. binary quality Q in {H,L}, w

      is this 'binary' rhe relevant threshold? Where did it com from? is it sort of generalizable? Consider if it misses some important nuance

    7. Public anonymity statistic. The anonymity choice is empirically important. Running python unjournal_anonymity_stats.py on the public data bundle gives 65 anonymous/generic public evaluator identifiers among 113 deduplicated evaluator-paper pairs with quantitative ratings: 57.5%. In the subset matched to published-evaluation status, the share is 63/105: 60.0%, with 7 unmatched title rows. This supports saying "a bit over half" choose anonymous/generic public identifiers, but the denominator should be stated. The wider evaluator_paper_level.csv denominator is not clean for this claim because survey-only rows are assigned generic Evaluator N labels.

      this should be a fold or footnote -- give a quick statistic and footnote yow it was captured

    8. When the answer is unclear, the practical move is not immediate publicity. It is a fit-and-timing conversation, coauthor consen

      too much 'not this but that' AI speak. And 'publicity' is vague here

    9. 2. Is the main obstacle credibility, visibility, field fit, or network access?

      this needs further explanation and clarification -- 'usual channels' should already encompass clarification

    10. e relevant question is not whether public evaluation is always good; it is when a public signal improves expected outcomes relative to waiting, revising privately, or continuing th

      this is the 'AI language of dichotomy' overused

    11. a result that a credible public test strictly helps authors whose default standing sits below the bar — and we are precise about the downside it carries for those just above it;

      The language of this is a bit unclear. Try to make it easier to understand.

    1. Explicit crux Which specific uncertainties — AGI timing, takeoff speed, power-seeking tendency, offense-defense balance, pause feasibility — most shift expert p(doom) estimates?Community solicitation for explicit AI-risk cruxes: uncertainties whose resolution would significantly shift p(doom), including AGI arrival year, takeoff speed, power-seekin

      this is meta -- I don't want meta, or at least put that into an 'opt-in' list

    1. ee our early automated prioritization prototype, which is outside legal research and currently focuses mainly on economics and related work.

      We can swap in here the legal prioritization prototype -- https://uj-prioritization-prototype.netlify.app/legal/ -- please do this -- and note that we're looking for feedback and examples to help improve and train this. Note that we don't envision this prioritization to be mainly driven by AI models -- humans will be making the ultimate decisions -- but these tools can be very helpful in the process.

    1. Comment directly on this page using the Hypothes.is sidebar (the < tab on the right edge). Or use the rating buttons on each paper card — human ratings are how we will calibrate these scores.

      Give people the option to suggest/add content.

    2. Comment directly on this page using the Hypothes.is sidebar (the < tab on the right edge). Or use the rating buttons on each paper card — human ratings are how we will calibrate these scores.

      Let us know if you have any questions about this.

    1. How this was made. Drafted by GPT Pro from existing Unjournal research and discussion (the elasticity-validation survey, the Bray et al. evaluation materials, and the PBM substitution literature), then built and polished into this interactive report in Claude Code. It is currently being reviewed and adjusted by hand. Treat figures and attributions as provisional until that review is complete; the governing evaluation lives on PubPub.

      Make this a folding box - and the header should say AI/human collaboration in some way

      Another folding box should have the standard call out about how we want feedback, and you can use the hypothesis tool for that.

    1. Note: This workshop is in early planning. The framing, evidence base, and participant list are still being developed.

      Still considering how to frame this workshop, and it depends on interest and participation. One frame is directly targeting what we know about plant-based products, who consumes them, and what it suggests for potential substitution and animal welfare. However, that evidence seems to be rather thin, inconclusive, and premature, perhaps. (See links to EA forum posts, etc.) Furthermore, our evaluation of Bray et al. on experimental versus standard quantitative marketing/I.O. estimates of own price elasticities suggests perhaps deep uncertainty. and lack of ability to be confident in these parameters, not to mention cross-price effects and substitution patterns. This potentially motivates a pivot towards focusing on these methodological questions, as well as framing it in terms of "what can we know and what research is worth pursuing."

    1. Thank you for participating in The Unjournal's Plant-Based Substitution Pivotal Questions workshop. Your feedback helps us measure the workshop's impact and improve future workshops.

      Remove this page for now because it makes it seem like the workshop already happened.

    1. A major methodological innovation. The framework is elegant and the estimation strategy is sound. The empirical component would especially benefit from more diverse and reliable samples, and from direct comparisons against existing scale-correction methods so readers can judge incremental value. Logic and communication could be tightened in places — rated lower here than the other dimensions.

      This is not his full evaluation. He gave a very in-depth evaluation, and you've only taken one paragraph here.

    2. The cost of calibration questions The central tension is practical, not theoretical. Prati flags that the evidence rests on a large number of calibration questions. It is unclear how well the correction performs with the realistic two or three CQs — and even two can be a heavy burden in large surveys. He suspects this is “one crucial reason anchoring vignettes have not been implemented at scale in 20 years.” Kaiser rates the work highly but pushes for more diverse, reliable samples and direct comparisons against existing scale-correction methods, so readers can judge the incremental value. His lower marks fall on logic & communication and on claims & evidence.
      1. Firstly, the header does not fully describe the critiques here. It's only one of the critiques.
      2. Secondly, even in this scrollytelling depiction, we probably want a bit more about what the evaluators are saying, going into more than one theme very briefly, because this is the core of The Unjournal's value add.
    3. Two experts, eight criteria

      We probably want a little bit of a transition here between the issue and the issue of measuring individuals' well-being through self-reports and what the Unjournal is now doing in terms of rating the paper, which is also on certain scales that may have subjective components themselves. Funnily enough. Make the distinction clearer here

    4. For decades, economists hesitated to use subjective well-being data for one stubborn reason: people use survey scales differently.

      This probably needs a little bit more context on why we're trying to measure people's well-being and happiness through self-reports.

    5. Estimated from a few extra calibration questions — not a full vignette battery.

      the diagram is not fully explained? what does each dot represent? Should we be giving 'names of people' (or IDs, or types of people) to make that clearer?

    6. data for one stubborn reason:

      I know this is meant for a public audience, but it's a little bit oversimplified. Perhaps we can say it in an equally concise and appealing way, but without making the absolute claims like "for one stubborn reason..." there may have been other reasons too. (Note to AI -- try to make this a persistent pattern in your writing. )