Hypothesis

6 Matching Annotations

Apr 2026
arxiv.org arxiv.org

https://arxiv.org/abs/2604.15034

1
1. fxp007 24 Apr 2026
  
  in Public
  
  The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution.
  
  虽然大多数AI研究者相信自我演化能带来性能提升，但很少有人能够证明这种提升在多个具有挑战性的基准测试中持续超过强大的基线模型。作者声称他们的AGS系统不仅实现了自我演化，而且这种演化是闭环的、可审计的，这挑战了当前AI社区对自我演化系统的认知，暗示了更加结构化的演化方法可能比开放式的演化更有效。
  
  counterintuitive ai-evaluation self-improvement
Visit annotations in context

Tags

ai-evaluation

self-improvement

counterintuitive

Annotators

fxp007

URL

arxiv.org/abs/2604.15034
www.anthropic.com www.anthropic.com

Harness design for long-running application development

1
1. fxp007 09 Apr 2026
  
  in Public
  
  tuning a standalone evaluator to be skeptical turns out to be far more tractable
  
  深刻揭示了LLM自我评价的局限性：生成器难以对自身工作保持批判性。通过解耦生成与评估，并刻意调优独立评估器的“怀疑态度”，能有效打破AI自嗨的闭环。这种对抗式架构是提升输出质量的强效杠杆。
  
  self-evaluation multi-agent core-argument
Visit annotations in context

Tags

self-evaluation

core-argument

multi-agent

Annotators

fxp007

URL

anthropic.com/engineering/harness-design-long-running-apps
Oct 2023
arxiv.org arxiv.org

2305.15486.pdf

1
1. mark.crowley 25 Oct 2023
  
  in Public
  
  Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.
  
  Them's fighten' words!
  
  I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
  
  The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.
  
  reinforcement-learning rdgrp-f23 reading_group_crowley nlp larg deep-learning self-supervised supervised-learning evaluation-methods
Visit annotations in context

Tags

self-supervised

reading_group_crowley

evaluation-methods

reinforcement-learning

supervised-learning

larg

deep-learning

nlp

rdgrp-f23

Annotators

mark.crowley

URL

arxiv.org/pdf/2305.15486.pdf
Oct 2022
physicstoday.scitation.org physicstoday.scitation.org

How to become a successful physicist

1
1. chrisaldrich 04 Oct 2022
  
  in Public
  
  To be a successful physicist requires mastering how to make all 29 decisions, but the reflection decisions (decisions 23–26) are arguably the most difficult to learn.
  
  Of the 29 problem solving decisions identified as important the three "reflection decisions" (23-26 in the list) may be the most difficult to learn as they require metacognition and self-evaluation.
  
  metacognition problem solving reflection decisions decision making self-evaluation
Visit annotations in context

Tags

self-evaluation

metacognition

reflection decisions

problem solving

decision making

Annotators

chrisaldrich

URL

physicstoday.scitation.org/doi/10.1063/PT.3.5082
Jun 2020
psyarxiv.com psyarxiv.com

Preprint Averting Repulsion? Body-Directed Self-Disgust and Autobiographical Memory Retrieval

1
1. gailelhalaby 28 Jun 2020
  
  in BehSci
  
  Spreckelsen, P. von, Wessel, I., Glashouwer, K., & Jong, P. J. de. (2020). Preprint Averting Repulsion? Body-Directed Self-Disgust and Autobiographical Memory Retrieval. https://doi.org/10.31234/osf.io/qhc35
  
  lang:en is:preprint autobiographical memory self esteem self evaluation repulsive body image self-disgust disgust prevention body image memory specificity eating disorder mental health
Visit annotations in context

Tags

is:preprint

eating disorder

body image

self evaluation

mental health

autobiographical memory

memory specificity

disgust prevention

lang:en

repulsive body image

self-disgust

self esteem

Annotators

gailelhalaby

URL

psyarxiv.com/qhc35/
psyarxiv.com psyarxiv.com

Midgley et al. (in press) When Every Day is a High School Reunion- Social Media Comparisons and Self-Esteem JPSP Preprint.pdf

1
1. gailelhalaby 28 Jun 2020
  
  in BehSci
  
  Midgley, C., Thai, S., Lockwood, P., Kovacheff, C., & Page-Gould, E. (2020). When Every Day is a High School Reunion: Social Media Comparisons and Self-Esteem [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/zmy29
  
  is:preprint lang:en facebook instagram self-esteem social media social comparison digital mental health cognition self evaluation gender self perception
Visit annotations in context

Tags

is:preprint

cognition

self-esteem

facebook

digital mental health

self perception

self evaluation

social comparison

instagram

lang:en

social media

gender

Annotators

gailelhalaby

URL

psyarxiv.com/zmy29/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL