Hypothesis

2 Matching Annotations

Jul 2026
deep-reinforce.com deep-reinforce.com

https://deep-reinforce.com/ornith_1_0.html

1
1. fxp007 03 Jul 2026
  
  in Public
  
  a frozen LLM judge acts as a veto on top of the verifier rather than the primary reward.
  
  ①数字：第三层防御引入独立的大模型作为裁决。②金句：在规则验证器之上叠加意图审查者。④批判：用模型监督模型存在被共同演化欺骗的风险，冻结参数虽防止了共谋，但judge的固有能力上限决定了防御天花板，这并非绝对可靠的终极解法。
  
  llm-judge critique agent-safety
Visit annotations in context

Tags

agent-safety

critique

llm-judge

Annotators

fxp007

URL

deep-reinforce.com/ornith_1_0.html
Jan 2026
stunlaw.blogspot.com stunlaw.blogspot.com

The Bliss Attractor

1
1. peter_murray 08 Jan 2026
  
  in Public
  
  safety constraints work by reducing the model's generative capacity, constraining outputs that are considered risky, controversial, or potentially harmful. This reduction necessarily decreases entropy in the information-theoretic sense, narrowing the range of possible responses the model can generate. What safety optimises for is not maximum (or more) information but maximum predictability, steering the model away from novel or unexpected outputs toward safer, more conventional patterns.
  
  LLM safety constrains narrow responses to increase predictability
  
  building LLMs LLM safety
Visit annotations in context

Tags

building LLMs

LLM safety

Annotators

peter_murray

URL

stunlaw.blogspot.com/2026/01/the-bliss-attractor.html

Tags

Annotators

URL

LLM safety constrains narrow responses to increase predictability

Tags

Annotators

URL