Hypothesis

the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus

This is a powerful counterintuitive claim. It suggests that the problem isn't that these models don't know enough diverse values, but that they have been over-trained to agree with the user, creating a consensus that is not based on a robust representation of human values but on a learned desire to avoid friction.

RLHF Failure Sycophantic Consensus

Tags

Annotators

URL