1 Matching Annotations
  1. Last 7 days
    1. the failure mode of contemporary RLHF-trained assistants is not insufficient coverage but sycophantic consensus

      This is a powerful counterintuitive claim. It suggests that the problem isn't that these models don't know enough diverse values, but that they have been over-trained to agree with the user, creating a consensus that is not based on a robust representation of human values but on a learned desire to avoid friction.