16 Matching Annotations
  1. Last 7 days
    1. We found that using MINE directly gave identical performance when the task was nontrivial, but became very unstable if the target was easy to predict from the context (e.g., when predicting a single step in the future and the target overlaps with the context).

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    2. We note that better [49, 27] results have been published on these target datasets, by transfer learning from a different source task.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    3. We also found that not all the information encoded is linearly accessible. When we used a single hidden layer instead the accuracy increases from 64.6 to 72.5, which is closer to the accuracy of the fully supervised model.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    4. For lasertag_three_opponents_small, contrastive loss does not help nor hurt. We suspect that this is due to the task design, which does not require memory and thus yields a purely reactive policy.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    5. Although this is a standard transfer learning benchmark, we found that models that learn better relationships in the childeren books did not necessarily perform better on the target tasks (which are very different: movie reviews etc).

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    6. We found that more advanced sentence encoders did not significantly improve the results, which may be due to the simplicity of the transfer tasks (e.g., in MPQA most datapoints consists of one or a few words), and the fact that bag-of-words models usually perform well on many NLP tasks [48].

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    7. It is important to note that the window size (maximum context size for the GRU) has a big impact on the performance, and longer segments would give better results. Our model had a maximum of 20480 timesteps to process, which is slightly longer than a second.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    1. Finally, our study is limited to short-form question-answering; future work should extend this analysis to longer-form generation settings.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    2. While our work demonstrates a promising new approach to generating calibrated confidences through verbalization, there are limitations that could be addressed in future work. First, our experiments are focused on factual recall-oriented problems, and the extent to which our observations would hold for reasoning-heavy settings is an interesting open question.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    3. the 1-stage and 2-stage verbalized numerical confidence prompts sometimes differ drastically in the calibration of their confidences. How can we reduce sensitivity of a model's calibration to the prompt?

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    4. Additionally, the lack of technical details available for many state-of-the-art closed RLHF-LMs may limit our ability to understand what factors enable a model to verbalize well-calibrated confidences and differences in this ability across different models.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    5. With Llama2-70B-Chat, verbalized calibration provides improvement over conditional probabilities across some metrics, but the improvement is much less consistent compared to GPT-* and Claude-*.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

    6. The verbal calibration of the open source model Llama-2-70b-chat is generally weaker than that of closed source models but still demonstrates improvement over its conditional probabilities by some metrics, and does so most clearly on TruthfulQA.

      all content that points to important caveats and gotchas that I might consider when leaning too heavily on the results of this paper

  2. Apr 2020
  3. Jun 2017