51 Matching Annotations
  1. Aug 2023
    1. ncorporated with CSV, the model becomes capable of usingcode to verify answers

      self-refine是在逻辑层面对代码进行检验,而很多数学问题是可以通过代入答案来检验正确性,从而可以通过代码进行检验

    2. This study provides the first systematic analysis of code generation, execution, and self-debugging’s role in mathematical problem-solving.

      其实self-refine也进行了类似的工作,只不过不是基于gpt4-code

    1. in one round,

      很有可能是过拟合了,generator无法产生有意义的数据用于verifier的训练,可能每一轮都从初识checkpoint进行训练能减缓该问题

    2. The results clearly demonstrate the capa-bility of CoRe to greatly boost PLMs’ reasoningability.

      主要的提升来源于verifier,而不是self-thinking

    1. his helps explain the effectiveness of our method.

      该工作的有效性主要来源于两个方面,一是模型根据unlabeled output生成instruction,二是模型自己迭代地选择更优的指令数据对。模型能够根据output生成instruction,且得到的数据包含高质量数据,这应该是由于生成高质量output比生成高质量instruction要难得多,而根据已有的output生成instruction就更简单,更容易得到高质量数据对。

    1. Figure 1: Running Examples of Evol-Instruct.

      要把这个流程应用于vision-language领域,主要问题是:1.设计in-breadth和in-depth的方向。2.设计gpt和lvlm的交互方式。

    1. Our GPT4Tools stands distinct from previous and concurrent studies [5– 11] in three ways

      与大多数工作利用gpt生成视觉-语言指令微调数据不同,该工作生成分解并调用api类型的指令数据

    1. Percentage (%) of full-mark answers on LMExamQA

      两个fine-tuned models的满分答案比例如此之高,说明这个benchmark的难度也不够

    2. Overview of our benchmarking method.

      如果是要生成众多语言模型不擅长的benchmark,breadth就是找到LLM不擅长的方向,depth就是在这些方向上不断加深。比如说广泛生成不同类别任务,LLM进行测试,根据测试结果选择最差的方向,加深难度。这样难度还是有层次的,加深一次难度增加一次

    1. n the image encoder, we insert the adapters beforethe multi-head attention modules.

      本文在image encoder中也添加了adapters,引入了更多可训练参数

    2. the number of optimized parameters is still kept at a very small scale, e.g.,3∼5M

      llava是微调了整个llm,故和llava比它调的参数是少得多,但mini-gpt4只调了一个linear层,应该是调参数最少的

    3. can maintain the NLP capabilities of LLMs

      mini-gpt4冻住了llm部分,故也完全保留了nlp能力,这里应该是在和llava比,llava由于使用的是llama,故在vl 指令微调的阶段开启了llama的权重。

  2. Jul 2023
    1. To amplify the correspondence signal between the input instance and the correct labe

      模型只学会根据label分布来判断指令,而忽略了input instance和label之间的关系

    1. We hypothesize that during inference, LLMslearn the correspondence between answer choice in the instruction (e.g. Determine the speaker ofthe dialogue, "agent" or "customer".) and the label (e.g. agent) from demonstrations

      可能llm学到的主要是根据指令解决问题

    1. we follow their setting and only feed the detail description subset of SVIT into the model

      更细致的图像描述可以减轻模型感知图像的幻觉

    1. his indicates the importance of keeping the trainingdata diverse and balanced across different categories in IFT

      这显示了该方法的一些问题,对难度较大的,或较偏的问题,gpt给出的评分可能会偏低,这导致这些指令数据被选中的概率偏低

    2. ∼6k high-quality data suffices to finetune LLaMA achieving similar performance as theoriginal ALPACA

      多余的数据是因为质量低,还是多样性差

    3. we designate “accuracy” as the dimension for ratingpurposes

      单一的维度肯定是不充分的,对每条指令数据集进行单独评分也无法考虑到很多因素,如多样性

    1. nearest neighbour score, which is a metric ofdataset diversity

      可以看出这个指标是最重要的,这可能因为验证集和训练集是没有任何重合的,故测试的是指令微调后的泛化能力,而diversity越高泛化能力越强。

    1. GRIPSworks best when models can follow declarativeinstructions and are responsive to changes to in-structions (shown in Appendix D)

      但其实在指令跟随模型上,指令搜索带来的提升是最小的,这可能是因为经过指令微调后,模型对指令理解的泛化能力大幅度提高了,对指令的细微区别就没那么敏感了

    1. Wedefine the partial order between y1 and candidatesbehind it as y1,2:n = y1 ≻ {y2, · · · , yn}, then theobjective of Bradley-Terry becomes

      将两个中最好的概率扩展到了多个中最好的概率

  3. Jun 2023
    1. We regard thereasoning paths that match the ground truth finalanswer as positive, and the others as negative.

      这样做可能会有小问题,错误的推理步骤也可能导致正确的答案

    1. The modest performance gains in Math Reasoning can be traced back to the inability to accuratelyidentify whether there is any error.

      那可不可以专门fine tune一个模型来进行错误的检查与定位

    1. This step is not intended to teach the generator new skills; it is intendedonly to teach the generator to produce solutions in the desired format

      在verifier文章中是对generator进行了少量的finetune,应该是因为gpt-4的性能本身就比gpt-3强得多,直接从gpt-3 sample较难得到有意义的solutions,而gpt-4就不需要了

    2. It provides more precise feedback,since it specifies the exact location of any errors that occur.

      一个好的reward model应该要有能力识别solution中是哪里出了错,仅从outcome 监督信号中泛化出这种能力是很困难的,我们可以通过更细致的监督信号简化这个学习过程

    1. Unfortunately, test@100 performance degradesmuch more sharply than test@1 as we increase the number of epoch

      这说明在采用majority voting或rerank类似的方法时要防止generator过拟合

    1. since a single logical error is enough to derail amuch larger solution
      1. 对于GPT-3 GPT-4这样的模型来说,解决复杂数学问题也并不容易,GPT-3可能需要很多次生成才能产生一个正确的答案,如果有一个很强的reward model可能可以将这个正确的答案筛选出来。
      2. 对于依赖于思维链的复杂推理问题,一个思维步骤的错误可能会导致整个回答的错误。
      3. 模型解决复杂数学问题的能力是否可以解耦为给出解答步骤的能力和执行解答步骤的能力,分开分析
    1. we use random words sampled from the un-labeled evaluation dataset as the content-free text

      通过这种方法也将domain label bias考虑在内

  4. May 2023
    1. Domain Label Bias

      似乎domain label bias及vanilla label bias和ICL没有一定的联系,并不是只有ICL中才存在这种形式的bias

    2. Us-ing random words limits the semantic meaning ofthe input, allowing us to estimate the vanilla-labeland context-label biases while using in-domainwords accounts for the effect of the task corpus

      总的来说,我们希望整个prompt部分只提供有价值的 无偏差的信息,这在一定程度上可以通过校准来实现

    1. On the other hand, the models do not improve (or even decline in accuracy) on evaluation datasets forwhich there is little support.
      1. 也就是说泛化性不好,指令有没有优化的方向
      2. 模型在instruction tuning的过程中学到了什么,这是决定其泛化能力的关键
      3. 是否可以评估模型在topic,task type,format等方向上的泛化能力