ncorporated with CSV, the model becomes capable of usingcode to verify answers
self-refine是在逻辑层面对代码进行检验,而很多数学问题是可以通过代入答案来检验正确性,从而可以通过代码进行检验
ncorporated with CSV, the model becomes capable of usingcode to verify answers
self-refine是在逻辑层面对代码进行检验,而很多数学问题是可以通过代入答案来检验正确性,从而可以通过代码进行检验
Generating code in brief and frequent segments
可能是每次生成的代码更短,更不容易出错
can improve computational capability morethan the natural language chains CNL
代码可以消除计算错误
This study provides the first systematic analysis of code generation, execution, and self-debugging’s role in mathematical problem-solving.
其实self-refine也进行了类似的工作,只不过不是基于gpt4-code
in one round,
很有可能是过拟合了,generator无法产生有意义的数据用于verifier的训练,可能每一轮都从初识checkpoint进行训练能减缓该问题
The results clearly demonstrate the capa-bility of CoRe to greatly boost PLMs’ reasoningability.
主要的提升来源于verifier,而不是self-thinking
In summary, the overall training objective forverifiers is given by
其实该目标是分开进行训练的
his helps explain the effectiveness of our method.
该工作的有效性主要来源于两个方面,一是模型根据unlabeled output生成instruction,二是模型自己迭代地选择更优的指令数据对。模型能够根据output生成instruction,且得到的数据包含高质量数据,这应该是由于生成高质量output比生成高质量instruction要难得多,而根据已有的output生成instruction就更简单,更容易得到高质量数据对。
Figure 1: Running Examples of Evol-Instruct.
要把这个流程应用于vision-language领域,主要问题是:1.设计in-breadth和in-depth的方向。2.设计gpt和lvlm的交互方式。
Our GPT4Tools stands distinct from previous and concurrent studies [5– 11] in three ways
与大多数工作利用gpt生成视觉-语言指令微调数据不同,该工作生成分解并调用api类型的指令数据
Percentage (%) of full-mark answers on LMExamQA
两个fine-tuned models的满分答案比例如此之高,说明这个benchmark的难度也不够
Overview of our benchmarking method.
如果是要生成众多语言模型不擅长的benchmark,breadth就是找到LLM不擅长的方向,depth就是在这些方向上不断加深。比如说广泛生成不同类别任务,LLM进行测试,根据测试结果选择最差的方向,加深难度。这样难度还是有层次的,加深一次难度增加一次
the visual neck of LaVIN is 6 timessmaller than that of LLaVA [18],
使用的linear layer先下采样再上采样,所以参数更少
n the image encoder, we insert the adapters beforethe multi-head attention modules.
本文在image encoder中也添加了adapters,引入了更多可训练参数
the number of optimized parameters is still kept at a very small scale, e.g.,3∼5M
llava是微调了整个llm,故和llava比它调的参数是少得多,但mini-gpt4只调了一个linear层,应该是调参数最少的
Mixture-of-Modality Adapter (MM-Adapter)
通过不同的adapter提供不同模态的能力,再通过router进行模态选择
can maintain the NLP capabilities of LLMs
mini-gpt4冻住了llm部分,故也完全保留了nlp能力,这里应该是在和llava比,llava由于使用的是llama,故在vl 指令微调的阶段开启了llama的权重。
To amplify the correspondence signal between the input instance and the correct labe
模型只学会根据label分布来判断指令,而忽略了input instance和label之间的关系
We hypothesize that during inference, LLMslearn the correspondence between answer choice in the instruction (e.g. Determine the speaker ofthe dialogue, "agent" or "customer".) and the label (e.g. agent) from demonstrations
可能llm学到的主要是根据指令解决问题
NLI
自然语言推理,判断是否能由前提推导出假设,可以看作只有两个选项的选择题
Sentence Completion
都是自然语言常识推理问题
‘no prompt text’ prompt of COSMOS-QA dataset
cosmos-qa是一个基于常识的阅读理解数据集,是multi-choice qa
e model trained on SVIT can describe abundant detailsaccurately
这应该得益于数据集中提供的额外regions信息
we follow their setting and only feed the detail description subset of SVIT into the model
更细致的图像描述可以减轻模型感知图像的幻觉
Errors in original annotations.
是否可以通过现有的vision language model进行检测,排除这种错误
his indicates the importance of keeping the trainingdata diverse and balanced across different categories in IFT
这显示了该方法的一些问题,对难度较大的,或较偏的问题,gpt给出的评分可能会偏低,这导致这些指令数据被选中的概率偏低
∼6k high-quality data suffices to finetune LLaMA achieving similar performance as theoriginal ALPACA
多余的数据是因为质量低,还是多样性差
ALPAGASUS trained on 3k/6k/9k selected data.
低质量的数据多了会影响性能,而高质量的数据是越多越好的
we designate “accuracy” as the dimension for ratingpurposes
单一的维度肯定是不充分的,对每条指令数据集进行单独评分也无法考虑到很多因素,如多样性
nearest neighbour score, which is a metric ofdataset diversity
可以看出这个指标是最重要的,这可能因为验证集和训练集是没有任何重合的,故测试的是指令微调后的泛化能力,而diversity越高泛化能力越强。
GRIPSworks best when models can follow declarativeinstructions and are responsive to changes to in-structions (shown in Appendix D)
但其实在指令跟随模型上,指令搜索带来的提升是最小的,这可能是因为经过指令微调后,模型对指令理解的泛化能力大幅度提高了,对指令的细微区别就没那么敏感了
Wedefine the partial order between y1 and candidatesbehind it as y1,2:n = y1 ≻ {y2, · · · , yn}, then theobjective of Bradley-Terry becomes
将两个中最好的概率扩展到了多个中最好的概率
Step-aware Verifier
step-awre verifier的优势来源于训练时更密集的监督信号
Step-aware Voting Verifier
为什么不将所有steps的结果聚合起来作为一个答案的判别结果?
We regard thereasoning paths that match the ground truth finalanswer as positive, and the others as negative.
这样做可能会有小问题,错误的推理步骤也可能导致正确的答案
feedback-based imitation learning
根据feedback获取的难易程度其实可以分为三个等级
several advantages
所有feedback without RL的方法共有的优点
The modest performance gains in Math Reasoning can be traced back to the inability to accuratelyidentify whether there is any error.
那可不可以专门fine tune一个模型来进行错误的检查与定位
step-supervised方法能不能结合decode,作为rerank方法的一种变体,rerank更契合output-supervised的思想
This step is not intended to teach the generator new skills; it is intendedonly to teach the generator to produce solutions in the desired format
在verifier文章中是对generator进行了少量的finetune,应该是因为gpt-4的性能本身就比gpt-3强得多,直接从gpt-3 sample较难得到有意义的solutions,而gpt-4就不需要了
It provides more precise feedback,since it specifies the exact location of any errors that occur.
一个好的reward model应该要有能力识别solution中是哪里出了错,仅从outcome 监督信号中泛化出这种能力是很困难的,我们可以通过更细致的监督信号简化这个学习过程
dropout significantly improves solution-level verifiers
使用一个7B的模型来进行classification是很容易过拟合的
Unfortunately, test@100 performance degradesmuch more sharply than test@1 as we increase the number of epoch
这说明在采用majority voting或rerank类似的方法时要防止generator过拟合
Supervised finetuning
supervised finetuning 也可以分为outcome-based和process-based
PAIRRANKER outperforms other rankers.
只用于ranking不用于RL,list wise或pair wise的方法肯定是比point wise更有效的
since a single logical error is enough to derail amuch larger solution
showing that they do not introducedomain-label bia
domain label bias是通过random in-domain words缓解的
we use random words sampled from the un-labeled evaluation dataset as the content-free text
通过这种方法也将domain label bias考虑在内
Domain Label Bias
似乎domain label bias及vanilla label bias和ICL没有一定的联系,并不是只有ICL中才存在这种形式的bias
Us-ing random words limits the semantic meaning ofthe input, allowing us to estimate the vanilla-labeland context-label biases while using in-domainwords accounts for the effect of the task corpus
总的来说,我们希望整个prompt部分只提供有价值的 无偏差的信息,这在一定程度上可以通过校准来实现
On the other hand, the models do not improve (or even decline in accuracy) on evaluation datasets forwhich there is little support.