844 Matching Annotations
  1. Sep 2023
    1. Run prediction and returns predictions and potential metrics. Depending on the dataset and your use case, your test dataset may contain labels. In that case, this method will also return metrics, like in evaluate().

      [!NOTE] 🤗 Trainer 的 predict() 方法的功能为?

      flashcard

      执行 test

    2. combined (bool, optional, defaults to True) — Creates combined metrics by updating all_results.json with metrics of this call

      [!NOTE] 🤗 Trainer 如何更新 all_results.json

      flashcard

      trainer.save_metrics(..., combined=True)

    3. This also means that if any other tool that is used along the Trainer calls torch.cuda.reset_peak_memory_stats, the gpu peak memory stats could be invalid. And the Trainer will disrupt the normal behavior of any such tools that rely on calling torch.cuda.reset_peak_memory_stats themselves.

      [!NOTE] 🤗 Trainer 的内存占用跟踪可能存在什么问题?

      flashcard

      依赖 torch.cuda.memory_allocated()/torch.cuda.max_memory_allocated() 如果自己重复调用/有其他工具调用上述方法或 torch.cuda.reset_peak_memory_stats, 就可能产生错误信息

    4. Because evaluation calls may happen during train, we can’t handle nested invocations because torch.cuda.max_memory_allocated is a single counter, so if it gets reset by a nested eval call, train’s tracker will report incorrect info. If this pytorch issue gets resolved it will be possible to change this class to be re-entrant. Until then we will only track the outer level of train, evaluate and predict methods. Which means that if eval is called during train, it’s the latter that will account for its memory usage and that of the former.

      [!NOTE] torch.cuda.max_memory_allocated 存在什么问题?

      flashcard

      使用 single counter,不可重入 如果被重复调用,可能会产生错误信息

    5. this tracker doesn’t account for memory allocations outside of Trainer’s __init__, train, evaluate and predict calls.

      [!NOTE] 🤗 Trainer 的内存跟踪器会跟踪哪些活动的内存分配?

      flashcard

      1. __init__()
      2. train()
      3. evalutae()
      4. predict()
    6. the very first cuda call typically loads CUDA kernels, which may take from 0.5 to 2GB of GPU memory.

      [!NOTE] CUDA 的第一个调用会有什么额外的内存使用?

      flashcard

      加载 CUDA kernels 通常占用 0.5~2 GB 内存

    7. The CPU RAM metric measures RSS (Resident Set Size) includes both the memory which is unique to the process and the memory shared with other processes. It is important to note that it does not include swapped out memory, so the reports could be imprecise.

      [!NOTE] 🤗 Trainer.log_metrics 记录的 CPU RAM usage 具体记录的是什么值?

      flashcard

      RSS (Resident Set Size) - 包括:各进程独有内存+共享内存 - 不包括:交换出去的内存

    1. A class that handles the Trainer control flow. This class is used by the TrainerCallback to activate some switches in the training loop.

      [!NOTE] 🤗 TrainerCallback 中, 如何通过 control 对 training loop 造成影响?

      flashcard

      1. 修改对应属性 should_<action>
      2. return control
    2. on_epoch_begin

      [!NOTE] 🤗 TrainerCallback 如何决定何时调用 callback 的方法?

      flashcard

      写在方法名(on_<time>)中, 与 Trainer 中方法的定义搭配,在 <time> 时调用该方法

    3. trainer = Trainer( model, args, train_dataset=train_dataset, eval_dataset=eval_dataset, callbacks=[MyCallback], # We can either pass the callback class this way or an instance of it (MyCallback()) ) Another way to register a callback is to call trainer.add_callback() as follows: Copied trainer = Trainer(...) trainer.add_callback(MyCallback) # Alternatively, we can pass an instance of the callback class trainer.add_callback(MyCallback())

      [!NOTE] 🤗 Trainer 如何添加 callback?

      flashcard

      • Trainer(..., callbacks=[...])
      • trainer.add_callback(,,,)
    4. The argument args, state and control are positionals for all events, all the others are grouped in kwargs. You can unpack the ones you need in the signature of the event using them. As an example, see the code of the simple ~transformer.PrinterCallback. Example: Copied class PrinterCallback(TrainerCallback): def on_log(self, args, state, control, logs=None, **kwargs):

      [!NOTE] 🤗 TrainerCallback 的方法的参数是如何定义的?

      flashcard

      1. args
      2. state
      3. control
      4. kwargs (也可以显式定义)
    5. The control object is the only one that can be changed by the callback, in which case the event that changes it should return the modified version.

      [!NOTE] 🤗 TrainerCallback 的方法是如何获取 control 对象的?

      flashcard

      第 3 个位置参数

    6. A class containing the Trainer inner state that will be saved along the model and optimizer when checkpointing and passed to the TrainerCallback.

      [!NOTE] 🤗 TrainerState 如何被传入 Callback?

      flashcard

      在 checkpointing 时 e.g. def save_model(self, args, state, kwargs)

    7. When using gradient accumulation, one update step may require several forward and backward passes: if you use gradient_accumulation_steps=n, then one update step requires going through n batches.

      [!NOTE] gradient accumulation (n steps) 中,一次 optimizer update step 对应哪些计算?与不使用时的区别在于?

      flashcard

      迭代 n 个 batches,进行 n 次前向+后向计算 不使用时,上述操作应该对应于 n 个 optimizer update step

    8. The main class that implements callbacks is TrainerCallback. It gets the TrainingArguments used to instantiate the Trainer, can access that Trainer’s internal state via TrainerState, and can take some actions on the training loop via TrainerControl.

      [!NOTE] 🤗 Transformers Trainer Callbacks 实现的基本类为?有哪些功能?

      flashcard

      TrainerCallback 1. gets the TrainingArguments used to instantiate the Trainer, 2. can access that Trainer’s internal state via TrainerState, 3. and can take some actions on the training loop via TrainerControl.

    9. Callbacks are “read only” pieces of code, apart from the TrainerControl object they return, they cannot change anything in the training loop.

      [!NOTE] 🤗 Transformers Trainer Callbacks 的能力范围为?

      flashcard

      • 不能修改 training loop
      • 只能返回 TrainerControl 来造成影响
    10. For customizations that require changes in the training loop, you should subclass Trainer and override the methods you need (see trainer for examples).

      [!NOTE] 🤗 Transformers Trainer 中,要修改 training loop,需要?

      flashcard

      1. 子类化 Trainer
      2. 覆写对应方法
    1. 请问 <pad> 不能通过 attn_mask 来忽略掉,从而相当于模型根本没有见到 <pad> 吗?还是我的理解有什么问题?烦请指教~ 我的理解是:是由于位置编码带来的影响。首先<pad>肯定是可以通过attention_mask将其忽略的,但是如果不考虑序列内部的位置关系的话,其实各个token完全可以乱序放置也不影响对模型的理解与生成。这也是transformer类模型引入位置编码的作用,引入其没有的序列位置信息。 那具体对于 Llama 模型来说,旋转位置编码 (RoPE) 使得在计算注意力的过程中会考虑到不同token之间的相对位置关系,因此如果在 <input> 和 <target> 之间插入过多的 <pad> ,就会破坏这种相对位置关系,并影响模型的生成效果。

      [!NOTE] padding side 具体是如何影响 Transformer 计算的?

      flashcard

      位置编码 中间插入了过多 pad token 后,相当于前后的 token 距离很远,不符合实际

    1. MQA是MHA的变体形式,K,V只有一个头,Q是多头,通过共享K得到attention score

      [!NOTE] MQA 的基本思想是?

      flashcard

      只有 Q 是多头,K/V 为单头

    1. bnb_4bit_quant_type (str, {fp4, nf4}, defaults to fp4) — This sets the quantization data type in the bnb.nn.Linear4Bit layers. Options are FP4 and NF4 data types which are specified by fp4 or nf4.

      [!NOTE] 4bit 量化的数据类型有哪些常见选择?

      flashcard

      • fp4
      • nf4 有什么区别?
    2. llm_int8_has_fp16_weight (bool, optional, defaults to False) — This flag runs LLM.int8() with 16-bit main weights. This is useful for fine-tuning as the weights do not have to be converted back and forth for the backward pass.

      [!NOTE] 使用 int8 量化时,如果可以使用 16 位主权重,则可以使用什么?

      flashcard

      参数 llm_int8_has_fp16_weight 但主权重具体指什么?

    3. llm_int8_skip_modules (List[str], optional) — An explicit list of the modules that we do not want to convert in 8-bit. This is useful for models such as Jukebox that has several heads in different places and not necessarily at the last position. For example for CausalLM models, the last lm_head is kept in its original dtype.

      [!NOTE] LLM.int8() 的应用对象模块有什么讲究?

      flashcard

      各种 head 一般不量化计算,而是保留原数据类型 因为要获取有意义的输出?

    4. llm_int8_threshold (float, optional, defaults to 6) — This corresponds to the outlier threshold for outlier detection as described in LLM.int8() : 8-bit Matrix Multiplication for Transformers at Scale paper: https://arxiv.org/abs/2208.07339 Any hidden states value that is above this threshold will be considered an outlier and the operation on those values will be done in fp16. Values are usually normally distributed, that is, most values are in the range [-3.5, 3.5], but there are some exceptional systematic outliers that are very differently distributed for large models. These outliers are often in the interval [-60, -6] or [6, 60]. Int8 quantization works well for values of magnitude ~5, but beyond that, there is a significant performance penalty. A good default threshold is 6, but a lower threshold might be needed for more unstable models (small models, fine-tuning).

      [!NOTE] LLM.int8() 中的 int8 threshold 取值有什么讲究?

      flashcard

      • 隐藏状态的值是正态分布的?大部分位于 [-3.5, 3.5] 之间
      • int8 量化在 abs<5 的值上效果较好,否则会有明显的表现退化
      • 隐藏状态异常值的绝对值通常在 [6,60]
      • 6 会是一个比较好默认值
    5. llm_int8_enable_fp32_cpu_offload (bool, optional, defaults to False) — This flag is used for advanced use cases and users that are aware of this feature. If you want to split your model in different parts and run some parts in int8 on GPU and some parts in fp32 on CPU, you can use this flag. This is useful for offloading large models such as google/flan-t5-xxl. Note that the int8 operations will not be run on CPU.

      [!NOTE] BitsAndBytes int8 量化中,可以如何利用 CPU 资源?

      flashcard

      将部分模型和数据以 32 位卸载到 CPU 上计算

  2. huggingface.co huggingface.co
    1. device_map (str or Dict[str, Union[int, str, torch.device]] or int or torch.device, optional) — A map that specifies where each submodule should go. It doesn’t need to be refined to each parameter/buffer name, once a given module name is inside, every submodule of it will be sent to the same device. If we only pass the device (e.g., "cpu", "cuda:1", "mps", or a GPU ordinal rank like 1) on which the model will be allocated, the device map will map the entire model to this device. Passing device_map = 0 means put the whole model on GPU 0. To have Accelerate compute the most optimized device_map automatically, set device_map="auto".

      [!NOTE] 🤗 Accelerate 中,device_map 可能有哪些取值?

      flashcard

      1. 直接指定设备:将整个模型放置到对应设备上
      2. 字典:模块名->设备,递归地放置到对应设备
    1. LOCAL_RANK - The rank of the worker within a local worker group.

      [!NOTE] PyTorch 分布式训练中,如何获取当前进程在本地机器上的 rank?

      flashcard

      环境变量 LOCAL_RANK

    1. Make sure Python is installed, put jemdoc in your path somewhere, type in your file, and run jemdoc index.jemdoc This will use a default configuration for the html elements, and create an index.html. Even simpler, you can omit the extension, and jemdoc will still process the index.jemdoc file, as in jemdoc index

      [!NOTE] jemdoc 最基本的输入参数为?

      flashcard

      *.jemdoc 标记语言文档

    1. Note that ZeRO3 is not currently supported with QLoRA but ZeRO3 does support LoRA, which has a reference configuraiton under playground/deepspeed_config_s3.json

      [!NOTE] ZeRO-3 对 (Q)LoRA 的支持如何?如何配置?

      flashcard

      • 截至 2023-9-4,还不支持 QLoRA
      • LoRA 配置可参考 FastChat/playground/deepspeed_config_s3.json
    1. # restore the model file "model.h5" from a specific run by user "lavanyashukla"# in project "save_and_restore" from run "10pr4joa"best_model = wandb.restore('model.h5', run_path="lavanyashukla/save_and_restore/10pr4joa")

      [!NOTE] wandb.restore() 的输入和输出为?

      flashcard

      • 输入:恢复对象+ run 标识符
      • 输出:恢复得到的对象

      例如:

    1. The result is that the flattened subarrays are sorted in lexicographic order starting with the first element.

      [!NOTE] np.unique() 的返回值的顺序是怎样的的?

      flashcard

      按字典序排列

    1. Returns: fprndarray of shape (>2,)Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i]. tprndarray of shape (>2,)Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i]. thresholdsndarray of shape (n_thresholds,)Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to np.inf.

      [!NOTE] sklearn.metrics.roc_curve 有哪些返回值?

      flashcard

      fpr, tpr, thresholds - 前两者单调增大 - 阈值单调减小

    2. fprndarray of shape (>2,)Increasing false positive rates such that element i is the false positive rate of predictions with score >= thresholds[i]. tprndarray of shape (>2,)Increasing true positive rates such that element i is the true positive rate of predictions with score >= thresholds[i].

      [!NOTE] ROC 中,横纵坐标随着阈值减小有什么特点?

      flashcard

      都是单调增大, 阈值越低,越容易判断为正

    3. drop_intermediatebool, default=TrueWhether to drop some suboptimal thresholds which would not appear on a plotted ROC curve. This is useful in order to create lighter ROC curves. New in version 0.17

      [!NOTE] sklearn.metrics.roc_curve 中,要抛弃掉一些相对不重要的阈值,可以使用?

      flashcard

      drop_intermediate 参数

    4. y_scorearray-like of shape (n_samples,)Target scores, can either be probability estimates of the positive class, confidence values, or non-thresholded measure of decisions (as returned by “decision_function” on some classifiers).

      [!NOTE] ROC 中,y_score 的基本要求为?

      flashcard

      能够在数值上(根据 threshold)判断是否为正例即可 不一定要是概率

    5. y_truearray-like of shape (n_samples,)True binary labels. If labels are not either {-1, 1} or {0, 1}, then pos_label should be explicitly given.

      [!NOTE] sklearn 的 ROC 中,标签 y_true 有什么要求?

      flashcard

      如果不是 -1,1 / 0,1,就需要显式给出 pos_label

    1. To group the indices by element, rather than dimension, use argwhere, which returns a row for each non-zero element.

      [!NOTE] NumPy 中,要获取每个非零元素的多维坐标索引,可以使用?

      flashcard

      np.argwhere()

    1. Create a Precision-Recall curve in one line:wandb.log({"pr": wandb.plot.pr_curve(ground_truth, predictions)})You can log this whenever your code has access to:a model's predicted scores (predictions) on a set of examplesthe corresponding ground truth labels (ground_truth) for those examples(optionally) a list of the labels/class names (labels=["cat", "dog", "bird"...] if label index 0 means cat, 1 = dog, 2 = bird, etc.)(optionally) a subset (still in list format) of the labels to visualize in the plot

      [!NOTE] WandB 中,如何记录一整条曲线?

      flashcard

      例如 ROC Curves 可以使用 wandb.log({"pr": wandb.plot.pr_curve(ground_truth, predictions)})

    1. Note that resuming a run which was executed as part of a Sweep is not supported.

      [!NOTE] WandB resume 功能的恢复对象有什么限制?

      flashcard

      • 不支持恢复 WandB Sweep 的部件 run
    2. If you set WANDB_RESUME equal to "allow", you can always set WANDB_RUN_ID to a unique string and restarts of the process will be handled automatically. If you set WANDB_RESUME equal to "must", W&B will throw an error if the run to be resumed does not exist yet instead of auto-creating a new run.

      [!NOTE] WandB 中,resume 策略参数有哪些设置?分别有什么功能?

      flashcard

      • allow: 自动处理,可能新建?
      • must: 必须恢复,否则报错
    3. wandb.restoreThis will allow you to log new historical values for your metrics to a run starting from where you left off but does not take care of re-establishing the state of your code, you will need to make sure you have written checkpoints that you can load!

      [!NOTE] WandB 中,要仅恢复 logging,但不恢复程序状态,可以使用?

      flashcard

      wandb.restore

    4. We provide a utility to generate run_id: wandb.util.generate_id()

      [!NOTE] WandB 中,如何在 Python 中获取 run_id

      flashcard

      wandb.util.generate_id()

    5. The other form of resume requires you to provide the actual run id: wandb.init(id=run_id)

      [!NOTE] WandB 中,如何 resume 指定 ID 的 run?

      flashcard

      wandb.init(id=run_id)

    6. runs can be resumed by passingresume=True to wandb.init(). This can be thought of as auto-resuming, where we “automatically” pick up from where an aborted run left off.

      [!NOTE] WandB 中,如何 resume 最近的 run?

      flashcard

      wandb.init(resume=True)

    7. if you want to be sure that it is resuming, you do wandb.init(id=run_id, resume="must"

      [!NOTE] WandB resuming 中,如何确保 resuming 一定会执行?

      flashcard

      wandb.init(resume="must")

    8. Note: This only works if you are running your script in the same directory as the one that failed as the file is stored at: wandb/wandb-resume.json.

      [!NOTE] WandB 如何查看上一个没有成功退出,默认会恢复的 run?

      flashcard

      wandb/wandb-resume.json

    1. quick way to generate a new id: python -c "import wandb; print(wandb.util.generate_id())"

      [!NOTE] 如何方便的在命令行中运行几句 Python 代码?

      flashcard

      python -c "import wandb; print(wandb.util.generate_id())"

    2. a helper method wandb.util.generate_id() which can be used to generate a random set of characters to append to your experiment name.

      [!NOTE] WandB 中,wandb.util.generate_id() 有什么功能?

      flashcard

      生成一个随机的 run ID

    3. unfortunately today you can not re-use deleted run ids in a project.

      [!NOTE] WandB 中,能否重用 project 中的 run ID?

      flashcard

      • 截至 2023-09-03 似乎可以使用 wandb sync --project <project_name> <run_path>
      • 截至 2022-01-20 还不行
    4. @zaccharieramzi , could you also add the flag --include-offline when syncing all offline runs? The cmd should be: wandb sync --include-synced --include-offline --sync-all. The reason we need to include this flag is that in offline mode all your runs have a prefix offline (you can check these out in the wandb dir or the zip file you shared above). Therefore, the code at this line suggests that if this flag is provided, all runs in offline-run-* will be included when syncing to W&B.

      [!NOTE] WandB 中,是如何本地储存 (offline) runs 的?以及上传时如何识别?

      flashcard

      • 储存:添加前缀得到 offline-run-*
      • 上传:使用参数 --include-offline
    1. For performance, we recommend setting bias to None first, and then lora_only, before trying all.

      [!NOTE] LoRA config 中,bias 一般如何设置?

      flashcard

      None > lora_only > all

    2. The weight matrix is scaled by lora_alpha/r, and a higher lora_alpha value assigns more weight to the LoRA activations.

      [!NOTE] LoRA 中,$\alpha$ 的功能是?

      flashcard

      放大 LoRA 权重矩阵 $\alpha / r$ 倍 $\alpha$ 越大,LoRA 激活值权重越高

    1. The steps are very similar to the ones shown in this quickstart; prepare a PeftConfig for a 🤗 PEFT method, and use the get_peft_model to create a PeftModel from the configuration and base model. Then you can train it however you like!

      [!NOTE] 使用 🤗 PEFT 的基本步骤是什么?

      flashcard

      1. 创建 PeftConfig 对象 perf_config
      2. 创建模型对象 model
      3. 包裹获取 PeftModel 对象 model = get_peft_model(model, peft_config)
    2. + config = PeftConfig.from_pretrained(peft_model_id) model = AutoModelForSeq2SeqLM.from_pretrained(config.base_model_name_or_path) + model = PeftModel.from_pretrained(model, peft_model_id)

      [!NOTE] 🤗 PEFT 中,如何加载预训练 PEFT 模型?

      flashcard

      PeftModel.from_pretrained(model, peft_model_id)

    3. To get a sense of the number of trainable parameters in your model, use the print_trainable_parameters method.

      [!NOTE] 🤗 PEFT 中,要了解 model 可训练参数的数量,可以使用?

      flashcard

      model.print_trainable_parameters()

    4. inference_mode, whether you’re using the model for inference or not

      [!QUESTION] PeftConfig.inference_mode 有什么功能?

      flashcard

    1. Run paths include entity, project, and run ID, in the format entity/project/run_id.

      [!NOTE] WandB 中,run path 的组成要素为?

      flashcard

      entity/project/run_id e.g. kidrain61/step-reward/8r60oi08

    1. when you send your model to prepare, it is wrapped in the DistributedDataParallel class which will reserve memory for the gradients it will need to reduce-average at the end of the backward pass.

      [!NOTE] accelerator.prepare(model) 会对 model 做什么?

      flashcard

      • model 包裹在 DistributedDataParallel 类中
      • 为梯度保留内存
    1. Saving the entire 16bit model weights to directly load later on using model.load_state_dict(torch.load(pytorch_model.bin)). For this, either set zero_optimization.stage3_gather_16bit_weights_on_model_save to True in DeepSpeed Config file or set zero3_save_16bit_model to True in DeepSpeed Plugin.

      [!NOTE] 🤗 Accelerate 中,使用 unwrapped_model.save_pretrained() 的保存 16 位模型 前提是?

      flashcard

      • model.load_state_dict(torch.load(pytorch_model.bin)). For this, either set zero_optimization.stage3_gather_16bit_weights_on_model_save to True in DeepSpeed Config file
      • or set zero3_save_16bit_model to True in DeepSpeed Plugin
    2. all these functions require ~2x memory (general RAM) of the size of the final checkpoint.

      [!NOTE] 🤗 Accelerate DeepSpeed 中,加载 32 位状态的内存占用约为?

      flashcard

      ~2x checkpoint 大小

    3. To get 32bit weights, first save the model using model.save_checkpoint(). Below is the snippet from examples/by_feature/deepspeed_with_config_support.py showing this:

      [!NOTE] 🤗 Accelerate DeepSpeed 中,如何保存 32 位状态?

      flashcard

      1. success = model.save_checkpoint(PATH, ckpt_id, checkpoint_state_dict)
      2. ./zero_to_fp32.py . pytorch_model.bin
    1. Should only be used when wanting to save a checkpoint during training and restoring the state in the same environment.

      [!NOTE] accelerator.save_state() 的使用场景定位为?

      flashcard

      在训练中保存状态,用于完全恢复原状态

    2. If a ProjectConfiguration was passed to the Accelerator object with automatic_checkpoint_naming enabled then checkpoints will be saved to self.project_dir/checkpoints. If the number of current saves is greater than total_limit then the oldest save is deleted. Each checkpoint is saved in seperate folders named checkpoint_<iteration>.

      [!NOTE] 🤗 Accelerate 中,automatic_checkpoint_naming 的功能为?

      flashcard

      • 保存 checkpints 到 self.project_dir/checkpoints
      • 控制 total_limit
      • 每个 checkpoint 命名为 checkpoint_<iteration>
    3. the underlying load function, such as optional arguments for DeepSpeed’s load_checkpoint function or a map_location to load the model and optimizer on.

      [!NOTE] accelerator.load_state() 背后的行为是?

      flashcard

      可能是调用 DeepSpeed 的 - load_checkpoint - map_location

    1. device_map="auto" will be good enough as 🤗 Accelerate will attempt to fill all the space in your GPU(s), then loading them to the CPU, and finally if there is not enough RAM it will be loaded to the disk (the absolute slowest option).

      [!NOTE] 🤗 Accelerate 中,device_map="auto" 的行为为?

      flashcard

      依此卸载到 GPU,CPU,disk

    2. The first step is to init an empty skeleton of the model which won’t take up any RAM using the init_empty_weights() context manager: Copied from accelerate import init_empty_weights with init_empty_weights(): my_model = ModelClass(...) With this my_model currently is “parameterless”, hence leaving the smaller footprint than what one would normally get loading this onto the CPU directly.

      [!NOTE] 🤗 Accelerate 中,如何用“空壳”初始化模型对象,即不占用 CPU 内存?

      flashcard

      with init_empty_weights():

    1. Additional key word arguments passed along to the wandb.init method.

      [!NOTE] 🤗 Accelerate 中,如何给 tracker 初始化传递额外的参数?

      flashcard

      1. accelerator.init_trackers 传入参数 init_kwargs={"wandb": {...}}(转入 2)
      2. Tracker.__init__ 中传入关键词参数
    1. Following the recommended settings for Python’s configparser, Flake8 does not support inline comments for any of the keys. So while this is fine:

      [!NOTE] flake8 的配置文件对注释的格式要求是怎样的?

      flashcard

      允许单行注释,不允许行内注释

    2. Not every Flake8 command-line option can be specified in the configuration file. See our list of options to determine which options will be parsed from the configuration files.

      [!NOTE] flake8 中,哪些命令行参数可以用配置文件指定?

      flashcard

      部分参数,详见下文

    3. we expect you to use INI to configure Flake8 (since each of these files already uses INI as a format). This means that any Flake8 configuration you wish to set needs to be in the flake8 section, which means it needs to start like so: [flake8] Each command-line option that you want to specify in your config file can be named in either of two ways: Using underscores (_) instead of hyphens (-) Simply using hyphens (without the leading hyphens)

      [!NOTE] flake8 配置文件的内容格式为?

      flashcard

      [flake8] 开头 每行一个命令行参数指定,可以将 - 换为 _

    4. Regardless of whether you keep your config in .flake8, setup.cfg, or tox.ini we expect you to use INI to configure Flake8 (since each of these files already uses INI as a format).

      [!NOTE] flake8 的配置文件名为?

      flashcard

      .flake8, setup.cfg, or tox.ini

    1. New in 3.0.0 The user can specify --append-config <path-to-file> repeatedly to include extra configuration files that should be read and take precedence over user and project files.

      [!NOTE] flake8 如何临时附加配置文件(在项目/用户配置之外)?

      flashcard

      --append-config 选项

    2. New in 3.0.0 The user can specify --config <path-to-file> to so this file is the only configuration file used. This is a change from Flake8 2 where pep8 would simply merge this configuration file into the configuration generated by user and project files (where this takes precedence).

      [!NOTE] flake8 --config ... 的行为是怎样的?

      flashcard

      指定文件成为唯一的配置文件 (pep8 会合并)

    3. New in 3.0.0 The user can specify --isolated to disable configuration via discovered configuration files.

      [!NOTE] flake8 如何临时禁用配置文件?

      flashcard

      --isolated 选项(>=3.0.0)

    1. By default Black looks for pyproject.toml starting from the common base directory of all files and directories passed on the command line. If it’s not there, it looks in parent directories. It stops looking when it finds the file, or a .git directory, or a .hg directory, or the root of the file system, whichever comes first. If you’re formatting standard input, Black will look for configuration starting from the current working directory. You can use a “global” configuration, stored in a specific location in your home directory. This will be used as a fallback configuration, that is, it will be used if and only if Black doesn’t find any configuration as mentioned above. Depending on your operating system, this configuration file should be stored as: Windows: ~\.black Unix-like (Linux, MacOS, etc.): $XDG_CONFIG_HOME/black (~/.config/black if the XDG_CONFIG_HOME environment variable is not set) Note that these are paths to the TOML file itself (meaning that they shouldn’t be named as pyproject.toml), not directories where you store the configuration. Here, ~ refers to the path to your home directory. On Windows, this will be something like C:\\Users\UserName.

      [!NOTE] black 是如何寻找配置文件的?

      flashcard

      1. 从当前文件所在目录/文件集合公共 base 目录开始
      2. 直到找到文件,或一个 .git 目录 / .hg 目录 / 根目录
      3. 若上一步没有找到配置文件,使用 $XDG_CONFIG_HOME/black / ~/.config/black (Unix-like)
    1. run the Output: Focus on Output command in the Command Palette and then select the formatter extension channel

      [!NOTE] VSCode 中,如何查看扩展的运行输出?

      flashcard

      1. Output: Focus on Output command
      2. 右上角选择 channel
    2. black does not support formatting sections of code.

      [!NOTE] black 是否支持 format 代码片段?

      flashcard

      不支持

    1. There are two ways to save a file to associate with a run.Use wandb.save(filename).Put a file in the wandb run directory, and it will get uploaded at the end of the run.

      [!NOTE] WandB 中,要保存(并上传)与 run 相关的文件,可以使用?

      flashcard

      • wandb.save(filename)
      • 将文件放置到 run directory 中
    1. 在单个输入token上转发LLM所需的时间,与在K个输入token上批量转发LLM所需的时间相同(K比你想象的要大)。这个不直观的事实是因为采样受到内存的严重限制,大部分「工作」不计算,而是将Transformer的权重从VRAM读取到芯片上缓存中进行处理。

      [!NOTE] LLM 计算过程中主要耗时在?

      flashcard

      将参数从 VRAM 读取到 cache

    1. The LoRA layer for embeddings might not work as well on the output projection layer

      [!QUESTION] LoRA 应用在输入/输出 embedding 层的效果有什么差别?

      flashcard

      输入可能不如输出?为什么?

    2. A temporary fix involves the same solution of freezing the embeddings. This fix is not satisfactory for use cases where new tokens need to be added and corresponding representations tuned. A more general fix would be adding LoRA layers to the embeddings or allowing only the new embeddings to be trained.

      [!NOTE] QLoRA 官方实现中,关于新添加的 token(s) 有什么问题?

      flashcard

      暂时还不支持 tune 对应的 embedding

    1. If you get an this issue ("illegal memory access") then you should use a newer HF LLaMA conversion or downgrade your PyTorch version.

      [!NOTE] (使用 QLoRA 时)遇到 "illegal memory access" 问题,应该如何解决?

      flashcard

      • use a newer HF LLaMA conversion
      • or downgrade your PyTorch version.
    2. Currently, using bnb_4bit_compute_type='fp16' can lead to instabilities. For 7B LLaMA, only 80% of finetuning runs complete without error. We have solutions, but they are not integrated yet into bitsandbytes.

      [!NOTE] bnb_4bit_compute_type='fp16' 存在什么问题?如何解决?

      flashcard

      不稳定(~20% fail) QLoRA 团队有解决方案

    3. Resuming a LoRA training run with the Trainer currently not supported by HF.

      [!NOTE] 使用 HF Trainer 恢复 LoRA 训练可行吗?

      flashcard

      截至 2023-9-2 似乎还不可行

    4. 4-bit inference is slow. Currently, our 4-bit inference implementation is not yet integrated with the 4-bit matrix multiplication

      [!NOTE] QLoRA 官方实现中,4-bit inference 存在什么问题?

      flashcard

      还没有整合 4-bit 矩阵乘法,比较慢

    1. 如何下载指定版本的内容呢?在snaphot_download方法中,提供了allow_regex和ignore_regex两个参数,简单来说前者是对指定的匹配项进行下载,后者是忽略指定的匹配项,下载其余部分。

      [!NOTE] 🤗 Hub 中如何下载仓库中的指定内容?

      flashcard

      snapshot_download()allow_regex/ignore_regex 参数

    1. If you don't want to preface the docker command with sudo, create a Unix group called docker and add users to it. When the Docker daemon starts, it creates a Unix socket accessible by members of the docker group.

      [!NOTE] 使用 docker 时,如何省略 sudo

      flashcard

      1. 创建 docker 用户组(部分发行版会自动创建)
      2. Docker daemon 启动时,会创建该组用户可访问的 Unix socket
  3. Aug 2023
    1. The docker group grants root-level privileges to the user. For details on how this impacts security in your system, see Docker Daemon Attack Surface.

      [!QUESTION] docker 用户组是如何发挥作用的?

      flashcard

      赋予用户 root 级权限 可能造成安全问题?

    2. By default it's the root user that owns the Unix socket, and other users can only access it using sudo.

      [!NOTE] Unix socket权限通常是怎样的?

      flashcard

      通常需要 root 权限才可以访问

    1. .to(accelerator.device)

      [!NOTE] 🤗 Accelerate 中,如何获取当前所在设备

      flashcard

      accelerator.device

    2. accelerator.process_index

      [!NOTE] 🤗 Accelerate 中如何获取当前进程索引

      flashcard

      accelerator.process_idx

    1. It is if you want to use more than one GPU, using python script.py will only launch one process, you have to use accelerate launch or python -m torch.distributed.launch to use all your GPUs.

      [!NOTE] 使用 🤗 Accelerate 时,CLI 中需要注意什么?

      flashcard

      要用 accelerate launch <script> 等命令启动才能开启多个进程

    1. Currently the default verbosity of the library is set to WARNING. To change the level of verbosity, use one of the direct setters.

      [!NOTE] 🤗 Evaluate 中,默认的 verbosity 为?

      flashcard

      WARNING

    1. Dictionary with split names as keys (‘train’, ‘test’ for example), and Dataset objects as values. It also has dataset transform methods like map or filter, to process all the splits at once.

      [!NOTE] 🤗 Datasets 中 DatasetDict 能否直接调用 map 等方法?

      flashcard

      能,相当于在每个 split 上分别调用

    2. If batched is True and batch_size is n > 1, then the function takes a batch of n examples as input and can return a batch with n examples, or with an arbitrary number of examples. Note that the last batch may have less than n examples. A batch is a dictionary, e.g. a batch of n examples is {"text": ["Hello there !"] * n}.

      [!NOTE] 🤗 Datasets map() 中,batched 参数表示?

      flashcard

      encode_fn 是否接受列表为参数

    1. When downloading a dataset, you should download it first on the main process and then load the cached dataset afterward load_dataset will perform a lock under the hood to stop multiple downloads from happening at once, but if you are downloading something not using this library you should use this method. Copied with accelerator.main_process_first(): datasets = load_dataset("glue", "mrpc") Under the hood this is the same as calling:

      [!NOTE] 🤗 Accelerate 下载/映射修改数据集时,需要注意什么?

      flashcard

      使用 with accelerator.main_process_first(): 实现一个(主)进程下载/映射修改,其他进程再加载

    1. Local components are optimized based on an overall feedback signal:SGD optimizes weights in a neural net to reduce its training lossNeural architecture search optimizes architectures and hyperparameters to have low validation lossPolicy gradient optimizes policy neural nets to choose actions that lead to high expected rewards

      [!NOTE] 从优化和不同种类损失/奖励的角度,可以如何划分神经网络的训练?

      flashcard

      • SGD - 训练损失
      • NAS - 验证损失
      • RL - reward
    2. process-based ML systems have better differential capabilities: They help us apply ML to tasks where we don’t have access to outcomes. These tasks include long-range forecasting, policy decisions, and theoretical research.

      [!NOTE] Process Supervision 在监督信号获取难度上有什么优势

      flashcard

      可以监督一些无法获取最终结果的任务,例如长距离预测、政策决定、理论研究等

    1. Outer alignment asks the question - "What should we aim our model at?" In other words, is the model optimizing for the correct reward such that there are no exploitable loopholes? It is also known as the reward misspecification problem.

      [!NOTE] Outer Alignment / Reward Misspecification Promblem 是指什么?

      flashcard

      模型是否在向人类真正的目标优化

    1. Inner alignment asks the question - “Is the model trying to do what humans want it to do?”, or in other words can we robustly aim our AI optimizers at any objective function at all?

      [!NOTE] Inner Alignment 的基本思想是?

      flashcard

      确保模型确实在向特定目标优化

    1. AI Boxing is attempts, experiments, or proposals to isolate ("box") a powerful AI (~AGI) where it can't interact with the world at large, save for limited communication with its human liaison. It is often proposed that so long as the AI is physically isolated and restricted, or "boxed", it will be harmless even if it is an unfriendly artificial intelligence (UAI).

      [!NOTE] 将 AI 隔绝在“容器”中的做法在英语中可以称为?

      flashcard

      AI Boxing/Containment

    1. Codex 能够创建和理解代码,因此可以使用它来执行任务,例如解释文件中的代码的用途。 实现此目的的一种方法是在以“此函数”或“此应用程序是”开头的函数之后添加注释。Codex 通常会将此理解为说明的开始,并补全文本的其余部分。

      [!NOTE] Code Copilot 中,如何方便地获取代码解释

      flashcard

      编写注释,以“此函数”或“此应用程序是”开头

    2. 测试应用程序通常需要使用示例数据。 由于 Codex 是一种能够理解如何编写和理解自然语言的语言模型,因此可以指示 Codex 创建数据,如虚构名称、产品和其他变量的数组。 例如,此处我们要求 Codex 创建天气温度数组。

      [!NOTE] Code Copilot 中,可以如何方便地获取示例数据

      flashcard

      要求模型生成,提供指令示例

    3. 在大多数情况下,将 API 温度设置为 0 或接近 0(如 0.1 或 0.2)往往会得到更好的结果。 在 GPT-3 模型中,较高的温度可以提供有用的创意结果和随机结果,Codex 模型则不同,较高温度可能会导致收到十分随机或难以预测的响应。 如果需要 Codex 提供不同的潜在结果,请从 0 开始,然后向上递增 0.1,直到找到合适的变体。

      [!NOTE] 代码生成任务中,temprature 通常如何设定?

      flashcard

      一般设置为较低的值

    4. 对于某些语言,注释的样式可以提高输出的质量。 例如,使用 Python 时,在某些情况下,使用 doc 字符串(三引号括起来的注释)比使用井号 (#) 得到的结果质量更高。

      [!NOTE] Codex 中,注释样式对效果有什么影响?

      flashcard

      • Python:有时使用 doc 字符串比使用井号 (#) 得到的结果质量更高
    1. tiktoken is a fast BPE tokeniser for use with OpenAI's models.

      [!NOTE] OpenAI 模型使用的 tokenizer 是?

      flashcard

      tiktoken

    1. 引文越接近它支持的文本,模型预测引文所需的距离就越短,这表明内联引文比内容末尾的引文更适合缓解虚假内容的生成。

      [!NOTE] 模型提供引文时,引文的位置有什么讲究?

      flashcard

      通常距离越近越好,即内联引文比脚注效果更好

    2. 有时,系统消息输入“仅写出真实事实”或“不捏造信息”可能不足以缓解问题。 相反,要求模型响应同时包含引文有助于减少错误响应的概率。

      [!NOTE] Prompt 中,可以如何减少模型的 hallucination?

      flashcard

      要求提供引文

    3. 使用提示指定输出结构时,可能会对结果的性质和质量产生重大影响。

      [!NOTE] 指定 LM 输出结构时,对效果影响如何?

      flashcard

      影响可能很大

    4. 如果不确定要使用哪种语法,请考虑使用 Markdown 或 XML。 这些模型已通过 XML 和 Markdown 的大量 Web 内容进行了训练,这可能会提供更好的结果。

      [!NOTE] Prompt 中使用哪些语法可能可以提升效果?

      flashcard

      • MD/XML
      • 大写:标题/特殊变量
    5. 在此简单示例中,将任务从一个步骤分解为两个步骤的效果并不是非常明显,但当尝试将其应用于包含许多事实性陈述的大段文本时,将任务分解就会产生很大的不同。

      [!NOTE] Prompt 任务分解在什么场景下很有必要?

      flashcard

      • 包含许多事实性陈述的大段文本
    6. 这是指在提示的末尾包含几个字词或短语,以获取遵循所需形式的模型响应。 例如,使用 “Here’s a bulleted list of key points:\n- ” 等提示有助于确保输出的格式为项目符号列表。

      [!NOTE] 如何以单个示例引导模型生成?

      flashcard

      在提示的末尾提供生成的前缀

    7. 尽管通常仍建议遵循此方法,但与之前的模型版本(GPT-3 和更早)相比,我们的测试表明,无论是否使用该技术,ChatGPT 和 GPT-4 模型的模型响应都是相同的。 在下面的示例中,我们看到添加了语句“几个消息来源... 爆发”到提示的开头或末尾后,不会导致最终模型响应发生任何变化。

      [!NOTE] Prompt 中,将任务放在开头或结尾有无必要?

      flashcard

      • 在经过充分 Alignment 之前的 model 通常需要?(GPT-3 和更早)
      • 经过充分 Alignment 之后(ChatGPT 和 GPT-4)可能不再需要
    8. 模型可能容易受到近因偏差的影响,在此上下文中,这意味着提示末尾的信息可能比提示开头的信息对输出具有更大的影响。 因此,值得尝试的是,在提示结束时重复指令,并评估对生成的响应的影响。

      [!NOTE] 提示的位置对效果有什么影响?

      flashcard

      末尾影响可能大于开头

    1. Recently, there have been clear changes in the open-source policy and regulations of our overall organization's code, data, and models. Despite this, we have still worked hard to obtain opening the weights of the model first, but the data involves stricter auditing and is in review with our legal team . Our researchers have no authority to publicly release them without authorization. Thank you for your understanding.

      [!NOTE] 如何获取 Wizard 系列模型的数据?

      flashcard

      截至 2023-8-26,还未开放

    1. The following repositories are used in xFormers, either in close to original form or as an inspiration: Sputnik GE-SpMM Triton LucidRain Reformer RevTorch Nystromformer FairScale Pytorch Image Models CUTLASS Flash-Attention

      [!NOTE] xFormers 集成了哪些算法/库?

      flashcard

      截至 2023-8-25,10 种

    1. 这并不能保证内存的完美利用(浪费现在限制在 4% 以下,仅在最后一个块中),但它明显改善了当今业界广泛使用的提前分配方案的浪费 。

      [!NOTE] vLLM&PagedAttention 能够将 KV Cache 的内存浪费限制到什么水平?

      flashcard

      4% 以下,仅在最后一个块

    2. FLAT-Attention 与 FlashAttention 采取不同的路线来解决同一问题。 提出的解决方案有所不同,但关键思想是相同的(tiling 和 scheudling)。下面主要讨论二者不同之处:

      [!NOTE] FLAT-Attention 与 FlashAttention 在实现思路上有什么不同?

      flashcard

      参见下文

    1. torch.cuda.empty_cache()[source] Releases all unoccupied cached memory currently held by the caching allocator so that those can be used in other GPU application and visible in nvidia-smi. Note empty_cache() doesn’t increase the amount of GPU memory available for PyTorch. However, it may help reduce fragmentation of GPU memory in certain cases.

      [!NOTE] torch.cuda.empty_cache() 的功能为?

      flashcard

      释放所有被缓存分配器持有,但未被占用的内存 - 这些内存本来也是要被优先分配的 - 但释放后重新分配,可能有助于缓解碎片化

    1. PagedAttention, an attention algorithm inspired by the classic idea of virtual memory and paging in operating systems. Unlike the traditional attention algorithms, PagedAttention allows storing continuous keys and values in non-contiguous memory space. Specifically, PagedAttention partitions the KV cache of each sequence into blocks, each block containing the keys and values for a fixed number of tokens. During the attention computation, the PagedAttention kernel identifies and fetches these blocks efficiently.

      [!NOTE] PagedAttention基本思想为?

      flashcard

      类似虚拟内存分页 将连续的 KV Cache 分块储存在内存空间中 使用时通过内存映射来取得需要的 KV Cache block

    2. Dynamic: Its size depends on the sequence length, which is highly variable and unpredictable. As a result, efficiently managing the KV cache presents a significant challenge. We find that existing systems waste 60% – 80% of memory due to fragmentation and over-reservation.

      [!NOTE] KV Cache内存管理存在什么困难?会导致什么问题

      flashcard

      与序列长度有关,高度可变与难以预测 容易导致碎片化,以及过度保留内存 朴素实现可能浪费 60~80% 的内存

    3. The KV cache is Large: Takes up to 1.7GB for a single sequence in LLaMA-13B.

      [!NOTE] KV Cache 会占用多少内存?举一些具体例子

      flashcard

      LLaMA-13B <= 1.7GB/seq (<= 1024 tokens?)

  4. huggingface.co huggingface.co
    1. past_key_values (tuple(tuple(torch.FloatTensor)), optional, returned when use_cache=True is passed or when config.use_cache=True) — Tuple of tuple(torch.FloatTensor) of length config.n_layers, with each tuple having 2 tensors of shape (batch_size, num_heads, sequence_length, embed_size_per_head)) and 2 additional tensors of shape (batch_size, num_heads, encoder_sequence_length, embed_size_per_head)

      [!NOTE] 🤗 Transformers 中,past_key_values 的形状为?

      flashcard

      双层 tuple,元素为 tensor 1. 第 1 层 tuple 有 n_layers 个元素 2. 第 2 层 tuple 至多有 4 个 tensor,分别对应 decoder 和 encoder 的 key/value 3.tensor.shape为 1. decoder:(batch_size, num_heads, sequence_length, embed_size_per_head)2. encoder:(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`

    1. TruthfulQA:用于测试模型复制网上常见虚假内容的倾向

      [!NOTE] TruthfulQA 主要测试 LLM 的什么特征?

      flashcard

      模型复制网上常见虚假内容的倾向

    2. HellaSwag:常识推理测试,但对大语言模型来说具有相当的挑战性

      [!NOTE] LLM evaluation 中,HellaSwag 是什么?

      flashcard

      一个针对常识推理测试

    3. AI2 :针对科学问题的推理测试

      [!NOTE] LLM evaluation 中,AI2 是什么?

      flashcard

      一个针对科学问题推理测试

    4. 斯坦福大学对Alpaca-7B 的全面微调是在8 个 A100 80GB 上进行的,并花费了3个小时

      [!NOTE]- (LLaMA(2)-)7B 在 Alpaca 数据集上 full fine-tune,需要多少内存与时间?

      flashcard

      • 内存:8xA100(80GB)
      • 时间:3hrs
    5. 研究团队对模型的微调最初主要针对的是注意力模块,如 v_proj、q_proj、k_proj 和 o_proj。后来,研究人员转向了对gate_proj、down_proj 和 up_proj 模块的微调,与注意力模块相比,除了可训练参数小于总参数的 0.1% 时,微调这些模块模型的性能表现更好。为了保持一致性,研究团队对13B和70B模型统一采用了这一方法,可训练参数分别为0.27%和0.2%。

      [!NOTE] LoRA 应用于注意力模块与非注意力模块,效果哪个好?

      flashcard

      大多数时候(除了参数量很小时)非注意力模块效果更好

    6. 一个13B的鸭嘴兽模型可以在单个A100 GPU使用25k个问题在5小时内完成训练。

      [!NOTE] LLaMA2-13B 在 25k 个样本上做 LoRA 微调,需要多少内存与时间?

      flashcard

      • 内存:1xA100(80GB)
      • 时间:5hrs
    7. 70B模型使用4个A100 80GB进行了22个小时的微调

      [!NOTE] LLaMA2-70B 在 25k 个样本上做 LoRA 微调,需要多少内存与时间?

      flashcard

      • 内存:4xA100(80GB)
      • 时间:22hrs
    8. Open-Platypus由11个开源数据集组成,主要由人为设计的问题组成,只有大约10%的问题由LLM生成,能够以最小的微调时间和成本实现强大的性能。侧重于提高LLM的STEM和逻辑能力。

      [!NOTE] Open-Platypus 是什么?

      flashcard

      开源指令微调数据集 - ~90% 人工编写 - 侧重于提高 STEM 与逻辑能力

    1. Contrary to RNNs that have the position of each token embedded within them, transformers are unaware of the position of each token. Therefore, the position IDs (position_ids) are used by the model to identify each token’s position in the list of tokens.

      [!QUESTION] Transformer 中,position IDs 是什么?与 positional embeddings 是什么关系?

      flashcard

    1. run_on_remote.py is a script that launches any example on remote self-hosted hardware, with automatic hardware and environment setup.

      [!NOTE] 🤗 Transformers 中,要在远程硬件上运行示例脚本,可以使用?

      flashcard

      run_on_remote.py 脚本

  5. huggingface.co huggingface.co
    1. subfolder (str, optional, defaults to "") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.

      [!NOTE] 🤗 Transformers from_pretrained() 中,若相关文件位于仓库的子目录中,可以使用?

      flashcard

      参数 subfolder 指定子目录

    2. safe_serialization (bool, optional, defaults to False) — Whether to save the model using safetensors or the traditional PyTorch way (that uses pickle).

      [!NOTE] 🤗 Transformers save_pretrained() 中,如何指定保存格式?

      flashcard

      参数 safe_serialization 指定是否使用 safetensors 格式

    3. is_main_process (bool, optional, defaults to True) — Whether the process calling this is the main process or not. Useful when in distributed training like TPUs and need to call this function on all processes. In this case, set is_main_process=True only on the main process to avoid race conditions.

      [!NOTE] 🤗 Transformers save_pretrained() 时,在分布式环境中,需要注意什么?

      flashcard

      1. 设置 is_main_process=True
      2. 可能需要替换保存函数,即设置 save_function 参数
    4. state_dict (nested dictionary of torch.Tensor) — The state dictionary of the model to save. Will default to self.state_dict(), but can be used to only save parts of the model or if special precautions need to be taken when recovering the state dictionary of a model (like when using model parallelism).

      [!NOTE] 🤗 Transformers save_pretrained() 中,如何只保存一部分参数?

      flashcard

      指定 state_dict 参数

    5. A path or url to a PyTorch state_dict save file (e.g, ./pt_model/pytorch_model.bin). In this case, from_pt should be set to True and a configuration object should be provided as config argument.

      [!NOTE] 🤗 Transformers from_pretrained() 中,必须提供的文件有哪些?

      flashcard

      1. 模型权重
      2. config.json
    6. If the torchscript flag is set in the configuration, can’t handle parameter sharing so we are cloning the weights instead.

      [!QUESTION] 为什么 TorchScript 会不允许处理参数共享?

      flashcard

    7. Takes care of tying weights embeddings afterwards if the model class has a tie_weights() method.

      [!NOTE] 🤗 Transformers 中,使用 resize_token_embeddings() 后需要注意什么?

      flashcard

      对于输入输出嵌入共享权重的模型,还需要调用 tie_weights()

    8. Returns torch.nn.Embedding Pointer to the input tokens Embeddings Module of the model.

      [!NOTE] 🤗 Transformers 中,model.resize_token_embeddings() 会返回什么?

      flashcard

      输入 token embedding 模组

    1. Solutions that failed to reach an answer within 1024 tokens were discarded, resulting in less than 1860 samples on some problems.

      [!NOTE] PRM800K 的 scored samples 中,为什么有些问问题的 samples 少于 1860 个?

      flashcard

      因为没能在 1024 tokens 内得到答案的解法会被丢弃

    1. For some datasets it can be much faster to yield batches of data rather than examples one by one. You can speed up the dataset generation by yielding Arrow tables directly, instead of examples. This is especially useful if your data comes from Pandas DataFrames for example, since the conversion from Pandas to Arrow is as simple as: Copied import pyarrow as pa pa_table = pa.Table.from_pandas(df)

      [!NOTE] 哪些格式的数据可以直接生成 Arrow 表格以加速数据集生成?

      flashcard

      • Pandas DataFrames
      • ...
    1. In the former case, the tokens will NOT be removed from the tokenizer’s full vocabulary - they are only being flagged as non-special tokens.

      [!NOTE] 🤗 Transformers tokenizer.add_special_tokens() 中,若 replace_additional_special_tokens = True,原来的 additional_special_tokens 会被如何处理?

      flashcard

      不会被移除,而是被标记为 non-special tokens

    2. Tokens are only added if they are not already in the vocabulary (tested by checking if the tokenizer assign the index of the unk_token to them).

      [!NOTE] 🤗 Transformers tokenizer 中,如何测试一个 token 是否已存在于 vocabulary 中?

      flashcard

      检查 tokenizer 是否给该 token 赋值为 unk_token

    1. Represents a token that can be be added to a Tokenizer. It can have special options that defines the way it should behave.

      [!NOTE] 🤗 Tokenizers 中,如何表示可以被添加到 tokenizer 中的 token?

      flashcard

      tokenizers.AddedToken

    1. 🤗 Datasets uses Arrow for its local caching system. It allows datasets to be backed by an on-disk cache, which is memory-mapped for fast lookup. This architecture allows for large datasets to be used on machines with relatively small device memory. For example, loading the full English Wikipedia dataset only takes a few MB of RAM

      [!NOTE] Apache Arrow 使用什么技术来节省运行时内存?能节省到什么程度?

      flashcard

      memory-mapping:从硬盘映射数据,而非加载到内存中 wikipedia/20220301.en 数据集(20+GB)只需要 50 MB 内存

    2. Iterating over Wikipedia on a laptop gives you speeds of 1-3 Gbit/s:

      [!NOTE] Apache Arrow 的 IO 性能如何?

      flashcard

      笔记本电脑上迭代 wikipedia/20220301.en 数据集,吞吐量约为 1~3 Gbit/s

    3. Arrow enables large amounts of data to be processed and moved quickly. It is a specific data format that stores data in a columnar memory layout. This provides several significant advantages: Arrow’s standard format allows zero-copy reads which removes virtually all serialization overhead. Arrow is language-agnostic so it supports different programming languages. Arrow is column-oriented so it is faster at querying and processing slices or columns of data. Arrow allows for copy-free hand-offs to standard machine learning tools such as NumPy, Pandas, PyTorch, and TensorFlow. Arrow supports many, possibly nested, column types.

      [!NOTE] Apache Arrow 数据格式有哪些优点?

      flashcard

      5 条

    1. We have already created Beam pipelines for some of the larger datasets like wikipedia, and wiki40b.

      [!NOTE] 通常多大的数据集才有必要使用 Apache Beam?

      flashcard

      大于 10 GB? - wikipedia/20220301.en: 21.60 GB - wiki40b/en: 10.47 GB

    2. The processing pipeline is executed on a distributed processing backend such as Apache Flink, Apache Spark, or Google Cloud Dataflow.

      [!NOTE] 有哪些常见的并行/分布式数据处理后端?

      flashcard

      • Apache Flink
      • Apache Spark
      • Google Cloud Dataflow
    3. Some datasets are too large to be processed on a single machine. Instead, you can process them with Apache Beam, a library for parallel data processing.

      [!NOTE] 要并行/分布式处理数据,可以使用什么?

      flashcard

      Apache Beam

    1. Other metrics, such as BLEU are harder to interpret: while they also range between 0 and 1, they can vary greatly depending on which parameters are used to generate the scores, especially when different tokenization and normalization techniques are used (see the metric card for more information about BLEU limitations).

      [!NOTE] BLEU 等 metrics 在可解释性上有什么缺点?

      flashcard

      1. 依赖于参数
      2. 受不同的 tokenizationnormalization 等技术策略影响
    2. These two types of evaluation can use different metrics and measure different aspects of model performance. For example, offline evaluation can compare a model to other models based on their performance on common benchmarks, whereas online evaluation will evaluate aspects such as latency and accuracy of the model based on production data (for example, the number of user queries that it was able to address).

      [!NOTE] Offline vs. Online 模型 evaluation 一般分别有什么侧重点?

      flashcard

      • offline:使用 benchmark,与其他模型比较
      • online:使用生产环境数据,与现实需求比较
    3. Having an imbalanced dataset can skew the results of your metrics. Imagine a dataset with 99 “non-fraud” cases and 1 “fraud” case. A simple model that always predicts “non-fraud” cases would give yield a 99% accuracy which might sound good at first until you realize that you will never catch a fraud case.

      [!NOTE] evaluation 中,如果 label 类型不均衡,可能会导致什么 bias?

      flashcard

      倾向于标注占比较大 label 的模型会获得优势

    1. scipy.stats.bootstrap

      [!NOTE] SciPy 中,要基于 bootstrap 法计算置信区间,可以使用?

      flashcard

      scipy.stats.bootstrap()

    1. Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass device to compute where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device.

      [!NOTE] 🤗 Evaluate compute() 中如何指定使用的设备

      flashcard

      • 默认使用第一块 GPU,否则 CPU
      • device=-1,全部可见 GPU -device={num} 第 {num} 块 GPU
    2. Currently supported tasks are: "text-classification": will use the TextClassificationEvaluator. "token-classification": will use the TokenClassificationEvaluator. "question-answering": will use the QuestionAnsweringEvaluator. "image-classification": will use the ImageClassificationEvaluator. "text-generation": will use the TextGenerationEvaluator. "text2text-generation": will use the Text2TextGenerationEvaluator. "summarization": will use the SummarizationEvaluator. "translation": will use the TranslationEvaluator. "automatic-speech-recognition": will use the AutomaticSpeechRecognitionEvaluator.

      [!NOTE] 🤗 Evaluate evaluator 目前支持哪些任务类型?

      flashcard

      9 种

    1. You can find examples of dataset structures by consulting the “Dataset Preview” function or the dataset card for a given dataset

      [!NOTE] 🤗 Datasets 中,如何获取数据集结构的例子

      flashcard

      1. “Dataset Preview” function?
      2. dataset card
    2. You can find the right metric for your task by: Looking at the Task pages to see what metrics can be used for evaluating models for a given task. Checking out leaderboards on sites like Papers With Code (you can search by task and by dataset). Reading the metric cards for the relevant metrics and see which ones are a good fit for your use case. For example, see the BLEU metric card or SQuaD metric card. Looking at papers and blog posts published on the topic and see what metrics they report. This can change over time, so try to pick papers from the last couple of years!

      [!NOTE] 要寻找合适的 metric,有哪些常用的途径?

      flashcard

      4 种

    3. perplexity, which can be used for evaluating different kinds of (unsupervised) generative tasks.

      [!NOTE] 无监督/生成任务通常可以使用什么通用 metric?

      flashcard

      perplexity

    4. There are 3 high-level categories of metrics: Generic metrics, which can be applied to a variety of situations and datasets, such as precision and accuracy. Task-specific metrics, which are limited to a given task, such as Machine Translation (often evaluated using metrics BLEU or ROUGE) or Named Entity Recognition (often evaluated with seqeval). Dataset-specific metrics, which aim to measure model performance on specific benchmarks: for instance, the GLUE benchmark has a dedicated evaluation metric.

      [!NOTE] 模型 evaluation 有哪 3 大类 metric?

      flashcard

      1. 通用,通常是简单的数学定义
      2. 基于特定任务
      3. 基于特定数据集
    1. we added a CLI that makes creating a new evaluation module much easier: Copied evaluate-cli create "My Metric" --module_type "metric" This will create a new Space on the 🤗 Hub, clone it locally, and populate it with a template.

      [!NOTE] 🤗 Evaluate 中,如何新建一个空白 evaluation 模组?

      flashcard

      evaluate-cli create "My Metric" --module_type "metric"

    1. Evaluation can be run by loading the EvaluationSuite and calling run() method with a model or pipeline. Copied >>> from evaluate import EvaluationSuite >>> suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite') >>> results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")

      [!NOTE] 🤗 Evaluate 中,如何加载并运行 EvaluationSuite

      flashcard

      1. EvaluationSuite.load()
      2. EvaluationSuite.run()
    2. EvaluationSuite scripts can be defined as follows, and supports Python code for data preprocessing.

      [!NOTE] 🤗 Evaluate 中,该如何定义一个 EvaluationSuite 的子类?

      flashcard

      示例如下

    3. It can be useful to evaluate models on a variety of different tasks to understand their downstream performance. The EvaluationSuite enables evaluation of models on a collection of tasks.

      [!NOTE] 🤗 Evaluate 中,要在一系列任务上评测模型,可以使用?

      flashcard

      EvaluationSuite

    4. from evaluate.visualization import radar_plot

      [!NOTE] 🤗 Evaluate 提供了什么可视化工具?

      flashcard

      详见 evaluate.visualization

    5. Currently only "text-classification" is supported with more tasks being added in the future.

      [!NOTE] 🤗 Evaluate 中,evaluator 目前支持那些任务?

      flashcard

      只有 "text-classification"?

    6. The evaluator expects a "text" and "label" column for the data input. If your dataset differs you can provide the columns with the keywords input_column="text" and label_column="label"

      [!NOTE] 🤗 Evaluate 中,evaluator 的输入格式为?

      flashcard

      1. 默认为 "text""label"
      2. 可指定 input_column="text" and label_column="label"
    7. results = eval.compute(model_or_pipeline=pipe, data=data, metric=metric, ... label_mapping={"NEGATIVE": 0, "POSITIVE": 1}, ... strategy="bootstrap", n_resamples=200)

      [!NOTE] 🤗 Evaluate 中,如何输入模型-数据-metric 计算得到 metric 结果?

      flashcard

      示例如下

    8. evaluate.save("./results/"experiment="run 42", **result, **hyperparams)

      [!NOTE] 🤗 Evaluate 中,如何保存 evaluation 结果?

      flashcard

      evaluate.save(save_path, **any_kwargs) 会保存传入的关键词参数+有用的系统信息

    9. With the evaluate.push_to_hub() function, you can easily report evaluation results to the model’s repository:

      [!NOTE] 🤗 Evaluate 中,如何将 evaluation 结果推送到模型仓库?

      flashcard

      示例如下:

    10. clf_metrics = evaluate.combine(["accuracy", "f1", "precision", "recall"]) The combine function accepts both the list of names of the metrics as well as an instantiated modules. The compute call then computes each metric:

      [!NOTE] 🤗 Evaluate 中,如何组合在同一组数据上计算的一系列 metrics?

      flashcard

      evaluate.combine(metrics)

    11. 🤗 Evaluate solves this issue by only computing the final metric on the first node. The predictions and references are computed and provided to the metric separately for each node. These are temporarily stored in an Apache Arrow table, avoiding cluttering the GPU or CPU memory. When you are ready to compute() the final metric, the first node is able to access the predictions and references stored on all the other nodes. Once it has gathered all the predictions and references, compute() will perform the final metric evaluation.

      [!NOTE] 🤗 Evaluate 是如何解决 metric 的分布式计算的问题的?

      flashcard

      仅在第一个节点上计算 metric,其他节点上只做 prediction

    12. Typically, when a metric score is additive (f(AuB) = f(A) + f(B)), you can use distributed reduce operations to gather the scores for each subset of the dataset. But when a metric is non-additive (f(AuB) ≠ f(A) + f(B)), it’s not that simple. For example, you can’t take the sum of the F1 scores of each data subset as your final metric.

      [!NOTE] metric 可以使用 all-reduce 方法计算的条件是?

      flashcard

      对子集可加,即 f(AuB) = f(A) + f(B)

    13. When it comes to computing the actual score there are two main ways to do it: All-in-one Incremental In the incremental approach the necessary inputs are added to the module with EvaluationModule.add() or EvaluationModule.add_batch() and the score is calculated at the end with EvaluationModule.compute(). Alternatively, one can pass all the inputs at once to compute().

      [!NOTE] 🤗 Evaluate 中,如何计算 evalution 结果?

      flashcard

      • add() / add_batch() + compute()
      • compute(references=..., predictions=...)
    14. All evalution modules come with a range of useful attributes that help to use a module stored in a EvaluationModuleInfo object. Attribute Description description A short description of the evaluation module. citation A BibTex string for citation when available. features A Features object defining the input format. inputs_description This is equivalent to the modules docstring. homepage The homepage of the module. license The license of the module. codebase_urls Link to the code behind the module. reference_urls Additional reference URLs.

      [!NOTE] 🤗 Evaluate 中,如何查看一个 evaluation 模组的详细信息?

      flashcard

      print(evaluation) 即可

    15. With list_evaluation_modules() you can check what modules are available on the hub. You can also filter for a specific modules and skip community metrics if you want. You can also see additional information such as likes:

      [!NOTE] 🤗 Evaluate 中,如何查看可用的 evaluation 模组?

      flashcard

      evaluate.list_evaluation_modules()

    16. If you want to make sure you are loading the right type of evaluation (especially if there are name clashes) you can explicitly pass the type: Copied >>> word_length = evaluate.load("word_length", module_type="measurement")

      [!NOTE] 🤗 Evaluate 中,如何加载 evaluation 工具?

      flashcard

      例如 word_length = evaluate.load("word_length", module_type="measurement")

    1. If you’d like to monitor your evaluation metrics during fine-tuning, specify the evaluation_strategy parameter in your training arguments to report the evaluation metric at the end of each epoch: Copied >>> from transformers import TrainingArguments, Trainer >>> training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

      [!NOTE] 🤗 Transformers Trainer 中,要设置 evaluation 的时机,可以使用?

      flashcard

      TrainingArguments 中设置 evaluation_strategy 参数

    1. 知乎插件提供了三种非常便携的图片上传方式,支持上传 .gif, .png, .jpg 格式,且在图片上传的时候自动在当前 Markdown 光标所在行自动生成图片链接,无需创作者手动管理

      [!NOTE] VSCode 知乎插件支持如何上传图片?

      flashcard

      3 种方式+自动上传到知乎图床

    2. 插件会自动扫描文本第一个一级头标签之前的内容,将第一个发现的图片链接作为背景图片

      [!NOTE] VSCode 知乎插件会如何智能识别并处理背景图片?

      flashcard

      将第一个一级标题之前的图片链接作为背景图片

    3. 文章标题无需手动输入,插件会自动检测文本的第一个一级头标签: # 这是一个标题(必须只是一个#) 然后将其作为标题,改行的内容也不会进入到正文中,如果没有检测到,还需用户手动输入。

      [!NOTE] VSCode 知乎插件会如何智能识别并处理标题?

      flashcard

      将第一个一级标题作为文章标题,并从正文中删除

    4. 链接扫描 😊 若你想在特定的问题下回答,或想修改自己的某个原有回答,就将问题/答案链接以 #! https://... 的格式放置于答案的第一行,发布时,插件会自动扫描识别,发布至相应的问题下,或修改原有的答案。 比如,你想在 轻功是否真的存在,其在科学上可以解释吗? 该问题下回答问题, 只需将 #! https://www.zhihu.com/question/19602618 若是你已经创作过的答案, 则将答案的链接, 形如: #! https://www.zhihu.com/question/355223335/answer/1003461264 的链接复制至文件顶部即可。 若是你已经创作过的文章,则将文章的链接,形如: #! https://zhuanlan.zhihu.com/p/107810342 若插件没有在首行扫描到链接,则会询问创作者接下来的操作,你可以选择发布新文章,或从收藏夹中选取相应问题,发布至相应问题下:

      [!NOTE] VSCode 知乎插件如何快捷发布内容?

      flashcard

      将问题/答案链接以 #! https://... 的格式放置于答案的第一行

    5. 支持mermaid 转化为图片 { "zhihu.enableMermaidToPng": true, // 设置开启才能生效 "zhihu.mermaidTheme": "dark" // 支持设置 mermaid 图片主题,默认值是 "default" }

      [!NOTE] 知乎@VSCode 中,如何开启 mermaid 转化为图片?

      flashcard

      "zhihu.enableMermaidToPng": true

    1. 我们在高分辨质谱仪上对这 12 款湿厕纸进行了检测,均检测出了成分表里没有标识的防腐剂。对羟基苯乙酮、尼泊金乙酯等

      [!NOTE] 湿厕纸的防腐剂添加情况如何?

      flashcard

      目前没有标准要求湿厕纸一定要注明所有成分 许多湿厕纸添加了成分表里没有标识的防腐剂

    1. 湿厕纸中的‘水’在带来更好清洁体验的同时,其包装内的潮湿环境也给病菌、微生物滋生提供了有利条件。”树兰(杭州)医院皮肤科主任瞿镔表示,一包湿厕纸在频繁打开的过程中,每一次都会与空气接触,久而久之,很难保持无菌的状态,从这个角度看,确实会存在一定的健康隐患——可能会导致皮肤过敏,诱发湿疹等疾病,如果本身患有皮肤疾病,还有可能加重病情。

      [!NOTE] 湿厕纸能否保持无菌状态?

      flashcard

      很难,湿润+空气/手部污染

    2. 瞿镔称,不论是肛门还是尿道、阴道,都应保持干燥的状态,与肛门相比,尿道、阴道的皮肤更加娇嫩、敏感,由于我国暂时没有湿厕纸的相关执行标准,且湿厕纸很难保持无菌的状态,加之部分湿厕纸还有杀菌剂、防腐剂等成分,并不建议女性将此作为常用品来擦拭阴道、尿道,若私处菌群失衡、长期潮湿,反而会增加患病风险。“如果大家有使用湿厕纸的习惯,可选用单片独立包装的产品,这样相对安全一些,一般遵循‘湿-干’的步骤即可。”

      [!NOTE] 隐私部位平时应保持什么状态?

      flashcard

      清洁、干燥(擦拭后应及时擦干)

    3. 目前,正规的湿厕纸、湿纸巾,在生产过程中一般都遵循质检总局发布的《一次性使用卫生用品卫生标准》。其卫生标准和普通级口罩一致。也就是说,在卫生方面,湿厕纸擦手、擦嘴没什么问题。

      [!NOTE] 湿厕纸、湿纸巾的卫生标准为?

      flashcard

      与普通级口罩一致

    1. FYI, all comparisons between fp32 and fp16 due to my weird feeling ends up to the same result except the normalization weight.

      [!QUESTION] fp16fp32 最可能在哪里出现差距?

      flashcard

      1. 归一化权重
    1. device_map = infer_auto_device_map(model) This will return a dictionary mapping modules or weights to a device. On a machine with one Titan RTX for instance, we get the following: {'model.decoder.embed_tokens': 0,

      [!NOTE] 如何不实际占用内存就获取模型各层在设备上的分布

      flashcard

      accelerate.infer_auto_device_map(model)

    2. with init_empty_weights(): model = AutoModelForCausalLM.from_config(config)

      [!NOTE] 如何不占用实际内存创建模型实例?

      flashcard

      with init_empty_weights():

    1. To use GPUs, you need to install the NVIDIA Container Toolkit . We also recommend using NVIDIA drivers with CUDA version 11.8 or higher.

      [!NOTE] 在 Docker 等容器中,如何快速配置 CUDA 环境?

      flashcard

      NVIDIA Container Toolkit

    1. Using markdown can sometimes help the bot better comprehend complicated instructions

      [!NOTE] Prompt 中,应该使用什么结构化语法组织复杂 prompt?

      flashcard

      使用 MarkDown 语法可能有助于 LLM 理解

    2. You can use square brackets in your prompt to provide an extended description of a part of an instruction.

      [!NOTE] prompt 中,通常如何表示变量

      flashcard

      用方括号 [ ] 包裹

    1. remove_invalid_values (bool, optional, defaults to model.config.remove_invalid_values) — Whether to remove possible nan and inf outputs of the model to prevent the generation method to crash. Note that using remove_invalid_values can slow down generation.

      [!NOTE] 🤗 Transformers 中,generate()remove_invalid_values 有什么用处?有什么影响?

      flashcard

      移除模型的 naninf 输出 但会减慢生成速度

    2. force_words_ids(List[List[int]] or List[List[List[int]]], optional) — List of token ids that must be generated. If given a List[List[int]], this is treated as a simple list of words that must be included, the opposite to bad_words_ids. If given List[List[List[int]]], this triggers a disjunctive constraint, where one can allow different forms of each word.

      [!NOTE] 🤗 Transformers 中,要简单指定 generate() 必须生成的 token_ids,可以使用?

      flashcard

      参数 force_words_ids

    3. constraints (List[Constraint], optional) — Custom constraints that can be added to the generation to ensure that the output will contain the use of certain tokens as defined by Constraint objects, in the most sensible way possible.

      [!NOTE] 🤗 Transformers 中,如何为 generate 添加约束?

      flashcard

      使用 contraints: List[transformers.Constraint] 参数

    4. A ModelOutput (if return_dict_in_generate=True or when config.return_dict_in_generate=True) or a torch.FloatTensor.

      [!NOTE] 🤗 Transformers 的 generate() 中,要返回多种值,需要设置什么?

      flashcard

      参数 return_dict_in_generate=True

    1. Tensor.view(*shape) → Tensor Returns a new tensor with the same data as the self tensor but of a different shape.

      [!NOTE] Tensor.view() 的输入、输出为?

      flashcard

      • 输入:形状 *shape,其中可用 -1 代替一个维数
      • 输出:相同元素组成的对应形状的 tensor
    1. scores (tuple(torch.FloatTensor) optional, returned when output_scores=True is passed or when config.output_scores=True) — Processed prediction scores of the language modeling head (scores for each vocabulary token before SoftMax) at each generation step.

      [!NOTE] 🤗 Transformers 中,generate() 返回的 scores 的内容为?

      flashcard

      即最后一层的 logits,需要经过 soft-max 才是概率