94 Matching Annotations
  1. Nov 2022
    1. 在基于fork的多进程实现中,每次fork会让子进程得到不同的虚拟空间地址,但此时其映射的还是父进程的物理内存空间,可以让子进程高效率读取父进程的memory数据。一旦子进程有写操作就会触发操作系统的copy-on-write异常,系统会拷贝出另一块空间供子进程使用。

      什么情况下会触发 copy-on-write?

  2. Jun 2022
    1. Loop unrolling can improve performance in two ways. First,it reduces the number of operations that do not contribute directly to the programresult, such as loop indexing and conditional branching. Second, it exposes waysin which we can further transform the code to reduce the number of operationsin the critical paths of the overall computation.

      loop unrolling 为什么可以提高 performance?

  3. Apr 2022
    1. Focus your attention on the inner loops, where the bulk of the computationsand memory accesses occur.. Try to maximize the spatial locality in your programs by reading data objectssequentially, with stride 1, in the order they are stored in memory.. Try to maximize the temporal locality in your programs by using a data objectas often as possible once it has been read from memory.

      为了写出有效率的程序,应该考虑哪些因素?

    1. std::move_if_noexcept will return a movable r-value if the object has a noexcept move constructor, otherwise it will return a copyable l-value. We can use the noexcept specifier in conjunction with std::move_if_noexcept to use move semantics only when a strong exception guarantee exists (and use copy semantics otherwise).

      如果在 move 过程中遇到异常,有什么办法可以处理?

  4. Mar 2022
    1. ● Relocatable object file (.o file)○ Code and data that can be combined with other relocatable object files to form executable object file■ Each .o file is produced from exactly one source (.c) file● Executable object file (a.out file)○ Code and data that can be copied directly into memory and then executed● Shared object file (.so file)○ Special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or run-time

      compile 之后的 object files 有哪几种类型?

    2. ● Modularity○ Program can be written as a collection of smaller source files, rather than one monolithic mass.● Efficiency○ Time: Separate compilation■ Change one source file, compile, and then relink. No need to recompile other source files.○ Space: Libraries■ Common functions can be aggregated into a single file...

      linker 有什么好处?

    3. ● Global symbols○ Symbols defined by module m that can be referenced by other modules.■ e.g., non-static C functions and non-static global variables.● External symbols○ Global symbols that are referenced by module m but defined by some other module.● Local symbols○ Symbols that are defined and referenced exclusively by module m.■ e.g., C functions and global variables defined with the static attribute.○ Local linker symbols are not local program variables

      分别有哪些 linker symbol?

    4. ● Symbol resolution○ Programs define and reference symbols (global variables and functions)○ Linker associates each symbol reference with exactly 1 symbol definition● Relocation○ Merges separate code and data sections into single sections○ Relocates symbols from relative locations in .o files to final memory locations○ Updates all references to symbols to reflect new positions

      linker 到底做了什么?

    5. ● Processes #include, #define, #if, macros○ Combines main source file with headers (textually)○ Defines and expands macros (token-based shorthand)○ Conditionally removes parts of the code (e.g. specialize for Linux, Mac, ...)● Removes all comments

      Pre-Processor 部分的流程是什么?

    1. Programs that repeatedly reference the same variables enjoy good temporallocality..For programs with stride-k reference patterns, the smaller the stride, thebetter the spatial locality. Programs with stride-1 reference patterns have goodspatial locality. Programs that hop around memory with large strides havepoor spatial locality..Loops have good temporal and spatial locality with respect to instructionfetches. The smaller the loop body and the greater the number of loop it-erations, the better the locality.

      locality 总结起来的特点是什么?

    2. One reason circuit designers organize DRAMs as two-dimensional arraysinstead of linear arrays is to reduce the number of address pins on the chip.

      DRAMs 被设计成 two-dimensional array 的原因是什么?

  5. Feb 2022
    1. The techniques we have outlined—randomization, stack protection, and lim-iting which portions of memory can hold executable code—are three of the mostcommon mechanisms used to minimize the vulnerability of programs to bufferoverflow attacks

      有什么技术可以保护程序免收攻击?

    2. The final example shows that one cancompute the difference of two pointers within the same data structure, with theresult being data having type long and value equal to the difference of the twoaddresses divided by the size of the data type.

      如何计算两个 pointers 的差?

    3. convention, registers %rbx, %rbp, and %r12–%r15 are classified as callee-saved registers. When procedure P calls procedure Q, Q must preserve the valuesof these registers, ensuring that they have the same values when Q returns to P asthey did when Q was called

      callee-saved register 有什么作用,应该如何理解?

    4. At times, however, local data mustbe stored in memory. Common cases of this include these:.There are not enough registers to hold all of the local data..The address operator ‘&’ is applied to a local variable, and hence we must beable to generate an address for it..Some of the local variables are arrays or structures and hence must be accessedby array or structure references.

      什么时候 local data 必须要被存放在 memory 里面?

  6. Jan 2022
    1. The cmp instructions set the condition codes according to the differences of theirtwo operands. They behave in the same way as the sub instructions, except thatthey set the condition codes without updating their destinations.

      cmp 指令集的作用是什么?

  7. Dec 2021
  8. Nov 2021
    1. logicallybe named movzlq, but this instruction does not exist. Instead, this type of datamovement can be implemented using a movl instruction having a register as thedestination. This technique takes advantage of the property that an instructiongenerating a 4-byte value with a register as the destination will fill the upper 4bytes with zeros.

      为什么在 movz 的指令中缺少 movzlq?

    2. in memory, to a register destination. Instructions in the movz class fill out theremaining bytes of the destination with zeros, while those in the movs class fillthem out by sign extension, replicating copies of the most significant bit of thesource operand.

      那两种 move 指令针对 copy smaller source 到 larger destination,他们的做法分别是什么?

    3. The source operand designates a value that is immediate, stored in a register,or stored in memory. The destination operand designates a location that is either aregister or a memory address. x86-64 imposes the restriction that a move instruc-tion cannot have both operands refer to memory locations. Copying a value fromone memory location to another requires two instructions—the first to load thesource value into a register, and the second to write this register value to the des-tination.

      move 的 source operand 和 destination operand 分别可以是哪些类型?

    4. The most general form is shown at the bottomof the table with syntax Imm(rb,ri,s). Such a reference has four components: animmediate offset Imm, a base register rb, an index register ri, and a scale factors, where s must be 1, 2, 4, or 8. Both the base and index must be 64-bit registers.The effective address is computed as Imm + R[rb] + R[ri] . s.

      访问 $$Imm(r_b, r_i, s)$$ 的内存应该如何计算,有哪些限制条件?

    5. A final difference is that we see two additional lines of code (lines8–9). These instructions will have no effect on the program, since they occur afterthe return instruction (line 7). They have been inserted to grow the code for thefunction to 16 bytes, enabling a better placement of the next block of code in termsof memory system performance.

      为什么有时候通过 disassembly 生成的 assembly 代码会在 ret 之后通过 nop 增加一些空格?

    1. One of the best things about classes is that they contain destructors that automatically get executed when an object of the class goes out of scope. So if you allocate (or acquire) memory in your constructor, you can deallocate it in your destructor, and be guaranteed that the memory will be deallocated when the class object is destroyed (regardless of whether it goes out of scope, gets explicitly deleted, etc…).

      smart pointer 的原理是什么?

    1. Three techniques to avoid losing critical information at half-precision: Full-precision master copy of weights. Maintain a full precision (FP32) copy of model weights that accumulates gradients. The numbers are rounded up to half-precision for forward & backward passes. The motivation is that each gradient update (i.e. gradient times the learning rate) might be too small to be fully contained within the FP16 range (i.e. 2−242−242^{-24} becomes zero in FP16). Loss scaling. Scale up the loss to better handle gradients with small magnitudes (See Fig. 16). Scaling up the gradients helps shift them to occupy a larger section towards the right section (containing larger values) of the representable range, preserving values that are otherwise lost. Arithmetic precision. For common network arithmetic (e.g. vector dot-product, reduction by summing up vector elements), we can accumulate the partial results in FP32 and then save the final output as FP16 before saving into memory. Point-wise operations can be executed in either FP16 or FP32.

      混合精度中是通过哪些方式保证精度不会损失的?

    2. two major memory consumption of large model training: The majority is occupied by model states, including optimizer states (e.g. Adam momentums and variances), gradients and parameters. Mixed-precision training demands a lot of memory since the optimizer needs to keep a copy of FP32 parameters and other optimizer states, besides the FP16 version. The remaining is consumed by activations, temporary buffers and unusable fragmented memory (named residual states in the paper).

      深度网络训练中的显存开销主要是哪些?

    3. Asynchronous parallel (ASP): Every GPU worker processes the data asynchronously, no waiting or stalling. However, it can easily lead to stale weights being used and thus lower the statistical learning efficiency. Even though it increases the computation time, it may not speed up training time to convergence.

      ASP 是什么以及其优缺点?

    4. Bulk synchronous parallels (BSP): Workers sync data at the end of every minibatch. It prevents model weights staleness and good learning efficiency but each machine has to halt and wait for others to send gradients.

      BSP 是什么以及其优缺点?