94 Matching Annotations
  1. Nov 2022
    1. 在基于fork的多进程实现中,每次fork会让子进程得到不同的虚拟空间地址,但此时其映射的还是父进程的物理内存空间,可以让子进程高效率读取父进程的memory数据。一旦子进程有写操作就会触发操作系统的copy-on-write异常,系统会拷贝出另一块空间供子进程使用。

      什么情况下会触发 copy-on-write?

  2. Jun 2022
    1. Once a compiler must resort to register spilling, any advantage of maintainingmultiple accumulators will most likely be lost.

      register spilling 是指什么现象?

    2. a reassociation transformation can reduce the number of opera-tions along the critical path in a computation, resulting in better performance bybetter utilizing the multiple functional units and their pipelining capabilities.

      reassociation 具体的做法是什么?

    3. Loop unrolling can improve performance in two ways. First,it reduces the number of operations that do not contribute directly to the programresult, such as loop indexing and conditional branching. Second, it exposes waysin which we can further transform the code to reduce the number of operationsin the critical paths of the overall computation.

      loop unrolling 为什么可以提高 performance?

    4. For an operation with latency L and capacity C, thisrequires an unrolling factor k ≥ C

      unrolling factor 需要如何设置来保证满流水?

  3. Apr 2022
    1. are evaluated simultaneously, a phenomenon referred to as instruction-level paral-lelism.


    2. If a compiler cannotdetermine whether or not two pointers may be aliased, it must assume that eithercase is possible, limiting the set of possible optimizations.

      pointer alias 的 optimization block 怎么理解?

    3. The case where two pointers may designate the same memory location isknown as memory aliasing.

      什么是 memory alias?

    4. Focus your attention on the inner loops, where the bulk of the computationsand memory accesses occur.. Try to maximize the spatial locality in your programs by reading data objectssequentially, with stride 1, in the order they are stored in memory.. Try to maximize the temporal locality in your programs by using a data objectas often as possible once it has been read from memory.


    5. Repeated references to local variables are good because the compiler cancache them in the register file (temporal locality).. Stride-1 reference patterns are good because caches at all levels of the memoryhierarchy store data as contiguous blocks (spatial locality).

      重复使用 local variable 以及 stride-1 pattern 为什么是 cache-friendly 的?

    6. we suggest adopting a mental model that assumeswrite-back, write-allocate caches.

      write hit 和 write miss 建议采用哪种模式进行思考?

    7. fully associative caches are only appropriate for small caches

      fully associative caches 适合什么场景?

    8. A copy of w is contained in the line if and only if the valid bit is setand the tag in the cache line matches the tag in the address of w.

      如何判断要读取的 w 在 cache line 里面?

    9. The process that a cache goes through of determining whether a request is ahit or a miss and then extracting the requested word consists of three steps: (1) setselection, (2) line matching, and (3) word extraction.

      process 请求内存,有哪三个步骤?

    1. std::shared_ptr can be used when you need multiple smart pointers that can co-own a resource. The resource will be deallocated when the last std::shared_ptr goes out of scope. std::weak_ptr can be used when you want a smart pointer that can see and use a shared resource, but does not participate in the ownership of that resource.

      weak_ptr 的适用场景

    1. Always make a copy of an existing std::shared_ptr if you need more than one std::shared_ptr pointing to the same resource.

      如果要创建多个 shared_ptr,推荐的做法是什么?

    1. Redistribution can easily become a bottleneck due to the bandwidthof cross-device links usually being magnitudes smaller than that of the on-device memory bus.

      redistribution arrays 可能会遇到什么问题?

    2. Modern large-scale deep learning workloads highlight the need for parallel execution across many devicesin order to fit model data into hardware accelerator memories. In these settings, array redistribution maybe required during a computation, but can also become a bottleneck if not done efficiently

      为什么需要 array redistribution?

    1. Second, don’t manually delete the resource out from underneath the std::unique_ptr.

      有什么误用 std::unique_ptr 的情况?

    2. Use std::make_unique() instead of creating std::unique_ptr and using new yourself.

      推荐的创建 std::unique_ptr 的方式是什么?有什么好处?

    3. Favor std::array, std::vector, or std::string over a smart pointer managing a fixed array, dynamic array, or C-style string.

      对于固定的 array,动态 array 和字符串,更推荐使用哪种类型?

    4. Because std::unique_ptr is designed with move semantics in mind, copy initialization and copy assignment are disabled. If you want to transfer the contents managed by std::unique_ptr, you must use move semantics.

      std::unique_ptr 可以使用 copy 初始化吗?

    1. std::move_if_noexcept will return a movable r-value if the object has a noexcept move constructor, otherwise it will return a copyable l-value. We can use the noexcept specifier in conjunction with std::move_if_noexcept to use move semantics only when a strong exception guarantee exists (and use copy semantics otherwise).

      如果在 move 过程中遇到异常,有什么办法可以处理?

    1. std::move can be used whenever we want to treat an l-value like an r-value for the purpose of invoking move semantics instead of copy semantics.

      std::move 在什么情况下可以使用?

    1. the goal of the move constructor and move assignment is to move ownership of the resources from one object to another (which is typically much less expensive than making a copy).

      move constructor 和 move assignment 的目的是什么?

    2. By default, C++ will provide a copy constructor and copy assignment operator if one is not explicitly provided. These compiler-provided functions do shallow copies, which may cause problems for classes that allocate dynamic memory. So classes that deal with dynamic memory should override these functions to do deep copies.

      c++ 默认提供什么样的 copy constructor,这会导致什么问题?

    1. First, r-value references extend the lifespan of the object they are initialized with to the lifespan of the r-value reference (l-value references to const objects can do this too). Second, non-const r-value references allow you to modify the r-value!

      R-value references 有什么性质非常有用?

  4. Mar 2022
    1. Move semantics means the class will transfer ownership of the object rather than making a copy.

      move semantics 是什么意思?

    2. A Smart pointer is a composition class that is designed to manage dynamically allocated memory and ensure that memory gets deleted when the smart pointer object goes out of scope.

      smart pointer 是什么?有什么好处?

    1. 1. Multiple strong symbols are not allowed○ Each item can be defined only once2. Given a strong symbol and multiple weak symbols, choose the strong symbol○ References to the weak symbol resolve to the strong symbol3. If there are multiple weak symbols, pick an arbitrary one

      linker 如何解决重复符号定义的问题?

    2. ● Relocatable object file (.o file)○ Code and data that can be combined with other relocatable object files to form executable object file■ Each .o file is produced from exactly one source (.c) file● Executable object file (a.out file)○ Code and data that can be copied directly into memory and then executed● Shared object file (.so file)○ Special type of relocatable object file that can be loaded into memory and linked dynamically, at either load time or run-time

      compile 之后的 object files 有哪几种类型?

    3. ● Static Linking○ Executable files and running memory images contain only the library code they actually use● Dynamic linking○ Executable files contain no library code○ During execution, single copy of library code can be shared across all executing processes

      static linking 和 dynamic linking 分别是什么?

    4. ● Modularity○ Program can be written as a collection of smaller source files, rather than one monolithic mass.● Efficiency○ Time: Separate compilation■ Change one source file, compile, and then relink. No need to recompile other source files.○ Space: Libraries■ Common functions can be aggregated into a single file...

      linker 有什么好处?

    5. ● Global symbols○ Symbols defined by module m that can be referenced by other modules.■ e.g., non-static C functions and non-static global variables.● External symbols○ Global symbols that are referenced by module m but defined by some other module.● Local symbols○ Symbols that are defined and referenced exclusively by module m.■ e.g., C functions and global variables defined with the static attribute.○ Local linker symbols are not local program variables

      分别有哪些 linker symbol?

    6. ● Symbol resolution○ Programs define and reference symbols (global variables and functions)○ Linker associates each symbol reference with exactly 1 symbol definition● Relocation○ Merges separate code and data sections into single sections○ Relocates symbols from relative locations in .o files to final memory locations○ Updates all references to symbols to reflect new positions

      linker 到底做了什么?

    7. ● Aggregates multiple independently compiled files containing machine code● Fills in those unknown addresses● The goal is to create 1 file with all of the needed code to run the program

      linker 的流程是什么?

    8. ○ This changes the format and structure of the code but preserves the semantics (what it does)○ Can change lots of details for optimization, as long as the overall effect is the same

      compiler 部分的流程是什么?

    9. ● Processes #include, #define, #if, macros○ Combines main source file with headers (textually)○ Defines and expands macros (token-based shorthand)○ Conditionally removes parts of the code (e.g. specialize for Linux, Mac, ...)● Removes all comments

      Pre-Processor 部分的流程是什么?

    10. Four steps for C: preprocessing, compiling, assembling, linking

      compile code 有哪 4 步?

    1. Restrictive placement policies of this kind lead to a type of miss known asa conflict miss, in which the cache is large enough to hold the referenced dataobjects, but because they map to the same cache block, the cache keeps missing.

      如何理解 conflict miss?

    2. When the size of the working set exceedsthe size of the cache, the cache will experience what are known as capacity misses.

      什么是 capacity miss?

    3. For caches high in the memory hierarchy (close tothe CPU) that are implemented in hardware and where speed is at a premium,this policy is usually too expensive to implement because randomly placed blocksare expensive to locate.

      cache 等级高的 memory 为什么不要实现最灵活的 placement policy?

    4. The decision about which block to replace is governed by the cache’s replacementpolicy.

      当 cache misses 发生的时候,需要做什么事情,有哪些方式?

    5. a program needs a particular data object d from level k + 1, it first looksfor d in one of the blocks currently stored at level k. If d happens to be cachedat level k, then we have what is called a cache hit.

      什么是 cache hits?什么是 cache misses?

    6. It is important to realize that while the block size is fixedbetween any particular pair of adjacent levels in the hierarchy, other pairs of levelscan have different block sizes.

      在 memory hierarchy 之间的 block size 有什么特点?

    7. The central idea of a memory hierarchy is that for each k, the faster and smallerstorage device at level k serves as a cache for the larger and slower storage device

      memory hierarchy 的中心想法是什么?该如何理解?

    8. Programs that repeatedly reference the same variables enjoy good temporallocality..For programs with stride-k reference patterns, the smaller the stride, thebetter the spatial locality. Programs with stride-1 reference patterns have goodspatial locality. Programs that hop around memory with large strides havepoor spatial locality..Loops have good temporal and spatial locality with respect to instructionfetches. The smaller the loop body and the greater the number of loop it-erations, the better the locality.

      locality 总结起来的特点是什么?

    9. Visiting every kth element of a contiguous vector is called a stride-kreference pattern. Stride-1 reference patterns are a common and important sourceof spatial locality in programs. In general, as the stride increases, the spatial localitydecreases.

      stride-k reference pattern 是指什么?

    10. Their alignment rule is based on the principle that any primitiveobject of K bytes must have an address that is a multiple of K.

      data alignment 的原则是什么?

    11. The disadvantage of the two-dimensional array organization isthat addresses must be sent in two distinct steps, which increases the access time.

      two-dimensional array 的缺点是什么?

    12. One reason circuit designers organize DRAMs as two-dimensional arraysinstead of linear arrays is to reduce the number of address pins on the chip.

      DRAMs 被设计成 two-dimensional array 的原因是什么?

    13. The memory system must periodically refresh every bit of memory byreading it out and then rewriting it.

      DRAM 不稳定,在计算机中如何防止其变化?

    1. 初始大小以宏 LEPT_PARSE_STACK_INIT_SIZE 的形式定义,使用 #ifndef X #define X ... #endif 方式的好处是,使用者可在编译选项中自行设置宏,没设置的话就用缺省值。
  5. Feb 2022
    1. The %rip register on x86-64 is a special-purpose register that always holds the memory address of the next instruction to execute in the program's code segment.

      %rip 有什么作用?

    1. • %rax: return value• %rsp: stack pointer• %rdi: 1st argument• %rsi: 2nd argument• %rdx: 3rd argument• %rcx: 4th argument• %r8: 5th argument• %r9: 6th argument

      有那几个常用且重要的 register?

    1. To manage a variable-size stack frame, x86-64 code uses register %rbp to serveas a frame pointer

      frame pointer 什么情况下会使用?

    2. The techniques we have outlined—randomization, stack protection, and lim-iting which portions of memory can hold executable code—are three of the mostcommon mechanisms used to minimize the vulnerability of programs to bufferoverflow attacks


    3. The array elements areordered in memory in row-major order, meaning all elements of row 0, whichcan be written A[0], followed by all elements of row 1 (A[1]), and so on.

      array 在 memory 中的排列顺序是怎么样的?

    4. The final example shows that one cancompute the difference of two pointers within the same data structure, with theresult being data having type long and value equal to the difference of the twoaddresses divided by the size of the data type.

      如何计算两个 pointers 的差?

    5. if p is a pointer to dataof type T , and the value of p is xp, then the expression p+i has value xp + L . i,where L is the size of data type T


    6. convention, registers %rbx, %rbp, and %r12–%r15 are classified as callee-saved registers. When procedure P calls procedure Q, Q must preserve the valuesof these registers, ensuring that they have the same values when Q returns to P asthey did when Q was called

      callee-saved register 有什么作用,应该如何理解?

    7. At times, however, local data mustbe stored in memory. Common cases of this include these:.There are not enough registers to hold all of the local data..The address operator ‘&’ is applied to a local variable, and hence we must beable to generate an address for it..Some of the local variables are arrays or structures and hence must be accessedby array or structure references.

      什么时候 local data 必须要被存放在 memory 里面?

  6. Jan 2022
    1. When an x86-64 procedure requires storage beyond what it can hold in reg-isters, it allocates space on the stack. This region is referred to as the procedure’s

      什么是 stack frame?

    2. The advantage of usinga jump table over a long sequence of if-else statements is that the time taken toperform the switch is independent of the number of switch cases.

      jump table 对比 if-else 的优势是什么?

    3. If one of those two expressions couldpossibly generate an error condition or a side effect, this could lead to invalidbehavior. Such is the case for our earlier example

      有什么情况下必须使用 branching 方式,而不能使用 conditional move?

    4. The testinstructions behave in the same manner as the and instructions, except that theyset the condition codes without altering their destinations.

      test 指令的作用是什么?

    5. The cmp instructions set the condition codes according to the differences of theirtwo operands. They behave in the same way as the sub instructions, except thatthey set the condition codes without updating their destinations.

      cmp 指令集的作用是什么?

    6. By using a PC-relativeencoding of the jump targets, the instructions can be compactly encoded (requiringjust 2 bytes), and the object code can be shifted to different positions in memorywithout alteration.

      pc-relative encoding 的计算方式是什么,有什么优势?

    7. It is important to recognize that the suffixes forthese instructions denote different conditions and not different operand sizes. Forexample, instructions setl and setb denote “set less” and “set below,” not “setlong word” or “set byte.”

      set 指令的后缀代表的含义是什么?

  7. Dec 2021
    1. one for unsigned (mulq) and one for two’s-complement (imulq) multiplication.For both of these instructions, one argument must be in register %rax, and theother is given as the instruction source operand.

      mulq 和 imulq 分别表示什么指令集,他们的操作数有什么要求?

    2. The different shift instructions can specify the shift amount either asan immediate value or with the single-byte register %cl.

      shift 指令可以接受哪些操作数?

  8. Nov 2021
    1. As with themov instructions, the two operands cannot both be memory locations.

      binary operation 的两个操作数可以是 memory location 吗?

    2. This operand can be either a register ora memory location.

      unary 的操作数可以是什么?

    3. The destination operand must be a register.

      load effective address 的 destination 需要是什么?

    4. The ability of the leaq instruction to perform addition and limited forms ofmultiplication proves useful when compiling simple arithmetic expressions suchas this example.

      leaq 在什么情况下有用?

    5. local variables such as x are often kept in registers rather thanstored in memory locations. Register access is much faster than memory access.

      local variables 通过会存在哪里,为什么?

    6. we see that whatwe call “pointers” in C are simply addresses. Dereferencing a pointer involvescopying that pointer into a register, and then using this register in a memoryreference.

      dereference pointer 在 assembly code 中如何实现?

    7. One important feature is that memoryreferences in x86-64 are always given with quad word registers, such as %rax, evenif the operand is a byte, single word, or double word.

      memory reference 属于那种 register 类型?

    8. logicallybe named movzlq, but this instruction does not exist. Instead, this type of datamovement can be implemented using a movl instruction having a register as thedestination. This technique takes advantage of the property that an instructiongenerating a 4-byte value with a register as the destination will fill the upper 4bytes with zeros.

      为什么在 movz 的指令中缺少 movzlq?

    9. in memory, to a register destination. Instructions in the movz class fill out theremaining bytes of the destination with zeros, while those in the movs class fillthem out by sign extension, replicating copies of the most significant bit of thesource operand.

      那两种 move 指令针对 copy smaller source 到 larger destination,他们的做法分别是什么?

    10. The source operand designates a value that is immediate, stored in a register,or stored in memory. The destination operand designates a location that is either aregister or a memory address. x86-64 imposes the restriction that a move instruc-tion cannot have both operands refer to memory locations. Copying a value fromone memory location to another requires two instructions—the first to load thesource value into a register, and the second to write this register value to the des-tination.

      move 的 source operand 和 destination operand 分别可以是哪些类型?

    11. The most general form is shown at the bottomof the table with syntax Imm(rb,ri,s). Such a reference has four components: animmediate offset Imm, a base register rb, an index register ri, and a scale factors, where s must be 1, 2, 4, or 8. Both the base and index must be 64-bit registers.The effective address is computed as Imm + R[rb] + R[ri] . s.

      访问 $$Imm(r_b, r_i, s)$$ 的内存应该如何计算,有哪些限制条件?

    12. C declaration Intel data type Assembly-code suffix Size (bytes)

      不同数据类型的 size 以及在 assembly 中的后缀?

    13. A final difference is that we see two additional lines of code (lines8–9). These instructions will have no effect on the program, since they occur afterthe return instruction (line 7). They have been inserted to grow the code for thefunction to 16 bytes, enabling a better placement of the next block of code in termsof memory system performance.

      为什么有时候通过 disassembly 生成的 assembly 代码会在 ret 之后通过 nop 增加一些空格?

    14. Its main feature isthat it is in a more readable textual format, as compared to the binary format ofmachine code.

      assembly code 和 machine code 相比最大的区别是什么?

    1. reinterpret_cast 运算符并不会改变括号中运算对象的值,而是对该对象从位模式上进行重新解释

      reinterpret_cast 在 c++ 中如何理解?

    1. A namespace is a scope.C++ provides namespaces to prevent name conflicts.

      namespace 有什么作用?

    1. But the other effect of unnamed namespaces is that all identifiers inside an unnamed namespace are treated as if they had internal linkage, which means that the content of an unnamed namespace can’t be seen outside of the file in which the unnamed namespace is defined.

      unnamed namespace 有什么作用?

    1. One of the best things about classes is that they contain destructors that automatically get executed when an object of the class goes out of scope. So if you allocate (or acquire) memory in your constructor, you can deallocate it in your destructor, and be guaranteed that the memory will be deallocated when the class object is destroyed (regardless of whether it goes out of scope, gets explicitly deleted, etc…).

      smart pointer 的原理是什么?

    1. Three techniques to avoid losing critical information at half-precision: Full-precision master copy of weights. Maintain a full precision (FP32) copy of model weights that accumulates gradients. The numbers are rounded up to half-precision for forward & backward passes. The motivation is that each gradient update (i.e. gradient times the learning rate) might be too small to be fully contained within the FP16 range (i.e. 2−242−242^{-24} becomes zero in FP16). Loss scaling. Scale up the loss to better handle gradients with small magnitudes (See Fig. 16). Scaling up the gradients helps shift them to occupy a larger section towards the right section (containing larger values) of the representable range, preserving values that are otherwise lost. Arithmetic precision. For common network arithmetic (e.g. vector dot-product, reduction by summing up vector elements), we can accumulate the partial results in FP32 and then save the final output as FP16 before saving into memory. Point-wise operations can be executed in either FP16 or FP32.


    2. two major memory consumption of large model training: The majority is occupied by model states, including optimizer states (e.g. Adam momentums and variances), gradients and parameters. Mixed-precision training demands a lot of memory since the optimizer needs to keep a copy of FP32 parameters and other optimizer states, besides the FP16 version. The remaining is consumed by activations, temporary buffers and unusable fragmented memory (named residual states in the paper).


    3. It partitions optimizer state, gradients and parameters across multiple data parallel processes via a dynamic communication schedule to minimize the communication volume.

      ZeRO-DP 的原理是什么?

    4. Asynchronous parallel (ASP): Every GPU worker processes the data asynchronously, no waiting or stalling. However, it can easily lead to stale weights being used and thus lower the statistical learning efficiency. Even though it increases the computation time, it may not speed up training time to convergence.

      ASP 是什么以及其优缺点?

    5. Bulk synchronous parallels (BSP): Workers sync data at the end of every minibatch. It prevents model weights staleness and good learning efficiency but each machine has to halt and wait for others to send gradients.

      BSP 是什么以及其优缺点?