8 Matching Annotations
  1. Last 7 days
    1. We split the residual of the space–time reconstruction into hyperbolic and parabolic contributions and treat them in different norms.

      将残差分裂为「双曲部分」和「抛物部分」并用不同范数处理——这个技巧看似平凡,实则是整篇论文最关键的工程决策。若不分裂,估计器会包含 ε⁻¹ 量级的项,在对流主导时完全失效。这类「范数分裂」策略在偏微分方程分析中是一种深刻的技巧:问题的物理本质(双曲 vs 抛物)决定了应该在哪个函数空间中度量误差。

  2. Apr 2023
  3. Feb 2023
  4. Jan 2023
    1. One of the main features of the high level architecture of a transformer is that each layer adds its results into what we call the “residual stream.”Constructing models with a residual stream traces back to early work by the Schmidhuber group, such as highway networks  and LSTMs, which have found significant modern success in the more recent residual network architecture . In transformers, the residual stream vectors are often called the “embedding.” We prefer the residual stream terminology, both because it emphasizes the residual nature (which we believe to be important) and also because we believe the residual stream often dedicates subspaces to tokens other than the present token, breaking the intuitions the embedding terminology suggests. The residual stream is simply the sum of the output of all the previous layers and the original embedding. We generally think of the residual stream as a communication channel, since it doesn't do any processing itself and all layers communicate through it.
    2. A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
  5. Feb 2019