Hypothesis

8 Matching Annotations

Apr 2026
arxiv.org arxiv.org

A Posteriori Error Analysis of Runge–Kutta Discontinuous Galerkin Schemes with SIAC Post-Processing

1
1. fxp007 09 Apr 2026
  
  in Public
  
  We split the residual of the space–time reconstruction into hyperbolic and parabolic contributions and treat them in different norms.
  
  将残差分裂为「双曲部分」和「抛物部分」并用不同范数处理——这个技巧看似平凡，实则是整篇论文最关键的工程决策。若不分裂，估计器会包含 ε⁻¹ 量级的项，在对流主导时完全失效。这类「范数分裂」策略在偏微分方程分析中是一种深刻的技巧：问题的物理本质（双曲 vs 抛物）决定了应该在哪个函数空间中度量误差。
  
  residual-splitting mixed-norms key-technique PDE-analysis
Visit annotations in context

Tags

mixed-norms

residual-splitting

PDE-analysis

key-technique

Annotators

fxp007

URL

arxiv.org/html/2604.01200v1
Apr 2023
clementneo.com clementneo.com

We Found An Neuron in GPT-2

1
1. mshook 12 Apr 2023
  
  in Public
  
  It seems like the neuron basically adds the embedding of “ an” to the residual stream, which increases the output probability for “ an” since the unembedding step consists of taking the dot product of the final residual with each token2.
  
  This cleared the dust from my eyes in understanding what the MLP layer does
  
  transformer embedding how residual explanation mlp
Visit annotations in context

Tags

explanation

mlp

residual

embedding

transformer

how

Annotators

mshook

URL

clementneo.com/posts/2023/02/11/we-found-an-neuron
Feb 2023
e2eml.school e2eml.school

Transformers from Scratch

2
1. mshook 12 Feb 2023
  
  in Public
  
  The second purpose of skip connections is specific to transformers — preserving the original input sequence.
  
  transformer why residual architecture skip
2. mshook 12 Feb 2023
  
  in Public
  
  Skip connections serve two purposes. The first is that they help keep the gradient smooth, which is a big help for backpropagation. Attention is a filter, which means that when it’s working correctly it will block most of what tries to pass through it.
  
  transformer architecture residual attention backpropagation
Visit annotations in context

Tags

skip

attention

backpropagation

residual

why

architecture

transformer

Annotators

mshook

URL

e2eml.school/transformers.html
www.lesswrong.com www.lesswrong.com

Induction heads - illustrated

1
1. mshook 11 Feb 2023
  
  in Public
  
  The central object in the transformer is the residual stream.
  
  transformer architecture nlp residual icl induction
Visit annotations in context

Tags

nlp

induction

architecture

residual

transformer

icl

Annotators

mshook

URL

lesswrong.com/posts/TvrfY4c9eaGLeyDkE/induction-heads-illustrated
Jan 2023
transformer-circuits.pub transformer-circuits.pub

A Mathematical Framework for Transformer Circuits

2
1. mshook 26 Jan 2023
  
  in Public
  
  One of the main features of the high level architecture of a transformer is that each layer adds its results into what we call the “residual stream.”Constructing models with a residual stream traces back to early work by the Schmidhuber group, such as highway networks and LSTMs, which have found significant modern success in the more recent residual network architecture . In transformers, the residual stream vectors are often called the “embedding.” We prefer the residual stream terminology, both because it emphasizes the residual nature (which we believe to be important) and also because we believe the residual stream often dedicates subspaces to tokens other than the present token, breaking the intuitions the embedding terminology suggests. The residual stream is simply the sum of the output of all the previous layers and the original embedding. We generally think of the residual stream as a communication channel, since it doesn't do any processing itself and all layers communicate through it.
  
  transformer residual architecture explanation colah attention image
2. mshook 26 Jan 2023
  
  in Public
  
  A transformer starts with a token embedding, followed by a series of “residual blocks”, and finally a token unembedding. Each residual block consists of an attention layer, followed by an MLP layer. Both the attention and MLP layers each “read” their input from the residual stream (by performing a linear projection), and then “write” their result to the residual stream by adding a linear projection back in. Each attention layer consists of multiple heads, which operate in parallel.
  
  transformer residual architecture alternative colah explanation
Visit annotations in context

Tags

attention

colah

residual

alternative

architecture

transformer

image

explanation

Annotators

mshook

URL

transformer-circuits.pub/2021/framework/index.html
Feb 2019
iphysresearch.github.io iphysresearch.github.io

A Paper A Day

1
1. Herb 01 Feb 2019
  
  in Public
  
  Fixup Initialization: Residual Learning Without Normalization
  
  关于拟合的表现，Regularization 和 BN 的设计总是很微妙，尤其是 learning rate 再掺和进来以后。此 paper 的作者也就相关问题结合自己的文章在 Reddit 上有所讨论。
  
  initialization Batch Normalization Residual
Visit annotations in context

Tags

Batch Normalization

initialization

Residual

Annotators

Herb

URL

iphysresearch.github.io/paper_summary/APaperADay.html

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL