18 Matching Annotations
  1. Jun 2024
    1. In general, if one projects a vector into a lower-dimensional space, one can't reconstruct the original vector. However, this changes if one knows that the original vector is sparse. In this case, it is often possible to recover the original vector.

      what i don't really understand is why does it matter that you want to recover the original vector??

    2. But for neural networks with highly sparse features, this cost may be outweighed by the benefit of being able to represent more features

      maybe this is circular, but isn't it more that features are those that which lend themselves to a sparse representation?

    3. Although it's only possible to have n orthogonal vectors in an n-dimensional space, it's possible to have \exp(n) many "almost orthogonal" (<\epsilon cosine similarity) vectors in high-dimensional spaces. See the Johnson–Lindenstrauss lemma.

      this really is key (and also a key result used in compressed sensing).

    4. Statistical Efficiency. Representing features as different directions may allow non-local generalization in models with linear transformations (such as the weights of neural nets), increasing their statistical efficiency relative to models which can only locally generalize. This view is especially advocated in some of Bengio's writing (e.g. ). A more accessible argument can be found in this blog post.

      I mean that the algorithm should be able to provide good generalizations even for inputs that are far from those it has seen during training. It should be able to generalize to new combinations of the underlying concepts that explain the data. Nearest-neighbor methods and related ones like kernel SVMs and decision trees can only generalize in some neighborhood around the training examples, in a way that is simple (like linear interpolation or linear extrapolation). Because the number of possible configurations of the underlying concepts that explain the data is exponentially large, this kind of generalization is good but not sufficient at all. Non-local generalization refers to the ability to generalize to a huge space of possible configurations of the underlying causes of the data, potentially very far from the observed data, going beyond linear combinations of training examples that have been seen in the neighborhood of the given input.

      via bengio himself

    5. Linear representations make features "linearly accessible." A typical neural network layer is a linear function followed by a non-linearity. If a feature in the previous layer is represented linearly, a neuron in the next layer can "select it" and have it consistently excite or inhibit that neuron. If a feature were represented non-linearly, the model would not be able to do this in a single step.

      not sure i understand this?

    6. In a linear representation, each feature f_i has a corresponding representation direction W_i. The presence of multiple features f_1, f_2… activating with values x_{f_1}, x_{f_2}… is represented by x_{f_1}W_{f_1} + x_{f_2}W_{f_2}.... To be clear, the features being represented are almost certainly nonlinear functions of the input. It's only the map from features to activation vectors which is linear. Note that whether something is a linear representation depends on what you consider to be the features.

      again, contrast with ICA, with the same linear representation...

    7. Examples of interpretable neurons are also cases of features as directions, since the amount a neuron activates corresponds to a basis direction in the representation

      not immediately obvious, but of course. i think it's because i always forget what exactly a "neuron" is. it's just a vector that you are inner-product-ing with the input vector. so clearly it's just a "direction"

    8. A final approach is to define features as properties of the input which a sufficiently large neural network will reliably dedicate a neuron to representing.This definition is trickier than it seems. Specifically, something is a feature if there exists a large enough model size such that it gets a dedicated neuron. This create a kind "epsilon-delta" like definition. Our present understanding – as we'll see in later sections – is that arbitrarily large models can still have a large fraction of their features be in superposition. However, for any given feature, assuming the feature importance curve isn't flat, it should eventually be given a dedicated neuron. This definition can be helpful in saying that something is a feature – curve detectors are a feature because you find them in across a range of models larger than some minimal size – but unhelpful for the much more common case of features we only hypothesize about or observe in superposition. For example, curve detectors appear to reliably occur across sufficiently sophisticated vision models, and so are a feature. For interpretable properties which we presently only observe in polysemantic neurons, the hope is that a sufficiently large model would dedicate a neuron to them. This definition is slightly circular, but avoids the issues with the earlier ones.

      One motivation for this work: how do you even define a "feature"?

      We want there to be a simple "interpretable" switch/node/neuron that represents a general concept?

    9. Decomposability: Network representations can be described in terms of independently understandable features.

      this harks back to ICA, where there you have a model of statistically independent components (and it's an additive model)

  2. Jun 2022
  3. Jan 2020
    1. The model was decomposable into different mini-models, where each one could be understood on its own.

      Similar to my idea about multiple trees. Should follow up.

    2. linear model that approximated COMPAS and depended on race, age, and criminal history, that COMPAS itself must depend on race.

      I need to check this: I assume approximating COMPAS is equivalent to simply running the linear model. Perhaps they're trying to say something about the model itself.

      Right, they're trying to claim that COMPAS is biased (these are from the ProPublica journalists after all). Thus, they simply apply a LM to the model (which, I'm pretty sure can't be too far from applying it to the actual data itself).

    3. Such explanations usually try to either mimic the black box’s predictions using an entirely different model (perhaps with different important variables, masking what the black box might actually be doing), or they provide another statistic that yields incomplete information about the calculation of the black box. Such explanations are shallow, or even hollow, since they extend the authority of the black box rather than recognizing it is not necessary.

      Not particularly convinced by this point – building an interpretable model over a black box or over the original data doesn't really make that much of a difference.

    4. The full machine learning model is as follows: if the person has either >3 prior crimes, or is 18–20 years old and male, or is 21–23 years old and has two or three prior crimes, they are predicted to be rearrested within two years from their evaluation, and otherwise not

      Examples can abound (and in this case, the example is a good example), but I'm always wary of any fenangling. While the application is very important, the data itself is not particularly complicated (very little in the way of signal).

      So, I think it's an important point to make, but at the same time, it really depends on the data.

    5. Being asked to choose an accurate machine or an understandable human is a false dichotomy.

      This is the crux that I argued previously. I wouldn't be so confident and say that it is a false dichotomy, but that it's unclear if the trade-off is inherent.