190 Matching Annotations
  1. Last 7 days
    1. the Elhage et al.(2021) study showing an information-copying role for self-attention.

      It turns out Meng does refer to induction heads, just not by name.

  2. Jan 2023
    1. e twoareas in which the forward-forward algorithm may be superior to backpropagation are as a model oflearning in cortex and as a way of making use of very low-power analog hardware without resortingto reinforcement learning(Jabri and Flower, 1992).
  3. Dec 2022
    1. Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic single-layer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.
    1. AI training data is filled with racist stereotypes, pornography, and explicit images of rape, researchers Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe found after analyzing a data set similar to the one used to build Stable Diffusion.

      That is horrifying. You'd think that authors would attempt to remove or filter this kind of material. There are, after all models out there that are trained to find it. It makes me wonder what awful stuff is in the GPT-3 dataset too.

    1. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B),Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatlyfacilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher

      By using more data on a smaller language model the authors were able to achieve better performance than with the larger models - this reduces the cost of using the model for inference.

  4. Nov 2022
    1. Kuratierungs-Filter auf Empfängerseite gibt, aber dann wäre auch e-mail-Spam als Problem gelöst und das sehe ich gerade noch nicht passieren.

      gibt es projekte, die Modelle auf gesammelte spam mails trainieren?

    1. “The metaphor is that the machine understands what I’m saying and so I’m going to interpret the machine’s responses in that context.”

      Interesting metaphor for why humans are happy to trust outputs from generative models

  5. Sep 2022
    1. Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features.
    1. The present generation of Southerners are not responsible for the past

      We can't judge or blame people based off of their ancestors' actions. In high school, I always hated that everyone knew my older siblings because it often felt like my future was already written for me even though I had not even experienced it myself yet.

    2. Haytian revolt

      We briefly touched on this in Traditions/Revolutions, and I know we will learn more about it later on in the course.

    3. his educational programme was un-necessarily narrow.

      When I was first annotating "The Education of the Negro," I also found Washington's idea of teaching industrial education singularly focused. However, towards the end of his article he made me come around to the idea because it seemed like a good way to instill a desire in students to work for themselves instead of someone else.

    4. the Free Negroes from 1830 up to war-time hadstriven to build industrial schools, and the American Missionary Associ-ation had from the first taught various trades; and Price and others hadsought a way of honorable alliance with the best of the Southerners. ButMr. Washington first indissolubly linked these things; he put enthusiasm,unlimited energy, and perfect faith into his programme, and changed itfrom a by-path into a veritable Way of Life

      ML: He was nor the first to come up with the idea obviously but he put a face on it. It seems like people myself included have a much easier time following something if there is a person in charge of it for them to follow.

    1. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life. They do not putinto their hands the tools they are best tted to use,and hence so many failures. Many a mother andsister have worked and slaved, living upon scantyfood, in order to give a son and brother a ’liberaleducation,’ and in doing this have built up a barrierbetween the boy and the work he was tted to do.Let me say to you that all honest work is honorablework. If the labor is manual, and seems common,you will have all the more chance to be thinking ofother things, or of work that is higher and bringsbetter pay, and to work out in your minds betterand higher duties and responsibilities foryourselves, and for thinking of ways by which youcan help others as well as yourselves, and bringthem up to your own higher level.

      I still see this in our school systems today, especially in certain classes where you feel like you are never going to use anything that you have learned in the real world.

    2. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life. They do not putinto their hands the tools they are best tted to use,and hence so many failures. Many a mother andsister have worked and slaved, living upon scantyfood, in order to give a son and brother a ’liberaleducation,’ and in doing this have built up a barrierbetween the boy and the work he was tted to do.Let me say to you that all honest work is honorablework. If the labor is manual, and seems common,you will have all the more chance to be thinking ofother things, or of work that is higher and bringsbetter pay, and to work out in your minds betterand higher duties and responsibilities foryourselves, and for thinking of ways by which youcan help others as well as yourselves, and bringthem up to your own higher level.

      I still see this in our school systems today, especially in certain classes where you feel like you are never going to use anything that you have learned in the real world.

    3. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life.

      When I was in high school, my mom would always say that they don't teach us some of the most important life skills in class. She was always ranting about how we should have to take a finance class to prepare for adulthood.

    4. “Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life.

      This is still accurate for schools today. For example, in middle school we had 8 classes a day for 45 minutes each for one semester. Even though we had class everyday it was far too little of time to actually learn a full subject. The teacher had to just give us a little bit of information on each topic we were supposed to cover.

  6. moodle.lynchburg.edu moodle.lynchburg.edu
    1. Uncle Bird had a small, rough farm, all woods and hills, miles from the big road; but he was fullof tales

      My uncles are also full of tales that they like to share with everyone they have the chance to.

    2. willow

      I named my Jeep Willow.

    1. Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
    2. Data, matrix multiplications, repeated and scaled with non-linear switches. Maybe that simplifies things a lot, but even today, most architectures boil down to these principles. Even the most complex systems, ideas, and papers can be boiled down to just that:
  7. Aug 2022
  8. Jul 2022
    1. Z-code models to improve common language understanding tasks such as name entity recognition, text summarization, custom text classification and key phrase extraction across its Azure AI services. But this is the first time a company has publicly demonstrated that it can use this new class of Mixture of Experts models to power machine translation products.

      this model is what actually z-code is and what makes it special

    2. have developed called Z-code, which offer the kind of performance and quality benefits that other large-scale language models have but can be run much more efficiently.

      can do the same but much faster

  9. Jun 2022
    1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
    1. This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.

      Matrix multiplication as table lookup

  10. May 2022
    1. Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.
    1. Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.

      The problem in 1988

    1. The source sequence will be pass to the TransformerEncoder, which will produce a new representation of it. This new representation will then be passed to the TransformerDecoder, together with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the next words in the target sequence (N+1 and beyond).
  11. Apr 2022
    1. Ourpre-trained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local re-ponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)
    1. Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).

      These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR-10 demo here

    1. input (32x32x3)max activation: 0.5, min: -0.5max gradient: 1.08696, min: -1.53051Activations:Activation Gradients:Weights:Weight Gradients:conv (32x32x16)filter size 5x5x3, stride 1max activation: 3.75919, min: -4.48241max gradient: 0.36571, min: -0.33032parameters: 16x5x5x3+16 = 1216

      The dimensions of these first two layers are explained here

    1. Here the lower level layers are frozen and are not trained, only the new classification head will update itself to learn from the features provided from the pre-trained chopped up model on the left.
    1. Starting from random noise, we optimize an image to activate a particular neuron (layer mixed4a, unit 11).

      And then we use that image as a kind of variable name to refer to the neuron in a way that more helpful than the the layer number and neuron index within the layer. This explanation is via one of Chris Olah's YouTube videos (https://www.youtube.com/watch?v=gXsKyZ_Y_i8)

  12. Mar 2022
    1. A special quality of humans, not shared by evolution or, as yet, by machines, is our ability to recognize gaps in our understanding and to take joy in the process of filling them in. It is a beautiful thing to experience the mysterious, and powerful, too.
  13. Feb 2022
    1. Verfahren des Relational Machine Learning, welche unter Ausnutzung der Graphstruktur in vielen Fällen Modelle besserer Qualität liefern.

      Rleational Machine Learning-Ansatz

    2. In vielen Anwendungen ist es allerdings notwendig, Daten nicht nur in hoher Qualität und semantisch angereichert zur Verfügung zu stellen, sondern neues Wissen aus vorhandenen Informationen zu generieren. Hierfür nutzen wir Machine Learning.

      Kombination mit ML-Anästze zur Generierung von neuem Wissen

    1. Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.
  14. Dec 2021
    1. the only thing an artificial neuron can do: classify a data point into one of two kinds by examining input values with weights and bias.

      How does this relate to "weighted sum shows similarity between the weights and the inputs"?

    1. The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.
    1. I’m particularly interested in two questions: First, just how weird is machine learning? Second, what sorts of choices do developers make as they shape a project?
  15. Nov 2021
    1. Now that we've made peace with the concepts of projections (matrix multiplications)

      Projections are matrix multiplications.Why didn't you sayso? spatial and channel projections in the gated gmlp

    2. Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.
    3. The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.
    1. You'll use a (70%, 20%, 10%) split for the training, validation, and test sets. Note the data is not being randomly shuffled before splitting. This is for two reasons: It ensures that chopping the data into windows of consecutive samples is still possible. It ensures that the validation/test results are more realistic, being evaluated on the data collected after the model was trained.

      Train, Validation, Test: 0.7, 0.2, 0.1

    1. The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9. See also Convolution arithmetic. and fully-connected A fully-connected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), max-pooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi​)=Σj=1N​eyj​eyi​​ for each entry (yiy_iyi​) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/ layer.

      This is a great visualization of MNIST hidden layers.

    1. The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.

      Finally

    1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
    1. The cube of activations that a neural network for computer vision develops at each hidden layer. Different slices of the cube allow us to target the activations of individual neurons, spatial positions, or channels.

      This is first explanation of

    1. The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.

      Could you be more specific?

    2. Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.
    1. These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
    1. To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.
  16. Oct 2021
    1. This approach, visualizing high-dimensional representations using dimensionality reduction, is an extremely broadly applicable technique for inspecting models in deep learning.
    2. These layers warp and reshape the data to make it easier to classify.
    1. Even with this very primitive single neuron, you can achieve 90% accuracy when recognizing a handwritten text image1. To recognize all the digits from 0 to 9, you would need just ten neurons to recognize them with 92% accuracy.

      And here is a Google Colab notebook that demonstrates that

  17. Sep 2021
    1. Humans perform a version of this task when interpretinghard-to-understand speech, such as an accent which is particularlyfast or slurred, or a sentence in a language we do not know verywell—we do not necessarily hear every single word that is said,but we pick up on salient key words and contextualize the rest tounderstand the sentence.

      Boy, don't they

    1. A neural network will predict your digit in the blue square above. Your image is 784 pixels (= 28 rows by 28 columns with black=1 and white=0). Those 784 features get fed into a 3 layer neural network; Input:784 - AvgPool:196 - Dense:100 - Softmax:10.
    1. Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI-2) from their own recordings. Our fine-tuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting [10]
    1. So whenever you hear of someone “training” a neural network, it just means finding the weights we use to calculate the prediction.
  18. Aug 2021
    1. I'm going to try provide an English text example. The following is based solely on my intuitive understanding of the paper 'Attention is all you need'.

      This is also good

    2. For the word q that your eyes see in the given sentence, what is the most related word k in the sentence to understand what q is about?
    3. So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.
    1. Here is a list of some open data available online. You can find a more complete list and details of the open data available online in Appendix B.

      DataHub (http://datahub.io/dataset)

      World Health Organization (http://www.who.int/research/en/)

      Data.gov (http://data.gov)

      European Union Open Data Portal (http://open-data.europa.eu/en/data/)

      Amazon Web Service public datasets (http://aws.amazon.com/datasets)

      Facebook Graph (http://developers.facebook.com/docs/graph-api)

      Healthdata.gov (http://www.healthdata.gov)

      Google Trends (http://www.google.com/trends/explore)

      Google Finance (https://www.google.com/finance)

      Google Books Ngrams (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

      Machine Learning Repository (http://archive.ics.uci.edu/ml/)

      As an idea of open data sources available online, you can look at the LOD cloud diagram (http://lod-cloud.net ), which displays the connections of the data link among several open data sources currently available on the network (see Figure 1-3).

    1. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem. It’s true, essentially, because the hidden layer can be used as a lookup table.
    2. Recursive Neural Networks
  19. Jul 2021
    1. In the language of Interpretable Machine Learning (IML) literature like Molnar et al.[20], input saliency is a method that explains individual predictions.
    1. Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.

    1. Vectors with a small Euclidean distance from one another are located in the same region of a vector space. Vectors with a high cosine similarity are located in the same general direction from the origin.
    1. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    2. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    1. In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.
    1. Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.
  20. Jun 2021
    1. One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning

      This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

  21. Apr 2021
    1. Machine learning app development has been gaining traction among companies from all over the world. When dealing with this part of machine learning application development, you need to remember that machine learning can recognize only the patterns it has seen before. Therefore, the data is crucial for your objectives. If you’ve ever wondered how to build a machine learning app, this article will answer your question.

    1. Machine learning is an extension of linear regression in a few ways. Firstly is that modern ML

      Machine learning is an extension to linear model which deals with much more complicated situation where we take few different inputs and get outputs.

  22. Nov 2020
    1. 可以认为 π k \pi_k πk​就是每个分量 N ( x ∣ μ k , Σ k ) \mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) N(x∣μk​,Σk​)的权重。

      有的书称为责任

  23. Oct 2020
  24. May 2020
    1. Machine learning has a limited scope
    2. AI is a bigger concept to create intelligent machines that can simulate human thinking capability and behavior, whereas, machine learning is an application or subset of AI that allows machines to learn from data without being programmed explicitly
    1. Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed
  25. Apr 2020
    1. Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. Use Keras if you need a deep learning library that: Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility). Supports both convolutional networks and recurrent networks, as well as combinations of the two. Runs seamlessly on CPU and GPU. Read the documentation at Keras.io. Keras is compatible with: Python 2.7-3.6.
  26. Jan 2020
    1. Suppose the algorithm chooses a tree that splits on education but not on age. Conditional on this tree, the estimated coefficients are consistent. But that does not imply that treatment effects do not also vary by age, as education may well covary with age; on other draws of the data, in fact, the same procedure could have chosen a tree that split on age instead

      a caveat

    2. hese heterogenous treatment effects can be used to assign treatments; Misra and Dubé (2016) illustrate this on the problem of price targeting, applying Bayesian regularized methods to a large-scale experiment where prices were randomly assigned

      todo -- look into the implication for treatment assignment with heterogeneity

    3. Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (2016) take care of high-dimensional controls in treatment effect estimation by solving two simultaneous prediction problems, one in the outcome and one in the treatment equation.

      this seems similar to my idea of regularizing on only a subset of the variables

    4. These same techniques applied here result in split-sample instrumental variables (Angrist and Krueger 1995) and “jackknife” instrumental variables

      some classical solutions to IV bias are akin to ML solutions

    5. Understood this way, the finite-sample biases in instrumental variables are a consequence of overfitting.

      traditional 'finite sample bias of IV' is really overfitting

    6. Even when we are interested in a parameter β ˆ, the tool we use to recover that parameter may contain (often implicitly) a prediction component. Take the case of linear instrumental variables understood as a two-stage procedure: first regress x = γ′z + δ on the instrument z, then regress y = β′x + ε on the fitted values x ˆ. The first stage is typically handled as an estimation step. But this is effectively a prediction task: only the predictions x ˆ enter the second stage; the coefficients in the first stage are merely a means to these fitted values.

      first stage of IV -- handled as an estimation problem, but really it's a prediction problem!

    7. Prediction in the Service of Estimation

      This is especially relevant to economists across the board, even the ML skeptics

    8. New Data

      The first application: constructing variables and meaning from high-dimensional data, especially outcome variables

      • satellite images (of energy use, lights etc) --> economic activity
      • cell phone data, Google street view to measure wealth
      • extract similarity of firms from 10k reports
      • even traditional data .. matching individuals in historical censuses
    9. Zhao and Yu (2006) who establish asymptotic model-selection consistency for the LASSO. Besides assuming that the true model is “sparse”—only a few variables are relevant—they also require the “irrepresentable condition” between observables: loosely put, none of the irrelevant covariates can be even moderately related to the set of relevant ones.

      Basically unrealistic for microeconomic applications imho

    10. First, it encourages the choice of less complex, but wrong models. Even if the best model uses interactions of number of bathrooms with number of rooms, regularization may lead to a choice of a simpler (but worse) model that uses only number of fireplaces. Second, it can bring with it a cousin of omitted variable bias, where we are typically concerned with correlations between observed variables and unobserved ones. Here, when regular-ization excludes some variables, even a correlation between observed variables and other observed (but excluded) ones can create bias in the estimated coefficients.

      Is this equally a problem for procedures that do not assum sparsity, such as the Ridge model?

    11. 97the variables are correlated with each other (say the number of rooms of a house and its square-footage), then such variables are substitutes in predicting house prices. Similar predictions can be produced using very different variables. Which variables are actually chosen depends on the specific finite sample.

      Lasso-chosen variables are unstable because of what we usually call 'multicollinearity.'<br> This presents a problem for making inferences from estimated coefficients.

    12. Through its regularizer, LASSO produces a sparse prediction function, so that many coefficients are zero and are “not used”—in this example, we find that more than half the variables are unused in each run

      This is true but they fail to mention that LASSO also shrinks the coefficients on variables that it keeps towards zero (relative to OLS). I think this is commonly misunderstood (from people I've spoken with).

    13. One obvious problem that arises in making such inferences is the lack of stan-dard errors on the coefficients. Even when machine-learning predictors produce familiar output like linear functions, forming these standard errors can be more complicated than seems at first glance as they would have to account for the model selection itself. In fact, Leeb and Pötscher (2006, 2008) develop conditions under which it is impossible to obtain (uniformly) consistent estimates of the distribution of model parameters after data-driven selection.

      This is a very serious limitation for Economics academic work.

    14. First, econometrics can guide design choices, such as the number of folds or the function class.

      How would Econometrics guide us in this?

    15. These choices about how to represent the features will interact with the regularizer and function class: A linear model can reproduce the log base area per room from log base area and log room number easily, while a regression tree would require many splits to do so.

      The choice of 'how to represent the features' is consequential ... it's not just 'throw it all in' (kitchen sink approach)

    16. Ta b l e 2Some Machine Learning Algorithms

      This is a very helpful table!

    17. Picking the prediction func-tion then involves two steps: The first step is, conditional on a level of complexity, to pick the best in-sample loss-minimizing function.8 The second step is to estimate the optimal level of complexity using empirical tuning (as we saw in cross-validating the depth of the tree).

      ML explained while standing on one leg.

    18. egularization combines with the observability of predic-tion quality to allow us to fit flexible functional forms and still find generalizable structure.

      But we can't really make statistical inferences about the structure, can we?

    19. This procedure works because prediction quality is observable: both predic-tions y ˆ and outcomes y are observed. Contrast this with parameter estimation, where typically we must rely on assumptions about the data-generating process to ensure consistency.

      I'm not clear what the implication they are making here is. Does it in some sense 'not work' with respect to parameter estimation?

    20. In empirical tuning, we create an out-of-sample experiment inside the original sample.

      remember that tuning is done within the training sample

    21. Performance of Different Algorithms in Predicting House Values

      Any reason they didn't try a Ridge or an Elastic net model here? My instinct is that these will beat LASSO for most Economic applications.

    22. We consider 10,000 randomly selected owner-occupied units from the 2011 metropolitan sample of the American Housing Survey. In addition to the values of each unit, we also include 150 variables that contain information about the unit and its location, such as the number of rooms, the base area, and the census region within the United States. To compare different prediction tech-niques, we evaluate how well each approach predicts (log) unit value on a separate hold-out set of 41,808 units from the same sample. All details on the sample and our empirical exercise can be found in an online appendix available with this paper athttp://e-jep.org

      Seems a useful example for trying/testing/benchmarking. But the link didn't work for me. Can anyone find it? Is it interactive? (This is why I think papers should be html and not pdfs...)

    23. Making sense of complex data such as images and text often involves a prediction pre-processing step.

      In using 'new kinds of data' in Economics we often need to do a 'classification step' first

    24. The fundamental insight behind these breakthroughs is as much statis-tical as computational. Machine intelligence became possible once researchers stopped approaching intelligence tasks procedurally and began tackling them empirically.

      I hadn't thought about how this unites the 'statistics to learn stuff' part of ML and the 'build a tool to do a task' part. Well-phrased.

    25. In another category of applications, the key object of interest is actually a parameter β, but the inference procedures (often implicitly) contain a prediction task. For example, the first stage of a linear instrumental variables regres-sion is effectively prediction. The same is true when estimating heterogeneous treatment effects, testing for effects on multiple outcomes in experiments, and flexibly controlling for observed confounders.

      This is most relevant tool for me. Before I learned about ML I often thought about using 'stepwise selection' for such tasks... to find the best set of 'control variables' etc. But without regularisation this seemed problematic.

    26. Machine Learning: An Applied Econometric Approach

      Shall we use Hypothesis to have a discussion ?

  27. Dec 2019
  28. Aug 2019
    1. Machine learning is an approach to making many similar decisions that involves algorithmically finding patterns in your data and using these to react correctly to brand new data
  29. Jul 2019
    1. We translate all patient measurements into statisticsthat are predictive of unsuccesfull discharge

      Egy analitikai pipeline, kb amit nekünk is össze kéne hozni a végére.

  30. Feb 2019
    1. One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable. Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

      Good explanation for why SGD is computationally better. I was confused about the benefits of repeated performing mini-batch GD, and why it might be better than batch GD. But I guess the advantage comes from being able to get better performance by vecotrizing computation.

    1. And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network.

      I do not get this. Epoch 15 indicates that we are already over-fitting to the training data set, on? Assuming both training and test set come from the same population that we are trying to learn from.

    2. If we see that the accuracy on the test data is no longer improving, then we should stop training

      This contradicts the earlier statement about epoch 280 being the point where there is over-training.

    3. It might be that accuracy on the test data and the training data both stop improving at the same time

      Can this happen? Can the accuracy on the training data set ever increase with the training epoch?

    4. What is the limiting value for the output activations aLj

      When c is large, small differences in z_j^L are magnified and the function jumps between 0 and 1, depending on the sign of the differences. On the other hand, when c is very small, all activation values will be close to 1/N; where N is the number of neurons in layer L.

  31. Jan 2019
    1. By utilizing the Deeplearning4j library1 for model representation, learning and prediction, KNIME builds upon a well performing open source solution with a thriving community.
    2. It is especially thanks to the work of Yann LeCun and Yoshua Bengio (LeCun et al., 2015) that the application of deep neural networks has boomed in recent years. The technique, which utilizes neural networks with many layers and enhanced backpropagation algorithms for learning, was made possible through both new research and the ever increasing performance of computer chips.
    3. One of KNIME's strengths is its multitude of nodes for data analysis and machine learning. While its base configuration already offers a variety of algorithms for this task, the plugin system is the factor that enables third-party developers to easily integrate their tools and make them compatible with the output of each other.
  32. Dec 2018
  33. Nov 2018
    1. 三,方法3:MDS

      Multi-dimensional scaling (MDS) and Principla Coordinate Analysis(PCoA) are very similar to PCA, except that instead of converting correlations into a 2-D graph, they convert distance among the samples into a 2-D graph.

      So, in order to do MDS or PCoA, we have to calculate the distance between Cell1 and Cell2, and distance between Cell1 and Cell3...

      • 1 2
      • 1 3
      • 1 4
      • 2 3
      • 2 4
      • 3 4

      One very common way to calculate distance between two things is to calculate the Euclidian distance.

      And once we calculated the distance between every pair of cells, MDS and PCoA would reduce them to a 2-D graph.

      The bad news is that if we used the Euclidean Distance, the graph would be identical to a PCA graph!!

      In other words, clustering based on minimizing the linear distances is the same with maximzing the linear correlations.

      我想这里也就是为什么,李宏毅老师在 t-SNE 课程一开始时说,其他非监督降维算法都只是专注于【如何让·簇内距小·】,而 t-SNE 还考虑了【如何让·簇间距大·】

      也就是说,PCA 的本质(或者叫另一种解释)也只是【找到一种转换函数,他能让原空间中距离近的两点,转换后距离更近】,他压根就没有考虑【簇内or簇外】而是“通杀”所有点。

      The good news is that there are tons of other ways to measure distance!!!

      For example, another way to measure distances between cells is to calcualte between cells is to calculate the average of the absolute value of the log fold changes among genes.

      Finally, we get a plot different from the PCA plot

      A biologist might choose to use log fold change to calculate distance because they are frequently interested in log fold changes among genes...

      But there are lots of distance to choose from...

      1. Manhattan Distance
      2. Hamming Distance
      3. Great Circle Distance

      In summary:

      • PCA creates plots based on correlations among samples;
      • MDS and PCoA create plots based on distances among samples

  34. Oct 2018
    1. T-distribution Stochastic Neighbor Embedding(t-SNE)

      之前介绍的所有方法都存在相同的弊病:

      similar data are close, but different data may collapse,亦即,相似(label)的点靠的确实很近,但不相似(label)的点也有可能靠的很近。

      不同的无监督学习在 MNIST 上的表现

      t-SNE 的原理

      \(x \rightarrow z\)

      t-SNE 一样是降维,从 x 向量降维到 z. 但 t-SNE 有一步很独特的标准化步骤:

      一, t-SNE 第一步:similarity normalization

      这一步假设我们已经知道 similarity 的公式,关于 similarity 的公式在【第四步】单独讨论,因为实在神妙

      这一步是对任意两个点之间的相似度进行标准化,目的是尽量让所有的相似度的度量都处在 [0,1] 之间。你可以把他看做是对相似度进行标准化,也可以看做是为求解KL散度做准备 --- 求条件概率分布

      compute similarity between all pairs of x: \(S(x^i, x^j)\)

      我们这里使用 Similarity(A,B) 来近似 P(A and B), 使用 \(\sum_{A\neq B}S(A,B)\) 来近似 P(B)

      \(P(A|B) = \frac{P(A\cap B)}{P(B)} = \frac{P(A\cap B)}{\sum_{all\ I\ \neq B}P(I\cap B)}\)

      \(P(x^j|x^i)=\frac{S(x^i, x^j)}{\sum_{k\neq i}S(x^i, x^k)}\)

      假设我们已经找到了一个 low dimension z-space。我们也就可以计算转换后样本的相似度,进一步计算 \(z^i\) \(z^j\) 的条件概率。

      compute similarity between all pairs of z: \(S'(z^i, z^j)\)

      \(P(z^j|z^i)=\frac{S(z^i, z^j)}{\sum_{k\neq i}S(z^i, z^k)}\)

      Find a set of z making the two distributions as close as possible:

      \(L = \sum_{i}KL(P(\star | x^i)||Q(\star | z^i))\)

      二, t-SNE 第二部:find z

      我们要找到一组转换后的“样本”, 使得转换前后的两组样本集(通过KL-divergence测量)的分布越接近越好:

      衡量两个分布的相似度:使用 KL 散度(也叫 Infomation Gain)。KL 散度越小,表示两个概率分布越接近。

      \(L = \sum_{i}KL(P(\star | x^i) || Q(\star | z^i))\)

      find zi to minimize the L.

      这个应该是很好做的,因为只要我们能找到 similarity 的计算公