184 Matching Annotations
  1. Last 7 days
    1. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B),Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatlyfacilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher

      By using more data on a smaller language model the authors were able to achieve better performance than with the larger models - this reduces the cost of using the model for inference.

  2. Nov 2022
    1. Kuratierungs-Filter auf Empfängerseite gibt, aber dann wäre auch e-mail-Spam als Problem gelöst und das sehe ich gerade noch nicht passieren.

      gibt es projekte, die Modelle auf gesammelte spam mails trainieren?

    1. “The metaphor is that the machine understands what I’m saying and so I’m going to interpret the machine’s responses in that context.”

      Interesting metaphor for why humans are happy to trust outputs from generative models

  3. Sep 2022
    1. Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features.
    1. The present generation of Southerners are not responsible for the past

      We can't judge or blame people based off of their ancestors' actions. In high school, I always hated that everyone knew my older siblings because it often felt like my future was already written for me even though I had not even experienced it myself yet.

    2. Haytian revolt

      We briefly touched on this in Traditions/Revolutions, and I know we will learn more about it later on in the course.

    3. his educational programme was un-necessarily narrow.

      When I was first annotating "The Education of the Negro," I also found Washington's idea of teaching industrial education singularly focused. However, towards the end of his article he made me come around to the idea because it seemed like a good way to instill a desire in students to work for themselves instead of someone else.

    4. the Free Negroes from 1830 up to war-time hadstriven to build industrial schools, and the American Missionary Associ-ation had from the first taught various trades; and Price and others hadsought a way of honorable alliance with the best of the Southerners. ButMr. Washington first indissolubly linked these things; he put enthusiasm,unlimited energy, and perfect faith into his programme, and changed itfrom a by-path into a veritable Way of Life

      ML: He was nor the first to come up with the idea obviously but he put a face on it. It seems like people myself included have a much easier time following something if there is a person in charge of it for them to follow.

    1. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life. They do not putinto their hands the tools they are best tted to use,and hence so many failures. Many a mother andsister have worked and slaved, living upon scantyfood, in order to give a son and brother a ’liberaleducation,’ and in doing this have built up a barrierbetween the boy and the work he was tted to do.Let me say to you that all honest work is honorablework. If the labor is manual, and seems common,you will have all the more chance to be thinking ofother things, or of work that is higher and bringsbetter pay, and to work out in your minds betterand higher duties and responsibilities foryourselves, and for thinking of ways by which youcan help others as well as yourselves, and bringthem up to your own higher level.

      I still see this in our school systems today, especially in certain classes where you feel like you are never going to use anything that you have learned in the real world.

    2. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life. They do not putinto their hands the tools they are best tted to use,and hence so many failures. Many a mother andsister have worked and slaved, living upon scantyfood, in order to give a son and brother a ’liberaleducation,’ and in doing this have built up a barrierbetween the boy and the work he was tted to do.Let me say to you that all honest work is honorablework. If the labor is manual, and seems common,you will have all the more chance to be thinking ofother things, or of work that is higher and bringsbetter pay, and to work out in your minds betterand higher duties and responsibilities foryourselves, and for thinking of ways by which youcan help others as well as yourselves, and bringthem up to your own higher level.

      I still see this in our school systems today, especially in certain classes where you feel like you are never going to use anything that you have learned in the real world.

    3. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life.

      When I was in high school, my mom would always say that they don't teach us some of the most important life skills in class. She was always ranting about how we should have to take a finance class to prepare for adulthood.

    4. “Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life.

      This is still accurate for schools today. For example, in middle school we had 8 classes a day for 45 minutes each for one semester. Even though we had class everyday it was far too little of time to actually learn a full subject. The teacher had to just give us a little bit of information on each topic we were supposed to cover.

  4. moodle.lynchburg.edu moodle.lynchburg.edu
    1. Uncle Bird had a small, rough farm, all woods and hills, miles from the big road; but he was fullof tales

      My uncles are also full of tales that they like to share with everyone they have the chance to.

    2. willow

      I named my Jeep Willow.

    1. Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
    2. Data, matrix multiplications, repeated and scaled with non-linear switches. Maybe that simplifies things a lot, but even today, most architectures boil down to these principles. Even the most complex systems, ideas, and papers can be boiled down to just that:
  5. Aug 2022
  6. Jul 2022
    1. Z-code models to improve common language understanding tasks such as name entity recognition, text summarization, custom text classification and key phrase extraction across its Azure AI services. But this is the first time a company has publicly demonstrated that it can use this new class of Mixture of Experts models to power machine translation products.

      this model is what actually z-code is and what makes it special

    2. have developed called Z-code, which offer the kind of performance and quality benefits that other large-scale language models have but can be run much more efficiently.

      can do the same but much faster

  7. Jun 2022
    1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
    1. This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.

      Matrix multiplication as table lookup

  8. May 2022
    1. Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.
    1. Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.

      The problem in 1988

    1. The source sequence will be pass to the TransformerEncoder, which will produce a new representation of it. This new representation will then be passed to the TransformerDecoder, together with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the next words in the target sequence (N+1 and beyond).
  9. Apr 2022
    1. Ourpre-trained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local re-ponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)
    1. Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).

      These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR-10 demo here

    1. input (32x32x3)max activation: 0.5, min: -0.5max gradient: 1.08696, min: -1.53051Activations:Activation Gradients:Weights:Weight Gradients:conv (32x32x16)filter size 5x5x3, stride 1max activation: 3.75919, min: -4.48241max gradient: 0.36571, min: -0.33032parameters: 16x5x5x3+16 = 1216

      The dimensions of these first two layers are explained here

    1. Here the lower level layers are frozen and are not trained, only the new classification head will update itself to learn from the features provided from the pre-trained chopped up model on the left.
    1. Starting from random noise, we optimize an image to activate a particular neuron (layer mixed4a, unit 11).

      And then we use that image as a kind of variable name to refer to the neuron in a way that more helpful than the the layer number and neuron index within the layer. This explanation is via one of Chris Olah's YouTube videos (https://www.youtube.com/watch?v=gXsKyZ_Y_i8)

  10. Mar 2022
    1. A special quality of humans, not shared by evolution or, as yet, by machines, is our ability to recognize gaps in our understanding and to take joy in the process of filling them in. It is a beautiful thing to experience the mysterious, and powerful, too.
  11. Feb 2022
    1. Verfahren des Relational Machine Learning, welche unter Ausnutzung der Graphstruktur in vielen Fällen Modelle besserer Qualität liefern.

      Rleational Machine Learning-Ansatz

    2. In vielen Anwendungen ist es allerdings notwendig, Daten nicht nur in hoher Qualität und semantisch angereichert zur Verfügung zu stellen, sondern neues Wissen aus vorhandenen Informationen zu generieren. Hierfür nutzen wir Machine Learning.

      Kombination mit ML-Anästze zur Generierung von neuem Wissen

    1. Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.
  12. Dec 2021
    1. the only thing an artificial neuron can do: classify a data point into one of two kinds by examining input values with weights and bias.

      How does this relate to "weighted sum shows similarity between the weights and the inputs"?

    1. The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.
    1. I’m particularly interested in two questions: First, just how weird is machine learning? Second, what sorts of choices do developers make as they shape a project?
  13. Nov 2021
    1. Now that we've made peace with the concepts of projections (matrix multiplications)

      Projections are matrix multiplications.Why didn't you sayso? spatial and channel projections in the gated gmlp

    2. Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.
    3. The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.
    1. You'll use a (70%, 20%, 10%) split for the training, validation, and test sets. Note the data is not being randomly shuffled before splitting. This is for two reasons: It ensures that chopping the data into windows of consecutive samples is still possible. It ensures that the validation/test results are more realistic, being evaluated on the data collected after the model was trained.

      Train, Validation, Test: 0.7, 0.2, 0.1

    1. The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9. See also Convolution arithmetic. and fully-connected A fully-connected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), max-pooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi​)=Σj=1N​eyj​eyi​​ for each entry (yiy_iyi​) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/ layer.

      This is a great visualization of MNIST hidden layers.

    1. The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.


    1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
    1. The cube of activations that a neural network for computer vision develops at each hidden layer. Different slices of the cube allow us to target the activations of individual neurons, spatial positions, or channels.

      This is first explanation of

    1. The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.

      Could you be more specific?

    2. Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.
    1. These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
    1. To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.
  14. Oct 2021
    1. This approach, visualizing high-dimensional representations using dimensionality reduction, is an extremely broadly applicable technique for inspecting models in deep learning.
    2. These layers warp and reshape the data to make it easier to classify.
    1. Even with this very primitive single neuron, you can achieve 90% accuracy when recognizing a handwritten text image1. To recognize all the digits from 0 to 9, you would need just ten neurons to recognize them with 92% accuracy.

      And here is a Google Colab notebook that demonstrates that

  15. Sep 2021
    1. Humans perform a version of this task when interpretinghard-to-understand speech, such as an accent which is particularlyfast or slurred, or a sentence in a language we do not know verywell—we do not necessarily hear every single word that is said,but we pick up on salient key words and contextualize the rest tounderstand the sentence.

      Boy, don't they

    1. A neural network will predict your digit in the blue square above. Your image is 784 pixels (= 28 rows by 28 columns with black=1 and white=0). Those 784 features get fed into a 3 layer neural network; Input:784 - AvgPool:196 - Dense:100 - Softmax:10.
    1. Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI-2) from their own recordings. Our fine-tuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting [10]
    1. So whenever you hear of someone “training” a neural network, it just means finding the weights we use to calculate the prediction.
  16. Aug 2021
    1. I'm going to try provide an English text example. The following is based solely on my intuitive understanding of the paper 'Attention is all you need'.

      This is also good

    2. For the word q that your eyes see in the given sentence, what is the most related word k in the sentence to understand what q is about?
    3. So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.
    1. Here is a list of some open data available online. You can find a more complete list and details of the open data available online in Appendix B.

      DataHub (http://datahub.io/dataset)

      World Health Organization (http://www.who.int/research/en/)

      Data.gov (http://data.gov)

      European Union Open Data Portal (http://open-data.europa.eu/en/data/)

      Amazon Web Service public datasets (http://aws.amazon.com/datasets)

      Facebook Graph (http://developers.facebook.com/docs/graph-api)

      Healthdata.gov (http://www.healthdata.gov)

      Google Trends (http://www.google.com/trends/explore)

      Google Finance (https://www.google.com/finance)

      Google Books Ngrams (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

      Machine Learning Repository (http://archive.ics.uci.edu/ml/)

      As an idea of open data sources available online, you can look at the LOD cloud diagram (http://lod-cloud.net ), which displays the connections of the data link among several open data sources currently available on the network (see Figure 1-3).

    1. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem. It’s true, essentially, because the hidden layer can be used as a lookup table.
    2. Recursive Neural Networks
  17. Jul 2021
    1. In the language of Interpretable Machine Learning (IML) literature like Molnar et al.[20], input saliency is a method that explains individual predictions.
    1. Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.

    1. Vectors with a small Euclidean distance from one another are located in the same region of a vector space. Vectors with a high cosine similarity are located in the same general direction from the origin.
    1. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    2. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    1. In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.
    1. Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.
  18. Jun 2021
    1. One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning

      This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

  19. Apr 2021
    1. Machine learning app development has been gaining traction among companies from all over the world. When dealing with this part of machine learning application development, you need to remember that machine learning can recognize only the patterns it has seen before. Therefore, the data is crucial for your objectives. If you’ve ever wondered how to build a machine learning app, this article will answer your question.

    1. Machine learning is an extension of linear regression in a few ways. Firstly is that modern ML

      Machine learning is an extension to linear model which deals with much more complicated situation where we take few different inputs and get outputs.

  20. Nov 2020
    1. 可以认为 π k \pi_k πk​就是每个分量 N ( x ∣ μ k , Σ k ) \mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) N(x∣μk​,Σk​)的权重。


  21. Oct 2020
  22. May 2020
    1. Machine learning has a limited scope
    2. AI is a bigger concept to create intelligent machines that can simulate human thinking capability and behavior, whereas, machine learning is an application or subset of AI that allows machines to learn from data without being programmed explicitly
    1. Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed
  23. Apr 2020
    1. Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. Use Keras if you need a deep learning library that: Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility). Supports both convolutional networks and recurrent networks, as well as combinations of the two. Runs seamlessly on CPU and GPU. Read the documentation at Keras.io. Keras is compatible with: Python 2.7-3.6.
  24. Jan 2020
    1. Suppose the algorithm chooses a tree that splits on education but not on age. Conditional on this tree, the estimated coefficients are consistent. But that does not imply that treatment effects do not also vary by age, as education may well covary with age; on other draws of the data, in fact, the same procedure could have chosen a tree that split on age instead

      a caveat

    2. hese heterogenous treatment effects can be used to assign treatments; Misra and Dubé (2016) illustrate this on the problem of price targeting, applying Bayesian regularized methods to a large-scale experiment where prices were randomly assigned

      todo -- look into the implication for treatment assignment with heterogeneity

    3. Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (2016) take care of high-dimensional controls in treatment effect estimation by solving two simultaneous prediction problems, one in the outcome and one in the treatment equation.

      this seems similar to my idea of regularizing on only a subset of the variables

    4. These same techniques applied here result in split-sample instrumental variables (Angrist and Krueger 1995) and “jackknife” instrumental variables

      some classical solutions to IV bias are akin to ML solutions

    5. Understood this way, the finite-sample biases in instrumental variables are a consequence of overfitting.

      traditional 'finite sample bias of IV' is really overfitting

    6. Even when we are interested in a parameter β ˆ, the tool we use to recover that parameter may contain (often implicitly) a prediction component. Take the case of linear instrumental variables understood as a two-stage procedure: first regress x = γ′z + δ on the instrument z, then regress y = β′x + ε on the fitted values x ˆ. The first stage is typically handled as an estimation step. But this is effectively a prediction task: only the predictions x ˆ enter the second stage; the coefficients in the first stage are merely a means to these fitted values.

      first stage of IV -- handled as an estimation problem, but really it's a prediction problem!

    7. Prediction in the Service of Estimation

      This is especially relevant to economists across the board, even the ML skeptics

    8. New Data

      The first application: constructing variables and meaning from high-dimensional data, especially outcome variables

      • satellite images (of energy use, lights etc) --> economic activity
      • cell phone data, Google street view to measure wealth
      • extract similarity of firms from 10k reports
      • even traditional data .. matching individuals in historical censuses
    9. Zhao and Yu (2006) who establish asymptotic model-selection consistency for the LASSO. Besides assuming that the true model is “sparse”—only a few variables are relevant—they also require the “irrepresentable condition” between observables: loosely put, none of the irrelevant covariates can be even moderately related to the set of relevant ones.

      Basically unrealistic for microeconomic applications imho

    10. First, it encourages the choice of less complex, but wrong models. Even if the best model uses interactions of number of bathrooms with number of rooms, regularization may lead to a choice of a simpler (but worse) model that uses only number of fireplaces. Second, it can bring with it a cousin of omitted variable bias, where we are typically concerned with correlations between observed variables and unobserved ones. Here, when regular-ization excludes some variables, even a correlation between observed variables and other observed (but excluded) ones can create bias in the estimated coefficients.

      Is this equally a problem for procedures that do not assum sparsity, such as the Ridge model?

    11. 97the variables are correlated with each other (say the number of rooms of a house and its square-footage), then such variables are substitutes in predicting house prices. Similar predictions can be produced using very different variables. Which variables are actually chosen depends on the specific finite sample.

      Lasso-chosen variables are unstable because of what we usually call 'multicollinearity.'<br> This presents a problem for making inferences from estimated coefficients.

    12. Through its regularizer, LASSO produces a sparse prediction function, so that many coefficients are zero and are “not used”—in this example, we find that more than half the variables are unused in each run

      This is true but they fail to mention that LASSO also shrinks the coefficients on variables that it keeps towards zero (relative to OLS). I think this is commonly misunderstood (from people I've spoken with).

    13. One obvious problem that arises in making such inferences is the lack of stan-dard errors on the coefficients. Even when machine-learning predictors produce familiar output like linear functions, forming these standard errors can be more complicated than seems at first glance as they would have to account for the model selection itself. In fact, Leeb and Pötscher (2006, 2008) develop conditions under which it is impossible to obtain (uniformly) consistent estimates of the distribution of model parameters after data-driven selection.

      This is a very serious limitation for Economics academic work.

    14. First, econometrics can guide design choices, such as the number of folds or the function class.

      How would Econometrics guide us in this?

    15. These choices about how to represent the features will interact with the regularizer and function class: A linear model can reproduce the log base area per room from log base area and log room number easily, while a regression tree would require many splits to do so.

      The choice of 'how to represent the features' is consequential ... it's not just 'throw it all in' (kitchen sink approach)

    16. Ta b l e 2Some Machine Learning Algorithms

      This is a very helpful table!

    17. Picking the prediction func-tion then involves two steps: The first step is, conditional on a level of complexity, to pick the best in-sample loss-minimizing function.8 The second step is to estimate the optimal level of complexity using empirical tuning (as we saw in cross-validating the depth of the tree).

      ML explained while standing on one leg.

    18. egularization combines with the observability of predic-tion quality to allow us to fit flexible functional forms and still find generalizable structure.

      But we can't really make statistical inferences about the structure, can we?

    19. This procedure works because prediction quality is observable: both predic-tions y ˆ and outcomes y are observed. Contrast this with parameter estimation, where typically we must rely on assumptions about the data-generating process to ensure consistency.

      I'm not clear what the implication they are making here is. Does it in some sense 'not work' with respect to parameter estimation?

    20. In empirical tuning, we create an out-of-sample experiment inside the original sample.

      remember that tuning is done within the training sample

    21. Performance of Different Algorithms in Predicting House Values

      Any reason they didn't try a Ridge or an Elastic net model here? My instinct is that these will beat LASSO for most Economic applications.

    22. We consider 10,000 randomly selected owner-occupied units from the 2011 metropolitan sample of the American Housing Survey. In addition to the values of each unit, we also include 150 variables that contain information about the unit and its location, such as the number of rooms, the base area, and the census region within the United States. To compare different prediction tech-niques, we evaluate how well each approach predicts (log) unit value on a separate hold-out set of 41,808 units from the same sample. All details on the sample and our empirical exercise can be found in an online appendix available with this paper athttp://e-jep.org

      Seems a useful example for trying/testing/benchmarking. But the link didn't work for me. Can anyone find it? Is it interactive? (This is why I think papers should be html and not pdfs...)

    23. Making sense of complex data such as images and text often involves a prediction pre-processing step.

      In using 'new kinds of data' in Economics we often need to do a 'classification step' first

    24. The fundamental insight behind these breakthroughs is as much statis-tical as computational. Machine intelligence became possible once researchers stopped approaching intelligence tasks procedurally and began tackling them empirically.

      I hadn't thought about how this unites the 'statistics to learn stuff' part of ML and the 'build a tool to do a task' part. Well-phrased.

    25. In another category of applications, the key object of interest is actually a parameter β, but the inference procedures (often implicitly) contain a prediction task. For example, the first stage of a linear instrumental variables regres-sion is effectively prediction. The same is true when estimating heterogeneous treatment effects, testing for effects on multiple outcomes in experiments, and flexibly controlling for observed confounders.

      This is most relevant tool for me. Before I learned about ML I often thought about using 'stepwise selection' for such tasks... to find the best set of 'control variables' etc. But without regularisation this seemed problematic.

    26. Machine Learning: An Applied Econometric Approach

      Shall we use Hypothesis to have a discussion ?

  25. Dec 2019
  26. Aug 2019
    1. Machine learning is an approach to making many similar decisions that involves algorithmically finding patterns in your data and using these to react correctly to brand new data
  27. Jul 2019
    1. We translate all patient measurements into statisticsthat are predictive of unsuccesfull discharge

      Egy analitikai pipeline, kb amit nekünk is össze kéne hozni a végére.

  28. Feb 2019
    1. One benefit of SGD is that it's computationally a whole lot faster. Large datasets often can't be held in RAM, which makes vectorization much less efficient. Rather, each sample or batch of samples must be loaded, worked with, the results stored, and so on. Minibatch SGD, on the other hand, is usually intentionally made small enough to be computationally tractable. Usually, this computational advantage is leveraged by performing many more iterations of SGD, making many more steps than conventional batch gradient descent. This usually results in a model that is very close to that which would be found via batch gradient descent, or better.

      Good explanation for why SGD is computationally better. I was confused about the benefits of repeated performing mini-batch GD, and why it might be better than batch GD. But I guess the advantage comes from being able to get better performance by vecotrizing computation.

    1. And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network.

      I do not get this. Epoch 15 indicates that we are already over-fitting to the training data set, on? Assuming both training and test set come from the same population that we are trying to learn from.

    2. If we see that the accuracy on the test data is no longer improving, then we should stop training

      This contradicts the earlier statement about epoch 280 being the point where there is over-training.

    3. It might be that accuracy on the test data and the training data both stop improving at the same time

      Can this happen? Can the accuracy on the training data set ever increase with the training epoch?

    4. What is the limiting value for the output activations aLj

      When c is large, small differences in z_j^L are magnified and the function jumps between 0 and 1, depending on the sign of the differences. On the other hand, when c is very small, all activation values will be close to 1/N; where N is the number of neurons in layer L.

  29. Jan 2019
    1. By utilizing the Deeplearning4j library1 for model representation, learning and prediction, KNIME builds upon a well performing open source solution with a thriving community.
    2. It is especially thanks to the work of Yann LeCun and Yoshua Bengio (LeCun et al., 2015) that the application of deep neural networks has boomed in recent years. The technique, which utilizes neural networks with many layers and enhanced backpropagation algorithms for learning, was made possible through both new research and the ever increasing performance of computer chips.
    3. One of KNIME's strengths is its multitude of nodes for data analysis and machine learning. While its base configuration already offers a variety of algorithms for this task, the plugin system is the factor that enables third-party developers to easily integrate their tools and make them compatible with the output of each other.
  30. Dec 2018
  31. Nov 2018
    1. 三,方法3:MDS

      Multi-dimensional scaling (MDS) and Principla Coordinate Analysis(PCoA) are very similar to PCA, except that instead of converting correlations into a 2-D graph, they convert distance among the samples into a 2-D graph.

      So, in order to do MDS or PCoA, we have to calculate the distance between Cell1 and Cell2, and distance between Cell1 and Cell3...

      • 1 2
      • 1 3
      • 1 4
      • 2 3
      • 2 4
      • 3 4

      One very common way to calculate distance between two things is to calculate the Euclidian distance.

      And once we calculated the distance between every pair of cells, MDS and PCoA would reduce them to a 2-D graph.

      The bad news is that if we used the Euclidean Distance, the graph would be identical to a PCA graph!!

      In other words, clustering based on minimizing the linear distances is the same with maximzing the linear correlations.

      我想这里也就是为什么,李宏毅老师在 t-SNE 课程一开始时说,其他非监督降维算法都只是专注于【如何让·簇内距小·】,而 t-SNE 还考虑了【如何让·簇间距大·】

      也就是说,PCA 的本质(或者叫另一种解释)也只是【找到一种转换函数,他能让原空间中距离近的两点,转换后距离更近】,他压根就没有考虑【簇内or簇外】而是“通杀”所有点。

      The good news is that there are tons of other ways to measure distance!!!

      For example, another way to measure distances between cells is to calcualte between cells is to calculate the average of the absolute value of the log fold changes among genes.

      Finally, we get a plot different from the PCA plot

      A biologist might choose to use log fold change to calculate distance because they are frequently interested in log fold changes among genes...

      But there are lots of distance to choose from...

      1. Manhattan Distance
      2. Hamming Distance
      3. Great Circle Distance

      In summary:

      • PCA creates plots based on correlations among samples;
      • MDS and PCoA create plots based on distances among samples

  32. Oct 2018
    1. T-distribution Stochastic Neighbor Embedding(t-SNE)


      similar data are close, but different data may collapse,亦即,相似(label)的点靠的确实很近,但不相似(label)的点也有可能靠的很近。

      不同的无监督学习在 MNIST 上的表现

      t-SNE 的原理

      \(x \rightarrow z\)

      t-SNE 一样是降维,从 x 向量降维到 z. 但 t-SNE 有一步很独特的标准化步骤:

      一, t-SNE 第一步:similarity normalization

      这一步假设我们已经知道 similarity 的公式,关于 similarity 的公式在【第四步】单独讨论,因为实在神妙

      这一步是对任意两个点之间的相似度进行标准化,目的是尽量让所有的相似度的度量都处在 [0,1] 之间。你可以把他看做是对相似度进行标准化,也可以看做是为求解KL散度做准备 --- 求条件概率分布

      compute similarity between all pairs of x: \(S(x^i, x^j)\)

      我们这里使用 Similarity(A,B) 来近似 P(A and B), 使用 \(\sum_{A\neq B}S(A,B)\) 来近似 P(B)

      \(P(A|B) = \frac{P(A\cap B)}{P(B)} = \frac{P(A\cap B)}{\sum_{all\ I\ \neq B}P(I\cap B)}\)

      \(P(x^j|x^i)=\frac{S(x^i, x^j)}{\sum_{k\neq i}S(x^i, x^k)}\)

      假设我们已经找到了一个 low dimension z-space。我们也就可以计算转换后样本的相似度,进一步计算 \(z^i\) \(z^j\) 的条件概率。

      compute similarity between all pairs of z: \(S'(z^i, z^j)\)

      \(P(z^j|z^i)=\frac{S(z^i, z^j)}{\sum_{k\neq i}S(z^i, z^k)}\)

      Find a set of z making the two distributions as close as possible:

      \(L = \sum_{i}KL(P(\star | x^i)||Q(\star | z^i))\)

      二, t-SNE 第二部:find z

      我们要找到一组转换后的“样本”, 使得转换前后的两组样本集(通过KL-divergence测量)的分布越接近越好:

      衡量两个分布的相似度:使用 KL 散度(也叫 Infomation Gain)。KL 散度越小,表示两个概率分布越接近。

      \(L = \sum_{i}KL(P(\star | x^i) || Q(\star | z^i))\)

      find zi to minimize the L.

      这个应该是很好做的,因为只要我们能找到 similarity 的计算公式,我们就能把 KL divergence 转换成关于 zi 的相关公式,然后使用梯度下降法---GD最小化这个式子即可。

      三,t-SNE 的弊端

      1. 需要计算所有两两pair的相似度
      2. 新点加入,需要重新计算他与所有点之间的相似度
      3. 由于步骤2导致的后续所有的条件概率\(P\ and\ Q\) 都需要重新计算

      因为 t-SNE 要求我们计算数据集的两两点之间的相似度,所以这是一个非常高计算量的算法。同时新数据点的加入会影响整个算法的过程,他会重新计算一遍整个过程,这个是十分不友好的,所以 t-SNE 一般不用于训练过程,仅仅用在可视化中,即便在可视化中也不会仅仅使用 t-SNE,依旧是因为他的超高计算量。

      在用 t-SNE 进行可视化的时候,一般先使用 PCA 把几千维度的数据点降维到几十维度,然后再利用 t-SNE 对几十维度的数据进行降维,比如降到2维之后,再plot到平面上。

      四,t-SNE 的 similarity 公式

      之前说过如果一种 similarity 公式:计算两点(xi, xj)之间的 2-norm distance(欧氏距离):

      \(S(x^i, x^j)=exp(-||x^i - x^j||_2)\)

      一般用在 graph 模型中计算 similarity。好处是他可以保证非常相近的点才会让这个 similarity 公式有值,因为 exponential 会使得该公式的结果随着两点距离变大呈指数级下降

      在 t-SNE 之前有一个算法叫做 SNE 在 z-space 和 x-space 都使用这个相似度公式。

      similarity of x-space: \(S(x^i, x^j)=exp(-||x^i - x^j||_2)\) similarity of z-space: \(S'(z^i, z^j)=exp(-||z^i - z^j||_2)\)

      t-SNE 神妙的地方就在于他在 z-space 上采用另一个公式作为 similarity 公式, 这个公式是 t-distribution 的一种(t 分布有参数可以调,可以调出很多不同的分布):

      \(S(x^i, x^j)=exp(-||x^i - x^j||_2)\) \(S'(z^i, z^j)=\frac{1}{1+||z^i - z^j||_2}\)

      可以通过函数图像来理解为什么需要进行这种修正,以及这种修正为什么能保证x-space原来近的点, 在 z-space 依旧近,原来 x-space 稍远的点,在 z-space 会拉的非常远

      t-distributin vs. gaussian as similarity measure

      也就是说,原来 x-space 上的点如果存在一些 gap(similarity 较小),这些 gap 就会在映射到 z-space 后被强化,变的更大更大。

    2. Unsupervised Learning: Neighbor Embedding

      著名的 tSNE 算法('NE' --- Neighbor Embedding)

      manifold Learning

      manifold 与 欧氏距离失效

      什么是 manifold,manifold 其实就是一个 2D 平面被卷曲起来成为一个3D物体,其最大的特点是3D空间中的两点之间Euclidean distance并不能衡量两者在(卷曲前)2D空间中的'远近',尤其是两者距离较大的时候,欧式几何不再适用 --- 3D远距离情况下欧式几何失效问题,在3D空间中欧式几何只能用在距离较近的时候。

      manifold and Euclidean distantce invalid

      manifold learning 就是针对3D下欧式几何失效问题要做的事情就是把卷曲的平面摊平,这样可以重新使用欧式几何求解问题(毕竟我们的很多算法都是基于 Euclidean distance)。这种摊平的过程也是一种降维过程。

      manifold learning algo-1: LLE


      第一步, 计算 w

      针对每个数据集中的点,【选取】他的K(超参数,类似KNN中的K)个邻居,定义名词该\(x^i\)点与其邻居\(x^j\)之间的【关系】为:\(w_{ij}\), \(w_{ij}\) represents the relation between \(x^i\) and \(x^j\)

      \(w_{ij}\) 就是我们要寻找的目标,我们希望借由 \(w_{ij}\) 使得 \(x^i\) 可以被K个邻居通过\(w_{ij}\)的加权和来近似,使用 Euclidean distance 衡量近似程度:

      given \(x_i, x_j\),, find a set of \(w_{ij}\) minimizing

      \(w_{ij} = argmin_{w_{ij},i\in [1,N],j\in [1,K]}\sum_i||x^i - \sum_jw_{ij}x^j||_2\)

      第二步, 计算 z 做降维,keep \(w_{ij}\) unchanged, 找到 \(z_{i}\) and \(z_{j}\)将 \(x^i, x^j\) 降维成\(z^i, z^j\), 原则是保持 \(w_{ij}\) 不变,因为我们要做的是 dimension reduction, 所以新产生的 \(z_i, z_j\) 应该比 \(x_i, x_j\) 的维度要低:

      given \(w_{ij}\), find a set of \(z_i\) minimizing

      \(z_{i} = argmin_{z_{i},i\in [1,N],j\in [1,K]}\sum_i||z^i - \sum_jw_{ij}z^j||_2\)

      LLE 的特点是:它属于 transductive learning 类似 KNN 是没有一个具体的函数(例如: \(f(x)=z\))用来做降维的.

      LLE 的一个好处是:看算法【第二步】,及时我们不知道 \(x_i\) 是什么,但只要知道点和点之间的关系【\(w_{ij}\)】我们依然可以使用 LLE 来找到 \(z_i\) 因为 \(x_i\) 起到的作用仅仅是找到 \(w_{ij}\)

      LLE 的累赘:必须对 K(邻居数量)谨慎选择,必须刚刚好才能得到较好的结果。

      • K 太小,整体 w (模型参数)的个数较少,能力不足,结果不好

      • K 太大,离 \(x_i\) 较远距离的点(x-space 就是卷曲的 2D 平面)也被考虑到,之前分析过 manifold 的特点就是距离太大的点 Euclidean distance 失效问题。而我们的公式计算 w 的时候使用的就是 Euclidean distance,所以效果也不好。

      这也就是为什么 K 在 LLE 中非常关键的原因。

      manifold learning algo-1: Laplacian Eigenmaps

      Graph-based approach, to solve manifold

      算数据集中点的两两之间的相似度,如果超过某个阈值就连接起来,如此构造一个 graph。得到 graph 之后,【两点之间的距离】就可以被【连线的长度】替代,换言之 laplacian eigenmaps 并不是计算两点之间的直线距离(euclidean distance)而是计算两点之间的曲线距离:

      回忆我们之前学习的 semi-supervised learning 中关于 graph-based 方法的描述:如果 x1 和 x2 在一个 high-density region 中相近,那么两者的标签(分类)相同,我们使用的公式是:

      \(L=\sum_{x^r}C(y^r, \hat{y}^r)\) + \lambda S

      \(S=\frac{1}{2}\sum_{i,j}w_{i,j}(y^i - y^j)^2=y^TLy\)

      \(L = D - W\)

      \(w_{i,j} = similarity between i and j if connected, else 0\)

      • \(x^r\): 带标数据
      • \(S\): 图(从整个数据集绘出)的平滑度
      • \(w\):两点之间的相似度,也就是graph的边的值
      • \(y^i\):预测标签
      • \(\hat{y}^r\):真实标签
      • \(L\): graph 的 laplacian

      同样的方法可以用在 unsupervised learning 中, 如果 xi 与 xj 的 similarity(\(w_{i,j}\)) 值很大,降维之后(曲面摊平之后)zi 和 zj 的距离(euclidean distance)就很近:

      \(S=\frac{1}{2}\sum_{i,j}w_{i,j}(z^i - z^j)^2\)

      但是仅仅最小化这个 S 会导致他的最小值就是 0,所以要给 z 一些限制 --- 虽然我们是把高维的扭曲平面进行摊平,但我们不希望摊平(降维)之后他仍然可以继续'摊'(曲面 ->摊平,依然是曲面 -> 继续摊), 也就是说我们这次摊平的结果应该是【最平的】,也就是说:

      if the dim of z is M, \(Span{z^1, z^2, ..., z^N} = R^M\)

      【给出结论】可以证明的是,这个 z 是 Laplacian (\(L\)) 的比较小的 eigenvalues 的 eigenvectors。所以整个算法才叫做 Laplacian eigenmaps, 因为他找到的 z 就是 laplacian matrix 的最小 eigenvalue 的 eigenvector.

      Spectral clustering: clustering on z

      结合刚才的 laplacian eigenmaps, 如果对 laplacian eigenmaps 找出的 z 做 clustering(eg, K-means) 这个算法就是 spectral clustering.

      spectral clustering = laplacian eigenmaps reduction + clustering

      T-distributed Stochastic Neighbor Embedding(t-SNE)

    3. Unsupervised Learning: Word Embedding

      why Word Embedding ?

      Word Embedding 是 Diemension Reduction 一个非常好,非常广为人知的应用。

      1-of-N Encoding 及其弊端

      apple = [1 0 0 0 0]

      bag = [0 1 0 0 0]

      cat = [0 0 1 0 0]

      dog = [0 0 0 1 0]

      elephant = [0 0 0 0 1]

      这个 N 就是这个世界上所有的单词的数量,或者你可以自己创建 vocabulary, 那么这个 N 就是 vocabulary 的 size.但是这样向量化的方式,太众生平等了原本有一定关系的单词,比如 cat 和 dog 都是动物,这根本无法从 [0 0 1 0 0] 和 [0 0 0 1 0] 中看出任何端倪。

      解决这件事情的一个方法是 Word Class

      Word Class 及其弊端

      先把所有的单词 cluster 成簇,然后用簇代替 1-of-N encoding 来表示单词。

      • cluster 1: dog, cat, bird;
      • cluster 2: ran, jumped, walk
      • cluster 3: flower, tree, apple

      虽然 Word Class 保留了单词的词性信息使得同类单词聚在一起,但是不同词性之间的关系依旧无法表达:名词 + 动词 + 名词/代词, 这种主谓宾的关系就没法用 Word Class 表示。

      这个问题可以通过 Word Embedding 来解决

      Word Embedding 把每个单词的 1-of-N encoding 向量都 project 到一个低维度空间中(Dimension Reduction),这个低维度空间就叫做 Embedding. 这样每个单词都是低维度空间中的一个点,我们希望:


      不但如此,当 Embedding 是二维空间(能够可视化)时,所有的点及其原本单词画在坐标图上之后,很容易就可以看到这个低维度的空间的每个维度(x,y轴)都带有具体的某种含义. 比如,dim-x 可能表示生物,dim-y 可能表示动作。

      How to find vector of Embedding space?

      为什么 autoencoder 无法做出 Word Embedding?

      Word Embedding 本质上是【非监督降维】,我们之前学习的方法最直接的就是使用 autoencoder, 但用在这里很显然是无效的。因为你的输入是 1-of-N encoding 向量,在这种向量表示下每个输入样本都是 independent 的,也就是单个样本之间没有任何关系 --- 毫无内在规律的样本,你怎么可能学出他们之间的关系呢?(ps, 本来无一物,何处惹尘埃。)


      我们的目的是让计算机理解单词的意思,这个完全不可能通过常规语言交流达此目的,所以需要曲径通幽,你只能通过其他方法来让计算机间接理解。最常用的间接的方法就是:understand word by its context.

      • 马英九 520 宣誓就职
      • 蔡英文 520 宣誓就职

      及其肯定不知道马英九和蔡英文到底是什么,但是只要他读了【含有‘马英九’和‘蔡英文’的】大量的文章,知道马英九和蔡英文的前后经常出现类似的文字(similar context),机器就可以推断出“马英九”和“蔡英文”是某种相关的东西。

      How to exploit the context?

      • count based
      • predition based

      count based

      这种方法最经典的做法叫做 Glove Vector https://nlp.stanford.edu/projects/glove/

      代价函数与 MF 和 LSA 的一模一样,使用 GD 就可以解,目标是找到越是经常共同出现在一篇文章的两个单词(num. of occur),越是具有相似的word vector(inner-product)


      越是 A,越是 B ===> \(L = \sum(A-B)^2\)

      if two words \(w_i\) and \(w_j\) frequently co-occur(occur in the same document), \(V(w_i)\) and \(V(w_j)\) would be close to each other.

      • \(V(wi)\) :word vector of word wi

      核心思想类似 MF 和 LSA:

      \(V(w_i) \cdot V(w_j) \Longleftrightarrow N_{i,j}\)

      \(L = \sum_{i,j}(V(w_i)\cdot V(w_j) - N_{i,j})^2\)

      find the word vector \(V(wi)\) and \(V(w)\), which can minimize the distance between inner product of thses two word vector and the number of times \(w_i\) and \(w_j\) occur in the same document.

      prediction based 做法

      prediction based 获取 word vector 是通过训练一个 单层NN 用 \(w_{i-1}\) 预测 \(w_i\) 单词的出现作为样本的数据和标签(\(x=w_{i-1}, y=w_i\)),选取第一层 Hiden layer 的输入作为 word vector.

      【注意】:上面不是刚说过 autoencoder 学不到任何东西么,那是因为 autoencoder 的input 和 output 都是单词 \(w_i\),亦即(\(x=w_i, y=w_i\)),但是这里 prediction-based NN 用的是下一个单词的 1-of-encoding 作为label.

      本质是用 cross-entropy 作为loss-fn训练一个 NN,这个 NN 的输入是某个单词的 1-of-encoding, 输出是 volcabulary-size 维度的(概率)向量--- 表示 volcabulary 中每个单词会被当做下一个单词的概率

      $$ Num.\ of\ volcabulary\ = 10w $$

      $$ \begin{vmatrix} 0 \\ 1 \\ 0 \\ 0 \\ 0 \\.\\.\\. \end{vmatrix}^{R^{10w}} \rightarrow NN \rightarrow \begin{vmatrix} 0.01 \\ 0.73 \\ 0.02 \\ 0.04 \\ 0.11 \\.\\.\\. \end{vmatrix}^{R^{10w}} $$

      • take out theinput of the neurons in 1st layer
      • use it to represent a word \(w\)
      • word vector, word embedding feature: \(V(w)\)

      训练好这个 prediction-based NN 之后,一个新的词汇进来就可以直接用这个词汇的 1-of-N encoding 乘以第一层隐含层的权重,就可以得到这个单词的 word vector.

      $$ x_i = 1-of-N\ encoding\ of\ word\ i $$

      $$ W = weight\ matrix\ of\ NN\ between\ input-layer\ and\ 1st-hidden-layer $$

      $$ WordVector_{x_i} = W^Tx_i $$

      prediction-based 背后原理


      • 马英九 520 宣誓就职
      • 蔡英文 520 宣誓就职

      因为 “马英九” 和 “蔡英文” 后面跟的都是 “520宣誓就职”,所以用在 prediction-based NN 中

      • 马英九 520 宣誓就职 \((x='Mayingjing', y='520 swear the oath of office')\)
      • 蔡英文 520 宣誓就职 \((x='Caiyingwen', y='520 swear the oath of office')\)

      也就是说给 prediction-based NN 输入 马英九 or 蔡英文的时候 NN 会“努力”把两个不同的输入 project 到同一个输出上。那么之后的每一层都会做这件事情,所以我们可以使用第一层的输入来做为 word vector.

      更好的 prediction-based NN

      • 用前 10 个单词预测下一个

      一般情况下我们不会只用前一个单词去预测下一个单词(\(w_{i-1} \Rightarrow w_i\)),并抽取 1st layer's input 作为 word vector, 我们会使用前面至少10个词汇来预测下一个词汇(\([w_{i-10}, w_{i-9}, ..., w_{i-1}] \Rightarrow w_i\))

      • 前10个词汇的相同 1-of-N encoding 位应该使用相同的 w 权重


      • “马英九宣誓就职时说xxxx”
      • “宣誓就职时马英九说xxxx”

      [w1:马英九] + [w2:宣誓就职时] => [w3:说xxxx]

      [w1:宣誓就职时] + [w2:马英九] => [w3:说xxxx]

      如果使用不同的权重,那么 [w1:马英九] + [w2:宣誓就职时] 和 [w1:宣誓就职时] + [w2:马英九] 就会产生不同的 word vector.


      • |V| : volcabulary size, for 1-of-N encoding
      • |z| : word vector size, the dimension of embedding space

      • the length of \(x_{i-1}\) and \(x_{i-2}\) are both |V|.

      • the length of \(z\) is |z|
      • \(z = W_1x_{i-2} + W_2x_{i-1}\)
      • the weight matrix \(W_1\) and \(W_2\) are both |Z|*|V| matrix
      • \(W_1 = W_2 = W \rightarrow z = W(x_{i-2} + x_{i-1})\)
      • we get \(W\) after NN trained
      • a new word's vector is \(wordvector_{w_i} = W(1ofNencdoing_{w_i})\)

      实作时,我们如何保证权重共享呢 --- \(W_1 = W_2\)?

      How to make wi equal to wj?

      Given wi and wj the same initialization, and the same update per step.

      \(w_0 = w_0\)

      \(w_i \leftarrow w_i - \eta \frac{\partial C}{\partial w_i} - \eta \frac{\partial C}{\partial w_i}\)

      \(w_i \leftarrow w_i - \eta \frac{\partial C}{\partial w_i} - \eta \frac{\partial C}{\partial w_i}\)

      扩展 prediction-based 方法

      • Continuous bag of word(CBOW) model

      \([w_{i-1}, w_{i+1}] \Rightarrow w_i\)

      • Skip-gram

      \(w_i \Rightarrow [w_{i-1}, w_{i+1}]\)

      word embedding 用于关系推导和问题回答


      有了 word vector, 很多关系都可以数字化(向量化)

      • 包含关系
      • 因果关系
      • 等等

      两类词汇的 word vector 的差值之间存在某种平行和相似(相似三角形or多边形)性,可以据此产生很多应用。

      \(WordVector_{China} - WordVector_{Beijing} // WordVector_{Spain} - WordVector_{Madrid}\)

      for \(WordVector_{country} - WordVector_{capital}\), if a word A \(\in\) country, and word B \(\in\) capital of this country, then \(A_0 - B_0 // A_1 - B_1 // A_2 - B_2 // . ..\)


      问题回答也是这个思路---利用word vector差值是相互平行的

      \(V(Germany) - V(Berlin) // V(Italy) - V(Rome)\)

      \(vector = V(Berlin) + V(Italy) - V(Rome)\)

      \(find\ a\ word\ from\ corpus\ whose\ word\ vector\ closest\ with\ 'vector'\)


      学习 word embedding prediction-based NN 的网络原理(单层; 前后单词为一个样本;取第一层输入)可以实现更进一步的应用。

      先获取中文的 word vector, 然后获取英文的 word vector, (这之后开始使用 prediction-based NN 的原理)然后 learn 一个 NN 使得他能把相同意思的中英文 word vector 投影到某个 space 上的同一点,这之后提取这个网络的第一层 hidden layer 的输入权重,就可以用来转换其他的英文和中文单词到该 space 上,通过就近取意的方法获取该单词的意思。

      对图像做 word embedding

      这里也是学习 word embedding prediciton-based NN 的网络原理,他可以实现扩展型图像识别,一般的图像识别是只能识别数据集中已经存在的类别,而通过 word embedding 的这种模式,可以实现对图像数据集中不存在(但是在 word 数据集中存在)的类别也正确识别(而不是指鹿为马,如果image dataset 中原本没有 ‘鹿’ 的话,普通的图像识别会就近的选择最'像'的类别,也就是他只能在指定的类别中选最优的)。

      先通过大量阅读做 word vector, 然后训练一个 NN 输入为图片,输出为(与之前的 word vector)相同维度的 vector, 并且使得 NN 把与 word(eg, 'dog') 相同类型的image(eg, dog img) project 到该word 的 word vector 附近甚至同一点上。

      如此面对新来的图片比如 '鹿.img', 输入给这个 NN 他就会 project 到 word vector space 上的 “鹿” 周围的某一点上。

      对 document 做 embedding

      1. word sequences with different lengths -> the vector with the same length
      2. the vector representing the meaning of the word sequence

      如何把 document 变成一个 vector 呢?

      首先把 document 用 bag of word(a vector) 来表示,然后将其输入给一个 auto encoder , autoencoder 的 bottle-neck layer 就是这篇文章的 embedding vector。

      这里与使用 autoencoder 无法用来做 word embedding 的道理是一样的,只不过对于 word embedding 来说 autoencoder 面对的是完全相互 independent 的 1-of-N encoding, 其本身就无规律可言,所以 autoencoder 不可能学到任何东西,所以没有直接规律,就寻找间接规律 通过学习 context 来判断单词的语义。

      这里 autoencoder 面对的是 document(bag-of-word), bag of word 中包含的信息仅仅是单词的数量 (大概是这样的向量\([22, 1, 879, 53, 109, ....]\))不论 bag-of-word(for document) 还是 1-of-N encoding(for word) 都是语义缺乏的编码方式。所以想通过这种编码方式让NN萃取有价值的信息是不可能的。


      • 1-of-N encoding, lack of words relationship, what we lack, what we use NN to predict, we discover(construct) some form of data-pair (x -> y) who can represent the "relationship" to train a NN , and the BY-PRODUCT is the weight of hiden layer --- a function(or call it matrix) who can project the word to word vector(a meaningful encoding)

      • bag-of-word encoding, lack of words ordering(another relationship), using (李宏毅老师没有明说,只列了 reference)

      white blood cells destroying an infection

      an infection destroying white blood cells

      they have same bag-of-words, but vary differentmeaning.

    4. word embedding 感想

      word embedding, 降维,创造关系,非监督


    5. 关于转导学习 和 归纳学习

      • 迁移学习 transfer learning
      • 转导学习 transductive learning
      • 归纳学习 inductive learning
      • 非监督学习 unsupervised learning
      • 自学习 self-taught learning
      • 多任务学习 multi-task learning



      转导学习的典型算法是 KNN:

      1. 初始化 K(超参,自选) 个中心点。
      2. 新来的数据会将其直接用来计算与每个中心点的距离。
      3. 取所有距离中最小的距离所对应的中心点作为该新数据点的“簇”。
      4. 重新计算该簇(加入中心点的簇需重新计算)中心点。


      我们使用了 unlabeled data 作为测试集数据,并使用之决定新的中心点(新的分类簇)


      Transfuctive learning: unlabeled data is the testing data.

      归纳学习的典型算法是 Bayes:

      \(P(y|x) = \frac{P(x|y)P(y)}{P(x)=\sum^K_{i=1}{P(x|y_i)P(y_i)}}\)

      通过 count-based methodNaive Bayes 或者 Maximum Likelihood(详见 lec5 笔记)(

      \(P([1,3,9,0] | y_1)=P(1|y_1)P(3|y_1)P(9|y_1)P(0|y_1)\) ) 先计算出:





      然后就可以带入 Bayes 公式,就可以得到一个模型公式。

      \(P(y|x) = \frac{P(x|y)P(y)}{P(x)=\sum^K_{i=1}{P(x|y_i)P(y_i)}}\)

      由此可见,inductive 和 transductive 最大的不同就是,前者会得到一个通用的模型公式,而后者是没有模型公式可用的。新来的数据点对于 inductive learning 可以直接带入模型公式计算即可,而 transductive learning 每次有新点进来都需要重新跑一次整个计算过程。

      对于通用模型公式这件事,李宏毅老师 lec5-LogisticRegression and Generative Model 中提到:

      Bayes model 会脑补出数据集中没有的数据。


      而 transudctive learning 则是针对特定问题域的算法。


      在 inductive learning 中,学习器试图自行利用未标记示例,即整个学习过程不需人工干预,仅基于学习器自身对未标记示例进行利用。

      transductive learning 与 inductive learning 不同的是

      transfuctive learning 假定未标记示例就是测试例



      inductive learning 考虑的是一个“开放世界”,即在进行学习时并不知道要预测的示例是什么,而直推学习考虑的则是一个“封闭世界”,在学习时已经知道了需要预测哪些示例。

      实际上,直推学习这一思路直接来源于统计学习理论[Vapnik],并被一些学者认为是统计学习理论对机器学习思想的最重要的贡献1。其出发点是不要通过解一个困难的问题来解决一个相对简单的问题。V. Vapnik认为,经典的归纳学习假设期望学得一个在整个示例分布上具有低错误率的决策函数,这实际上把问题复杂化了,因为在很多情况下,人们并不关心决策函数在整个示例分布上性能怎么样,而只是期望在给定的要预测的示例上达到最好的性能。后者比前者简单,因此,在学习过程中可以显式地考虑测试例从而更容易地达到目的。这一思想在机器学习界目前仍有争议

    6. Matrix Factorization

      有时候你会有两种 object, 他们之间由某种共通的 latent factor 去操控。



      如果这两个向量很相似的话,那么两者 match,就会产生【喜欢】


      • 已知 \(r_{person}\) 和 \(r_{idiom}\) 是这两个向量
      • 求:人和动漫人物之间的匹配度(喜欢程度)。


      • 已知 任何动漫人物之间的匹配度(喜欢程度)
      • 求:向量\(r_{person}\) 和 \(r_{idiom}\)

      【注意】这个 latent vector 的数目是试出来的,一开始我们并不知道。

      比如 latent attribute = ['傲娇', '天然呆']

      person A : \(r^A = [0.7, 0.3]\)

      character 1: \(r^1 = [0.7, 0.3]\)

      \(r^A \cdot r^1 = 0.58\)

      • # of Otaku = M
      • # of characters = N
      • # of latent factor(a vector) = K (means vector r is K dim)

      $$ X \in R^{M*N} \approx \begin{vmatrix} r^A \\ r^B \\r^C\\.\\.\\. \end{vmatrix}^{M*K} * \begin{vmatrix} r^1 & r^2 & r^3 &.&.&. \end{vmatrix}^{K*N} $$

      这个问题可以直接使用 SVD 来解决,虽然SVD分解得到的是三个矩阵,你可以视情况将中间矩阵合并给前面的或后面的

      MF 处理缺值问题 --- Gradient Descent

      上面的是理想情况:table X 中所有的值都完备;

      现实情况是:tabel X 通常是缺值的,有很多 Missing value. 那该如何是好呢?

      使用 GD,来做,目标函数是:

      \(L = \sum_{(i,j)}(r^i \cdot r^j - n_{ij})^2\)

      通过对这个目标函数最小化,就可以求出 r.

      然后就可以用这些 r 来求出 table 中的每一个值。

      more about MF

      MF 的求值函数(table X 的计算函数,我们之前一直假设他是两个 latent vector 的匹配程度)可以考虑更多的因素。他不仅仅可以表示匹配程度

      从: \(r^A \cdot r^1 \approx 5\) 到更精确的: \(r^A \cdot r^1 + b_A + b_1\approx 5\)

      • \(b_A\): 表示他对动漫多感兴趣
      • \(b_1\): 表示这个动漫的推广力度如何

      如此新的 GD 优化目标就变成:

      \(L = \sum_{(i,j)}(r^i \cdot r^j + b_i + b_j - n_{ij})^2\)

      也可以加 L1 - Regularization, 比如 \(r^i, r^j\) 是 sparse 的---喜好很明确,要么天然呆,要么就是傲娇的,不会有模糊的喜好。

      MF for Topic analysis

      MF 技术用在语义分析就叫做 LSA(latent semantic analysis):

      • character -> document
      • otakus -> word
      • table item -> term frequency of word in this document

      注意:通常我们在做 LSA 的时候还会加一步操作,term frequency always weighted by inverse document frequency, 这步操作叫做 TF-IDF.

      也就是说,你用作 \(L = \sum_{(i,j)}(r^i \cdot r^j + b_i + b_j - n_{ij})^2\) 中的 \(n_{ij}\) 不是原始的某篇文章中的某个单词的出现次数而是出现次数乘以包含这个单词的文章数的倒数亦即,

      (n_{ij} = \frac{TF}{IDF})\

      如此当我们通过 GD 找到 latent vector 时,这个向量的每一个位表示的是 topics(财经,政治,娱乐,游戏等)

    7. 从 components-PCA 到 Autoencoder

      根据之前通过 SVD 矩阵分解得到的结论: u = w

      和公式:\(\hat{x} = \sum_{k=1}^Kc_kw^k \approx x-\bar{x}\)

      再结合线性代数的知识,我们可以得到,能让 reconstruction error 最小的 c 就是:

      \(c^k = (x-\bar{x})\cdot w^k\)

      结合这两个公式,我们就可以找到一个 Autoencoder 结构:

      1. \((x-\bar{x})\cdot w^k = c^k\)
      2. \(\sum_{k=1}^Kc^kw^k = \hat{x}\)

      $$ \begin{vmatrix} x_1 - \bar{x} \\ x_2 - \bar{x} \\ .\\.\\. \end{vmatrix} \Rightarrow \begin{vmatrix} c^1 \\ c^2 \end{vmatrix} \Rightarrow \begin{vmatrix} \bar{x_1} \\ \bar{x_2} \\ .\\.\\. \end{vmatrix} $$

      autoencoder 的缺点 --- 无法像 PCA 一样得到完美的数学解

      这样一个线性的 autoencoder 可以通过 Gradient Descent 来求解,但是 autoencoder 得到的解只能无限接近 PCA 通过 SVD 或者 拉格朗日乘数法的解,但不可能完全一致,因为 PCA 得到的 W 矩阵是一个列和列之间都相互垂直的矩阵,autoencoder 确实可以得到一个解,但无法保证参数矩阵 W 的列之间相互垂直

      autoencoder 的优点 --- 可以 deep,形成非线性函数,面对复杂问题更 power

      PCA 只能做压扁不能做拉直

      就像下面显示的这样,PCA 处理这类 manifold(卷曲面)的数据是无能为力的。他只能把数据都往某个方向压在一起。

      PCA - weaknees

      而 deep autoencoder 可以处理这类复杂的降维问题:

      PCA vs. autoencoder

      how many the principle components?



      \(eigen\ ratio\ = \frac{\lambda_i}{\lambda_1 + \lambda_2 + \lambda_3 + \lambda_4 + \lambda_5 + \lambda_6}\)

      • \(\lambda\) : eigen value of Cov(x) matrix

      我们求解 PCA 的函数的时候给出的结论是:

      W 的列是 \(S=Cov(x)\) 的topmost K 个 eignen values 对应的 eigen vectors.

      eigen values 的物理意义是降维后的 z 空间中的数据集在这一维度的 variance。

      在决定 z 空间的维度之前,我们引入一个指标: eigen ratio,这个指标有什么用呢?他能帮助我们估算出前多少个 eigen vector 是比较合适的。

      假设我们预先希望降维到 6 dimension,那么我们就可以通过之前学到的方法得到 6 个 eigen vector 和 6 个 eigen value, 同时也可以得到 6 个 eigen ratio.

      通过这 6 个 eigen ratio 我们就可以看出谁提供的 variance 是非常小的(而我们的目标是找到最大 Variance(z))。eigen ration 太小表示映射之后的那个维度,所有的点都挤在一起了,他没什么区别度,也就是提供不了太多有用的信息。

      之前没有提过,component 的维度应该是与原始 x 样本的维度是一致的,因为你可以把 component 看成是原始维度的一种组合(笔画 <- 像素)。

      以宝可梦为例,说明: 如果把原本 6 个维度的宝可梦,降维到4维度,可以发现的是这个 4 个 componets 大概的物理意义是:

      1. 1st_component: 强度,对应的原始样本 6 维度都是正系数。
      2. 2nd_component: 防御力(牺牲速度),对应的原始样本中 'Def' 最高,'Speed' 最低(负值)
      3. 3rd_component: 特殊防御(牺牲攻击和生命值),对应的原始样本中 'Sp Def' 最高,'HP','Atk'最低(都是负值)
      4. 4th_component: 生命值(牺牲攻击力和防御力),对应的原始样本中的'HP'最高, 'Atk' 'Def' 最低(都是负值)

      从实际应用看 PCA 得到的 'component'

      'component' 不一定是 ‘部分’。

      • 对手写数字图片进行降维

      • 对人脸图片进行降维



      我们一直强调 component 似乎是一种部分与整体