216 Matching Annotations
  1. Apr 2024
    1. Machine learning is acknowledged to have originated with the work of McCulloch and Pitts (1943). They recognised that brain signals are digital in nature, more specifically binary signals. According to Chakraborty and Joseph (2017) each ML system comprises five components: (1) a problem, (2) data source, (3) a model, (4) an optimization algorithm and (5) validation and testing.<br /> ML is best suited for situations that require extracting patterns from noisy data or sensory perception—or a data-up approach.

      Benford’s Law” is one of the simplest ways to detect fraud. It is accomplished by running an analysis on the first digits in a given set of data. A predictable distribution of first digits will exist in a set of “real” data. Benford’s Law has existed since the late 1800s. AI is beneficial here because ML algorith....

  2. Mar 2024
  3. Feb 2024
    1. Constructing Prompts for the Command Model Techniques for constructing prompts for the Command model. Developers
    1. Now, let’s modify the prompt by adding a few examples of how we expect the output to be. Pythonuser_input = "Send a message to Alison to ask if she can pick me up tonight to go to the concert together" prompt=f"""Turn the following message to a virtual assistant into the correct action: Message: Ask my aunt if she can go to the JDRF Walk with me October 6th Action: can you go to the jdrf walk with me october 6th Message: Ask Eliza what should I bring to the wedding tomorrow Action: what should I bring to the wedding tomorrow Message: Send message to supervisor that I am sick and will not be in today Action: I am sick and will not be in today Message: {user_input}""" response = generate_text(prompt, temp=0) print(response) This time, the style of the response is exactly how we want it. Can you pick me up tonight to go to the concert together?
    2. But we can also get the model to generate responses in a certain format. Let’s look at a couple of them: markdown tables
    3. And here’s the same request to the model, this time with the product description of the product added as context. Pythoncontext = """Think back to the last time you were working without any distractions in the office. That's right...I bet it's been a while. \ With the newly improved CO-1T noise-cancelling Bluetooth headphones, you can work in peace all day. Designed in partnership with \ software developers who work around the mayhem of tech startups, these headphones are finally the break you've been waiting for. With \ fast charging capacity and wireless Bluetooth connectivity, the CO-1T is the easy breezy way to get through your day without being \ overwhelmed by the chaos of the world.""" user_input = "What are the key features of the CO-1T wireless headphone" prompt = f"""{context} Given the information above, answer this question: {user_input}""" response = generate_text(prompt, temp=0) print(response) Now, the model accurately lists the features of the model. The answer is: The CO-1T wireless headphones are designed to be noise-canceling and Bluetooth-enabled. They are also designed to be fast charging and have wireless Bluetooth connectivity. Format
    4. While LLMs excel in text generation tasks, they struggle in context-aware scenarios. Here’s an example. If you were to ask the model for the top qualities to look for in wireless headphones, it will duly generate a solid list of points. But if you were to ask it for the top qualities of the CO-1T headphone, it will not be able to provide an accurate response because it doesn’t know about it (CO-1T is a hypothetical product we just made up for illustration purposes). In real applications, being able to add context to a prompt is key because this is what enables personalized generative AI for a team or company. It makes many use cases possible, such as intelligent assistants, customer support, and productivity tools, that retrieve the right information from a wide range of sources and add it to the prompt.
    5. We set a default temperature value of 0, which nudges the response to be more predictable and less random. Throughout this chapter, you’ll see different temperature values being used in different situations. Increasing the temperature value tells the model to generate less predictable responses and instead be more “creative.”
  4. Oct 2023
    1. racialized social hierarchies, thus facilitating dominationand exploitation.

      We also talked about that in Traditions and Revolutions

  5. Sep 2023
    1. they do not know enough about the topic at hand or because, they say, theysimply are not “smart enough.”

      I find myself saying this sometimes too

    2. BLEND THE AUTHOR’S WORDS WITH YOUR OWN

      Important to use a quotation for the essay we have to write

    3. VERBS FOR MAKING A CLAIM

      This will be very useful for writing my essays

    4. his ability to enter complex, many-sided conversations has taken on aspecial urgency in today’s polarized red state / blue state America,

      Politics in the United States

    5. Letter from Birmingham Jail,

      Learned in AP Gov, and important document in the Civil Rights Movement

    6. f you have been taught to write atraditional five-paragraph essay, for example, you have learned how todevelop a thesis and support it with evidence.

      What I was taught throughout my high school years

    7. Less experiencedwriters, by contrast, are often unfamiliar with these basic moves and unsurehow to make them in their own writing.

      More reading done means more experience, and better writing.

    8. STATE YOUR OWN IDEAS AS A RESPONSE TOOTHERS

      I think this is a good way for essays to be written, and it seems like I write a lot of essays that require this format.

    9. once you mastered it you no longerhad to give much conscious thought to the various moves that go into doingit.

      This is very true in my life, a lot of things come as second nature such as brushing my teeth and driving.

  6. moodle.lynchburg.edu moodle.lynchburg.edu
    1. “Why don’tyou do something about it?’

      I think this goes to a lot of things in life. A lot of people say this and say that, but none of them ever do anything

  7. moodle.lynchburg.edu moodle.lynchburg.edu
    1. for their knowledge of theirown ignorance.

      Rare ability to be aware of being ignorant, and we all struggle with it.

    2. The father was a quiet, simple soul, calmly ignorant, with no touch of vulgarity. The mother wasdifferent,—strong, bustling, and energetic, with a quick, restless tongue, and an ambition to live“like folks.”

      This is a very similar way to my household, but I know this is not the norm in most.

    1. it is easier to do ill than well in the world

      I think this relates to everyone's life. It's a lot harder to do the right that can help the world than it is to the easy thing that hurts the world thing sometimes.

  8. Jun 2023
    1. We use the same model and architecture as GPT-2

      What do they mean by "model" here? If they have retrained on more data, with a slightly different architecture, then the model weights after training must be different.

  9. May 2023
  10. Apr 2023
    1. Now we are getting somewhere. At this point, we also see that the dimensions of W and b for each layer are specified by the dimensions of the inputs and the number of nodes in each layer. Let’s clean up the above diagram by not labeling every w and b value individually.
    1. The Delta Method, from the field of nonlinear regression. The Bayesian Method, from Bayesian modeling and statistics. The Mean-Variance Estimation Method, using estimated statistics. The Bootstrap Method, using data resampling and developing an ensemble of models.

      Four methods to compute prediction intervals.

    1. A novel method for estimating prediction uncertainty using machine learning techniques is presented. Uncertainty is expressed in the form of the two quantiles (constituting the prediction interval) of the underlying distribution of prediction errors. The idea is to partition the input space into different zones or clusters having similar model errors using fuzzy c-means clustering. The prediction interval is constructed for each cluster on the basis of empirical distributions of the errors associated with all instances belonging to the cluster under consideration and propagated from each cluster to the examples according to their membership grades in each cluster. Then a regression model is built for in-sample data using computed prediction limits as targets, and finally, this model is applied to estimate the prediction intervals (limits) for out-of-sample data. The method was tested on artificial and real hydrologic data sets using various machine learning techniques. Preliminary results show that the method is superior to other methods estimating the prediction interval. A new method for evaluating performance for estimating prediction interval is proposed as well.

      Prediction intervals using quantiles. Use clustering.

  11. Feb 2023
    1. the Elhage et al.(2021) study showing an information-copying role for self-attention.

      It turns out Meng does refer to induction heads, just not by name.

  12. Jan 2023
    1. e twoareas in which the forward-forward algorithm may be superior to backpropagation are as a model oflearning in cortex and as a way of making use of very low-power analog hardware without resortingto reinforcement learning(Jabri and Flower, 1992).
  13. Dec 2022
    1. Our method is based on the hypothesis that the weights of a generator act as Optimal Linear Associative Memory (OLAM). OLAM is a classic single-layer neural data structure for memorizing associations that was described by Teuvo Kohonen and James A Anderson (independently) in the 1970s. In our case, we hypothesize that within a large modern multilayer convolutional network, the each individual layer plays the role of an OLAM that stores a set of rules that associates keys, which denote meaningful context, with values, which determine output.
    1. AI training data is filled with racist stereotypes, pornography, and explicit images of rape, researchers Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe found after analyzing a data set similar to the one used to build Stable Diffusion.

      That is horrifying. You'd think that authors would attempt to remove or filter this kind of material. There are, after all models out there that are trained to find it. It makes me wonder what awful stuff is in the GPT-3 dataset too.

    1. We test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and4× more more data. Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B),Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks.This also means that Chinchilla uses substantially less compute for fine-tuning and inference, greatlyfacilitating downstream usage. As a highlight, Chinchilla reaches a state-of-the-art average accuracy of67.5% on the MMLU benchmark, greater than a 7% improvement over Gopher

      By using more data on a smaller language model the authors were able to achieve better performance than with the larger models - this reduces the cost of using the model for inference.

  14. Nov 2022
    1. Kuratierungs-Filter auf Empfängerseite gibt, aber dann wäre auch e-mail-Spam als Problem gelöst und das sehe ich gerade noch nicht passieren.

      gibt es projekte, die Modelle auf gesammelte spam mails trainieren?

    1. “The metaphor is that the machine understands what I’m saying and so I’m going to interpret the machine’s responses in that context.”

      Interesting metaphor for why humans are happy to trust outputs from generative models

  15. Sep 2022
    1. Consider a toy model where we train an embedding of five features of varying importanceWhere “importance” is a scalar multiplier on mean squared error loss. in two dimensions, add a ReLU afterwards for filtering, and vary the sparsity of the features.
    1. The present generation of Southerners are not responsible for the past

      We can't judge or blame people based off of their ancestors' actions. In high school, I always hated that everyone knew my older siblings because it often felt like my future was already written for me even though I had not even experienced it myself yet.

    2. Haytian revolt

      We briefly touched on this in Traditions/Revolutions, and I know we will learn more about it later on in the course.

    3. his educational programme was un-necessarily narrow.

      When I was first annotating "The Education of the Negro," I also found Washington's idea of teaching industrial education singularly focused. However, towards the end of his article he made me come around to the idea because it seemed like a good way to instill a desire in students to work for themselves instead of someone else.

    4. the Free Negroes from 1830 up to war-time hadstriven to build industrial schools, and the American Missionary Associ-ation had from the first taught various trades; and Price and others hadsought a way of honorable alliance with the best of the Southerners. ButMr. Washington first indissolubly linked these things; he put enthusiasm,unlimited energy, and perfect faith into his programme, and changed itfrom a by-path into a veritable Way of Life

      ML: He was nor the first to come up with the idea obviously but he put a face on it. It seems like people myself included have a much easier time following something if there is a person in charge of it for them to follow.

    1. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life. They do not putinto their hands the tools they are best tted to use,and hence so many failures. Many a mother andsister have worked and slaved, living upon scantyfood, in order to give a son and brother a ’liberaleducation,’ and in doing this have built up a barrierbetween the boy and the work he was tted to do.Let me say to you that all honest work is honorablework. If the labor is manual, and seems common,you will have all the more chance to be thinking ofother things, or of work that is higher and bringsbetter pay, and to work out in your minds betterand higher duties and responsibilities foryourselves, and for thinking of ways by which youcan help others as well as yourselves, and bringthem up to your own higher level.

      I still see this in our school systems today, especially in certain classes where you feel like you are never going to use anything that you have learned in the real world.

    2. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life. They do not putinto their hands the tools they are best tted to use,and hence so many failures. Many a mother andsister have worked and slaved, living upon scantyfood, in order to give a son and brother a ’liberaleducation,’ and in doing this have built up a barrierbetween the boy and the work he was tted to do.Let me say to you that all honest work is honorablework. If the labor is manual, and seems common,you will have all the more chance to be thinking ofother things, or of work that is higher and bringsbetter pay, and to work out in your minds betterand higher duties and responsibilities foryourselves, and for thinking of ways by which youcan help others as well as yourselves, and bringthem up to your own higher level.

      I still see this in our school systems today, especially in certain classes where you feel like you are never going to use anything that you have learned in the real world.

    3. Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life.

      When I was in high school, my mom would always say that they don't teach us some of the most important life skills in class. She was always ranting about how we should have to take a finance class to prepare for adulthood.

    4. “Our schools teach everybody a little of almosteverything, but, in my opinion, they teach very fewchildren just what they ought to know in order tomake their way successfully in life.

      This is still accurate for schools today. For example, in middle school we had 8 classes a day for 45 minutes each for one semester. Even though we had class everyday it was far too little of time to actually learn a full subject. The teacher had to just give us a little bit of information on each topic we were supposed to cover.

  16. moodle.lynchburg.edu moodle.lynchburg.edu
    1. Uncle Bird had a small, rough farm, all woods and hills, miles from the big road; but he was fullof tales

      My uncles are also full of tales that they like to share with everyone they have the chance to.

    2. willow

      I named my Jeep Willow.

    1. Now, the progression of NLP, as discussed, tells a story. We begin with tokens and then build representations of these tokens. We use these representations to find similarities between tokens and embed them in a high-dimensional space. The same embeddings are also passed into sequential models that can process sequential data. Those models are used to build context and, through an ingenious way, attend to parts of the input sentence that are useful to the output sentence in translation.
    2. Data, matrix multiplications, repeated and scaled with non-linear switches. Maybe that simplifies things a lot, but even today, most architectures boil down to these principles. Even the most complex systems, ideas, and papers can be boiled down to just that:
  17. Aug 2022
  18. Jul 2022
    1. Z-code models to improve common language understanding tasks such as name entity recognition, text summarization, custom text classification and key phrase extraction across its Azure AI services. But this is the first time a company has publicly demonstrated that it can use this new class of Mixture of Experts models to power machine translation products.

      this model is what actually z-code is and what makes it special

    2. have developed called Z-code, which offer the kind of performance and quality benefits that other large-scale language models have but can be run much more efficiently.

      can do the same but much faster

  19. Jun 2022
    1. The dominant idea is one of attention, by which a representation at a position is computed as a weighted combination of representations from other positions. A common self-supervision objective in a transformer model is to mask out occasional words in a text. The model works out what word used to be there. It does this by calculating from each word position (including mask positions) vectors that represent a query, key, and value at that position. The query at a position is compared with the value at every position to calculate how much attention to pay to each position; based on this, a weighted average of the values at all positions is calculated. This operation is repeated many times at each level of the transformer neural net, and the resulting value is further manipulated through a fully connected neural net layer and through use of normalization layers and residual connections to produce a new vector for each word. This whole process is repeated many times, giving extra layers of depth to the transformer neural net. At the end, the representation above a mask position should capture the word that was there in the original text: for instance, committee as illustrated in Figure 1.
    1. This trick of using a one-hot vector to pull out a particular row of a matrix is at the core of how transformers work.

      Matrix multiplication as table lookup

  20. May 2022
    1. Given the complexities of the brain’s structure and the functions it performs, any one of these models is surely oversimplified and ultimately wrong—at best, an approximation of some aspects of what the brain does. However, some models are less wrong than others, and consistent trends in performance across models can reveal not just which model best fits the brain but also which properties of a model underlie its fit to the brain, thus yielding critical insights that transcend what any single model can tell us.
    1. Such a highly non-linear problem would clearly benefitfrom the computational power of many layers. Unfortu-nately, back-propagation learning generally slows downby an order of magnitude every time a layer is added toa network.

      The problem in 1988

    1. The source sequence will be pass to the TransformerEncoder, which will produce a new representation of it. This new representation will then be passed to the TransformerDecoder, together with the target sequence so far (target words 0 to N). The TransformerDecoder will then seek to predict the next words in the target sequence (N+1 and beyond).
  21. Apr 2022
    1. Ourpre-trained network is nearly identical to the “AlexNet”architecture (Krizhevsky et al., 2012), but with local re-ponse normalization layers after pooling layers following(Jia et al., 2014). It was trained with the Caffe frameworkon the ImageNet 2012 dataset (Deng et al., 2009)
    1. Example 1. For example, suppose that the input volume has size [32x32x3], (e.g. an RGB CIFAR-10 image). If the receptive field (or the filter size) is 5x5, then each neuron in the Conv Layer will have weights to a [5x5x3] region in the input volume, for a total of 5*5*3 = 75 weights (and +1 bias parameter). Notice that the extent of the connectivity along the depth axis must be 3, since this is the depth of the input volume. Example 2. Suppose an input volume had size [16x16x20]. Then using an example receptive field size of 3x3, every neuron in the Conv Layer would now have a total of 3*3*20 = 180 connections to the input volume. Notice that, again, the connectivity is local in 2D space (e.g. 3x3), but full along the input depth (20).

      These two examples are the first two layers of Andrej Karpathy's wonderful working ConvNetJS CIFAR-10 demo here

    1. input (32x32x3)max activation: 0.5, min: -0.5max gradient: 1.08696, min: -1.53051Activations:Activation Gradients:Weights:Weight Gradients:conv (32x32x16)filter size 5x5x3, stride 1max activation: 3.75919, min: -4.48241max gradient: 0.36571, min: -0.33032parameters: 16x5x5x3+16 = 1216

      The dimensions of these first two layers are explained here

    1. Here the lower level layers are frozen and are not trained, only the new classification head will update itself to learn from the features provided from the pre-trained chopped up model on the left.
    1. Starting from random noise, we optimize an image to activate a particular neuron (layer mixed4a, unit 11).

      And then we use that image as a kind of variable name to refer to the neuron in a way that more helpful than the the layer number and neuron index within the layer. This explanation is via one of Chris Olah's YouTube videos (https://www.youtube.com/watch?v=gXsKyZ_Y_i8)

  22. Mar 2022
    1. A special quality of humans, not shared by evolution or, as yet, by machines, is our ability to recognize gaps in our understanding and to take joy in the process of filling them in. It is a beautiful thing to experience the mysterious, and powerful, too.
  23. Feb 2022
    1. Verfahren des Relational Machine Learning, welche unter Ausnutzung der Graphstruktur in vielen Fällen Modelle besserer Qualität liefern.

      Rleational Machine Learning-Ansatz

    2. In vielen Anwendungen ist es allerdings notwendig, Daten nicht nur in hoher Qualität und semantisch angereichert zur Verfügung zu stellen, sondern neues Wissen aus vorhandenen Informationen zu generieren. Hierfür nutzen wir Machine Learning.

      Kombination mit ML-Anästze zur Generierung von neuem Wissen

    1. Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.
  24. Dec 2021
    1. the only thing an artificial neuron can do: classify a data point into one of two kinds by examining input values with weights and bias.

      How does this relate to "weighted sum shows similarity between the weights and the inputs"?

    1. The transformer model introduces the idea of instead of adding another complex mechanism (attention) to an already complex Seq2Seq model; we can simplify the solution by forgetting about everything else and just focusing on attention.
    1. I’m particularly interested in two questions: First, just how weird is machine learning? Second, what sorts of choices do developers make as they shape a project?
  25. Nov 2021
    1. Now that we've made peace with the concepts of projections (matrix multiplications)

      Projections are matrix multiplications.Why didn't you sayso? spatial and channel projections in the gated gmlp

    2. Computers are especially good at matrix multiplications. There is an entire industry around building computer hardware specifically for fast matrix multiplications. Any computation that can be expressed as a matrix multiplication can be made shockingly efficient.
    3. The selective-second-order-with-skips model is a useful way to think about what transformers do, at least in the decoder side. It captures, to a first approximation, what generative language models like OpenAI's GPT-3 are doing.
    1. You'll use a (70%, 20%, 10%) split for the training, validation, and test sets. Note the data is not being randomly shuffled before splitting. This is for two reasons: It ensures that chopping the data into windows of consecutive samples is still possible. It ensures that the validation/test results are more realistic, being evaluated on the data collected after the model was trained.

      Train, Validation, Test: 0.7, 0.2, 0.1

    1. The following figure presents a simple functional diagram of the neural network we will use throughout the article. The neural network is a sequence of linear (both convolutional A convolution calculates weighted sums of regions in the input. In neural networks, the learnable weights in convolutional layers are referred to as the kernel. For example Image credit to https://towardsdatascience.com/gentle-dive-into-math-behind-convolutional-neural-networks-79a07dd44cf9. See also Convolution arithmetic. and fully-connected A fully-connected layer computes output neurons as weighted sum of input neurons. In matrix form, it is a matrix that linearly transforms the input vector into the output vector. ), max-pooling, and ReLU First introduced by Nair and Hinton, ReLU calculates f(x)=max(0,x)f(x)=max(0,x)f(x)=max(0,x) for each entry in a vector input. Graphically, it is a hinge at the origin: Image credit to https://pytorch.org/docs/stable/nn.html#relu layers, culminating in a softmax Softmax function calculates S(yi)=eyiΣj=1NeyjS(y_i)=\frac{e^{y_i}}{\Sigma_{j=1}^{N} e^{y_j}}S(yi​)=Σj=1N​eyj​eyi​​ for each entry (yiy_iyi​) in a vector input (yyy). For example, Image credit to https://ljvmiranda921.github.io/notebook/2017/08/13/softmax-and-the-negative-log-likelihood/ layer.

      This is a great visualization of MNIST hidden layers.

    1. The Query word can be interpreted as the word for which we are calculating Attention. The Key and Value word is the word to which we are paying attention ie. how relevant is that word to the Query word.

      Finally

    1. Other work on interpreting transformer internals has focused mostly on what the attention is looking at. The logit lens focuses on what GPT "believes" after each step of processing, rather than how it updates that belief inside the step.
    1. The cube of activations that a neural network for computer vision develops at each hidden layer. Different slices of the cube allow us to target the activations of individual neurons, spatial positions, or channels.

      This is first explanation of

    1. The attention layer (W in the diagram) computes three vectors based on the input, termed key, query, and value.

      Could you be more specific?

    2. Attention is a means of selectively weighting different elements in input data, so that they will have an adjusted impact on the hidden states of downstream layers.
    1. These findings provide strong evidence for a classic hypothesis about the computations underlying human language understanding, that the brain’s language system is optimized for predictive processing in the service of meaning extraction
    1. To review, the Forget gate decides what is relevant to keep from prior steps. The input gate decides what information is relevant to add from the current step. The output gate determines what the next hidden state should be.Code DemoFor those of you who understand better through seeing the code, here is an example using python pseudo code.
  26. Oct 2021
    1. This approach, visualizing high-dimensional representations using dimensionality reduction, is an extremely broadly applicable technique for inspecting models in deep learning.
    2. These layers warp and reshape the data to make it easier to classify.
    1. Even with this very primitive single neuron, you can achieve 90% accuracy when recognizing a handwritten text image1. To recognize all the digits from 0 to 9, you would need just ten neurons to recognize them with 92% accuracy.

      And here is a Google Colab notebook that demonstrates that

  27. Sep 2021
    1. Humans perform a version of this task when interpretinghard-to-understand speech, such as an accent which is particularlyfast or slurred, or a sentence in a language we do not know verywell—we do not necessarily hear every single word that is said,but we pick up on salient key words and contextualize the rest tounderstand the sentence.

      Boy, don't they

    1. A neural network will predict your digit in the blue square above. Your image is 784 pixels (= 28 rows by 28 columns with black=1 and white=0). Those 784 features get fed into a 3 layer neural network; Input:784 - AvgPool:196 - Dense:100 - Softmax:10.
    1. Personalized ASR models. For each of the 432 participants with disordered speech, we create a personalized ASR model (SI-2) from their own recordings. Our fine-tuning procedure was optimized for our adaptation process, where we only have between ¼ and 2 h of data per speaker. We found that updating only the first five encoder layers (versus the complete model) worked best and successfully prevented overfitting [10]
    1. So whenever you hear of someone “training” a neural network, it just means finding the weights we use to calculate the prediction.
  28. Aug 2021
    1. I'm going to try provide an English text example. The following is based solely on my intuitive understanding of the paper 'Attention is all you need'.

      This is also good

    2. For the word q that your eyes see in the given sentence, what is the most related word k in the sentence to understand what q is about?
    3. So basically: q = the vector representing a word K and V = your memory, thus all the words that have been generated before. Note that K and V can be the same (but don't have to). So what you do with attention is that you take your current query (word in most cases) and look in your memory for similar keys. To come up with a distribution of relevant words, the softmax function is then used.
    1. Here is a list of some open data available online. You can find a more complete list and details of the open data available online in Appendix B.

      DataHub (http://datahub.io/dataset)

      World Health Organization (http://www.who.int/research/en/)

      Data.gov (http://data.gov)

      European Union Open Data Portal (http://open-data.europa.eu/en/data/)

      Amazon Web Service public datasets (http://aws.amazon.com/datasets)

      Facebook Graph (http://developers.facebook.com/docs/graph-api)

      Healthdata.gov (http://www.healthdata.gov)

      Google Trends (http://www.google.com/trends/explore)

      Google Finance (https://www.google.com/finance)

      Google Books Ngrams (http://storage.googleapis.com/books/ngrams/books/datasetsv2.html)

      Machine Learning Repository (http://archive.ics.uci.edu/ml/)

      As an idea of open data sources available online, you can look at the LOD cloud diagram (http://lod-cloud.net ), which displays the connections of the data link among several open data sources currently available on the network (see Figure 1-3).

    1. A neural network with a hidden layer has universality: given enough hidden units, it can approximate any function. This is a frequently quoted – and even more frequently, misunderstood and applied – theorem. It’s true, essentially, because the hidden layer can be used as a lookup table.
    2. Recursive Neural Networks
  29. Jul 2021
    1. In the language of Interpretable Machine Learning (IML) literature like Molnar et al.[20], input saliency is a method that explains individual predictions.
    1. Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.

    1. Vectors with a small Euclidean distance from one another are located in the same region of a vector space. Vectors with a high cosine similarity are located in the same general direction from the origin.
    1. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    2. If you're serious about neural networks, I have one recommendation. Try to rebuild this network from memory.
    1. In our research, i.e., the wormnet project, we try to build machine learning models motivated by the C. elegans nervous system. By doing so, we have to pay a cost, as we constrain ourselves to such models in contrast to standard artificial neural networks, whose modeling space is purely constraint by memory and compute limitations. However, there are potentially some advantages and benefits we gain. Our objective is to better understand what’s necessary for effective neural information processing to emerge.
    1. Recommendations DON'T use shifted PPMI with SVD. DON'T use SVD "correctly", i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with (p = 0.5)). DO use PPMI and SVD with short contexts (window size of (2)). DO use many negative samples with SGNS. DO always use context distribution smoothing (raise unigram distribution to the power of (lpha = 0.75)) for all methods. DO use SGNS as a baseline (robust, fast and cheap to train). DO try adding context vectors in SGNS and GloVe.
  30. Jun 2021
    1. One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning

      This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

  31. Apr 2021
    1. Machine learning app development has been gaining traction among companies from all over the world. When dealing with this part of machine learning application development, you need to remember that machine learning can recognize only the patterns it has seen before. Therefore, the data is crucial for your objectives. If you’ve ever wondered how to build a machine learning app, this article will answer your question.

    1. Machine learning is an extension of linear regression in a few ways. Firstly is that modern ML

      Machine learning is an extension to linear model which deals with much more complicated situation where we take few different inputs and get outputs.

  32. Nov 2020
    1. 可以认为 π k \pi_k πk​就是每个分量 N ( x ∣ μ k , Σ k ) \mathcal{N}(\boldsymbol{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k) N(x∣μk​,Σk​)的权重。

      有的书称为责任

  33. Oct 2020
  34. May 2020
    1. Machine learning has a limited scope
    2. AI is a bigger concept to create intelligent machines that can simulate human thinking capability and behavior, whereas, machine learning is an application or subset of AI that allows machines to learn from data without being programmed explicitly
    1. Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed
  35. Apr 2020
    1. Keras is a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano. It was developed with a focus on enabling fast experimentation. Being able to go from idea to result with the least possible delay is key to doing good research. Use Keras if you need a deep learning library that: Allows for easy and fast prototyping (through user friendliness, modularity, and extensibility). Supports both convolutional networks and recurrent networks, as well as combinations of the two. Runs seamlessly on CPU and GPU. Read the documentation at Keras.io. Keras is compatible with: Python 2.7-3.6.
  36. Jan 2020
    1. Suppose the algorithm chooses a tree that splits on education but not on age. Conditional on this tree, the estimated coefficients are consistent. But that does not imply that treatment effects do not also vary by age, as education may well covary with age; on other draws of the data, in fact, the same procedure could have chosen a tree that split on age instead

      a caveat

    2. hese heterogenous treatment effects can be used to assign treatments; Misra and Dubé (2016) illustrate this on the problem of price targeting, applying Bayesian regularized methods to a large-scale experiment where prices were randomly assigned

      todo -- look into the implication for treatment assignment with heterogeneity

    3. Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, and Newey (2016) take care of high-dimensional controls in treatment effect estimation by solving two simultaneous prediction problems, one in the outcome and one in the treatment equation.

      this seems similar to my idea of regularizing on only a subset of the variables

    4. These same techniques applied here result in split-sample instrumental variables (Angrist and Krueger 1995) and “jackknife” instrumental variables

      some classical solutions to IV bias are akin to ML solutions

    5. Understood this way, the finite-sample biases in instrumental variables are a consequence of overfitting.

      traditional 'finite sample bias of IV' is really overfitting

    6. Even when we are interested in a parameter β ˆ, the tool we use to recover that parameter may contain (often implicitly) a prediction component. Take the case of linear instrumental variables understood as a two-stage procedure: first regress x = γ′z + δ on the instrument z, then regress y = β′x + ε on the fitted values x ˆ. The first stage is typically handled as an estimation step. But this is effectively a prediction task: only the predictions x ˆ enter the second stage; the coefficients in the first stage are merely a means to these fitted values.

      first stage of IV -- handled as an estimation problem, but really it's a prediction problem!

    7. Prediction in the Service of Estimation

      This is especially relevant to economists across the board, even the ML skeptics

    8. New Data

      The first application: constructing variables and meaning from high-dimensional data, especially outcome variables

      • satellite images (of energy use, lights etc) --> economic activity
      • cell phone data, Google street view to measure wealth
      • extract similarity of firms from 10k reports
      • even traditional data .. matching individuals in historical censuses
    9. Zhao and Yu (2006) who establish asymptotic model-selection consistency for the LASSO. Besides assuming that the true model is “sparse”—only a few variables are relevant—they also require the “irrepresentable condition” between observables: loosely put, none of the irrelevant covariates can be even moderately related to the set of relevant ones.

      Basically unrealistic for microeconomic applications imho

    10. First, it encourages the choice of less complex, but wrong models. Even if the best model uses interactions of number of bathrooms with number of rooms, regularization may lead to a choice of a simpler (but worse) model that uses only number of fireplaces. Second, it can bring with it a cousin of omitted variable bias, where we are typically concerned with correlations between observed variables and unobserved ones. Here, when regular-ization excludes some variables, even a correlation between observed variables and other observed (but excluded) ones can create bias in the estimated coefficients.

      Is this equally a problem for procedures that do not assum sparsity, such as the Ridge model?

    11. 97the variables are correlated with each other (say the number of rooms of a house and its square-footage), then such variables are substitutes in predicting house prices. Similar predictions can be produced using very different variables. Which variables are actually chosen depends on the specific finite sample.

      Lasso-chosen variables are unstable because of what we usually call 'multicollinearity.'<br> This presents a problem for making inferences from estimated coefficients.

    12. Through its regularizer, LASSO produces a sparse prediction function, so that many coefficients are zero and are “not used”—in this example, we find that more than half the variables are unused in each run

      This is true but they fail to mention that LASSO also shrinks the coefficients on variables that it keeps towards zero (relative to OLS). I think this is commonly misunderstood (from people I've spoken with).

    13. One obvious problem that arises in making such inferences is the lack of stan-dard errors on the coefficients. Even when machine-learning predictors produce familiar output like linear functions, forming these standard errors can be more complicated than seems at first glance as they would have to account for the model selection itself. In fact, Leeb and Pötscher (2006, 2008) develop conditions under which it is impossible to obtain (uniformly) consistent estimates of the distribution of model parameters after data-driven selection.

      This is a very serious limitation for Economics academic work.

    14. First, econometrics can guide design choices, such as the number of folds or the function class.

      How would Econometrics guide us in this?

    15. These choices about how to represent the features will interact with the regularizer and function class: A linear model can reproduce the log base area per room from log base area and log room number easily, while a regression tree would require many splits to do so.

      The choice of 'how to represent the features' is consequential ... it's not just 'throw it all in' (kitchen sink approach)

    16. Ta b l e 2Some Machine Learning Algorithms

      This is a very helpful table!

    17. Picking the prediction func-tion then involves two steps: The first step is, conditional on a level of complexity, to pick the best in-sample loss-minimizing function.8 The second step is to estimate the optimal level of complexity using empirical tuning (as we saw in cross-validating the depth of the tree).

      ML explained while standing on one leg.

    18. egularization combines with the observability of predic-tion quality to allow us to fit flexible functional forms and still find generalizable structure.

      But we can't really make stati