Hypothesis

176 Matching Annotations

May 2026
x.com x.com

https://x.com/DimitrisPapail/status/2028669695344148946

1
1. fxp007 07 May 2026
  
  in Public
  
  Width, not depth, is the bottleneck. A wide model (d=256, 6 layers, 4.9M params) dramatically outperforms a deep model (d=128, 12 layers, 2.4M params). SUBLEQ execution requires routing 32 mem values through attention simultaneously and width helps for that.
  
  大多数人认为在深度学习中，模型深度比宽度更重要，尤其是在处理复杂任务时。但作者发现对于SUBLEQ执行，宽度而非深度是瓶颈，这挑战了深度学习架构设计的传统观念，暗示某些计算任务可能需要不同的架构优先级。
  
  non-consensus deep-learning
Visit annotations in context

Tags

deep-learning

non-consensus

Annotators

fxp007

URL

x.com/DimitrisPapail/status/2028669695344148946
Jun 2025
www.cs.toronto.edu www.cs.toronto.edu

dqn.pdf

1
1. mark.crowley 09 Jun 2025
  
  in Public
  
  Playing Atari with Deep Reinforcement Learning 19 Dec 2013 · Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller
  
  The paper from 2013 that introduced the DQN algorithm for using Deep Learning with Reinforcement Learning to play Atari game.
  
  reinforcement-learning dqn atari-games deep-learning
Visit annotations in context

Tags

deep-learning

reinforcement-learning

atari-games

dqn

Annotators

mark.crowley

URL

cs.toronto.edu/~vmnih/docs/dqn.pdf
Mar 2025
www.sciencedaily.com www.sciencedaily.com

Deep learning rethink overcomes major obstacle in AI industry

1
1. pbk1 11 Mar 2025
  
  in Public
  
  Anshumali's prime research work on SLIDE algorithms.
  
  Deep learning AI/ML
Visit annotations in context

Tags

AI/ML

Deep learning

Annotators

pbk1

URL

sciencedaily.com/releases/2020/03/200305135041.htm
Nov 2024
medium.com medium.com

Translating vision into sound

1
1. stopresetgo 07 Nov 2024
  
  in Public
  
  for - article - Medium - Translating vision into sound - A deep learning perspectiive - Viktor Toth - 21 April, 2019 - from - search - Google - android app "The Voice" translates images into audio signal - https://hyp.is/OJKKmJ1MEe-TAp_w_0SK_Q/www.google.com/search?q=android+app+%22The+Voice%22+translates+images+into+audio+signal&sca_esv=6fa4053b1bfce2fa&sxsrf=ADLYWIK_UqZZZ9OCRCwH4D6FoSaykbMTpQ:1731013461104&ei=VSstZ4eCBqi8xc8P5KP_kAU&ved=0ahUKEwjHgM3Tj8uJAxUoXvEDHeTRH1IQ4dUDCA8&uact=5&oq=android+app+%22The+Voice%22+translates+images+into+audio+signal&gs_lp=Egxnd3Mtd2l6LXNlcnAiO2FuZHJvaWQgYXBwICJUaGUgVm9pY2UiIHRyYW5zbGF0ZXMgaW1hZ2VzIGludG8gYXVkaW8gc2lnbmFsMggQABiABBiiBDIIEAAYgAQYogQyCBAAGIAEGKIEMggQABiABBiiBDIIEAAYgAQYogRI2xdQpglYjRJwAXgCkAEAmAGZA6ABmQOqAQM0LTG4AQPIAQD4AQGYAgOgAqADwgIKEAAYsAMY1gQYR8ICBBAAGEeYAwDiAwUSATEgQIgGAZAGCJIHBTIuNC0xoAewBA&sclient=gws-wiz-serp
  
  article - Medium - Translating vision into sound - A deep learning perspectiive - Viktor Toth - 21 April, 2019 from - search - Google - android app "The Voice" translates images into audio signal
Visit annotations in context

Tags

from - search - Google - android app "The Voice" translates images into audio signal

article - Medium - Translating vision into sound - A deep learning perspectiive - Viktor Toth - 21 April, 2019

Annotators

stopresetgo

URL

medium.com/mindsoft/translating-vision-into-sound-443b7e01eced
www.google.com www.google.com

android app "The Voice" translates images into audio signal - Google Search

1
1. stopresetgo 07 Nov 2024
  
  in Public
  
  for - search - Google - android app "The Voice" translates images into audio signal - from - webcast - Michael Levin - Can we create new senses for humans? - interview - David Eagleman - https://hyp.is/BHS6up09Ee-1qefERFpeQg/www.youtube.com/watch?v=YCvFgrpfNGM - to - Medium - article Translating vision into sound. A deep learning perspective - Viktor Toth - April 2019 - https://hyp.is/lQL4Yp1MEe-66-dpgenOBA/medium.com/mindsoft/translating-vision-into-sound-443b7e01eced
  
  from - webcast - Michael Levin - Can we create new senses for humans? - interview - David Eagleman search - Google - android app "The Voice" translates images into audio signal to - Medium - article Translating vision into sound. A deep learning perspective - Viktor Toth - April 2019
Visit annotations in context

Tags

search - Google - android app "The Voice" translates images into audio signal

to - Medium - article Translating vision into sound. A deep learning perspective - Viktor Toth - April 2019

from - webcast - Michael Levin - Can we create new senses for humans? - interview - David Eagleman

Annotators

stopresetgo

URL

google.com/search
www.youtube.com www.youtube.com

Ex-Buddhist Monk REVEALS Secret Tibetan Prophecy Happening RIGHT NOW! | Dr. John Churchill

1
1. stopresetgo 04 Nov 2024
  
  in Public
  
  it isn't just about alleviating their own personal suffering it's also about alleviating Universal suffering so this is where the the bodh satra or the Christ or those kinds of archetypes about being concerned about the whole
  
  for - example - individual's evolutionary learning journey - new self revisiting old self and gaining new insight - universal compassion of Buddhism and the individual / collective gestalt - adjacency - the universal compassion of the bodhisattva - Deep humanity idea of the individual / collective gestalt - the Deep Humanity Common Human Denominators (CHD) as pointing to the self / other fundamental identity - Freud, Winnicott, Kline's idea of the self formed by relationship with the other, in particular the mOTHER (Deep Humanity), the Most significant OTHER
  
  adjacency - between - the universal compassion of the bodhisattva - Deep humanity idea of the individual / collective gestalt - the Deep Humanity Common Human Denominators (CHD) as pointing to the self / other fundamental identity - Freud, Winnicott, Kline's idea of the self formed by relationship with the other, in particular the mOTHER (Deep Humanity), the Most significant OTHER - adjacency relationship - When I heard John Churchill explain the second turning, - the Mahayana approach, - I was already familiar with it from my many decades of Buddhist teaching but with - those teachings in the rear view mirror of my life and - developing an open source, non-denominational spirituality (Deep Humanity) - Hearing these old teachings again, mixed with the new ideas of the individual / collective gestalt - This becomes an example of Indyweb idea of recording our individual evolutionary learning journey and - the present self meeting the old self - When this happens, new adjacencies can often surface - In this case, due to my own situatedness in life, the universal compassion of the bodhisattva can be articulated from a Deep Humanity perspective: - The Freudian, Klinian, Winnicott and Becker perspective of the individual as being constructed out of the early childhood social interactions with the mOTHER, - a Deep Humanity re-interpretation of "mother" to "mOTHER" to mean "the Most significant OTHER" of the newly born neonate. - A deep realization that OUR OWN SELF IDENTITY WAS CONSTRUCTED out of a SOCIAL RELATIONSHIP with mOTHER demonstrates our intertwingled individual/collective and self/other - The Deep Humanity "Common Human Denominators" (CHD) are a way to deeply APPRECIATE those qualities human beings have in common with each other - Later on, Churchill talks about how the sacred is lost in western modernity - A first step in that direction is treating other humans as sacred, then after that, to treat ALL life as sacred - Using tools like the CHD help us to find fundamental similarities while divisive differences might be polarizing and driving us apart - A universal compassion is only possible if we vividly see how we are constructed of the other - Another way to say this is that we see others not from an individual level, but from a species level
  
  adjacency - the universal compassion of the bodhisattva - Deep humanity idea of the individual / collective gestalt - the Deep Humanity Common Human Denominators (CHD) as pointing to the self / other fundamental identity - Freud, Winnicott, Kline's idea of the self formed by relationship with the other, in particular the mOTHER (Deep Humanity), the Most significant OTHER example - individual's evolutionary learning journey - new self revisiting old self and gaining new insight - universal compassion of Buddhism and the individual / collective gestalt
Visit annotations in context

Tags

adjacency - the universal compassion of the bodhisattva - Deep humanity idea of the individual / collective gestalt - the Deep Humanity Common Human Denominators (CHD) as pointing to the self / other fundamental identity - Freud, Winnicott, Kline's idea of the self formed by relationship with the other, in particular the mOTHER (Deep Humanity), the Most significant OTHER

example - individual's evolutionary learning journey - new self revisiting old self and gaining new insight - universal compassion of Buddhism and the individual / collective gestalt

Annotators

stopresetgo

URL

youtube.com/watch
Oct 2024
medium.com medium.com

People talk about co-creating all the time.

1
1. stopresetgo 19 Oct 2024
  
  in Public
  
  Effective collaboration is essential for mutual learning.
  
  for - Deep Humanity - intertwingled individual / collective learning - evolutionary learning journey - symmathesy - mutual learning - Nora Bateson
  
  Deep Humanity - intertwingled individual / collective learning - evolutionary learning journey symmathesy - mutual learning - Nora Bateson
Visit annotations in context

Tags

Deep Humanity - intertwingled individual / collective learning - evolutionary learning journey

symmathesy - mutual learning - Nora Bateson

Annotators

stopresetgo

URL

medium.com/@unstitution21/people-talk-about-co-creating-all-the-time-a51868e0266a
Sep 2024
www.datacamp.com www.datacamp.com

What is Deep Learning? A Tutorial for Beginners

2
1. mfaisalk 03 Sep 2024
  
  in Public
  
  How deep learning differs from traditional machine learning While machine learning has been a transformative technology in its own right, deep learning takes it a step further by automating many of the tasks that typically require human expertise. Deep learning is essentially a specialized subset of machine learning, distinguished by its use of neural networks with three or more layers. These neural networks attempt to simulate the behavior of the human brain—albeit far from matching its ability—in order to "learn" from large amounts of data. You can explore machine learning vs deep learning in more detail in a separate post.
  
  deep learning vs ML deep learning
2. mfaisalk 03 Sep 2024
  
  in Public
  
  Deep learning is a type of machine learning that teaches computers to perform tasks by learning from examples, much like humans do. Imagine teaching a computer to recognize cats: instead of telling it to look for whiskers, ears, and a tail, you show it thousands of pictures of cats. The computer finds the common patterns all by itself and learns how to identify a cat. This is the essence of deep learning. In technical terms, deep learning uses something called "neural networks," which are inspired by the human brain. These networks consist of layers of interconnected nodes that process information. The more layers, the "deeper" the network, allowing it to learn more complex features and perform more sophisticated tasks.
  
  deep learning what is
Visit annotations in context

Tags

deep learning

deep learning vs ML

what is

Annotators

mfaisalk

URL

datacamp.com/tutorial/tutorial-deep-learning-tutorial
Aug 2024
www.youtube.com www.youtube.com

Damian Marley ft Nas.... Patience #REACTION

1
1. MrHoornTheScholar 10 Aug 2024
  
  in Public
  
  For true deep processing and learning, intellectualism, one must think beyond the single source they are consuming and think about everything they know. Although keep in mind selective attention for true learning and thinking.
  
  This process is habitualized by means of Zettelkasten and further aided in tool like hypothes.is
  
  Selective Attention Intellectualism Learning Deep Learning Deep Thinking Deep Processing Hypothes.is Syntopical Reading Syntopical Thinking Networked Thinking Zettelkasten
Visit annotations in context

Tags

Selective Attention

Deep Learning

Syntopical Thinking

Deep Processing

Learning

Intellectualism

Syntopical Reading

Zettelkasten

Hypothes.is

Deep Thinking

Networked Thinking

Annotators

MrHoornTheScholar

URL

youtube.com/watch
www.youtube.com www.youtube.com

FIRST TIME LISTENING TO Nas & Damien Marley - Patience | 10s HIP HOP REACTION

1
1. MrHoornTheScholar 10 Aug 2024
  
  in Public
  
  Unrelated to the song itself. It is interesting that different people interpret the song's meaning differently. Likely due to individual differences in perspective, history, culture, etc.
  
  Makes me reflect. Is knowledge/wisdom contained solely in content and words? Or is knowledge/wisdom rather contained in the RELATIONSHIP, the INTERACTION, between past experience, previous knowledge (identity) and substance?
  
  Currently I am inclined to go for the latter.
  
  Knowledge Research Education Learning Deep Learning Reading Analytical Reading Interpretating Interpretative Research Science Critical Thinking Deep Thinking Self-Thinking Society Patience Song
Visit annotations in context

Tags

Reading

Interpretative Research

Deep Learning

Education

Interpretating

Patience Song

Science

Learning

Critical Thinking

Research

Analytical Reading

Knowledge

Deep Thinking

Self-Thinking Society

Annotators

MrHoornTheScholar

URL

youtube.com/watch
Jul 2024
www.youtube.com www.youtube.com

How to Learn Complex Skills Quickly (And Forever)

1
1. MrHoornTheScholar 13 Jul 2024
  
  in Public
  
  Rail Framework Dr. Justin Sung Deep Learning YouTube Watch
Visit annotations in context

Tags

Watch

YouTube

Deep Learning

Dr. Justin Sung

Rail Framework

Annotators

MrHoornTheScholar

URL

youtube.com/watch
www.youtube.com www.youtube.com

Mastery: How to Learn Anything Fast | Nishant Kasibhatla

1
1. MrHoornTheScholar 03 Jul 2024
  
  in Public
  
  Nishant says: 2x Output for 1x input...
  
  His formula for mastery: 1. Learn (input -- focus, singletasking) 2. Reflect (output, pause... what is the main takeaway, how to use?) 3. Implement (output, apply) 4. Share (output, teach the material)
  
  These principles are great... Obviously they are not comprehensive as they do not necessarily reflect higher order learning. See Bloom's and Solo's, nor take foundation of Cognitive Load Theory for example... It's understandable though since you can't mention everything in a 20 minute talk XD.
  
  The argument I'd make is that the 3 subsequent steps are a part of learning. So the first step should not be called learn but rather encode, since that is literally the process of forming the initial cognitive schemas and putting them into long-term memory...
  
  Focus Learning Multitasking Deep Learning Shallow Learning Nishant Kasibhatla Encoding Cognitive Schemas
Visit annotations in context

Tags

Nishant Kasibhatla

Deep Learning

Focus

Shallow Learning

Learning

Multitasking

Cognitive Schemas

Encoding

Annotators

MrHoornTheScholar

URL

youtube.com/watch
Jun 2024
www.youtube.com www.youtube.com

How I Built an Evidence-Based Learning System in 312 Weeks

1
1. MrHoornTheScholar 09 Jun 2024
  
  in Public
  
  (~0:45)
  
  Justin mentions that a better way to think about learning is in systems rather than techniques. This is true for virtually anything. Tips & Tricks don't get you anywhere, it is the systems which bring you massive improvements because they have components all working together to achieve one goal or a set of goals.
  
  Any good system has these components working together seamlessly, creating something emergent; worth more than the sum of its parts.
  
  Dr. Justin Sung Systems Systems Thinking Deep Learning Techniques Tips-and-Tricks
Visit annotations in context

Tags

Techniques

Systems Thinking

Tips-and-Tricks

Deep Learning

Systems

Dr. Justin Sung

Annotators

MrHoornTheScholar

URL

youtube.com/watch
jaredhenderson.substack.com jaredhenderson.substack.com

Gatekeeping Ourselves

2
1. MrHoornTheScholar 08 Jun 2024
  
  in Public
  
  The ubiquity of smartphones and social media have also affected literacy across the board. Children and adults alike are reading in fundamentally different ways. For one, phones have been shown — to no one’s surprise — to interfere with our ability to focus. And apps such as TikTok, Facebook, and Instagram have shifted our reading habits toward short and often fragmentary text.
  
  The first thing I ask people who cannot focus for more than an hour straight (which I would argue is a necessity for proper deep learning; see also Flow) is how their dopamine regulation is.
  
  Dopamine regulation is the biggest factor that I know of (I am not an expert, so there might be even more influential factors) that hampers with the ability to focus for prolonged times in a cyclic way.
  
  One can enjoy learning, and thus focus, if the average dopamine the brain produces is close to the dopamine they get when performing the act of learning. This is hard if someone uses "dopamine factories" such as TikTok and other shortform content.
  
  Dopamine Focus Deep Learning Learning Cognition Flow
2. MrHoornTheScholar 08 Jun 2024
  
  in Public
  
  Testing culture also discourages deep reading, critics say, because it emphasizes close reading of excerpts, for example, to study a particular literary technique, rather than reading entire works.
  
  Indeed. But testing in general, as it is done currently, in modern formal education, discourages deep learning as opposed to shallow learning.
  
  Why? Because tests with marks implore students to start learning at max 3 days before the test, thus getting knowledge into short-term memory and not long term memory. Rendering the process of learning virtually useless even though they "pass" the curriculum.
  
  I know this because I was such a student, and saw it all around me with virtually every other student I met, and I was in HAVO, a level not considered "low".
  
  It does not help that teachers, or the system, expect students to know how to learn (efficiently) without it ever being taught to them.
  
  My message to the system: start teaching students how to learn the moment they enter high school
  
  Education Learning Tests Shallow Learning Deep Learning Deep Processing Memory Cognition
Visit annotations in context

Tags

Deep Learning

Dopamine

Focus

Education

Shallow Learning

Deep Processing

Memory

Learning

Flow

Tests

Cognition

Annotators

MrHoornTheScholar

URL

jaredhenderson.substack.com/p/gatekeeping-ourselves
May 2024
www.linkedin.com www.linkedin.com

Post | Feed | LinkedIn

2
1. MrHoornTheScholar 23 May 2024
  
  in Public
  
  Matthew van der Hoorn Yes totally agree but could be used for creating a draft to work with, that's always the angle I try to take buy hear what you are saying Matthew!
  
  Reply to Nidhi Sachdeva: Nidhi Sachdeva, PhD Just went through the micro-lesson itself. In the context of teachers using to generate instruction examples, I do not argue against that. The teacher does not have to learn the content, or so I hope.
  
  However, I would argue that the learners themselves should try to come up with examples or analogies, etc. But this depends on the learner's learning skills, which should be taught in schools in the first place.
  
  Nidhi Sachdeva Reply Learning Teaching Encoding Deep Processing AI
2. MrHoornTheScholar 23 May 2024
  
  in Public
  
  ***Deep Processing***-> It's important in learning. It's when our brain constructs meaning and says, "Ah, I get it, this makes sense." -> It's when new knowledge establishes connections to your pre-existing knowledge.-> When done well, It's what makes the knowledge easily retrievable when you need it. How do we achieve deep processing in learning? 👉🏽 STORIES, EXPLANATIONS, EXAMPLES, ANALOGIES and more - they all promote deep meaningful processing. 🤔BUT, it's not always easy to come up with stories and examples. It's also time-consuming. You can ask you AI buddies to help with that. We have it now, let's leverage it. Here's a microlesson developed on 7taps Microlearning about this topic.
  
  Reply to Nidhi Sachdeva: I agree mostly, but I would advice against using AI for this. If your brain is not doing the work (the AI is coming up with the story/analogy) it is much less effective. Dr. Sönke Ahrens already said: "He who does the effort, does the learning."
  
  I would bet that Cognitive Load Theory also would show that there is much less optimized intrinsic cognitive load (load stemming from the building or automation of cognitive schemas) when another person, or the AI, is thinking of the analogies.
  
  https://www.linkedin.com/feed/update/urn:li:activity:7199396764536221698/
  
  Nidhi Sachdeva Reply Cognitive Load Theory Cognitive Schemas Deep Processing AI Learning
Visit annotations in context

Tags

Cognitive Load Theory

Reply

Deep Processing

Learning

Nidhi Sachdeva

Teaching

Cognitive Schemas

Encoding

AI

Annotators

MrHoornTheScholar

URL

linkedin.com/feed/update/urn:li:activity:7199396764536221698/
www.npr.org www.npr.org

Bulky Cameras, Meet The Lens-less FlatCam

3
1. MrHoornTheScholar 19 May 2024
  
  in Public
  
  A slew of recent brain imaging research suggests handwriting's power stems from the relative complexity of the process and how it forces different brain systems to work together to reproduce the shapes of letters in our heads onto the page.
  
  Interesting. Needs more research on my part.
  
  Writing by Hand Writing vs. Typing Learning Deep Thinking
2. MrHoornTheScholar 19 May 2024
  
  in Public
  
  In adults, taking notes by hand during a lecture, instead of typing, can lead to better conceptual understanding of material.
  
  This is because of the fact that one needs to think (process) before writing. One can't possibly write everything verbatim. Deep processing. Relational thinking.
  
  Deep Thinking Learning Writing by Hand Writing vs. Typing Note-Making
3. MrHoornTheScholar 19 May 2024
  
  in Public
  
  Why writing by hand beats typing for thinking and learning
  
  Writing Writing by Hand Typing Writing vs. Typing Thinking Learning Deep Thinking
Visit annotations in context

Tags

Writing vs. Typing

Note-Making

Learning

Thinking

Typing

Writing

Deep Thinking

Writing by Hand

Annotators

MrHoornTheScholar

URL

npr.org/sections/health-shots/2022/03/25/1088902487/former-nurse-found-guilty-in-accidental-injection-death-of-75-year-old-patient
Oct 2023
openreview.net openreview.net

Generative Causal Representation Learning for Out-of-Distribution Motion Forecasting

1
1. mark.crowley 25 Oct 2023
  
  in Public
  
  Shayan Shirahmad Gale Bagi, Zahra Gharaee, Oliver Schulte, and Mark Crowley Generative Causal Representation Learning for Out-of-Distribution Motion Forecasting In International Conference on Machine Learning (ICML). Honolulu, Hawaii, USA. Jul, 2023.
  
  causality causal-inference deep-learning machine-learning icml icml2023
Visit annotations in context

Tags

causality

deep-learning

causal-inference

machine-learning

icml2023

icml

Annotators

mark.crowley

URL

openreview.net/pdf
www.nature.com www.nature.com

Scientific discovery in the age of artificial intelligence

1
1. mark.crowley 25 Oct 2023
  
  in Public
  
  Wang et. al. "Scientific discovery in the age of artificial intelligence", Nature, 2023.
  
  A paper about the current state of using AI/ML for scientific discovery, connected with the AI4Science workshops at major conferences.
  
  (NOTE: since Springer/Nature don't allow public pdfs to be linked without a paywall, we can't use hypothesis directly on the pdf of the paper, this link is to the website version of it which is what we'll use to guide discussion during the reading group.)
  
  machine-learning deep-learning ai-for-science artificial-intelligence reading_group_crowley rdgrp-f23
Visit annotations in context

Tags

deep-learning

ai-for-science

rdgrp-f23

machine-learning

artificial-intelligence

reading_group_crowley

Annotators

mark.crowley

URL

nature.com/articles/s41586-023-06221-2
arxiv.org arxiv.org

2105.03322.pdf

1
1. mark.crowley 25 Oct 2023
  
  in Public
  
  "Are Pre-trained Convolutions Better than Pre-trained Transformers?"
  
  transformers deep-learning nlp large-language-models reading_group_crowley rdgrp-s23
Visit annotations in context

Tags

deep-learning

large-language-models

rdgrp-s23

transformers

reading_group_crowley

nlp

Annotators

mark.crowley

URL

arxiv.org/pdf/2105.03322.pdf
arxiv.org arxiv.org

2305.15486.pdf

1
1. mark.crowley 25 Oct 2023
  
  in Public
  
  Quantitatively, SPRING with GPT-4 outperforms all state-of-the-art RLbaselines, trained for 1M steps, without any training.
  
  Them's fighten' words!
  
  I haven't read it yet, but we're putting it on the list for this fall's reading group. Seriously, a strong result with a very strong implied claim. they are careful to say it's from their empirical results, very worth a look. I suspect that amount of implicit knowledge in the papers, text and DAG are helping to do this.
  
  The Big Question: is their comparison to RL baselines fair, are they being trained from scratch? What does a fair comparison of any from-scratch model (RL or supervised) mean when compared to an LLM approach (or any approach using a foundation model), when that model is not really from scratch.
  
  reinforcement-learning rdgrp-f23 reading_group_crowley nlp larg deep-learning self-supervised supervised-learning evaluation-methods
Visit annotations in context

Tags

larg

reinforcement-learning

nlp

supervised-learning

deep-learning

rdgrp-f23

self-supervised

reading_group_crowley

evaluation-methods

Annotators

mark.crowley

URL

arxiv.org/pdf/2305.15486.pdf
Aug 2023
arxiv.org arxiv.org

2308.09543.pdf

1
1. mark.crowley 22 Aug 2023
  
  in Public
  
  Title: Delays, Detours, and Forks in the Road: Latent State Models of Training Dynamics Authors: Michael Y. Hu1 Angelica Chen1 Naomi Saphra1 Kyunghyun Cho Note: This paper seems cool, using older interpretable machine learning models, graphical models to understand what is going on inside a deep neural network
  
  Link: https://arxiv.org/pdf/2308.09543.pdf
  
  deep-learning machine-learning hidden-markov-models graphical-models interpretability visualization regularization
Visit annotations in context

Tags

deep-learning

graphical-models

regularization

hidden-markov-models

machine-learning

visualization

interpretability

Annotators

mark.crowley

URL

arxiv.org/pdf/2308.09543.pdf
www.youtube.com www.youtube.com

Lecture #9: How to Read so that you *Retain* Information - YouTube

1
1. MrHoornTheScholar 08 Aug 2023
  
  in Public
  
  The essence for this video is correct; active learning, progressive summarization, deep processing, relational analytical thinking, even evaluative.
  
  Yet, the implementation is severely lacking; marginalia, text writing, etc.
  
  Better would be the use of mindmaps or GRINDEmaps. I personally would combine it with the Antinet of course.
  
  I do like this guy's teaching style though 😂
  
  Learning Active Learning Learning Misunderstandings Deep Thinking Relational Thinking Deep Processing
Visit annotations in context

Tags

Active Learning

Relational Thinking

Deep Thinking

Deep Processing

Learning

Learning Misunderstandings

Annotators

MrHoornTheScholar

URL

youtube.com/watch
Jul 2023
arxiv.org arxiv.org

Deep Reinforcement Learning with Double Q-learning

1
1. mark.crowley 10 Jul 2023
  
  in Public
  
  Paper that evaluated the existing Double Q-Learning algorithm on the new DQN approach and validated that it is very effective in the Deep RL realm.
  
  reinforcement-learning dqn deep-learning
Visit annotations in context

Tags

deep-learning

reinforcement-learning

dqn

Annotators

mark.crowley

URL

arxiv.org/pdf/1509.06461v3
arxiv.org arxiv.org

Liang15.pdf

1
1. mark.crowley 10 Jul 2023
  
  in Public
  
  Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"
  
  Response paper to DQN showing that well designed Value Function Approximations can also do well at these complex tasks without the use of Deep Learning
  
  A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!
  
  dqn reinforcement-learning atari-games deep-learning shallow-learning
Visit annotations in context

Tags

deep-learning

dqn

shallow-learning

reinforcement-learning

atari-games

Annotators

mark.crowley

URL

arxiv.org/pdf/1512.01563
arxiv.org arxiv.org

1511.05952.pdf

1
1. mark.crowley 10 Jul 2023
  
  in Public
  
  Tom Schaul, John Quan, Ioannis Antonoglou and David Silver. "PRIORITIZED EXPERIENCE REPLAY", ICLR, 2016.
  
  reinforcement-learning ppo deep-learning deep-rl policy-gradient direct-policy-search trust-region
Visit annotations in context

Tags

reinforcement-learning

deep-rl

deep-learning

trust-region

ppo

policy-gradient

direct-policy-search

Annotators

mark.crowley

URL

arxiv.org/pdf/1511.05952.pdf
openaccess.thecvf.com openaccess.thecvf.com

Temporal Recurrent Networks for Online Action Detection

1
1. mark.crowley 07 Jul 2023
  
  in Public
  
  Xu, ICCV, 2019 "Temporal Recurrent Networks for Online Action Detection"
  
  arxiv: https://arxiv.org/abs/1811.07391 hypothesis: https://hyp.is/go?url=https%3A%2F%2Fopenaccess.thecvf.com%2Fcontent_ICCV_2019%2Fpapers%2FXu_Temporal_Recurrent_Networks_for_Online_Action_Detection_ICCV_2019_paper.pdf&group=world
  
  driver-behaviour-learning autonomous-driving lstm rnn deep-learning recurrent-neural-networks time-series
Visit annotations in context

Tags

driver-behaviour-learning

deep-learning

time-series

recurrent-neural-networks

autonomous-driving

lstm

rnn

Annotators

mark.crowley

URL

openaccess.thecvf.com/content_ICCV_2019/papers/Xu_Temporal_Recurrent_Networks_for_Online_Action_Detection_ICCV_2019_paper.pdf
www.notion.so www.notion.so

Notion – The all-in-one workspace for your notes, tasks, wikis, and databases.

2
1. MrHoornTheScholar 06 Jul 2023
  
  in Public
  
  You can tell people just like I have you to focus their attention, choose a target. Imagine there's a spotlight shining just on it. Don't pay much attention to what's in your periphery almost as if you have like blinders on, right? So don't pay attention to those distractors. People can do that. We have them talk to us about like, well, what is it that you're focused on? What's catching your attention right now? Those are easy instructions to understand and it's easy to make your eyes do it. What's important though is that that's not what their eyes do naturally. When they're walking or when they're running, people do take a sort of wider perspective. They broaden their scope of attention relative to what these instructions are having them do. And when we taught people that narrowed style of attention, what we found is that they moved 23% faster in this course that we had set up. From the start line to the finish line, it was always exactly the same distance. And we were using our stop watches to see how fast did they move. They moved 23% faster and they said it hurt 17% less. Right? So exactly the same actual experience, but subjectively it was easier and they performed better. They increase the efficiency of this particular exercise.
  
  (24:58) In order to perform significantly better, you need to FOCUS your attention on a single thing only. Multitasking won't work, and thinking about different things at once also doesn't work. Set up your environment to foster this insane level of focus.
  
  Deep Thinking Deep Work Flow Focus Learning Cognition
2. MrHoornTheScholar 06 Jul 2023
  
  in Public
  
  We prioritize what we see versus what we hear, why is that? Now, what comes to mind when I say that is when, somebody is saying no, but shaking their head yes. And so we have this disconnect, but we tend to prioritize what the action and not what we're hearing. So something that we visually see instead of what we hear.Speaker 1There isn't a definitive answer on that, but one source of insight on why do we do that, it could be related to the neurological real estate that's taken up by our visual experience. There's far more of our cortex, the outer layer of our brain that responds to visual information than any other form of information
  
  (13:36) Perhaps this is also why visual information is so useful for learning and cognition (see GRINDE)... Maybe the visual medium should be used more in instruction instead of primarily auditory lectures (do take into account redundancy and other medium effects from CLT though)
  
  Education Learning Cognition Deep Processing
Visit annotations in context

Tags

Education

Focus

Deep Processing

Learning

Flow

Deep Work

Cognition

Deep Thinking

Annotators

MrHoornTheScholar

URL

notion.so/matthew-van-der-hoorn/154-Emily-Balcetis-Setting-and-Achieving-Goals-bc1738c0f46645269388bb217392c120
Jun 2023
www.youtube.com www.youtube.com

"The Foundational Protocols of Health" Dr. Andrew Huberman & Dr. Li Wu - YouTube

2
1. MrHoornTheScholar 28 Jun 2023
  
  in Public
  
  (14:20-19:00) Dopamine Prediction Error is explained by Andrew Huberman in the following way: When we anticipate something exciting dopamine levels rise and rise, but when we fail it drops below baseline, decreasing motivation and drive immensely, sometimes even causing us to get sad. However, when we succeed, dopamine rises even higher, increasing our drive and motivation significantly... This is the idea that successes build upon each other, and why celebrating the "marginal gains" is a very powerful tool to build momentum and actually make progress. Surprise increases this effect even more: big dopamine hit, when you don't anticipate it.
  
  Social Media algorithms make heavy use of this principle, therefore enslaving its user, in particular infinite scrolling platforms such as TikTok... Your dopamine levels rise as you're looking for that one thing you like, but it drops because you don't always have that one golden nugget. Then it rises once in a while when you find it. This contrast creates an illusion of enjoyment and traps the user in an infinite search of great content, especially when it's shortform. It makes you waste time so effectively. This is related to getting the success mindset of preferring delayed gratification over instant gratification.
  
  It would be useful to reflect and introspect on your dopaminic baseline, and see what actually increases and decreases your dopamine, in addition to whether or not these things help to achieve your ambitions. As a high dopaminic baseline (which means your dopamine circuit is getting used to high hits from things as playing games, watching shortform content, watching porn) decreases your ability to focus for long amounts of time (attention span), and by extent your ability to learn and eventually reach success. Studying and learning can actually be fun, if your dopamine levels are managed properly, meaning you don't often engage in very high-dopamine emitting activities. You want your brain to be used to the low amounts of dopamine that studying gives. A framework to help with this reflection would be Kolb's.
  
  A short-term dopamine reset is to not use the tool or device for about half an hour to an hour (or do NSDR). However, this is not a long-term solution.
  
  Andrew Huberman Dopamine Introspection Social Media Marginal Gains Learning Studying Intellectual Work Focus Deep Work NSDR Kolb's Attention Span Instant Gratification Delayed Gratification
2. MrHoornTheScholar 28 Jun 2023
  
  in Public
  
  The 4 (behavioral) keypoints for great physical and mental as well as cognitive health:
  
  One) (2:00-4:05) View sunlight early in the day. The light needs to reach the eyes--increasing alertness, mood, and focus, through certain receptors. Also increases sleep quality at night, according to Huberman. Ideally five to ten minutes on a clear day, and ten to twenty minutes on an overcast day. No sunglasses, and certainly not through windows and windshields. If no sun is out yet, use artificial bright light. Do this daily.
  
  Two) (4:05-6:10) Do physical exercise each and every day. Doesn't have to be super intense. Huberman recommends zone two cardiovascular exercise. Walking very fast, running, cycling, rowing, swimming are examples. He says to get at least between 150 and 200 minutes of this exercise per week. Some resistance training as well for longevity and wellbeing, increases metabolism as well. Do this at least every other day, according to Huberman. Huberman alternates each day between cardiovascular exercise and resistance training.
  
  Three) (6:20-9:10) People should have access to a rapid de-stress protocol or tools. This should be able to do quickly and instantly, without friction. You can just do one breath for destress. ( Deep long breath through nose, one quick breath in nose to completely fill the longs, and then breathe out through mouth long.)
  
  Four) (9:12-14:00) To have a deliberate rewiring nervous system protocol to use. A thing that can be done is NSDR (Non-Sleep Deep Rest protocol), this is specifically to increase energy.
  
  Ideally the NSDR should be done after each learning session as well to imitate deep sleep (REM) and therefore accelerate neuroplasticity and thus rewire the nervous system; increasing the strength of connections between neurons and therefore increase retention significantly.
  
  NSDR is also a process of autonomity and control, it allows one to find that they are in control of their body and brain. It makes one realize that external factors don't necessarily have influence. According to Huberman, NSDR even replenishes dopamine when it is depleted, making it also suitable for increasing motivation.
  
  Health Success Andrew Huberman Learning Focus Intellectual Work Performance Enablers NSDR Calming Nervous System Neuroplasticity Autonomity Deep Work
Visit annotations in context

Tags

Dopamine

Focus

Studying

Kolb's

Delayed Gratification

Attention Span

Deep Work

Andrew Huberman

Success

Autonomity

Health

Neuroplasticity

NSDR

Instant Gratification

Marginal Gains

Introspection

Performance Enablers

Learning

Calming Nervous System

Intellectual Work

Social Media

Annotators

MrHoornTheScholar

URL

youtube.com/watch
cdn.openai.com cdn.openai.com

Language Models are Unsupervised Multitask Learners

1
1. mark.crowley 28 Jun 2023
  
  in Public
  
  Recent work in computer vision has shown that common im-age datasets contain a non-trivial amount of near-duplicateimages. For instance CIFAR-10 has 3.3% overlap betweentrain and test images (Barz & Denzler, 2019). This results inan over-reporting of the generalization performance of ma-chine learning systems.
  
  CIFAR-10 performance results are overestimates since some of the training data is essentially in the test set.
  
  image-processing convolutional-neural-networks deep-learning machine-learning datasets
Visit annotations in context

Tags

image-processing

deep-learning

datasets

machine-learning

convolutional-neural-networks

Annotators

mark.crowley

URL

cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
www.fandm.edu www.fandm.edu

617813975725918530-aamas2016-shallow-rl.pdf

1
1. mark.crowley 16 Jun 2023
  
  in Public
  
  Liang, Machado, Talvite, Bowling - AAMAS 2016 "State of the Art Control of Atari Games Using Shallow Reinforcement Learning"
  
  A great paper showing how to think differently about the latest advances in Deep RL. All is not always what it seems!
  
  reinforcement-learning dqn deep-learning shallow-learning atari-games uwece457C
Visit annotations in context

Tags

uwece457C

deep-learning

dqn

shallow-learning

reinforcement-learning

atari-games

Annotators

mark.crowley

URL

fandm.edu/uploads/files/617813975725918530-aamas2016-shallow-rl.pdf
May 2023
medium.com medium.com

How to make it simple to explain AI, ML, DL and Data Science?

1
1. WHPrivate 28 May 2023
  
  in Public
  
  Deep Learning (DL) A Technique for Implementing Machine LearningSubfield of ML that uses specialized techniques involving multi-layer (2+) artificial neural networksLayering allows cascaded learning and abstraction levels (e.g. line -> shape -> object -> scene)Computationally intensive enabled by clouds, GPUs, and specialized HW such as FPGAs, TPUs, etc.
  
  [29] AI - Deep Learning
  
  AI Artificial Intelligence Deep Learning
Visit annotations in context

Tags

Artificial Intelligence

Deep Learning

AI

Annotators

WHPrivate

URL

medium.com/@marcellvollmer/how-to-make-it-simple-to-explain-ai-ml-dl-and-data-science-a49e54d54a12
Jan 2023
inst-fs-iad-prod.inscloudgate.net inst-fs-iad-prod.inscloudgate.net

Untitled document

1
1. mark.crowley 10 Jan 2023
  
  in Public
  
  "Talking About Large Language Models" by Murray Shanahan
  
  nlp large-language-models deep-learning transformers
Visit annotations in context

Tags

deep-learning

transformers

large-language-models

nlp

Annotators

mark.crowley

URL

inst-fs-iad-prod.inscloudgate.net/files/ada31e51-be16-45cc-8ec7-53e2dc795590/Talking About Large Language Models.pdf
Nov 2022
nlp.seas.harvard.edu nlp.seas.harvard.edu

The Annotated Transformer

1
1. hahattpro 14 Nov 2022
  
  in Public
  
  Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
  
  deep-learning nlp
Visit annotations in context

Tags

deep-learning

nlp

Annotators

hahattpro

URL

nlp.seas.harvard.edu//2018/04/03/attention.html
Apr 2022
github.com github.com

westonganger/active_snapshot: Simplified snapshots and restoration for ActiveRecord models and associations with a transparent white-box implementation

1
1. TylerRick 27 Apr 2022
  
  in Public
  
  Instead read this gems brief source code completely before use OR copy the code straight into your codebase.
  
  read the source code learning by reading the source having a deep understanding of something software development: use of libraries: only use if you've read the source and understand how it works software development: use of libraries vs. copying code into app project copy and paste programming
Visit annotations in context

Tags

read the source code

having a deep understanding of something

software development: use of libraries vs. copying code into app project

software development: use of libraries: only use if you've read the source and understand how it works

copy and paste programming

learning by reading the source

Annotators

TylerRick

URL

github.com/westonganger/active_snapshot
Feb 2022
www.udacity.com www.udacity.com

Intro to TensorFlow for Deep Learning | Udacity

1
1. chrisaldrich 16 Feb 2022
  
  in Public
  
  https://www.udacity.com/course/intro-to-tensorflow-for-deep-learning--ud187
  
  bookmark deep learning TensorFlow artificial intelligence Udacity
Visit annotations in context

Tags

TensorFlow

bookmark

artificial intelligence

Udacity

deep learning

Annotators

chrisaldrich

URL

udacity.com/course/intro-to-tensorflow-for-deep-learning--ud187
Dec 2021
media-exp1.licdn.com media-exp1.licdn.com

1639993836361

1
1. aidemoreto 24 Dec 2021
  
  in Public
  
  Deep learning: A definition
  
  Deep learning: A definition
  
  Deep learning: A definition
Visit annotations in context

Tags

Deep learning: A definition

Annotators

aidemoreto

URL

media-exp1.licdn.com/dms/document/C561FAQGCEOxjElpYdQ/feedshare-document-pdf-analyzed/0/1639993836361
Jun 2021
stackoverflow.com stackoverflow.com

How do I create an average from a Ruby array?

1
1. TylerRick 08 Jun 2021
  
  in Public
  
  Programmers should be encouraged to understand what is correct, why it is correct, and then propagate.
  
  new tag?:
  
  understand why it is correct
  
  good advice programming: understand the language, don't fear it quotable annotation meta: may need new tag programming languages: learning/understanding the subtleties having a deep understanding of something combating widespread incorrectness/misconception by consistently doing it correctly spreading/propagating good ideas
Visit annotations in context

Tags

programming languages: learning/understanding the subtleties

having a deep understanding of something

quotable

annotation meta: may need new tag

programming: understand the language, don't fear it

combating widespread incorrectness/misconception by consistently doing it correctly

spreading/propagating good ideas

good advice

Annotators

TylerRick

URL

stackoverflow.com/questions/1341271/how-do-i-create-an-average-from-a-ruby-array
Apr 2021
yashuseth.blog yashuseth.blog

BERT Explained – A list of Frequently Asked Questions

1
1. mromanello 22 Apr 2021
  
  in Public
  
  tutorial article on BERT
  
  bert deep learning language models
Visit annotations in context

Tags

language models

deep learning

bert

Annotators

mromanello

URL

yashuseth.blog/2019/06/12/bert-explained-faqs-understand-bert-working/
Mar 2021
arxiv.org arxiv.org

Semantic and Relational Spaces in Science of Science: Deep Learning Models for Article Vectorisation

1
1. n.parfitt 15 Mar 2021
  
  in BehSci
  
  Kozlowski, Diego, Jennifer Dusdal, Jun Pang, and Andreas Zilian. ‘Semantic and Relational Spaces in Science of Science: Deep Learning Models for Article Vectorisation’. ArXiv:2011.02887 [Physics], 5 November 2020. http://arxiv.org/abs/2011.02887.
  
  lang:en is:article semantic relational science deep learning model article vectorization literature review epistemic social pattern computer science tool research Natural Language Processing Graph Neural Networks
Visit annotations in context

Tags

literature

review

science

Natural Language Processing

pattern

learning

vectorization

is:article

epistemic

model

Graph Neural Networks

relational

social

deep

tool

lang:en

article

computer science

research

semantic

Annotators

n.parfitt

URL

arxiv.org/abs/2011.02887
Feb 2021
moritz.digital moritz.digital

Cognition Augmentation Software (CAS)

1
1. dylsteck 08 Feb 2021
  
  in Public
  
  The RECALL Augmenting Memory architecture. It can, for example, help users restore context before their next conference or class. The student, walking to a lecture, could be primed with a summary of it through his smart glasses, surfacing relevant information. The description of the "Memory vault" in this architecture exhibits a high similarity to Vannevar Bush's Memex.
  
  It's these deep learning breakthroughs that now make a lot of these memex and semantic web technologies accesssible. This is a note I also referenced in the SWTs whitepaper for Cortex. Great to see Moritz and RemNote picking up on this change as well.
  
  deep learning memex
Visit annotations in context

Tags

memex

deep learning

Annotators

dylsteck

URL

moritz.digital/blog/cas
Jan 2021
arxiv.org arxiv.org

2007.05112.pdf

1
1. ragged 07 Jan 2021
  
  in Public
  
  during the backward pass, feedback connectionsare used in concert with forward connections to dynamically invert the forward transformation,thereby enabling errors to flow backward
  
  deep learning neuroscience
Visit annotations in context

Tags

neuroscience

deep learning

Annotators

ragged

URL

arxiv.org/pdf/2007.05112.pdf
Dec 2020
sites.research.google sites.research.google

Tone Transfer — Magenta DDSP

1
1. tanemika 07 Dec 2020
  
  in Public
  
  PT machine learning musique deep learning outil
Visit annotations in context

Tags

musique

outil

PT

deep learning

machine learning

Annotators

tanemika

URL

sites.research.google/tonetransfer
towardsdatascience.com towardsdatascience.com

14 Deep and Machine Learning Uses That Made 2019 a New AI Age.

1
1. tanemika 05 Dec 2020
  
  in Public
  
  deep learning 2019 liste
Visit annotations in context

Tags

2019

deep learning

liste

Annotators

tanemika

URL

towardsdatascience.com/14-deep-learning-uses-that-blasted-me-away-2019-206a5271d98
Sep 2020
artbreeder.com artbreeder.com

Artbreeder

1
1. tanemika 06 Sep 2020
  
  in Public
  
  deep learning site
Visit annotations in context

Tags

site

deep learning

Annotators

tanemika

URL

artbreeder.com/browse
Jun 2020
arxiv.org arxiv.org

Deep learning of stochastic contagion dynamics on complex networks

1
1. ErikStuchly 11 Jun 2020
  
  in BehSci
  
  Murphy, C., Laurence, E., & Allard, A. (2020). Deep learning of stochastic contagion dynamics on complex networks. ArXiv:2006.05410 [Cond-Mat, Physics:Physics, Stat]. http://arxiv.org/abs/2006.05410
  
  is:article lang:en modeling deep learning dynamics stochastic process complex system network big data complexity structure
Visit annotations in context

Tags

dynamics

modeling

lang:en

stochastic process

deep learning

complex system

complexity

is:article

structure

network

big data

Annotators

ErikStuchly

URL

arxiv.org/abs/2006.05410
May 2020
www.sciencedirect.com www.sciencedirect.com

MONN: A Multi-objective Neural Network for Predicting Compound-Protein Interactions and Affinities

1
1. zhentg 02 May 2020
  
  in Public
  
  this is done by Tshinghua Univ
  
  open source deep learning binding affinity
Visit annotations in context

Tags

deep learning

open source

binding affinity

Annotators

zhentg

URL

sciencedirect.com/science/article/pii/S2405471220300818
Apr 2020
doi.org doi.org

COVID-19 Epidemic Analysis using Machine Learning and Deep Learning Algorithms

1
1. edampf 23 Apr 2020
  
  in BehSci
  
  Punn, N. S., Sonbhadra, S. K., & Agarwal, S. (2020). COVID-19 Epidemic Analysis using Machine Learning and Deep Learning Algorithms [Preprint]. Health Informatics. https://doi.org/10.1101/2020.04.08.20057679
  
  is:preprint lang:en COVID-19 machine learning deep learning algorithm analysis epidemiology transmission AI artificial intelligence data sharing data modeling prediction future real-time information Johns Hopkins dashboard
Visit annotations in context

Tags

modeling

is:preprint

artificial intelligence

deep learning

data sharing

algorithm

real-time

epidemiology

dashboard

COVID-19

AI

lang:en

data

Johns Hopkins

machine learning

analysis

future

prediction

information

transmission

Annotators

edampf

URL

doi.org/10.1101/2020.04.08.20057679
Oct 2019
neuralnetworksanddeeplearning.com neuralnetworksanddeeplearning.com

Neural Networks and Deep Learning

1
1. fuelpress 20 Oct 2019
  
  in Public
  
  As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power. Furthermore, it's a great way to develop more advanced techniques, such as deep learning. And so throughout the book we'll return repeatedly to the problem of handwriting recognition. Later in the book, we'll discuss how these ideas may be applied to other problems in computer vision, and also in speech, natural language processing, and other domains.Of course, if the point of the chapter was only to write a computer program to recognize handwritten digits, then the chapter would be much shorter! But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent. Throughout, I focus on explaining why things are done the way they are, and on building your neural networks intuition. That requires a lengthier discussion than if I just presented the basic mechanics of what's going on, but it's worth it for the deeper understanding you'll attain. Amongst the payoffs, by the end of the chapter we'll be in position to understand what deep learning is, and why it matters.PerceptronsWhat is a neural network? To get started, I'll explain a type of artificial neuron called a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it's more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron. We'll get to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way they are, it's worth taking the time to first understand perceptrons.So how do perceptrons work? A perceptron takes several binary inputs, x1,x2,…x1,x2,…x_1, x_2, \ldots, and produces a single binary output: In the example shown the perceptron has three inputs, x1,x2,x3x1,x2,x3x_1, x_2, x_3. In general it could have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He introduced weights, w1,w2,…w1,w2,…w_1,w_2,\ldots, real numbers expressing the importance of the respective inputs to the output. The neuron's output, 000 or 111, is determined by whether the weighted sum ∑jwjxj∑jwjxj\sum_j w_j x_j is less than or greater than some threshold value. Just like the weights, the threshold is a real number which is a parameter of the neuron. To put it in more precise algebraic terms: output={01if ∑jwjxj≤ thresholdif ∑jwjxj> threshold(1)(1)output={0if ∑jwjxj≤ threshold1if ∑jwjxj> threshold\begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\ 1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold} \end{array} \right. \tag{1}\end{eqnarray} That's all there is to how a perceptron works!That's the basic mathematical model. A way you can think about the perceptron is that it's a device that makes decisions by weighing up evidence. Let me give an example. It's not a very realistic example, but it's easy to understand, and we'll soon get to more realistic examples. Suppose the weekend is coming up, and you've heard that there's going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors: Is the weather good? Does your boyfriend or girlfriend want to accompany you? Is the festival near public transit? (You don't own a car). We can represent these three factors by corresponding binary variables x1,x2x1,x2x_1, x_2, and x3x3x_3. For instance, we'd have x1=1x1=1x_1 = 1 if the weather is good, and x1=0x1=0x_1 = 0 if the weather is bad. Similarly, x2=1x2=1x_2 = 1 if your boyfriend or girlfriend wants to go, and x2=0x2=0x_2 = 0 if not. And similarly again for x3x3x_3 and public transit.Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But perhaps you really loathe bad weather, and there's no way you'd go to the festival if the weather is bad. You can use perceptrons to model this kind of decision-making. One way to do this is to choose a weight w1=6w1=6w_1 = 6 for the weather, and w2=2w2=2w_2 = 2 and w3=2w3=2w_3 = 2 for the other conditions. The larger value of w1w1w_1 indicates that the weather matters a lot to you, much more than whether your boyfriend or girlfriend joins you, or the nearness of public transit. Finally, suppose you choose a threshold of 555 for the perceptron. With these choices, the perceptron implements the desired decision-making model, outputting 111 whenever the weather is good, and 000 whenever the weather is bad. It makes no difference to the output whether your boyfriend or girlfriend wants to go, or whether public transit is nearby.By varying the weights and the threshold, we can get different models of decision-making. For example, suppose we instead chose a threshold of 333. Then the perceptron would decide that you should go to the festival whenever the weather was good or when both the festival was near public transit and your boyfriend or girlfriend was willing to join you. In other words, it'd be a different model of decision-making. Dropping the threshold means you're more willing to go to the festival.Obviously, the perceptron isn't a complete model of human decision-making! But what the example illustrates is how a perceptron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of perceptrons could make quite subtle decisions: In this network, the first column of perceptrons - what we'll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence. What about the perceptrons in the second layer? Each of those perceptrons is making a decision by weighing up the results from the first layer of decision-making. In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer. And even more complex decisions can be made by the perceptron in the third layer. In this way, a many-layer network of perceptrons can engage in sophisticated decision making.Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. It's less unwieldy than drawing a single output line which then splits.Let's simplify the way we describe perceptrons. The condition ∑jwjxj>threshold∑jwjxj>threshold\sum_j w_j x_j > \mbox{threshold} is cumbersome, and we can make two notational changes to simplify it. The first change is to write ∑jwjxj∑jwjxj\sum_j w_j x_j as a dot product, w⋅x≡∑jwjxjw⋅x≡∑jwjxjw \cdot x \equiv \sum_j w_j x_j, where www and xxx are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality, and to replace it by what's known as the perceptron's bias, b≡−thresholdb≡−thresholdb \equiv -\mbox{threshold}. Using the bias instead of the threshold, the perceptron rule can be rewritten: output={01if w⋅x+b≤0if w⋅x+b>0(2)(2)output={0if w⋅x+b≤01if w⋅x+b>0\begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right. \tag{2}\end{eqnarray} You can think of the bias as a measure of how easy it is to get the perceptron to output a 111. Or to put it in more biological terms, the bias is a measure of how easy it is to get the perceptron to fire. For a perceptron with a really big bias, it's extremely easy for the perceptron to output a 111. But if the bias is very negative, then it's difficult for the perceptron to output a 111. Obviously, introducing the bias is only a small change in how we describe perceptrons, but we'll see later that it leads to further notational simplifications. Because of this, in the remainder of the book we won't use the threshold, we'll always use the bias.I've described perceptrons as a method for weighing evidence to make decisions. Another way perceptrons can be used is to compute the elementary logical functions we usually think of as underlying computation, functions such as AND, OR, and NAND. For example, suppose we have a perceptron with two inputs, each with weight −2−2-2, and an overall bias of 333. Here's our perceptron: Then we see that input 000000 produces output 111, since (−2)∗0+(−2)∗0+3=3(−2)∗0+(−2)∗0+3=3(-2)*0+(-2)*0+3 = 3 is positive. Here, I've introduced the ∗∗* symbol to make the multiplications explicit. Similar calculations show that the inputs 010101 and 101010 produce output 111. But the input 111111 produces output 000, since (−2)∗1+(−2)∗1+3=−1(−2)∗1+(−2)∗1+3=−1(-2)*1+(-2)*1+3 = -1 is negative. And so our perceptron implements a NAND gate!The NAND example shows that we can use perceptrons to compute simple logical functions. In fact, we can use networks of perceptrons to compute any logical function at all. The reason is that the NAND gate is universal for computation, that is, we can build any computation up out of NAND gates. For example, we can use NAND gates to build a circuit which adds two bits, x1x1x_1 and x2x2x_2. This requires computing the bitwise sum, x1⊕x2x1⊕x2x_1 \oplus x_2, as well as a carry bit which is set to 111 when both x1x1x_1 and x2x2x_2 are 111, i.e., the carry bit is just the bitwise product x1x2x1x2x_1 x_2: To get an equivalent network of perceptrons we replace all the NAND gates by perceptrons with two inputs, each with weight −2−2-2, and an overall bias of 333. Here's the resulting network. Note that I've moved the perceptron corresponding to the bottom right NAND gate a little, just to make it easier to draw the arrows on the diagram: One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron. When I defined the perceptron model I didn't say whether this kind of double-output-to-the-same-place was allowed. Actually, it doesn't much matter. If we don't want to allow this kind of thing, then it's possible to simply merge the two lines, into a single connection with a weight of -4 instead of two connections with -2 weights. (If you don't find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as marked: Up to now I've been drawing inputs like x1x1x_1 and x2x2x_2 as variables floating to the left of the network of perceptrons. In fact, it's conventional to draw an extra layer of perceptrons - the input layer - to encode the inputs: This notation for input perceptrons, in which we have an output, but no inputs, is a shorthand. It doesn't actually mean a perceptron with no inputs. To see this, suppose we did have a perceptron with no inputs. Then the weighted sum ∑jwjxj∑jwjxj\sum_j w_j x_j would always be zero, and so the perceptron would output 111 if b>0b>0b > 0, and 000 if b≤0b≤0b \leq 0. That is, the perceptron would simply output a fixed value, not the desired value (x1x1x_1, in the example above). It's better to think of the input perceptrons as not really being perceptrons at all, but rather special units which are simply defined to output the desired values, x1,x2,…x1,x2,…x_1, x_2,\ldots.The adder example demonstrates how a network of perceptrons can be used to simulate a circuit containing many NAND gates. And because NAND gates are universal for computation, it follows that perceptrons are also universal for computation.The computational universality of perceptrons is simultaneously reassuring and disappointing. It's reassuring because it tells us that networks of perceptrons can be as powerful as any other computing device. But it's also disappointing, because it makes it seem as though perceptrons are merely a new type of NAND gate. That's hardly big news!However, the situation is better than this view suggests. It turns out that we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons. This tuning happens in response to external stimuli, without direct intervention by a programmer. These learning algorithms enable us to use artificial neurons in a way which is radically different to conventional logic gates. Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.Sigmoid neuronsLearning algorithms sound terrific. But how can we devise such algorithms for a neural network? Suppose we have a network of perceptrons that we'd like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we'd like is for this small change in weight to cause only a small corresponding change in the output from the network. As we'll see in a moment, this property will make learning possible. Schematically, here's what we want (obviously this network is too simple to do handwriting recognition!): If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an "8" when it should be a "9". We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a "9". And then we'd repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.The problem is that this isn't what happens when our network contains perceptrons. In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 000 to 111. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. So while your "9" might now be classified correctly, the behaviour of the network on all the other images is likely to have completely changed in some hard-to-control way. That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behaviour. Perhaps there's some clever way of getting around this problem. But it's not immediately obvious how we can get a network of perceptrons to learn.We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we depicted perceptrons: Just like a perceptron, the sigmoid neuron has inputs, x1,x2,…x1,x2,…x_1, x_2, \ldots. But instead of being just 000 or 111, these inputs can also take on any values between 000 and 111. So, for instance, 0.638…0.638…0.638\ldots is a valid input for a sigmoid neuron. Also just like a perceptron, the sigmoid neuron has weights for each input, w1,w2,…w1,w2,…w_1, w_2, \ldots, and an overall bias, bbb. But the output is not 000 or 111. Instead, it's σ(w⋅x+b)σ(w⋅x+b)\sigma(w \cdot x+b), where σσ\sigma is called the sigmoid function* *Incidentally, σσ\sigma is sometimes called the logistic function, and this new class of neurons called logistic neurons. It's useful to remember this terminology, since these terms are used by many people working with neural nets. However, we'll stick with the sigmoid terminology., and is defined by: σ(z)≡11+e−z.(3)(3)σ(z)≡11+e−z.\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{-z}}. \tag{3}\end{eqnarray} To put it all a little more explicitly, the output of a sigmoid neuron with inputs x1,x2,…x1,x2,…x_1,x_2,\ldots, weights w1,w2,…w1,w2,…w_1,w_2,\ldots, and bias bbb is 11+exp(−∑jwjxj−b).(4)(4)11+exp⁡(−∑jwjxj−b).\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)}. \tag{4}\end{eqnarray}At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if you're not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.To understand the similarity to the perceptron model, suppose z≡w⋅x+bz≡w⋅x+bz \equiv w \cdot x + b is a large positive number. Then e−z≈0e−z≈0e^{-z} \approx 0 and so σ(z)≈1σ(z)≈1\sigma(z) \approx 1. In other words, when z=w⋅x+bz=w⋅x+bz = w \cdot x+b is large and positive, the output from the sigmoid neuron is approximately 111, just as it would have been for a perceptron. Suppose on the other hand that z=w⋅x+bz=w⋅x+bz = w \cdot x+b is very negative. Then e−z→∞e−z→∞e^{-z} \rightarrow \infty, and σ(z)≈0σ(z)≈0\sigma(z) \approx 0. So when z=w⋅x+bz=w⋅x+bz = w \cdot x +b is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. It's only when w⋅x+bw⋅x+bw \cdot x+b is of modest size that there's much deviation from the perceptron model.What about the algebraic form of σσ\sigma? How can we understand that? In fact, the exact form of σσ\sigma isn't so important - what really matters is the shape of the function when plotted. Here's the shape: -4-3-2-1012340.00.20.40.60.81.0zsigmoid function function s(x) {return 1/(1+Math.exp(-x));} var m = [40, 120, 50, 120]; var height = 290 - m[0] - m[2]; var width = 600 - m[1] - m[3]; var xmin = -5; var xmax = 5; var sample = 400; var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]); var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))}; }); var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]); var y = d3.scale.linear() .domain([0, 1]) .range([height, 0]); var line = d3.svg.line() .x(function(d) { return x(d.x); }) .y(function(d) { return y(d.y); }) var graph = d3.select("#sigmoid_graph") .append("svg") .attr("width", width + m[1] + m[3]) .attr("height", height + m[0] + m[2]) .append("g") .attr("transform", "translate(" + m[3] + "," + m[0] + ")"); var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient("bottom") graph.append("g") .attr("class", "x axis") .attr("transform", "translate(0, " + height + ")") .call(xAxis); var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient("left") .ticks(5) graph.append("g") .attr("class", "y axis") .call(yAxis); graph.append("path").attr("d", line(data)); graph.append("text") .attr("class", "x label") .attr("text-anchor", "end") .attr("x", width/2) .attr("y", height+35) .text("z"); graph.append("text") .attr("x", (width / 2)) .attr("y", -10) .attr("text-anchor", "middle") .style("font-size", "16px") .text("sigmoid function"); This shape is a smoothed out version of a step function: -4-3-2-1012340.00.20.40.60.81.0zstep function function s(x) {return x < 0 ? 0 : 1;} var m = [40, 120, 50, 120]; var height = 290 - m[0] - m[2]; var width = 600 - m[1] - m[3]; var xmin = -5; var xmax = 5; var sample = 400; var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]); var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))}; }); var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]); var y = d3.scale.linear() .domain([0,1]) .range([height, 0]); var line = d3.svg.line() .x(function(d) { return x(d.x); }) .y(function(d) { return y(d.y); }) var graph = d3.select("#step_graph") .append("svg") .attr("width", width + m[1] + m[3]) .attr("height", height + m[0] + m[2]) .append("g") .attr("transform", "translate(" + m[3] + "," + m[0] + ")"); var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient("bottom") graph.append("g") .attr("class", "x axis") .attr("transform", "translate(0, " + height + ")") .call(xAxis); var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient("left") .ticks(5) graph.append("g") .attr("class", "y axis") .call(yAxis); graph.append("path").attr("d", line(data)); graph.append("text") .attr("class", "x label") .attr("text-anchor", "end") .attr("x", width/2) .attr("y", height+35) .text("z"); graph.append("text") .attr("x", (width / 2)) .attr("y", -10) .attr("text-anchor", "middle") .style("font-size", "16px") .text("step function"); If σσ\sigma had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be 111 or 000 depending on whether w⋅x+bw⋅x+bw\cdot x+b was positive or negative* *Actually, when w⋅x+b=0w⋅x+b=0w \cdot x +b = 0 the perceptron outputs 000, while the step function outputs 111. So, strictly speaking, we'd need to modify the step function at that one point. But you get the idea.. By using the actual σσ\sigma function we get, as already implied above, a smoothed out perceptron. Indeed, it's the smoothness of the σσ\sigma function that is the crucial fact, not its detailed form. The smoothness of σσ\sigma means that small changes ΔwjΔwj\Delta w_j in the weights and ΔbΔb\Delta b in the bias will produce a small change ΔoutputΔoutput\Delta \mbox{output} in the output from the neuron. In fact, calculus tells us that ΔoutputΔoutput\Delta \mbox{output} is well approximated by Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,(5)(5)Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b, \tag{5}\end{eqnarray} where the sum is over all the weights, wjwjw_j, and ∂output/∂wj∂output/∂wj\partial \, \mbox{output} / \partial w_j and ∂output/∂b∂output/∂b\partial \, \mbox{output} /\partial b denote partial derivatives of the outputoutput\mbox{output} with respect to wjwjw_j and bbb, respectively. Don't panic if you're not comfortable with partial derivatives! While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news): ΔoutputΔoutput\Delta \mbox{output} is a linear function of the changes ΔwjΔwj\Delta w_j and ΔbΔb\Delta b in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.If it's the shape of σσ\sigma which really matters, and not its exact form, then why use the particular form used for σσ\sigma in Equation (3)σ(z)≡11+e−zσ(z)≡11+e−z\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}$('#margin_387419264610_reveal').click(function() {$('#margin_387419264610').toggle('slow', function() {});});? In fact, later in the book we will occasionally consider neurons where the output is f(w⋅x+b)f(w⋅x+b)f(w \cdot x + b) for some other activation function f(⋅)f(⋅)f(\cdot). The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5)Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔbΔoutput≈∑j∂output∂wjΔwj+∂output∂bΔb\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}$('#margin_727997094331_reveal').click(function() {$('#margin_727997094331').toggle('slow', function() {});}); change. It turns out that when we compute those partial derivatives later, using σσ\sigma will simplify the algebra, simply because exponentials have lovely properties when differentiated. In any case, σσ\sigma is commonly-used in work on neural nets, and is the activation function we'll use most often in this book.How should we interpret the output from a sigmoid neuron? Obviously, one big difference between perceptrons and sigmoid neurons is that sigmoid neurons don't just output 000 or 111. They can have as output any real number between 000 and 111, so values such as 0.173…0.173…0.173\ldots and 0.689…0.689…0.689\ldots are legitimate outputs. This can be useful, for example, if we want to use the output value to represent the average intensity of the pixels in an image input to a neural network. But sometimes it can be a nuisance. Suppose we want the output from the network to indicate either "the input image is a 9" or "the input image is not a 9". Obviously, it'd be easiest to do this if the output was a 000 or a 111, as in a perceptron. But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least 0.50.50.5 as indicating a "9", and any output less than 0.50.50.5 as indicating "not a 9". I'll always explicitly state when we're using such a convention, so it shouldn't cause any confusion. Exercises Sigmoid neurons simulating perceptrons, part I \mbox{} Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, c>0c>0c > 0. Show that the behaviour of the network doesn't change.Sigmoid neurons simulating perceptrons, part II \mbox{} Suppose we have the same setup as the last problem - a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We won't need the actual input value, we just need the input to have been fixed. Suppose the weights and biases are such that w⋅x+b≠0w⋅x+b≠0w \cdot x + b \neq 0 for the input xxx to any particular perceptron in the network. Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant c>0c>0c > 0. Show that in the limit as c→∞c→∞c \rightarrow \infty the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail when w⋅x+b=0w⋅x+b=0w \cdot x + b = 0 for one of the perceptrons? The architecture of neural networksIn the next section I'll introduce a neural network that can do a pretty good job classifying handwritten digits. In preparation for that, it helps to explain some terminology that lets us name different parts of a network. Suppose we have the network: As mentioned earlier, the leftmost layer in this network is called the input layer, and the neurons within the layer are called input neurons. The rightmost or output layer contains the output neurons, or, as in this case, a single output neuron. The middle layer is called a hidden layer, since the neurons in this layer are neither inputs nor outputs. The term "hidden" perhaps sounds a little mysterious - the first time I heard the term I thought it must have some deep philosophical or mathematical significance - but it really means nothing more than "not an input or an output". The network above has just a single hidden layer, but some networks have multiple hidden layers. For example, the following four-layer network has two hidden layers: Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.The design of the input and output layers in a network is often straightforward. For example, suppose we're trying to determine whether a handwritten image depicts a "9" or not. A natural way to design the network is to encode the intensities of the image pixels into the input neurons. If the image is a 646464 by 646464 greyscale image, then we'd have 4,096=64×644,096=64×644,096 = 64 \times 64 input neurons, with the intensities scaled appropriately between 000 and 111. The output layer will contain just a single neuron, with output values of less than 0.50.50.5 indicating "input image is not a 9", and values greater than 0.50.50.5 indicating "input image is a 9 ". While the design of the input and output layers of a neural network is often straightforward, there can be quite an art to the design of the hidden layers. In particular, it's not possible to sum up the design process for the hidden layers with a few simple rules of thumb. Instead, neural networks researchers have developed many design heuristics for the hidden layers, which help people get the behaviour they want out of their nets. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the network. We'll meet several such design heuristics later in this book. Up to now, we've been discussing neural networks where the output from one layer is used as input to the next layer. Such networks are called feedforward neural networks. This means there are no loops in the network - information is always fed forward, never fed back. If we did have loops, we'd end up with situations where the input to the σσ\sigma function depended on the output. That'd be hard to make sense of, and so we don't allow such loops.However, there are other models of artificial neural networks in which feedback loops are possible. These models are called recurrent neural networks. The idea in these models is to have neurons which fire for some limited duration of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration. That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops don't cause problems in such a model, since a neuron's output only affects its input at some later time, not instantaneously.Recurrent neural nets have been less influential than feedforward networks, in part because the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent networks are still extremely interesting. They're much closer in spirit to how our brains work than feedforward networks. And it's possible that recurrent networks can solve important problems which can only be solved with great difficulty by feedforward networks. However, to limit our scope, in this book we're going to concentrate on the more widely-used feedforward networks.A simple network to classify handwritten digitsHaving defined neural networks, let's return to handwriting recognition. We can split the problem of recognizing handwritten digits into two sub-problems. First, we'd like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, we'd like to break the imageinto six separate images, We humans solve this segmentation problem with ease, but it's challenging for a computer program to correctly break up the image. Once the image has been segmented, the program then needs to classify each individual digit. So, for instance, we'd like our program to recognize that the first digit above,is a 5.We'll focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then it's probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.To recognize individual digits we will use a three-layer neural network: The input layer of the network contains neurons encoding the values of the input pixels. As discussed in the next section, our training data for the network will consist of many 282828 by 282828 pixel images of scanned handwritten digits, and so the input layer contains 784=28×28784=28×28784 = 28 \times 28 neurons. For simplicity I've omitted most of the 784784784 input neurons in the diagram above. The input pixels are greyscale, with a value of 0.00.00.0 representing white, a value of 1.01.01.0 representing black, and in between values representing gradually darkening shades of grey.The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by nnn, and we'll experiment with different values for nnn. The example shown illustrates a small hidden layer, containing just n=15n=15n = 15 neurons.The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an output ≈1≈1\approx 1, then that will indicate that the network thinks the digit is a 000. If the second neuron fires then that will indicate that the network thinks the digit is a 111. And so on. A little more precisely, we number the output neurons from 000 through 999, and figure out which neuron has the highest activation value. If that neuron is, say, neuron number 666, then our network will guess that the input digit was a 666. And so on for the other output neurons.You might wonder why we use 101010 output neurons. After all, the goal of the network is to tell us which digit (0,1,2,…,90,1,2,…,90, 1, 2, \ldots, 9) corresponds to the input image. A seemingly natural way of doing that is to use just 444 output neurons, treating each neuron as taking on a binary value, depending on whether the neuron's output is closer to 000 or to 111. Four neurons are enough to encode the answer, since 24=1624=162^4 = 16 is more than the 10 possible values for the input digit. Why should our network use 101010 neurons instead? Isn't that inefficient? The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with 101010 output neurons learns to recognize digits better than the network with 444 output neurons. But that leaves us wondering why using 101010 output neurons works better. Is there some heuristic that would tell us in advance that we should use the 101010-output encoding instead of the 444-output encoding?To understand why we do this, it helps to think about what the neural network is doing from first principles. Consider first the case where we use 101010 output neurons. Let's concentrate on the first output neuron, the one that's trying to decide whether or not the digit is a 000. It does this by weighing up evidence from the hidden layer of neurons. What are those hidden neurons doing? Well, just suppose for the sake of argument that the first neuron in the hidden layer detects whether or not an image like the following is present:It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs. In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:As you may have guessed, these four images together make up the 000 image that we saw in the line of digits shown earlier:So if all four of these hidden neurons are firing then we can conclude that the digit is a 000. Of course, that's not the only sort of evidence we can use to conclude that the image was a 000 - we could legitimately get a 000 in many other ways (say, through translations of the above images, or slight distortions). But it seems safe to say that at least in this case we'd conclude that the input was a 000.Supposing the neural network functions in this way, we can give a plausible explanation for why it's better to have 101010 outputs from the network, rather than 444. If we had 444 outputs, then the first output neuron would be trying to decide what the most significant bit of the digit was. And there's no easy way to relate that most significant bit to simple shapes like those shown above. It's hard to imagine that there's any good historical reason the component shapes of the digit will be closely related to (say) the most significant bit in the output.Now, with all that said, this is all just a heuristic. Nothing says that the three-layer neural network has to operate in the way I described, with the hidden neurons detecting simple component shapes. Maybe a clever learning algorithm will find some assignment of weights that lets us use only 444 output neurons. But as a heuristic the way of thinking I've described works pretty well, and can save you a lot of time in designing good neural network architectures.Exercise There is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 333 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least 0.990.990.99, and incorrect outputs have activation less than 0.010.010.01. Learning with gradient descentNow that we have a design for our neural network, how can it learn to recognize digits? The first thing we'll need is a data set to learn from - a so-called training data set. We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST's name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States' National Institute of Standards and Technology. Here's a few images from MNIST: As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we'll ask it to recognize images which aren't in the training set!The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We'll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn't see during training.We'll use the notation xxx to denote a training input. It'll be convenient to regard each training input xxx as a 28×28=78428×28=78428 \times 28 = 784-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We'll denote the corresponding desired output by y=y(x)y=y(x)y = y(x), where yyy is a 101010-dimensional vector. For example, if a particular training image, xxx, depicts a 666, then y(x)=(0,0,0,0,0,0,1,0,0,0)Ty(x)=(0,0,0,0,0,0,1,0,0,0)Ty(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T is the desired output from the network. Note that TTT here is the transpose operation, turning a row vector into an ordinary (column) vector.What we'd like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x)y(x)y(x) for all training inputs xxx. To quantify how well we're achieving this goal we define a cost function* *Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since it's often used in research papers and other discussions of neural networks. : C(w,b)≡12n∑x∥y(x)−a∥2.(6)(6)C(w,b)≡12n∑x‖y(x)−a‖2.\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \tag{6}\end{eqnarray} Here, www denotes the collection of all weights in the network, bbb all the biases, nnn is the total number of training inputs, aaa is the vector of outputs from the network when xxx is input, and the sum is over all training inputs, xxx. Of course, the output aaa depends on xxx, www and bbb, but to keep the notation simple I haven't explicitly indicated this dependence. The notation ∥v∥‖v‖\| v \| just denotes the usual length function for a vector vvv. We'll call CCC the quadratic cost function; it's also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that C(w,b)C(w,b)C(w,b) is non-negative, since every term in the sum is non-negative. Furthermore, the cost C(w,b)C(w,b)C(w,b) becomes small, i.e., C(w,b)≈0C(w,b)≈0C(w,b) \approx 0, precisely when y(x)y(x)y(x) is approximately equal to the output, aaa, for all training inputs, xxx. So our training algorithm has done a good job if it can find weights and biases so that C(w,b)≈0C(w,b)≈0C(w,b) \approx 0. By contrast, it's not doing so well when C(w,b)C(w,b)C(w,b) is large - that would mean that y(x)y(x)y(x) is not close to the output aaa for a large number of inputs. So the aim of our training algorithm will be to minimize the cost C(w,b)C(w,b)C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent. Why introduce the quadratic cost? After all, aren't we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost? The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won't cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That's why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy.Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_501822820305_reveal').click(function() {$('#margin_501822820305').toggle('slow', function() {});});. Isn't this a rather ad hoc choice? Perhaps if we chose a different cost function we'd get a totally different set of minimizing weights and biases? This is a valid concern, and later we'll revisit the cost function, and make some modifications. However, the quadratic cost function of Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_555483302348_reveal').click(function() {$('#margin_555483302348').toggle('slow', function() {});}); works perfectly well for understanding the basics of learning in neural networks, so we'll stick with it for now.Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w,b)C(w,b)C(w, b). This is a well-posed problem, but it's got a lot of distracting structure as currently posed - the interpretation of www and bbb as weights and biases, the σσ\sigma function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now we're going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, we're going to imagine that we've simply been given a function of many variables and we want to minimize that function. We're going to develop a technique called gradient descent which can be used to solve such minimization problems. Then we'll come back to the specific function we want to minimize for neural networks.Okay, let's suppose we're trying to minimize some function, C(v)C(v)C(v). This could be any real-valued function of many variables, v=v1,v2,…v=v1,v2,…v = v_1, v_2, \ldots. Note that I've replaced the www and bbb notation by vvv to emphasize that this could be any function - we're not specifically thinking in the neural networks context any more. To minimize C(v)C(v)C(v) it helps to imagine CCC as a function of just two variables, which we'll call v1v1v_1 and v2v2v_2:What we'd like is to find where CCC achieves its global minimum. Now, of course, for the function plotted above, we can eyeball the graph and find the minimum. In that sense, I've perhaps shown slightly too simple a function! A general function, CCC, may be a complicated function of many variables, and it won't usually be possible to just eyeball the graph to find the minimum.One way of attacking the problem is to use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where CCC is an extremum. With some luck that might work when CCC is a function of just one or a few variables. But it'll turn into a nightmare when we have many more variables. And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. Using calculus to minimize that just won't work!(After asserting that we'll gain insight by imagining CCC as a function of just two variables, I've turned around twice in two paragraphs and said, "hey, but what if it's a function of many more than two variables?" Sorry about that. Please believe me when I say that it really does help to imagine CCC as a function of two variables. It just happens that sometimes that picture breaks down, and the last two paragraphs were dealing with such breakdowns. Good thinking about mathematics often involves juggling multiple intuitive pictures, learning when it's appropriate to use each picture, and when it's not.)Okay, so calculus doesn't work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn't be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We'd randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of CCC - those derivatives would tell us everything we need to know about the local "shape" of the valley, and therefore how our ball should roll.Based on what I've just written, you might suppose that we'll be trying to write down Newton's equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we're not going to take the ball-rolling analogy quite that seriously - we're devising an algorithm to minimize CCC, not developing an accurate simulation of the laws of physics! The ball's-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let's simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?To make this question more precise, let's think about what happens when we move the ball a small amount Δv1Δv1\Delta v_1 in the v1v1v_1 direction, and a small amount Δv2Δv2\Delta v_2 in the v2v2v_2 direction. Calculus tells us that CCC changes as follows: ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.(7)(7)ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2. \tag{7}\end{eqnarray} We're going to find a way of choosing Δv1Δv1\Delta v_1 and Δv2Δv2\Delta v_2 so as to make ΔCΔC\Delta C negative; i.e., we'll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define ΔvΔv\Delta v to be the vector of changes in vvv, Δv≡(Δv1,Δv2)TΔv≡(Δv1,Δv2)T\Delta v \equiv (\Delta v_1, \Delta v_2)^T, where TTT is again the transpose operation, turning row vectors into column vectors. We'll also define the gradient of CCC to be the vector of partial derivatives, (∂C∂v1,∂C∂v2)T(∂C∂v1,∂C∂v2)T\left(\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T. We denote the gradient vector by ∇C∇C\nabla C, i.e.: ∇C≡(∂C∂v1,∂C∂v2)T.(8)(8)∇C≡(∂C∂v1,∂C∂v2)T.\begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T. \tag{8}\end{eqnarray} In a moment we'll rewrite the change ΔCΔC\Delta C in terms of ΔvΔv\Delta v and the gradient, ∇C∇C\nabla C. Before getting to that, though, I want to clarify something that sometimes gets people hung up on the gradient. When meeting the ∇C∇C\nabla C notation for the first time, people sometimes wonder how they should think about the ∇∇\nabla symbol. What, exactly, does ∇∇\nabla mean? In fact, it's perfectly fine to think of ∇C∇C\nabla C as a single mathematical object - the vector defined above - which happens to be written using two symbols. In this point of view, ∇∇\nabla is just a piece of notational flag-waving, telling you "hey, ∇C∇C\nabla C is a gradient vector". There are more advanced points of view where ∇∇\nabla can be viewed as an independent mathematical entity in its own right (for example, as a differential operator), but we won't need such points of view.With these definitions, the expression (7)ΔC≈∂C∂v1Δv1+∂C∂v2Δv2ΔC≈∂C∂v1Δv1+∂C∂v2Δv2\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}$('#margin_512380394946_reveal').click(function() {$('#margin_512380394946').toggle('slow', function() {});}); for ΔCΔC\Delta C can be rewritten as ΔC≈∇C⋅Δv.(9)(9)ΔC≈∇C⋅Δv.\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v. \tag{9}\end{eqnarray} This equation helps explain why ∇C∇C\nabla C is called the gradient vector: ∇C∇C\nabla C relates changes in vvv to changes in CCC, just as we'd expect something called a gradient to do. But what's really exciting about the equation is that it lets us see how to choose ΔvΔv\Delta v so as to make ΔCΔC\Delta C negative. In particular, suppose we choose Δv=−η∇C,(10)(10)Δv=−η∇C,\begin{eqnarray} \Delta v = -\eta \nabla C, \tag{10}\end{eqnarray} where ηη\eta is a small, positive parameter (known as the learning rate). Then Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_31741254841_reveal').click(function() {$('#margin_31741254841').toggle('slow', function() {});}); tells us that ΔC≈−η∇C⋅∇C=−η∥∇C∥2ΔC≈−η∇C⋅∇C=−η‖∇C‖2\Delta C \approx -\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2. Because ∥∇C∥2≥0‖∇C‖2≥0\| \nabla C \|^2 \geq 0, this guarantees that ΔC≤0ΔC≤0\Delta C \leq 0, i.e., CCC will always decrease, never increase, if we change vvv according to the prescription in (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_48762573303_reveal').click(function() {$('#margin_48762573303').toggle('slow', function() {});});. (Within, of course, the limits of the approximation in Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_919658643545_reveal').click(function() {$('#margin_919658643545').toggle('slow', function() {});});). This is exactly the property we wanted! And so we'll take Equation (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_287729255111_reveal').click(function() {$('#margin_287729255111').toggle('slow', function() {});}); to define the "law of motion" for the ball in our gradient descent algorithm. That is, we'll use Equation (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_718723868298_reveal').click(function() {$('#margin_718723868298').toggle('slow', function() {});}); to compute a value for ΔvΔv\Delta v, then move the ball's position vvv by that amount: v→v′=v−η∇C.(11)(11)v→v′=v−η∇C.\begin{eqnarray} v \rightarrow v' = v -\eta \nabla C. \tag{11}\end{eqnarray} Then we'll use this update rule again, to make another move. If we keep doing this, over and over, we'll keep decreasing CCC until - we hope - we reach a global minimum.Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient ∇C∇C\nabla C, and then to move in the opposite direction, "falling down" the slope of the valley. We can visualize it like this:Notice that with this rule gradient descent doesn't reproduce real physical motion. In real life a ball has momentum, and that momentum may allow it to roll across the slope, or even (momentarily) roll uphill. It's only after the effects of friction set in that the ball is guaranteed to roll down into the valley. By contrast, our rule for choosing ΔvΔv\Delta v just says "go down, right now". That's still a pretty good rule for finding the minimum!To make gradient descent work correctly, we need to choose the learning rate ηη\eta to be small enough that Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_560455937071_reveal').click(function() {$('#margin_560455937071').toggle('slow', function() {});}); is a good approximation. If we don't, we might end up with ΔC>0ΔC>0\Delta C > 0, which obviously would not be good! At the same time, we don't want ηη\eta to be too small, since that will make the changes ΔvΔv\Delta v tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, ηη\eta is often varied so that Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_157848846275_reveal').click(function() {$('#margin_157848846275').toggle('slow', function() {});}); remains a good approximation, but the algorithm isn't too slow. We'll see later how this works. I've explained gradient descent when CCC is a function of just two variables. But, in fact, everything works just as well even when CCC is a function of many more variables. Suppose in particular that CCC is a function of mmm variables, v1,…,vmv1,…,vmv_1,\ldots,v_m. Then the change ΔCΔC\Delta C in CCC produced by a small change Δv=(Δv1,…,Δvm)TΔv=(Δv1,…,Δvm)T\Delta v = (\Delta v_1, \ldots, \Delta v_m)^T is ΔC≈∇C⋅Δv,(12)(12)ΔC≈∇C⋅Δv,\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v, \tag{12}\end{eqnarray} where the gradient ∇C∇C\nabla C is the vector ∇C≡(∂C∂v1,…,∂C∂vm)T.(13)(13)∇C≡(∂C∂v1,…,∂C∂vm)T.\begin{eqnarray} \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T. \tag{13}\end{eqnarray} Just as for the two variable case, we can choose Δv=−η∇C,(14)(14)Δv=−η∇C,\begin{eqnarray} \Delta v = -\eta \nabla C, \tag{14}\end{eqnarray} and we're guaranteed that our (approximate) expression (12)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_869505431896_reveal').click(function() {$('#margin_869505431896').toggle('slow', function() {});}); for ΔCΔC\Delta C will be negative. This gives us a way of following the gradient to a minimum, even when CCC is a function of many variables, by repeatedly applying the update rule v→v′=v−η∇C.(15)(15)v→v′=v−η∇C.\begin{eqnarray} v \rightarrow v' = v-\eta \nabla C. \tag{15}\end{eqnarray} You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position vvv in order to find a minimum of the function CCC. The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of CCC, a point we'll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.Indeed, there's even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let's suppose that we're trying to make a move ΔvΔv\Delta v in position so as to decrease CCC as much as possible. This is equivalent to minimizing ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\Delta C \approx \nabla C \cdot \Delta v. We'll constrain the size of the move so that ∥Δv∥=ϵ‖Δv‖=ϵ\| \Delta v \| = \epsilon for some small fixed ϵ>0ϵ>0\epsilon > 0. In other words, we want a move that is a small step of a fixed size, and we're trying to find the movement direction which decreases CCC as much as possible. It can be proved that the choice of ΔvΔv\Delta v which minimizes ∇C⋅Δv∇C⋅Δv\nabla C \cdot \Delta v is Δv=−η∇CΔv=−η∇C\Delta v = - \eta \nabla C, where η=ϵ/∥∇C∥η=ϵ/‖∇C‖\eta = \epsilon / \|\nabla C\| is determined by the size constraint ∥Δv∥=ϵ‖Δv‖=ϵ\|\Delta v\| = \epsilon. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease CCC.Exercises Prove the assertion of the last paragraph. Hint: If you're not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it. I explained gradient descent when CCC is a function of two variables, and when it's a function of more than two variables. What happens when CCC is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case? People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of CCC, and this can be quite costly. To see why it's costly, suppose we want to compute all the second partial derivatives ∂2C/∂vj∂vk∂2C/∂vj∂vk\partial^2 C/ \partial v_j \partial v_k. If there are a million such vjvjv_j variables then we'd need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since ∂2C/∂vj∂vk=∂2C/∂vk∂vj∂2C/∂vj∂vk=∂2C/∂vk∂vj\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j. Still, you get the point.! That's going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we'll use gradient descent (and variations) as our main approach to learning in neural networks.How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights wkwkw_k and biases blblb_l which minimize the cost in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_1246306310_reveal').click(function() {$('#margin_1246306310').toggle('slow', function() {});});. To see how this works, let's restate the gradient descent update rule, with the weights and biases replacing the variables vjvjv_j. In other words, our "position" now has components wkwkw_k and blblb_l, and the gradient vector ∇C∇C\nabla C has corresponding components ∂C/∂wk∂C/∂wk\partial C / \partial w_k and ∂C/∂bl∂C/∂bl\partial C / \partial b_l. Writing out the gradient descent update rule in terms of components, we have wkbl→→w′k=wk−η∂C∂wkb′l=bl−η∂C∂bl.(16)(17)(16)wk→wk′=wk−η∂C∂wk(17)bl→bl′=bl−η∂C∂bl.\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}. \tag{17}\end{eqnarray} By repeatedly applying this update rule we can "roll down the hill", and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.There are a number of challenges in applying the gradient descent rule. We'll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let's look back at the quadratic cost in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_214093216664_reveal').click(function() {$('#margin_214093216664').toggle('slow', function() {});});. Notice that this cost function has the form C=1n∑xCxC=1n∑xCxC = \frac{1}{n} \sum_x C_x, that is, it's an average over costs Cx≡∥y(x)−a∥22Cx≡‖y(x)−a‖22C_x \equiv \frac{\|y(x)-a\|^2}{2} for individual training examples. In practice, to compute the gradient ∇C∇C\nabla C we need to compute the gradients ∇Cx∇Cx\nabla C_x separately for each training input, xxx, and then average them, ∇C=1n∑x∇Cx∇C=1n∑x∇Cx\nabla C = \frac{1}{n} \sum_x \nabla C_x. Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient ∇C∇C\nabla C by computing ∇Cx∇Cx\nabla C_x for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C∇C\nabla C, and this helps speed up gradient descent, and thus learning.To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number mmm of randomly chosen training inputs. We'll label those random training inputs X1,X2,…,XmX1,X2,…,XmX_1, X_2, \ldots, X_m, and refer to them as a mini-batch. Provided the sample size mmm is large enough we expect that the average value of the ∇CXj∇CXj\nabla C_{X_j} will be roughly equal to the average over all ∇Cx∇Cx\nabla C_x, that is, ∑mj=1∇CXjm≈∑x∇Cxn=∇C,(18)(18)∑j=1m∇CXjm≈∑x∇Cxn=∇C,\begin{eqnarray} \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C, \tag{18}\end{eqnarray} where the second sum is over the entire set of training data. Swapping sides we get ∇C≈1m∑j=1m∇CXj,(19)(19)∇C≈1m∑j=1m∇CXj,\begin{eqnarray} \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}}, \tag{19}\end{eqnarray} confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch. To connect this explicitly to learning in neural networks, suppose wkwkw_k and blblb_l denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those, wkbl→→w′k=wk−ηm∑j∂CXj∂wkb′l=bl−ηm∑j∂CXj∂bl,(20)(21)(20)wk→wk′=wk−ηm∑j∂CXj∂wk(21)bl→bl′=bl−ηm∑j∂CXj∂bl,\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\ b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}, \tag{21}\end{eqnarray} where the sums are over all the training examples XjXjX_j in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until we've exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch.Incidentally, it's worth noting that conventions vary about scaling of the cost function and of mini-batch updates to the weights and biases. In Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_85851492824_reveal').click(function() {$('#margin_85851492824').toggle('slow', function() {});}); we scaled the overall cost function by a factor 1n1n\frac{1}{n}. People sometimes omit the 1n1n\frac{1}{n}, summing over the costs of individual training examples instead of averaging. This is particularly useful when the total number of training examples isn't known in advance. This can occur if more training data is being generated in real time, for instance. And, in a similar way, the mini-batch update rules (20)wk→w′k=wk−ηm∑j∂CXj∂wkwk→wk′=wk−ηm∑j∂CXj∂wk\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray}$('#margin_801900730537_reveal').click(function() {$('#margin_801900730537').toggle('slow', function() {});}); and (21)bl→b′l=bl−ηm∑j∂CXj∂blbl→bl′=bl−ηm∑j∂CXj∂bl\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}$('#margin_985072620111_reveal').click(function() {$('#margin_985072620111').toggle('slow', function() {});}); sometimes omit the 1m1m\frac{1}{m} term out the front of the sums. Conceptually this makes little difference, since it's equivalent to rescaling the learning rate ηη\eta. But when doing detailed comparisons of different work it's worth watching out for.We can think of stochastic gradient descent as being like political polling: it's much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election. For example, if we have a training set of size n=60,000n=60,000n = 60,000, as in MNIST, and choose a mini-batch size of (say) m=10m=10m = 10, this means we'll get a factor of 6,0006,0006,000 speedup in estimating the gradient! Of course, the estimate won't be perfect - there will be statistical fluctuations - but it doesn't need to be perfect: all we really care about is moving in a general direction that will help decrease CCC, and that means we don't need an exact computation of the gradient. In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it's the basis for most of the learning techniques we'll develop in this book.Exercise An extreme version of gradient descent is to use a mini-batch size of just 1. That is, given a training input, xxx, we update our weights and biases according to the rules wk→w′k=wk−η∂Cx/∂wkwk→wk′=wk−η∂Cx/∂wkw_k \rightarrow w_k' = w_k - \eta \partial C_x / \partial w_k and bl→b′l=bl−η∂Cx/∂blbl→bl′=bl−η∂Cx/∂blb_l \rightarrow b_l' = b_l - \eta \partial C_x / \partial b_l. Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, 202020. Let me conclude this section by discussing a point that sometimes bugs people new to gradient descent. In neural networks the cost CCC is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space. Some people get hung up thinking: "Hey, I have to be able to visualize all these extra dimensions". And they may start to worry: "I can't think in four dimensions, let alone five (or five million)". Is there some special ability they're missing, some ability that "real" supermathematicians have? Of course, the answer is no. Even most professional mathematicians can't visualize four dimensions especially well, if at all. The trick they use, instead, is to develop other ways of representing what's going on. That's exactly what we did above: we used an algebraic (rather than visual) representation of ΔCΔC\Delta C to figure out how to move so as to decrease CCC. People who are good at thinking in high dimensions have a mental library containing many different techniques along these lines; our algebraic trick is just one example. Those techniques may not have the simplicity we're accustomed to when visualizing three dimensions, but once you build up a library of such techniques, you can get pretty good at thinking in high dimensions. I won't go into more detail here, but if you're interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions. While some of the techniques discussed are quite complex, much of the best content is intuitive and accessible, and could be mastered by anyone. Implementing our network to classify digitsAlright, let's write a program that learns how to recognize handwritten digits, using stochastic gradient descent and the MNIST training data. We'll do this with a short Python (2.7) program, just 74 lines of code! The first thing we need is to get the MNIST data. If you're a git user then you can obtain the data by cloning the code repository for this book,git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git If you don't use git then you can download the data and code here.Incidentally, when I described the MNIST data earlier, I said it was
  
  @fuelpress
  
  deep learning
Visit annotations in context

Tags

deep learning

Annotators

fuelpress

URL

neuralnetworksanddeeplearning.com/chap1.html
docs.pytorch.org docs.pytorch.org

Neural Transfer with PyTorch — PyTorch Tutorials 0.1.12_2 documentation

1
1. markus22 12 Oct 2019
  
  in Public
  
  gram matrix must be normalized by dividing each element by the total number of elements in the matrix.
  
  true, after downsampling your gradient will get smaller on later layers
  
  deep learning style transfer
Visit annotations in context

Tags

deep learning

style transfer

Annotators

markus22

URL

docs.pytorch.org/advanced/neural_style_tutorial.html
Sep 2019
github.com github.com

dzharii/awesome-elasticsearch

2
1. baxemyr 09 Sep 2019
  
  in Public
  
  Deep Learning for Search - teaches you how to leverage neural networks, NLP, and deep learning techniques to improve search performance. (2019) Relevant Search: with applications for Solr and Elasticsearch - demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines. (2016)
  
  Elasticsearch Deep Learning
2. baxemyr 09 Sep 2019
  
  in Public
  
  Elasticsearch with Machine Learning (English translation) by Kunihiko Kido Recommender System with Mahout and Elasticsearch
  
  Elasticsearch Deep Learning
Visit annotations in context

Tags

Deep Learning

Elasticsearch

Annotators

baxemyr

URL

github.com/dzharii/awesome-elasticsearch
Jun 2019
www.d2l.ai www.d2l.ai

Dive into Deep Learning — Dive into Deep Learning 0.7 documentation

1
1. ildar 09 Jun 2019
  
  in Public
  
  @course deep learning
Visit annotations in context

Tags

deep learning

@course

Annotators

ildar

URL

d2l.ai/
May 2019
www.andrew.cmu.edu www.andrew.cmu.edu

95-865 Unstructured Data Analytics

1
1. ildar 03 May 2019
  
  in Public
  
  @course nlp deep learning
Visit annotations in context

Tags

deep learning

@course

nlp

Annotators

ildar

URL

andrew.cmu.edu/user/georgech/95-865/
Apr 2019
arxiv.org arxiv.org

1811.11987

1
1. ildar 10 Apr 2019
  
  in Public
  
  cnn deep learning
Visit annotations in context

Tags

cnn

deep learning

Annotators

ildar

URL

arxiv.org/pdf/1811.11987.pdf
Mar 2019
www.phontron.com www.phontron.com

CS 11-747: Neural Networks for NLP

1
1. ildar 26 Mar 2019
  
  in Public
  
  nlp deep learning @course
Visit annotations in context

Tags

deep learning

@course

nlp

Annotators

ildar

URL

phontron.com/class/nn4nlp2017/schedule.html
www.comp.nus.edu.sg www.comp.nus.edu.sg

CS6101 - Deep Learning for NLP

1
1. ildar 26 Mar 2019
  
  in Public
  
  @course nlp deep learning
Visit annotations in context

Tags

deep learning

@course

nlp

Annotators

ildar

URL

comp.nus.edu.sg/~kanmy/courses/6101_1810/
stacks.stanford.edu stacks.stanford.edu

thesis-augmented.pdf

1
1. haiy 08 Mar 2019
  
  in Public
  
  NEURAL READING COMPREHENSION AND BEYOND
  
  deep-learning thesis
Visit annotations in context

Tags

deep-learning

thesis

Annotators

haiy

URL

stacks.stanford.edu/file/druid:gd576xb1833/thesis-augmented.pdf
csus-dspace.calstate.edu csus-dspace.calstate.edu

Sudarshan Deo - Masters Project Report Fall 2018 - Final Draft.pdf

1
1. haiy 08 Mar 2019
  
  in Public
  
  DEEP LEARNING WITH CONVOLUTIONAL NEURAL NETWORKS FOR IMAGE RECOGNITION: STEP-BY-STEP PROCESS FROM PREPARATION TO GENERALIZATION
  
  deep-learning CNN tutorial
Visit annotations in context

Tags

CNN

deep-learning

tutorial

Annotators

haiy

URL

csus-dspace.calstate.edu/bitstream/handle/10211.3/207763/Sudarshan Deo - Masters Project Report Fall 2018 - Final Draft.pdf
scholarworks.sjsu.edu scholarworks.sjsu.edu

Deep Learning for Chatbots

1
1. haiy 08 Mar 2019
  
  in Public
  
  Deep Learning for Chatbots
  
  thesis deep-learning chatbot
Visit annotations in context

Tags

thesis

chatbot

deep-learning

Annotators

haiy

URL

scholarworks.sjsu.edu/cgi/viewcontent.cgi
stacks.stanford.edu stacks.stanford.edu

EFFICIENT METHODS AND HARDWARE FOR DEEP LEARNING-augmented.pdf

1
1. haiy 08 Mar 2019
  
  in Public
  
  EFFICIENT METHODS AND HARDWARE FOR DEEP LEARNING
  
  Deep Compression" can reduce the model sizeby 18?to 49?without hurting the prediction accuracy. We also discovered that pruning and thesparsity constraint not only applies to model compression but also applies to regularization, andwe proposed dense-sparse-dense training (DSD), which can improve the prediction accuracy for awide range of deep learning models. To efficiently implement "Deep Compression" in hardware,we developed EIE, the "Efficient Inference Engine", a domain-specific hardware accelerator thatperforms inference directly on the compressed model which significantly saves memory bandwidth.Taking advantage of the compressed model, and being able to deal with the irregular computationpattern efficiently, EIE improves the speed by 13?and energy efficiency by 3,400?over GPU
  
  deep-learning thesis
Visit annotations in context

Tags

deep-learning

thesis

Annotators

haiy

URL

stacks.stanford.edu/file/druid:qf934gh3708/EFFICIENT METHODS AND HARDWARE FOR DEEP LEARNING-augmented.pdf
cjc.ict.ac.cn cjc.ict.ac.cn

pl-201745181647.pdf

1
1. haiy 08 Mar 2019
  
  in Public
  
  深度文本匹配综述
  
  NLP deep-learning review valuable
Visit annotations in context

Tags

deep-learning

valuable

NLP

review

Annotators

haiy

URL

cjc.ict.ac.cn/online/onlinepaper/pl-201745181647.pdf
github.com github.com

brightmart/text_classification

1
1. haiy 08 Mar 2019
  
  in Public
  
  all kinds of text classification models and more with deep learning
  
  github NLP deep-learning tutorial CNN
Visit annotations in context

Tags

tutorial

deep-learning

NLP

CNN

github

Annotators

haiy

URL

github.com/brightmart/text_classification
arxiv.org arxiv.org

1510.03820v4.pdf

1
1. haiy 08 Mar 2019
  
  in Public
  
  A Sensitivity Analysis of (and Practitioners’ Guide to) Convolutional Neural Networks for Sentence Classification
  
  NLP deep-learning review
Visit annotations in context

Tags

deep-learning

NLP

review

Annotators

haiy

URL

arxiv.org/pdf/1510.03820.pdf
www.ijcai.org www.ijcai.org

Differentiated Attentive Representation Learning for Sentence Classification

1
1. haiy 08 Mar 2019
  
  in Public
  
  Differentiated Attentive Representation Learning for Sentence Classification
  
  NLP deep-learning sentence-classification
Visit annotations in context

Tags

deep-learning

NLP

sentence-classification

Annotators

haiy

URL

ijcai.org/proceedings/2018/0644.pdf
arxiv.org arxiv.org

1610.02583.pdf

1
1. haiy 08 Mar 2019
  
  in Public
  
  A Gentle Tutorial of Recurrent Neural Network with ErrorBackpropagation
  
  A Gentle Tutorial of Recurrent Neural Network with ErrorBackpropagation
  
  tutorial lstm deep-learning
Visit annotations in context

Tags

tutorial

deep-learning

lstm

Annotators

haiy

URL

arxiv.org/pdf/1610.02583.pdf
arxiv.org arxiv.org

1812.06834.pdf

1
1. haiy 08 Mar 2019
  
  in Public
  
  A Tutorial on Deep Latent Variable Models of Natural Language
  
  tutorial deep-learning NLP
Visit annotations in context

Tags

tutorial

deep-learning

NLP

Annotators

haiy

URL

arxiv.org/pdf/1812.06834.pdf
arxiv.org arxiv.org

1702.00887.pdf

1
1. haiy 07 Mar 2019
  
  in Public
  
  BACKGROUND: ATTENTIONNETWORKS
  
  deep-learning attention NLP valuable
Visit annotations in context

Tags

deep-learning

attention

valuable

NLP

Annotators

haiy

URL

arxiv.org/pdf/1702.00887.pdf
arxiv.org arxiv.org

()

1
1. haiy 07 Mar 2019
  
  in Public
  
  To the best of our knowl-edge, there has not been any other work exploringthe use of attention-based architectures for NMT
  
  目前并没人来用attention来做机器翻译
  
  deep-learning attention NLP
Visit annotations in context

Tags

deep-learning

attention

NLP

Annotators

haiy

URL

arxiv.org/pdf/1508.04025.pdf
github.com github.com

tensorflow/nmt

1
1. haiy 07 Mar 2019
  
  in Public
  
  deep-learning NLP nmt
Visit annotations in context

Tags

deep-learning

NLP

nmt

Annotators

haiy

URL

github.com/tensorflow/nmt
gitee.com gitee.com

nmt.dvi

1
1. haiy 07 Mar 2019
  
  in Public
  
  LSTM Derivations
  
  deep-learning tutorial lstm valuable
Visit annotations in context

Tags

tutorial

deep-learning

valuable

lstm

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/LSTM_Derivations.pdf
jalammar.github.io jalammar.github.io

The Illustrated Transformer

1
1. haiy 07 Mar 2019
  
  in Public
  
  博客很赞！
  
  NLP attention deep-learning valuable
Visit annotations in context

Tags

deep-learning

attention

valuable

NLP

Annotators

haiy

URL

jalammar.github.io/illustrated-transformer/
arxiv.org arxiv.org

1607.06450v1.pdf

2
1. haiy 07 Mar 2019
  
  in Public
  
  One of the challenges of deep learning is that the gradients with respect to the weights in one layerare highly dependent on the outputs of the neurons in the previous layer especially if these outputschange in a highly correlated way. Batch normalization [Ioffe and Szegedy, 2015] was proposedto reduce such undesirable “covariate shift”. The method normalizes the summed inputs to eachhidden unit over the training cases. Specifically, for theithsummed input in thelthlayer, the batchnormalization method rescales the summed inputs according to their variances under the distributionof the data
  
  batch normalization的出现是为了解决神经元的输入和当前计算值交互的高度依赖的问题。因为要计算期望值，所以需要拿到所有样本然后进行计算，显然不太现实。因此将取样范围和训练时的mini-batch保持一致。但是这就把局限转移到mini-batch的大小上了，很难应用到RNN。因此需要LayerNormalization.
  
  deep-learning normalization
2. haiy 07 Mar 2019
  
  in Public
  
  Layer Normalization
  
  deep-learning attention
Visit annotations in context

Tags

normalization

deep-learning

attention

Annotators

haiy

URL

arxiv.org/pdf/1607.06450.pdf
www.semanticscholar.org www.semanticscholar.org

1409.0473v7.pdf

1
1. haiy 06 Mar 2019
  
  in Public
  
  NEURAL MACHINE TRANSLATION BY JOINTLY LEARNING TO ALIGN AND TRANSLATE
  
  deep-learning NLP attention
Visit annotations in context

Tags

deep-learning

attention

NLP

Annotators

haiy

URL

semanticscholar.org/reader/fa72afa9b2cbc8f0d7b05d52548906610ffbb9c5
emnlp2014.org emnlp2014.org

Learning Phrase Representations using RNN Encoder--Decoder for Statistical Machine Translation

1
1. haiy 06 Mar 2019
  
  in Public
  
  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
  
  deep-learning NLP valuable GRU
Visit annotations in context

Tags

deep-learning

GRU

valuable

NLP

Annotators

haiy

URL

emnlp2014.org/papers/pdf/EMNLP2014179.pdf
proceedings.neurips.cc proceedings.neurips.cc

Sequence to Sequence Learning with Neural Networks

1
1. haiy 06 Mar 2019
  
  in Public
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

proceedings.neurips.cc/paper_files/paper/2014/file/a14ac55a4f27472c5d894ec1c3c743d2-Paper.pdf
arxiv.org arxiv.org

1703.03130.pdf

1
1. haiy 06 Mar 2019
  
  in Public
  
  A STRUCTURED SELF-ATTENTIVE SENTENCE EMBEDDING
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

arxiv.org/pdf/1703.03130.pdf
arxiv.org arxiv.org

Untitled document

1
1. haiy 06 Mar 2019
  
  in Public
  
  Contextual Word Representations: A Contextual Introduction
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

arxiv.org/pdf/1902.06006.pdf
www.csie.ntu.edu.tw www.csie.ntu.edu.tw

寬螢幕簡報

1
1. haiy 06 Mar 2019
  
  in Public
  
  Deep Learning for Dialogue Systems
  
  deep-learning Dialogue System task-oriented tutorial valuable
Visit annotations in context

Tags

tutorial

deep-learning

valuable

Dialogue System

task-oriented

Annotators

haiy

URL

csie.ntu.edu.tw/~yvchen/doc/COLING18_Tutorial.pdf
Feb 2019
aclweb.org aclweb.org

End-to-End Task-Completion Neural Dialogue Systems

1
1. haiy 25 Feb 2019
  
  in Public
  
  ecent advances of deep learning have inspiredmany applications of neural models to dialoguesystems. Wen et al. (2017) and Bordes et al.(2017) introduced a network-based end-to-endtrainable task-oriented dialogue system, whichtreated dialogue system learning as the problemof learning a mapping from dialogue histories tosystem responses, and applied an encoder-decodermodel to train the whole system
  
  Wen和Bordes介绍了一种基于网络的端到端的任务型对话系统，这个系统将对话系统学习看成是从历史回话到系统回复的映射关系的学习问题，并且应用了一个编码解码器来训练整个系统。
  
  这个思路很有意思，和我之前想构建一个电销员的语料库来做用户回复响应很像。这个很有可能。
  
  chatbot deep-learning
Visit annotations in context

Tags

deep-learning

chatbot

Annotators

haiy

URL

aclweb.org/anthology/I17-1074
github.com github.com

a short introduction to mxnet design and implementation (chinese) · Issue #797 · apache/incubator-mxnet

1
1. haiy 21 Feb 2019
  
  in Public
  
  MXNet设计和实现简介
  
  deep-learning mxnet system
Visit annotations in context

Tags

deep-learning

mxnet

system

Annotators

haiy

URL

github.com/apache/incubator-mxnet/issues/797
arxiv.org arxiv.org

1611.05962.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning NLP
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

arxiv.org/pdf/1611.05962.pdf
arxiv.org arxiv.org

1507.05523.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  How to Generate a Good Word Embedding?
  
  deep-learning NLP
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

arxiv.org/pdf/1507.05523.pdf
licstar.net licstar.net

Deep Learning in NLP （一）词向量和语言模型 | licstar的博客

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning NLP
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

licstar.net/archives/328
gitee.com gitee.com

卷积神经网络&深度学习手册.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning CNN book tutorial
Visit annotations in context

Tags

CNN

deep-learning

book

tutorial

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/卷积神经网络&深度学习手册.pdf
yann.lecun.com yann.lecun.com

lecun-01a.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning CNN
Visit annotations in context

Tags

CNN

deep-learning

Annotators

haiy

URL

yann.lecun.com/exdb/publis/pdf/lecun-98.pdf
gitee.com gitee.com

Untitled document

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning book mxnet
Visit annotations in context

Tags

deep-learning

book

mxnet

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/mxnet-autograd.pdf
book.haihome.top book.haihome.top

Deep Learning

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning book
Visit annotations in context

Tags

deep-learning

book

Annotators

haiy

URL

book.haihome.top/deeplearning/www.deeplearningbook.org/
gitee.com gitee.com

Distributed Representations, Simple Recurrent Networks, And Grammatical Structure

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning
Visit annotations in context

Tags

deep-learning

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/RNN.pdf
gitee.com gitee.com

chris_gru_lstm.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning tutorial book
Visit annotations in context

Tags

tutorial

deep-learning

book

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/chris_gru_lstm.pdf
dougalmaclaurin.com dougalmaclaurin.com

autograd-thesis-harvard.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning book
Visit annotations in context

Tags

deep-learning

book

Annotators

haiy

URL

dougalmaclaurin.com/phd-thesis.pdf
gitee.com gitee.com

deeplearning.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning tutorial book
Visit annotations in context

Tags

tutorial

deep-learning

book

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/DeepLearningTutorial.pdf
gitee.com gitee.com

DeepLearningMethodsAndApplications.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning book
Visit annotations in context

Tags

deep-learning

book

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/DeepLearningMethodsAndApplications.pdf
gitee.com gitee.com

DeepLearningFramework.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  deep-learning system
Visit annotations in context

Tags

deep-learning

system

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/DeepLearningFramework.pdf
arxiv.org arxiv.org

1708.02709.pdf

1
1. haiy 21 Feb 2019
  
  in Public
  
  Recent Trends in Deep Learning BasedNatural Language Processing
  
  deep-learning trends
Visit annotations in context

Tags

deep-learning

trends

Annotators

haiy

URL

arxiv.org/pdf/1708.02709.pdf
www.aclweb.org www.aclweb.org

Document Modeling with Gated Recurrent Neural Network for Sentiment Classification

1
1. haiy 20 Feb 2019
  
  in Public
  
  deep-learning text-classification
Visit annotations in context

Tags

deep-learning

text-classification

Annotators

haiy

URL

aclweb.org/anthology/D15-1167.pdf
gitee.com gitee.com

Frontiers of Natural Language Processing - *10ptDeep Learning Indaba 2018, Stellenbosch, South Africa*-20pt

1
1. haiy 20 Feb 2019
  
  in Public
  
  NLP deep-learning review
Visit annotations in context

Tags

deep-learning

NLP

review

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/nlp/Frontiers of Natural Language Processing.pdf
nlp.stanford.edu nlp.stanford.edu

GloVe- Global Vectors for Word Representation.pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  GloVe: Global Vectors for Word Representation
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

nlp.stanford.edu/pubs/glove.pdf
www.jmlr.org www.jmlr.org

collobert11a.dvi

1
1. haiy 20 Feb 2019
  
  in Public
  
  Natural Language Processing (Almost) from Scratch
  
  NLP deep-learning google
Visit annotations in context

Tags

google

deep-learning

NLP

Annotators

haiy

URL

jmlr.org/papers/volume12/collobert11a/collobert11a.pdf
gitee.com gitee.com

Short Text Similarity with Word Embeddings.pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  Short Text Similarity with Word Embeddings
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/nlp/Short Text Similarity with Word Embeddings.pdf
gitee.com gitee.com

TRAINING RECURRENT NEURAL NETWORKS.pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  deep-learning
Visit annotations in context

Tags

deep-learning

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/nlp/TRAINING RECURRENT NEURAL NETWORKS.pdf
gitee.com gitee.com

深度长文-NLP的巨人肩膀(下).pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/nlp/深度长文-NLP的巨人肩膀(下).pdf
gitee.com gitee.com

深度长文-NLP的巨人肩膀(上).pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/nlp/深度长文-NLP的巨人肩膀(上).pdf
markcmarino.com markcmarino.com

1706.03762.pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  Attention Is All You Need
  
  NLP deep-learning attention
Visit annotations in context

Tags

deep-learning

attention

NLP

Annotators

haiy

URL

markcmarino.com/readings/150/maw/attentionisallyouneed.pdf
cs.stanford.edu cs.stanford.edu

paragraph_vector.pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  Distributed Representations of Sentences and Documents - Doc2Vec
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

cs.stanford.edu/~quocle/paragraph_vector.pdf
tanthiamhuat.files.wordpress.com tanthiamhuat.files.wordpress.com

Deep Learning with Python

1
1. haiy 20 Feb 2019
  
  in Public
  
  deep-learning python book
Visit annotations in context

Tags

deep-learning

book

python

Annotators

haiy

URL

tanthiamhuat.files.wordpress.com/2018/03/deeplearningwithpython.pdf
aclanthology.org aclanthology.org

Deep Contextualized Word Representations

1
1. haiy 20 Feb 2019
  
  in Public
  
  Deep contextualized word representations
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

aclanthology.org/N18-1203.pdf
gitee.com gitee.com

Attention and Augmented Recurrent Neural Networks.pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  attention deep-learning NLP
Visit annotations in context

Tags

deep-learning

attention

NLP

Annotators

haiy

URL

gitee.com/arthurhu/pdfs/raw/master/deeplearning/nlp/Attention and Augmented Recurrent Neural Networks.pdf
arxiv.org arxiv.org

1810.04805.pdf

1
1. haiy 20 Feb 2019
  
  in Public
  
  BERT: Pre-training of Deep Bidirectional Transformers forLanguage Understanding
  
  NLP deep-learning
Visit annotations in context

Tags

deep-learning

NLP

Annotators

haiy

URL

arxiv.org/pdf/1810.04805
Jan 2019
www.sciencedirect.com www.sciencedirect.com

KNIME for reproducible cross-domain analysis of life science data

2
1. Maciej_Motyka 02 Jan 2019
  
  in Public
  
  By utilizing the Deeplearning4j library1 for model representation, learning and prediction, KNIME builds upon a well performing open source solution with a thriving community.
  
  KNIME ML/AI deep learning integration
2. Maciej_Motyka 02 Jan 2019
  
  in Public
  
  It is especially thanks to the work of Yann LeCun and Yoshua Bengio (LeCun et al., 2015) that the application of deep neural networks has boomed in recent years. The technique, which utilizes neural networks with many layers and enhanced backpropagation algorithms for learning, was made possible through both new research and the ever increasing performance of computer chips.
  
  ML/AI deep learning
Visit annotations in context

Tags

integration

ML/AI

KNIME

deep learning

Annotators

Maciej_Motyka

URL

sciencedirect.com/science/article/pii/S0168165617315651
Dec 2018
winstonhsu.info winstonhsu.info

[Fall 2018] Cognitive Computing (感知運算)

1
1. ildar 04 Dec 2018
  
  in Public
  
  deep learning @course
Visit annotations in context

Tags

deep learning

@course

Annotators

ildar

URL

winstonhsu.info/2018f-cognitive-computing/
berkeley-deep-learning.github.io berkeley-deep-learning.github.io

CS 294-131: Special Topics in Deep Learning

1
1. ildar 03 Dec 2018
  
  in Public
  
  deep learning @course
Visit annotations in context

Tags

deep learning

@course

Annotators

ildar

URL

berkeley-deep-learning.github.io/cs294-131-f18/
bcourses.berkeley.edu bcourses.berkeley.edu

CS194/294-129 Designing, Visualizing and Understanding Deep Neural Networks

1
1. ildar 03 Dec 2018
  
  in Public
  
  deep learning @course
Visit annotations in context

Tags

deep learning

@course

Annotators

ildar

URL

bcourses.berkeley.edu/courses/1468734
www.quora.com www.quora.com

Why is ReLU the most common activation function used in neural networks? - Quora

1
1. ildar 02 Dec 2018
  
  in Public
  
  deep learning activation function
Visit annotations in context

Tags

deep learning

activation function

Annotators

ildar

URL

quora.com/Why-is-ReLU-the-most-common-activation-function-used-in-neural-networks
github.com github.com

х-Ким/nlp_with_pytorch

1
1. ildar 02 Dec 2018
  
  in Public
  
  nlp deep learning korean
Visit annotations in context

Tags

korean

deep learning

nlp

Annotators

ildar

URL

github.com/kh-kim/nlp_with_pytorch
shop.oreilly.com shop.oreilly.com

Natural Language Processing with PyTorch

1
1. ildar 02 Dec 2018
  
  in Public
  
  nlp deep learning @find
Visit annotations in context

Tags

@find

deep learning

nlp

Annotators

ildar

URL

shop.oreilly.com/product/0636920063445.do
github.com github.com

astorfi/Deep-Learning-NLP

1
1. ildar 02 Dec 2018
  
  in Public
  
  nlp deep learning
Visit annotations in context

Tags

deep learning

nlp

Annotators

ildar

URL

github.com/astorfi/Deep-Learning-NLP
github.com github.com

rguthrie3/DeepLearningForNLPInPytorch

1
1. ildar 02 Dec 2018
  
  in Public
  
  nlp deep learning
Visit annotations in context

Tags

deep learning

nlp

Annotators

ildar

URL

github.com/rguthrie3/DeepLearningForNLPInPytorch
github.com github.com

rouseguy/DeepLearning-NLP

1
1. ildar 02 Dec 2018
  
  in Public
  
  nlp deep learning
Visit annotations in context

Tags

deep learning

nlp

Annotators

ildar

URL

github.com/rouseguy/DeepLearning-NLP
github.com github.com

brianspiering/awesome-dl4nlp

1
1. ildar 02 Dec 2018
  
  in Public
  
  nlp deep learning
Visit annotations in context

Tags

deep learning

nlp

Annotators

ildar

URL

github.com/brianspiering/awesome-dl4nlp
Oct 2018
Local file Local file

SO-YOLO based WBC Detection with Fourier Ptychographic Microscopy

1
1. aerobius 05 Oct 2018
  
  in Public
  
  Many detection methods such as Faster-RCNN and YOLO, perform badly in small objects detec-tion. With some considerable improvements in the originalframework of YOLOv2, our proposed SO-YOLO can solvethis problem perfectly.
  
  YOLO SO-YOLO deep learning
Tags

YOLO

SO-YOLO

deep learning

Annotators

aerobius
ieeexplore.ieee.org ieeexplore.ieee.org

SO-YOLO based WBC Detection with Fourier Ptychographic Microscopy - IEEE Journals & Magazine

2
1. aerobius 04 Oct 2018
  
  in Public
  
  As a convolutional neural network, SO-YOLO outperforms state-of-the-art detection methods both in accuracy and speed.
  
  YOLO deep learning SO-YOLO
2. aerobius 04 Oct 2018
  
  in Public
  
  SO-YOLO performs well in detecting small objects compared with other methods.
  
  YOLO SO-YOLO deep learning
Visit annotations in context

Tags

YOLO

SO-YOLO

deep learning

Annotators

aerobius

URL

ieeexplore.ieee.org/document/8444969
May 2018
web.stanford.edu web.stanford.edu

Stanford University: Tensorflow for Deep Learning Research

1
1. gylpm 22 May 2018
  
  in Public
  
  CS 20: Tensorflow for Deep Learning Research
  
  课程时间: 1月-3月, 2018
  
  course stanford tensorflow deep learning
Visit annotations in context

Tags

stanford

course

tensorflow

deep learning

Annotators

gylpm

URL

web.stanford.edu/class/cs20si/
Mar 2018
arxiv.org arxiv.org

Mathematics of Deep Learning

1
1. gylpm 08 Mar 2018
  
  in Public
  
  Mathematics of Deep Learning
  
  paper deep learning
Visit annotations in context

Tags

paper

deep learning

Annotators

gylpm

URL

arxiv.org/abs/1712.04741
distill.pub distill.pub

Feature Visualization

1
1. gylpm 08 Mar 2018
  
  in Public
  
  deep learning visualization
Visit annotations in context

Tags

visualization

deep learning

Annotators

gylpm

URL

distill.pub/2017/feature-visualization
distill.pub distill.pub

The Building Blocks of Interpretability

1
1. gylpm 08 Mar 2018
  
  in Public
  
  deep learning visualization
Visit annotations in context

Tags

visualization

deep learning

Annotators

gylpm

URL

distill.pub/2018/building-blocks
www.pyimagesearch.com www.pyimagesearch.com

Face detection with OpenCV and deep learning - PyImageSearch

1
1. gylpm 07 Mar 2018
  
  in Public
  
  deep learning face detection caffe opencv
Visit annotations in context

Tags

caffe

opencv

face detection

deep learning

Annotators

gylpm

URL

pyimagesearch.com/2018/02/26/face-detection-with-opencv-and-deep-learning/
webfoundation.org webfoundation.org

AI_Report_WF.pdf

2
1. hiperterminal 01 Mar 2018
  
  in Public
  
  artificial neural network
  
  El deep learning incluye redes neuronales
  
  Deep learning Redes neuronales
2. hiperterminal 01 Mar 2018
  
  in Public
  
  Artificial intelligence (AI), machine learning and deep learning
  
  Explicación gráfica de artificial intelligence, machine learning y deep learning
  
  Inteligencia artificial Machine learning Deep learning
Visit annotations in context

Tags

Redes neuronales

Inteligencia artificial

Machine learning

Deep learning

Annotators

hiperterminal

URL

webfoundation.org/docs/2017/07/AI_Report_WF.pdf
Sep 2017
tenso.rs tenso.rs

TensorFire

1
1. dckc 01 Sep 2017
  
  in Public
  
  webgl tensorflow deep-learning data parallel
Visit annotations in context

Tags

tensorflow

deep-learning

data

parallel

webgl

Annotators

dckc

URL

tenso.rs/
Aug 2017
arxiv.org arxiv.org

1708.04347.pdf

1
1. hodapp 31 Aug 2017
  
  in Public
  
  This is a very easy paper to follow, but it looks like their methodology is a simple way to improve performance on limited data. I'm curious how well this is reproduced elsewhere.
  
  convolutional neural networks neural networks deep learning
Visit annotations in context

Tags

convolutional neural networks

deep learning

neural networks

Annotators

hodapp

URL

arxiv.org/pdf/1708.04347.pdf
databricks.com databricks.com

A Vision for Making Deep Learning Simple - The Databricks Blog

1
1. SamRose 31 Aug 2017
  
  in Public
  
  pyspark deep learning sparkdl
Visit annotations in context

Tags

pyspark

deep learning

sparkdl

Annotators

SamRose

URL

databricks.com/blog/2017/06/06/databricks-vision-simplify-large-scale-deep-learning.html
blog.athelas.com blog.athelas.com

A Brief History of CNNs in Image Segmentation: From R-CNN to Mask R-CNN

1
1. hodapp 17 Aug 2017
  
  in Public
  
  Excellent overview. I found the papers a little hard to grasp, and this cleared a lot of that up.
  
  deep learning computer vision machine learning
Visit annotations in context

Tags

computer vision

deep learning

machine learning

Annotators

hodapp

URL

blog.athelas.com/a-brief-history-of-cnns-in-image-segmentation-from-r-cnn-to-mask-r-cnn-34ea83205de4
Apr 2017
www.nature.com www.nature.com

Hybrid computing using a neural network with dynamic external memory : Nature

1
1. vaughn 19 Apr 2017
  
  in Public
  
  code: https://github.com/deepmind/dnc
  
  deeplearners deep learning ai mtldata
Visit annotations in context

Tags

deeplearners

mtldata

ai

deep learning

Annotators

vaughn

URL

nature.com/articles/nature20101
www.fast.ai www.fast.ai

Launching fast.ai · fast.ai

3
1. vaughn 16 Apr 2017
  
  in Public
  
  areas where deep learning is currently being poorly utilized
  
  who is curating a list of deep learning success stories, case studies and applications?
  
  deep learning ai mtldata deeplearners
2. vaughn 16 Apr 2017
  
  in Public
  
  highly automated tools for training deep learning models
  
  such as?
  
  deep learning ai deeplearners mtldata
3. vaughn 16 Apr 2017
  
  in Public
  
  The best way we can help these people is by giving them the tools and knowledge to solve their own problems, using their own expertise and experience.
  
  Agree or disagree?
  
  deeplearners deep learning ai
Visit annotations in context

Tags

mtldata

ai

deep learning

deeplearners

Annotators

vaughn

URL

fast.ai/2016/10/07/fastai-launch/
channel9.msdn.com channel9.msdn.com

Recurrent Neural Networks and Other Machines that Learn Algorithms Symposium Session 1

1
1. vaughn 16 Apr 2017
  
  in Public
  
  first 15 minutes is a very interesting history of deep learning
  
  deeplearners deep learning ai
Visit annotations in context

Tags

deeplearners

ai

deep learning

Annotators

vaughn

URL

channel9.msdn.com/Events/Neural-Information-Processing-Systems-Conference/Neural-Information-Processing-Systems-Conference-NIPS-2016/Recurrent-Neural-Networks-and-Other-Machines-that-Learn-Algorithms-Symposium-Session-1
mtldata.com mtldata.com

Learning Deep Learning - MTLDATA

3
1. vaughn 14 Apr 2017
  
  in Public
  
  Slack
  
  register or login here: http://mtldata.com/slack
  
  mtldata deep learners deep learning
2. vaughn 14 Apr 2017
  
  in Public
  
  r/deeplearners subreddit
  
  http://reddit.com/r/deeplearners
  
  deep learners deep learning
3. vaughn 14 Apr 2017
  
  in Public
  
  online reading group
  
  Please share any feedback you have.
  
  deep learners deep learning
Visit annotations in context

Tags

mtldata

deep learners

deep learning

Annotators

vaughn

URL

mtldata.com/deeplearners/
arxiv.org arxiv.org

1704.01568.pdf

1
1. vaughn 12 Apr 2017
  
  in Public
  
  Appendix A:Table of various deep learning applications
  
  This is a good list. Has anyone come across a comprehensive list of deep learning applications?
  
  deep learning
Visit annotations in context

Tags

deep learning

Annotators

vaughn

URL

arxiv.org/ftp/arxiv/papers/1704/1704.01568.pdf
colah.github.io colah.github.io

Understanding LSTM Networks -- colah's blog

1
1. rookiepig 06 Apr 2017
  
  in Public
  
  Almost all exciting results based on recurrent neural networks are achieved with them.
  
  lstm是rnn的一种，但是一般所RNN指传统标准的RNN
  
  deep learning lstm
Visit annotations in context

Tags

lstm

deep learning

Annotators

rookiepig

URL

colah.github.io/posts/2015-08-Understanding-LSTMs/
www.tensorflow.org www.tensorflow.org

MNIST For ML Beginners | TensorFlow

1
1. pavelanni 01 Apr 2017
  
  in Public
  
  If we write that out as equations, we get:
  
  It would be easier to understand what are x and y and W here if the actual numbers were used, like 784, 10, 55000, etc. In this simple example there are 3 x and 3 y, which is misleading. In reality there are 784 x elements (for each pixel) and 55,000 such x arrays and only 10 y elements (for each digit) and then 55,000 of them.
  
  tensorflow deep learning neural networks
Visit annotations in context

Tags

tensorflow

deep learning

neural networks

Annotators

pavelanni

URL

tensorflow.org/get_started/mnist/beginners
Mar 2017
engineering.skymind.io engineering.skymind.io

Distributed Deep Learning, Part 1: An Introduction to Distributed Training of Neural Networks

2
1. rookiepig 09 Mar 2017
  
  in Public
  
  Consequently, our advice is simple: continue to train your networks on a single machine, until the training time becomes prohibitive.
  
  一定要对数据加载时间、参数通信时间、计算时间有个明确的评估，不能为了并行而并行。能单机解决的问题就不着急上多机。
  
  deep learning
2. rookiepig 09 Mar 2017
  
  in Public
  
  odel parallelism can work well in practice, data parallelism is arguably the preferred approach for distributed systems and has been the focus of more research
  
  why ?
  
  deep learning
Visit annotations in context

Tags

deep learning

Annotators

rookiepig

URL

engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks
Dec 2016
www.youtube.com www.youtube.com

Nuts and Bolts of Applying Deep Learning (Andrew Ng)

1
1. sravya8 31 Dec 2016
  
  in Public
  
  Key points:
  
  Scale of data is especially good for large NNs
  
  Having a combination of HPC and AI skills is important to have optimal impact (handle scale challenges and bigger/complex NN)
  
  Most of the value right now comes from CNNS, FCs, RNNS. Unsupervised, GANs and others might be future but they are research topics right now.
  
  E2E DL might be relevant for some cases in future like speech -> transcript, Image -> captioning, text -> image
  
  Self driving cars might also move to E2E, but none of us have enough data image -> steer
  
  Workflow:
  
  Bias = Training error - Human error. Try Bigger model, run longer, New model architecture
  
  Variance = Dev error - Train error. Try More data, Regularization, New model architecture.
  
  Conflict between bias and variance is weaker in DL. We can have bigger model with more data.
  
  More data:
  
  Data synthesis/augmentation is becoming useful and popular: OCR (superpose alphabets on various images), Speech (Superpose various background noises), NLP(?) But does have drawbacks, if it is not representative
  
  Unified data warehouse helps leverage data usage across company
  
  Data set breakdown:
  
  Dev and test should come from same distribution. As we spend a lot of time optimizing for Dev accuracy.
  
  Progress plateaus above Human level performance:
  
  But there is theoretical optimal error rate (Bayes rate)
  
  What to do when bias is high:
  
  Look at examples of the ones machine got it wrong
  
  Get labels from humans?
  
  Error analysis: Segment training - identify segments where training error is higher than human.
  
  Estimate bias/variance effect?
  
  How do you define human level performance: Example: Error of a panel of experts
  
  Size of data:
  
  How do you define a NN as small vs medium vs large?
  
  Is the reason large NN can leverage bigger data is because it would not cause overfitting unlike on smaller NNs?
  
  Deep learning
Visit annotations in context

Tags

Deep learning

Annotators

sravya8

URL

youtube.com/watch
karpathy.medium.com karpathy.medium.com

Yes you should understand backprop – Andrej Karpathy – Medium

1
1. shubhamjain0594 27 Dec 2016
  
  in Public
  
  A good blog post on why do you need to understand backpropagation links to videos from CS231n by karpathy.
  
  deep-learning blog-post
Visit annotations in context

Tags

deep-learning

blog-post

Annotators

shubhamjain0594

URL

karpathy.medium.com/yes-you-should-understand-backprop-e2f06eab496b
Nov 2016
classroom.udacity.com classroom.udacity.com

Deep Neural Network in TensorFlow - Udacity

1
1. Lancejchen 22 Nov 2016
  
  in Public
  
  Deep neural networks use multiple layers with each layer requiring it's own weight and bias.
  
  Every layer needs its own weights and bias. And in tensorflow, it is a good practice to put all weights inside a dictionary, which is easier for management.
  
  deep learning
Visit annotations in context

Tags

deep learning

Annotators

Lancejchen

URL

classroom.udacity.com/nanodegrees/nd013/parts/fbf77062-5703-404e-b60c-95b78b2f3f9e/modules/6df7ae49-c61c-4bb2-a23e-6527e69209ec/lessons/b516a270-8600-4f93-a0a3-20dfeabe5da6/concepts/83a3a2a2-a9bd-4b7b-95b0-eb924ab14432
github.com github.com

sbrugman/deep-learning-papers

1
1. robertknight 10 Nov 2016
  
  in Public
  
  deep-learning
Visit annotations in context

Tags

deep-learning

Annotators

robertknight

URL

github.com/sbrugman/deep-learning-papers/blob/master/README.md
Oct 2016
medium.com medium.com

Deep Learning Is Going to Teach Us All the Lesson of Our Lives: Jobs Are for Machines – Basic income – Medium

1
1. otterscotter 23 Oct 2016
  
  in Public
  
  Big Data
  
  big data information deep learning machine learning AI
Visit annotations in context

Tags

AI

information

deep learning

big data

machine learning

Annotators

otterscotter

URL

medium.com/basic-income/deep-learning-is-going-to-teach-us-all-the-lesson-of-our-lives-jobs-are-for-machines-7c6442e37a49
www.miaoerduo.com www.miaoerduo.com

基于Caffe的DeepID2实现（上） - 喵耳朵

1
1. gitlinux 12 Oct 2016
  
  in Public
  
  这里要求，输入的数据时成对存在，每一对都有一个公共的label，是否是同一个类别。
  
  Verification signal
  
  DeepID2 Deep Learning
Visit annotations in context

Tags

Deep Learning

DeepID2

Annotators

gitlinux

URL

miaoerduo.com/deep-learning/基于caffe的deepid2实现（上）.html
Jul 2016
arxiv.org arxiv.org

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

1
1. suhangpro 26 Jul 2016
  
  in Public
  
  XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks
  
  xnor-net: a very efficient network
  
  arxiv paper deep learning
Visit annotations in context

Tags

arxiv

paper

deep learning

Annotators

suhangpro

URL

arxiv.org/abs/1603.05279
arxiv.org arxiv.org

Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks

1
1. suhangpro 12 Jul 2016
  
  in Public
  
  Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks
  
  arxiv paper vision deep learning motion prediction
Visit annotations in context

Tags

motion prediction

paper

vision

arxiv

deep learning

Annotators

suhangpro

URL

arxiv.org/abs/1607.02586
inst-fs-iad-prod.inscloudgate.net inst-fs-iad-prod.inscloudgate.net

Deep learning

6
1. haiy 10 Jul 2016
  
  in Public
  
  half-spaces sepa-rated by a hyperplane19.
  
  传统算法的局限，在图像和语音领域，需要对不相干的钝感和对几个很小地方差异的敏感
  
  Deep Learning
2. haiy 10 Jul 2016
  
  in Public
  
  Deep learning
  
  四大金刚中的三个
  
  Deep Learning
3. haiy 10 Jul 2016
  
  in Public
  
  The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure.
  
  深度学习的最重要的一方面就是多层特征自动学习
  
  Deep Learning
4. haiy 10 Jul 2016
  
  in Public
  
  most practitioners use a procedure called stochastic gradient descent (SGD).
  
  随机梯度下降算法，讲的很好
  
  Deep Learning
5. haiy 10 Jul 2016
  
  in Public
  
  , The chain rule of derivatives tells us how two small effects (that of a small change of x on y, and that of y on z) are composed.
  
  我擦！原来如此！！！
  
  Deep Learning
6. haiy 10 Jul 2016
  
  in Public
  
  The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives.
  
  反向传播过程来计算一个具有多层模块权重的目标函数的梯度其实不过是求导链式规则实际应用。
  
  Deep Learning
Visit annotations in context

Tags

Deep Learning

Annotators

haiy

URL

inst-fs-iad-prod.inscloudgate.net/files/2cc016f7-1e27-47f7-85bb-36aec57d7787/Deep Learning.pdf
en.wikipedia.org en.wikipedia.org

Run-length encoding - Wikipedia, the free encyclopedia

1
1. haiy 10 Jul 2016
  
  in Public
  
  Run-length encoding (RLE)
  
  数据无损压缩编码
  
  Deep Learning
Visit annotations in context

Tags

Deep Learning

Annotators

haiy

URL

en.wikipedia.org/wiki/Run-length_encoding
www.zhihu.com www.zhihu.com

请问人工神经网络中的activation function的作用具体是什么？为什么ReLu要好过于tanh和sigmoid function? - 知乎用户的回答 - 知乎

1
1. haiy 10 Jul 2016
  
  in Public
  
  根据评论区 @山丹丹@啸王的提醒，更正了一些错误（用斜体显示），在此谢谢各位。并根据自己最近的理解，增添了一些东西（用斜体显示）。如果还有错误，欢迎大家指正。第一个问题：为什么引入非线性激励函数？如果不用激励函数（其实相当于激励函数是f(x) = x），在这种情况下你每一层输出都是上层输入的线性函数，很容易验证，无论你神经网络有多少层，输出都是输入的线性组合，与没有隐藏层效果相当，这种情况就是最原始的感知机（Perceptron）了。正因为上面的原因，我们决定引入非线性函数作为激励函数，这样深层神经网络就有意义了（不再是输入的线性组合，可以逼近任意函数）。最早的想法是sigmoid函数或者tanh函数，输出有界，很容易充当下一层输入（以及一些人的生物解释balabala）。第二个问题：为什么引入Relu呢？第一，采用sigmoid等函数，算激活函数时（指数运算），计算量大，反向传播求误差梯度时，求导涉及除法，计算量相对大，而采用Relu激活函数，整个过程的计算量节省很多。第二，对于深层网络，sigmoid函数反向传播时，很容易就会出现梯度消失的情况（在sigmoid接近饱和区时，变换太缓慢，导数趋于0，这种情况会造成信息丢失，参见 @Haofeng Li 答案的第三点），从而无法完成深层网络的训练。第三，Relu会使一部分神经元的输出为0，这样就造成了网络的稀疏性，并且减少了参数的相互依存关系，缓解了过拟合问题的发生（以及一些人的生物解释balabala）。当然现在也有一些对relu的改进，比如prelu，random relu等，在不同的数据集上会有一些训练速度上或者准确率上的改进，具体的大家可以找相关的paper看。多加一句，现在主流的做法，会在做完relu之后，加一步batch normalization，尽可能保证每一层网络的输入具有相同的分布[1]。而最新的paper[2]，他们在加入bypass connection之后，发现改变batch normalization的位置会有更好的效果。大家有兴趣可以看下。
  
  ReLU的好
  
  Deep Learning
Visit annotations in context

Tags

Deep Learning

Annotators

haiy

URL

zhihu.com/question/29021768/answer/43017159
arxiv.org arxiv.org

Unsupervised Learning of 3D Structure from Images

1
1. suhangpro 06 Jul 2016
  
  in Public
  
  Unsupervised Learning of 3D Structure from Images Authors: Danilo Jimenez Rezende, S. M. Ali Eslami, Shakir Mohamed, Peter Battaglia, Max Jaderberg, Nicolas Heess (Submitted on 3 Jul 2016) Abstract: A key goal of computer vision is to recover the underlying 3D structure from 2D observations of the world. In this paper we learn strong deep generative models of 3D structures, and recover these structures from 3D and 2D images via probabilistic inference. We demonstrate high-quality samples and report log-likelihoods on several datasets, including ShapeNet [2], and establish the first benchmarks in the literature. We also show how these models and their inference networks can be trained end-to-end from 2D images. This demonstrates for the first time the feasibility of learning to infer 3D representations of the world in a purely unsupervised manner.
  
  The 3D representation of a 2D image is ambiguous and multi-modal. We achieve such reasoning by learning a generative model of 3D structures, and recover this structure from 2D images via probabilistic inference.
  
  paper 3d deep learning arxiv vision graphics
Visit annotations in context

Tags

3d

paper

vision

arxiv

deep learning

graphics

Annotators

suhangpro

URL

arxiv.org/abs/1607.00662
arxiv.org arxiv.org

Learning without Forgetting

1
1. suhangpro 06 Jul 2016
  
  in Public
  
  When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning as standard practice for improved new task performance.
  
  Learning w/o Forgetting: distilled transfer learning
  
  deep learning paper arxiv vision
Visit annotations in context

Tags

arxiv

paper

deep learning

vision

Annotators

suhangpro

URL

arxiv.org/abs/1606.09282
Jun 2016
arxiv.org arxiv.org

Low-shot visual object recognition

1
1. suhangpro 13 Jun 2016
  
  in Public
  
  Low-shot visual object recognition
  
  vision paper arxiv deep learning to read
Visit annotations in context

Tags

to read

paper

vision

arxiv

deep learning

Annotators

suhangpro

URL

arxiv.org/abs/1606.02819
Apr 2016
techcrunch.com techcrunch.com

Your Algorithmic Self Meets Super-Intelligent AI

1
1. daveh70 21 Apr 2016
  
  in Public
  
  We should have control of the algorithms and data that guide our experiences online, and increasingly offline. Under our guidance, they can be powerful personal assistants.
  
  Big business has been very militant about protecting their "intellectual property". Yet they regard every detail of our personal lives as theirs to collect and sell at whim. What a bunch of little darlings they are.
  
  machine learning deep learning big data personal data internet web artificial intelligence
Visit annotations in context

Tags

web

internet

artificial intelligence

personal data

deep learning

big data

machine learning

Annotators

daveh70

URL

techcrunch.com/2015/12/14/your-algorithmic-self-meets-super-intelligent-ai/
Dec 2015
openai.com openai.com

OpenAI

1
1. daveh70 11 Dec 2015
  
  in Public
  
  OpenAI is a non-profit artificial intelligence research company. Our goal is to advance digital intelligence in the way that is most likely to benefit humanity as a whole, unconstrained by a need to generate financial return.
  
  https://twitter.com/open_ai
  
  They're hiring: https://openai.com/about/
  
  ai artificial intelligence machine learning deep learning
Visit annotations in context

Tags

artificial intelligence

ai

deep learning

machine learning

Annotators

daveh70

URL

openai.com/blog/introducing-openai/

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators