2,904 Matching Annotations
  1. Nov 2019
    1. The text documents a year-long research project into experiential learning in teacher professional development. Teachers participated in experiential learning themselves to then begin to implement it into their own classrooms to serve their students. By and large, teachers were receptive, had misconceptions addressed, changed their practices with their colleagues and students to develop more engaging and active classrooms. Essentially, a shift from teacher-centered learning to student-centered learning was achieved in small increments by using experiential learning and reflection to facilitate teacher growth thereby creating new pathways for student learning. Given the nature of the traditional methods predominantly used, this study seems to reflect some elements of transformative learning in which teacher conventions and ideas were challenged and adjusted through heterogenous groups and personal reflection. Rating: 9/10

    1. Problem-based learning (PBL) in a growing trend in approaching adult learning, particularly in ESL/ELL classrooms. In this text, the basic principles and methods of PBL for ELL/ESL classes are covered for instructors to implement. Key aspects of PBL include relevance to student lives and the opportunity to practice English in a heterogenous group with the end goal being application to another area of life. Multiple resources are helpful for implementation of PBL including technology. A review of the benefits of PBL is summarized as well as drawbacks with embedded suggestions to resolve possible difficulties. Rating: 8/10

    1. Author Jeff Cobb features guest Celisa to discuss trends in the field of lifelong learning. The speakers note twelve existing trends such as MOOCs, micro-credentials, neuroscience, and self-directed learning. Both private and public sectors or contributing to existing and emerging trends. Life-long learning is transforming as services explore free and paid services to extend learning to more populations.

    1. In this text, authors Kit Kacirek and Michael Miller explore adult learning for mature adults, or those identified as senior citizens. Research into mature adult learning programs centered around leisure activities, reveals situational pedagogy in which some traditional adult learning theory may need to be adapted to suit the cognitive changes in adults with advanced age. A brief description of the research methods reveals that adults in advanced age prefer lecture, use of media, and field trips. The implications for such a study are useful as the population of mature adults grows due to advancements in medicine and thus the demand for learning opportunities increases as well.

    1. Section 508 compliance is discussed to support instructors knowledge of section 508 and how to begin the process of ensuring instructional content is 508 compliant. Section 508 of the federal Rehabilitation Act governs access of media to all persons whether they have a disability or not. Including captions, audio description, and accessible video players are vital to compliance. Compliance with 508 is necessary given that data that illustrates the percent of employees that have need for accommodations to support their learning. This brief article seems highly related to Universal Design of Learning. Rating: 10/10

    1. Author Douglas Lieberman provides insights into how to use text to improve learning. Suggestions for type of text, volume of text, animations, and graphics are discussed to maximize their usefulness and convey information to learners and/or facilitate discussion among learners. Rating: 6/10

    1. The Northwest Center of Public Health Practice's toolkit title "Effective Adult Learning: A toolkit for teaching adults," is . a highly comprehensive resource for instructional design for adult learning instructors. Sections include course or training design, objectives of adult learning, various tools to help in the process of course design, and brief overviews of adult learning methods and theory. The embedded section review charts make it easier for quick references. Rating: 10/10

    2. To be effective in teaching adults, it’s important to know your audience and have a general understanding of how adults learn

      This literature is a resource to assist in adult teaching. The first section of the reading defines who your audience (background, does your selected audience need more training, learning objectives). Then explains the learning objectives in more detail and how to develop effective learning objectives (Specify, Measureable, Achievable, Relevant, and Time-bound) and if needed the ABCD model (Audience, Behavior, Condition, Degree) can be utilized. Secondly, developing training content. Lastly, deliver your training. The article is very good. Rating: 5/5

    1. conventional learning objectives can work against us.

      Cathy Moore discusses the love-hate relationship with learning objectives. Objectives can be a critical tool to guide instruction however, we can miss the boat when it comes to meaningful, applicable, and relevant learning. In the text, Moore is critical of objectives that merely are used to ensure a learner knows content. It is preferential, and superior instruction, to ensure a learner can exercise the knowledge with observable actions in context. Rating: 9/10

    1. The Council for Adult and Experiential Learning (CAEL) provides opportunities for professional development for adult learning instructors and organizations that serve adult learners. CAEL has launched its first live stream of the conference to allow people to attend remotely. While the conference has since passed, this resource could be useful to calendar for the coming year. Included on the is a blog, newsletter sign up, and resources for higher education, employers and workforce development. Rating 8/10

    1. The lesson plan template provided is a helpful tool for designing a basic lesson with adult learning concepts. Some of the lesson plan template is also a part of pedagogy, but some key elements reflect adult learning theory. For example, the section on Practice and Application encourages activities to transfer skills to new situations and concluded by a reflection activity. Given adult learners may have various goals for their learning, the segment addresses adult learning theory. The template could be used or adapted to begin designing around technological tools used for instruction as well. The template does seem to reflect a model of synchronous, face-to-face learning given it suggests the instructor move around the room to monitor progress and assist learners. Rating: 6/10

    1. The use of on-line instructional delivery methodscontinues to grow as technological and societal changes have enabled and encouraged this growth.

      The article was written to help the reader understand how adult learners comprehend lessons and their learning styles. The type of learning method that is used in this article is the andragogical process model (eight element process). The article is an interesting view of how the andragogical process model can be used to explore how the adult mind understands how to use online learning to educate themselves. Rating 3/5

    1. The use of technology to support learning for K-12 students is gaining popularity, leading many to ask whether there might be similar solutions for low-skilled adults.

      This article emphasizes on the topic of how adult learning is hindered by technology and how to teach an adult learner. Using five theories; 1) Shared experience 2) Problem-solving scenarios 3) Reflection on experience 4) Own their learning 5) Have an ah-ha moment. Adults all differently and all want that opportunity to have a new learning moment. Rating 5/5

    1. Designed to be used in a workshop setting, the content provides an understanding of adult learning theory and it's application of best practices in both face to face and e-learning environments. Participants are provided a list of web tools to facilitate learning.

      6/10: the format is bit difficult to access out of context

    1. Drawing from constructivist principles, the authors address how emotions affect motivation and learning for adults. They then provide practical application for instructors to implement to create productive learning environments where adult learners feel safe to explore new knowledge and learn from their experiences.

      9/10: while most of the application is to learning in general, the strategies are still applicable to technology in the classroom

    1. Transformative learning theory and methods to support it are discussed in this text. Andragogy is initially reviewed in order for the reader to become acclimated to basic principles of adult learning. Transformative learning segments emphasize the methods and environments needed to achieve such deep and challenging learning. Due to the intensive personal nature of transformative learning, one must understand the readiness of the learner. The text notes that learners in transition are more apt to engage in transformative learning if given an opportunity to develop self-awareness, and a willingness to be in discomfort in open, non-hierarchical environments.

    1. In this text, instructional designers are given brief synopses of three adult learning theories including andragogy, transformational learning, and experiential learning in order to understand how adults best learn and apply learning. The structure of the text is brief paragraphs with numerated descriptors and/or bullet points for reader convenience. Suggestions for learning activities are also provided for the instructional designer to consider in their course design. In the segment for transformative learning, a link is provided to provide the instructional designer more specific methods to incorporate. At the end of the text, diagrams are provided to visual core aspects and flow of each learning theories process. Rating: 7/10

    1. The Digital Promise article presents four major factors to consider when implementing technology for adult learning purposes. The factors include flexibility and benefits of blended learning, data use to support development of instruction, environments with diverse technology available support various learners, and allow the instructor's role to change to meet learner needs. Issues related to each factor are shared and suggestions for resolutions are provided. Rating: 7/10-a good resource for introduction to factors and issues in adult learning via technology.

    1. As online learning matures, it is important for both theorists and practitioners to understand how to apply new and emerging educational practices and technologies that foster a sense of community and optimize the online learning environment.

      The article expresses the design theory elements (goals, values, methods) and how it can assist with defining new tools for online learning. Rating 5/5

    1. An understanding of adult learning theories (ie, andragogy) in healthcare professional education programs is important for several reasons.

      The author of this article articulates the instrumental learning theories in the healthcare industry. The information provided is more like a speedy way for students and healthcare providers to understand the learning theories. Rating: 4/5

    1. n. Key to this model is the assumption that online education has evolved as a subset of learning in general rather than a subset of distance learning

      This article helps the reader understand the major theories that are related to technology using the leaning theories, theoretical frameworks, and models. Rate: 4/5

    1. Twitter offers two distinct benefits to engaging learners. First of all, it allows learners to respond to classroom discussions in a way that feels right for them, offering shy or introverted students a chance to participate in the class discussion without having to speak in a public forum. Secondly, it allows students to continue the conversation after class is completed, posting relevant links to course material, and reaching out to you (the educator) with additional thoughts or questions.

      The article explains how social media, student learning through digital experience, and Learning Management Systems can be beneficial to the learner/student. Article Rating: 3/5

    1. Some of our adult-ed students take their courses virtually, with students checking in with teachers via Skype or by email, but a majority spend at least some time in a classroom.

      This article expresses how learning can be taught using the internet and one does not have to be in class to learn.

  2. Oct 2019
    1. As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power. Furthermore, it's a great way to develop more advanced techniques, such as deep learning. And so throughout the book we'll return repeatedly to the problem of handwriting recognition. Later in the book, we'll discuss how these ideas may be applied to other problems in computer vision, and also in speech, natural language processing, and other domains.Of course, if the point of the chapter was only to write a computer program to recognize handwritten digits, then the chapter would be much shorter! But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent. Throughout, I focus on explaining why things are done the way they are, and on building your neural networks intuition. That requires a lengthier discussion than if I just presented the basic mechanics of what's going on, but it's worth it for the deeper understanding you'll attain. Amongst the payoffs, by the end of the chapter we'll be in position to understand what deep learning is, and why it matters.PerceptronsWhat is a neural network? To get started, I'll explain a type of artificial neuron called a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it's more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron. We'll get to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way they are, it's worth taking the time to first understand perceptrons.So how do perceptrons work? A perceptron takes several binary inputs, x1,x2,…x1,x2,…x_1, x_2, \ldots, and produces a single binary output: In the example shown the perceptron has three inputs, x1,x2,x3x1,x2,x3x_1, x_2, x_3. In general it could have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He introduced weights, w1,w2,…w1,w2,…w_1,w_2,\ldots, real numbers expressing the importance of the respective inputs to the output. The neuron's output, 000 or 111, is determined by whether the weighted sum ∑jwjxj∑jwjxj\sum_j w_j x_j is less than or greater than some threshold value. Just like the weights, the threshold is a real number which is a parameter of the neuron. To put it in more precise algebraic terms: output={01if ∑jwjxj≤ thresholdif ∑jwjxj> threshold(1)(1)output={0if ∑jwjxj≤ threshold1if ∑jwjxj> threshold\begin{eqnarray} \mbox{output} & = & \left\{ \begin{array}{ll} 0 & \mbox{if } \sum_j w_j x_j \leq \mbox{ threshold} \\ 1 & \mbox{if } \sum_j w_j x_j > \mbox{ threshold} \end{array} \right. \tag{1}\end{eqnarray} That's all there is to how a perceptron works!That's the basic mathematical model. A way you can think about the perceptron is that it's a device that makes decisions by weighing up evidence. Let me give an example. It's not a very realistic example, but it's easy to understand, and we'll soon get to more realistic examples. Suppose the weekend is coming up, and you've heard that there's going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors: Is the weather good? Does your boyfriend or girlfriend want to accompany you? Is the festival near public transit? (You don't own a car). We can represent these three factors by corresponding binary variables x1,x2x1,x2x_1, x_2, and x3x3x_3. For instance, we'd have x1=1x1=1x_1 = 1 if the weather is good, and x1=0x1=0x_1 = 0 if the weather is bad. Similarly, x2=1x2=1x_2 = 1 if your boyfriend or girlfriend wants to go, and x2=0x2=0x_2 = 0 if not. And similarly again for x3x3x_3 and public transit.Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But perhaps you really loathe bad weather, and there's no way you'd go to the festival if the weather is bad. You can use perceptrons to model this kind of decision-making. One way to do this is to choose a weight w1=6w1=6w_1 = 6 for the weather, and w2=2w2=2w_2 = 2 and w3=2w3=2w_3 = 2 for the other conditions. The larger value of w1w1w_1 indicates that the weather matters a lot to you, much more than whether your boyfriend or girlfriend joins you, or the nearness of public transit. Finally, suppose you choose a threshold of 555 for the perceptron. With these choices, the perceptron implements the desired decision-making model, outputting 111 whenever the weather is good, and 000 whenever the weather is bad. It makes no difference to the output whether your boyfriend or girlfriend wants to go, or whether public transit is nearby.By varying the weights and the threshold, we can get different models of decision-making. For example, suppose we instead chose a threshold of 333. Then the perceptron would decide that you should go to the festival whenever the weather was good or when both the festival was near public transit and your boyfriend or girlfriend was willing to join you. In other words, it'd be a different model of decision-making. Dropping the threshold means you're more willing to go to the festival.Obviously, the perceptron isn't a complete model of human decision-making! But what the example illustrates is how a perceptron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of perceptrons could make quite subtle decisions: In this network, the first column of perceptrons - what we'll call the first layer of perceptrons - is making three very simple decisions, by weighing the input evidence. What about the perceptrons in the second layer? Each of those perceptrons is making a decision by weighing up the results from the first layer of decision-making. In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer. And even more complex decisions can be made by the perceptron in the third layer. In this way, a many-layer network of perceptrons can engage in sophisticated decision making.Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. It's less unwieldy than drawing a single output line which then splits.Let's simplify the way we describe perceptrons. The condition ∑jwjxj>threshold∑jwjxj>threshold\sum_j w_j x_j > \mbox{threshold} is cumbersome, and we can make two notational changes to simplify it. The first change is to write ∑jwjxj∑jwjxj\sum_j w_j x_j as a dot product, w⋅x≡∑jwjxjw⋅x≡∑jwjxjw \cdot x \equiv \sum_j w_j x_j, where www and xxx are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality, and to replace it by what's known as the perceptron's bias, b≡−thresholdb≡−thresholdb \equiv -\mbox{threshold}. Using the bias instead of the threshold, the perceptron rule can be rewritten: output={01if w⋅x+b≤0if w⋅x+b>0(2)(2)output={0if w⋅x+b≤01if w⋅x+b>0\begin{eqnarray} \mbox{output} = \left\{ \begin{array}{ll} 0 & \mbox{if } w\cdot x + b \leq 0 \\ 1 & \mbox{if } w\cdot x + b > 0 \end{array} \right. \tag{2}\end{eqnarray} You can think of the bias as a measure of how easy it is to get the perceptron to output a 111. Or to put it in more biological terms, the bias is a measure of how easy it is to get the perceptron to fire. For a perceptron with a really big bias, it's extremely easy for the perceptron to output a 111. But if the bias is very negative, then it's difficult for the perceptron to output a 111. Obviously, introducing the bias is only a small change in how we describe perceptrons, but we'll see later that it leads to further notational simplifications. Because of this, in the remainder of the book we won't use the threshold, we'll always use the bias.I've described perceptrons as a method for weighing evidence to make decisions. Another way perceptrons can be used is to compute the elementary logical functions we usually think of as underlying computation, functions such as AND, OR, and NAND. For example, suppose we have a perceptron with two inputs, each with weight −2−2-2, and an overall bias of 333. Here's our perceptron: Then we see that input 000000 produces output 111, since (−2)∗0+(−2)∗0+3=3(−2)∗0+(−2)∗0+3=3(-2)*0+(-2)*0+3 = 3 is positive. Here, I've introduced the ∗∗* symbol to make the multiplications explicit. Similar calculations show that the inputs 010101 and 101010 produce output 111. But the input 111111 produces output 000, since (−2)∗1+(−2)∗1+3=−1(−2)∗1+(−2)∗1+3=−1(-2)*1+(-2)*1+3 = -1 is negative. And so our perceptron implements a NAND gate!The NAND example shows that we can use perceptrons to compute simple logical functions. In fact, we can use networks of perceptrons to compute any logical function at all. The reason is that the NAND gate is universal for computation, that is, we can build any computation up out of NAND gates. For example, we can use NAND gates to build a circuit which adds two bits, x1x1x_1 and x2x2x_2. This requires computing the bitwise sum, x1⊕x2x1⊕x2x_1 \oplus x_2, as well as a carry bit which is set to 111 when both x1x1x_1 and x2x2x_2 are 111, i.e., the carry bit is just the bitwise product x1x2x1x2x_1 x_2: To get an equivalent network of perceptrons we replace all the NAND gates by perceptrons with two inputs, each with weight −2−2-2, and an overall bias of 333. Here's the resulting network. Note that I've moved the perceptron corresponding to the bottom right NAND gate a little, just to make it easier to draw the arrows on the diagram: One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron. When I defined the perceptron model I didn't say whether this kind of double-output-to-the-same-place was allowed. Actually, it doesn't much matter. If we don't want to allow this kind of thing, then it's possible to simply merge the two lines, into a single connection with a weight of -4 instead of two connections with -2 weights. (If you don't find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as marked: Up to now I've been drawing inputs like x1x1x_1 and x2x2x_2 as variables floating to the left of the network of perceptrons. In fact, it's conventional to draw an extra layer of perceptrons - the input layer - to encode the inputs: This notation for input perceptrons, in which we have an output, but no inputs, is a shorthand. It doesn't actually mean a perceptron with no inputs. To see this, suppose we did have a perceptron with no inputs. Then the weighted sum ∑jwjxj∑jwjxj\sum_j w_j x_j would always be zero, and so the perceptron would output 111 if b>0b>0b > 0, and 000 if b≤0b≤0b \leq 0. That is, the perceptron would simply output a fixed value, not the desired value (x1x1x_1, in the example above). It's better to think of the input perceptrons as not really being perceptrons at all, but rather special units which are simply defined to output the desired values, x1,x2,…x1,x2,…x_1, x_2,\ldots.The adder example demonstrates how a network of perceptrons can be used to simulate a circuit containing many NAND gates. And because NAND gates are universal for computation, it follows that perceptrons are also universal for computation.The computational universality of perceptrons is simultaneously reassuring and disappointing. It's reassuring because it tells us that networks of perceptrons can be as powerful as any other computing device. But it's also disappointing, because it makes it seem as though perceptrons are merely a new type of NAND gate. That's hardly big news!However, the situation is better than this view suggests. It turns out that we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons. This tuning happens in response to external stimuli, without direct intervention by a programmer. These learning algorithms enable us to use artificial neurons in a way which is radically different to conventional logic gates. Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.Sigmoid neuronsLearning algorithms sound terrific. But how can we devise such algorithms for a neural network? Suppose we have a network of perceptrons that we'd like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we'd like is for this small change in weight to cause only a small corresponding change in the output from the network. As we'll see in a moment, this property will make learning possible. Schematically, here's what we want (obviously this network is too simple to do handwriting recognition!): If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an "8" when it should be a "9". We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a "9". And then we'd repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.The problem is that this isn't what happens when our network contains perceptrons. In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from 000 to 111. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. So while your "9" might now be classified correctly, the behaviour of the network on all the other images is likely to have completely changed in some hard-to-control way. That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behaviour. Perhaps there's some clever way of getting around this problem. But it's not immediately obvious how we can get a network of perceptrons to learn.We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we depicted perceptrons: Just like a perceptron, the sigmoid neuron has inputs, x1,x2,…x1,x2,…x_1, x_2, \ldots. But instead of being just 000 or 111, these inputs can also take on any values between 000 and 111. So, for instance, 0.638…0.638…0.638\ldots is a valid input for a sigmoid neuron. Also just like a perceptron, the sigmoid neuron has weights for each input, w1,w2,…w1,w2,…w_1, w_2, \ldots, and an overall bias, bbb. But the output is not 000 or 111. Instead, it's σ(w⋅x+b)σ(w⋅x+b)\sigma(w \cdot x+b), where σσ\sigma is called the sigmoid function* *Incidentally, σσ\sigma is sometimes called the logistic function, and this new class of neurons called logistic neurons. It's useful to remember this terminology, since these terms are used by many people working with neural nets. However, we'll stick with the sigmoid terminology., and is defined by: σ(z)≡11+e−z.(3)(3)σ(z)≡11+e−z.\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{-z}}. \tag{3}\end{eqnarray} To put it all a little more explicitly, the output of a sigmoid neuron with inputs x1,x2,…x1,x2,…x_1,x_2,\ldots, weights w1,w2,…w1,w2,…w_1,w_2,\ldots, and bias bbb is 11+exp(−∑jwjxj−b).(4)(4)11+exp⁡(−∑jwjxj−b).\begin{eqnarray} \frac{1}{1+\exp(-\sum_j w_j x_j-b)}. \tag{4}\end{eqnarray}At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if you're not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.To understand the similarity to the perceptron model, suppose z≡w⋅x+bz≡w⋅x+bz \equiv w \cdot x + b is a large positive number. Then e−z≈0e−z≈0e^{-z} \approx 0 and so σ(z)≈1σ(z)≈1\sigma(z) \approx 1. In other words, when z=w⋅x+bz=w⋅x+bz = w \cdot x+b is large and positive, the output from the sigmoid neuron is approximately 111, just as it would have been for a perceptron. Suppose on the other hand that z=w⋅x+bz=w⋅x+bz = w \cdot x+b is very negative. Then e−z→∞e−z→∞e^{-z} \rightarrow \infty, and σ(z)≈0σ(z)≈0\sigma(z) \approx 0. So when z=w⋅x+bz=w⋅x+bz = w \cdot x +b is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. It's only when w⋅x+bw⋅x+bw \cdot x+b is of modest size that there's much deviation from the perceptron model.What about the algebraic form of σσ\sigma? How can we understand that? In fact, the exact form of σσ\sigma isn't so important - what really matters is the shape of the function when plotted. Here's the shape: -4-3-2-1012340.00.20.40.60.81.0zsigmoid function function s(x) {return 1/(1+Math.exp(-x));} var m = [40, 120, 50, 120]; var height = 290 - m[0] - m[2]; var width = 600 - m[1] - m[3]; var xmin = -5; var xmax = 5; var sample = 400; var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]); var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))}; }); var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]); var y = d3.scale.linear() .domain([0, 1]) .range([height, 0]); var line = d3.svg.line() .x(function(d) { return x(d.x); }) .y(function(d) { return y(d.y); }) var graph = d3.select("#sigmoid_graph") .append("svg") .attr("width", width + m[1] + m[3]) .attr("height", height + m[0] + m[2]) .append("g") .attr("transform", "translate(" + m[3] + "," + m[0] + ")"); var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient("bottom") graph.append("g") .attr("class", "x axis") .attr("transform", "translate(0, " + height + ")") .call(xAxis); var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient("left") .ticks(5) graph.append("g") .attr("class", "y axis") .call(yAxis); graph.append("path").attr("d", line(data)); graph.append("text") .attr("class", "x label") .attr("text-anchor", "end") .attr("x", width/2) .attr("y", height+35) .text("z"); graph.append("text") .attr("x", (width / 2)) .attr("y", -10) .attr("text-anchor", "middle") .style("font-size", "16px") .text("sigmoid function"); This shape is a smoothed out version of a step function: -4-3-2-1012340.00.20.40.60.81.0zstep function function s(x) {return x < 0 ? 0 : 1;} var m = [40, 120, 50, 120]; var height = 290 - m[0] - m[2]; var width = 600 - m[1] - m[3]; var xmin = -5; var xmax = 5; var sample = 400; var x1 = d3.scale.linear().domain([0, sample]).range([xmin, xmax]); var data = d3.range(sample).map(function(d){ return { x: x1(d), y: s(x1(d))}; }); var x = d3.scale.linear().domain([xmin, xmax]).range([0, width]); var y = d3.scale.linear() .domain([0,1]) .range([height, 0]); var line = d3.svg.line() .x(function(d) { return x(d.x); }) .y(function(d) { return y(d.y); }) var graph = d3.select("#step_graph") .append("svg") .attr("width", width + m[1] + m[3]) .attr("height", height + m[0] + m[2]) .append("g") .attr("transform", "translate(" + m[3] + "," + m[0] + ")"); var xAxis = d3.svg.axis() .scale(x) .tickValues(d3.range(-4, 5, 1)) .orient("bottom") graph.append("g") .attr("class", "x axis") .attr("transform", "translate(0, " + height + ")") .call(xAxis); var yAxis = d3.svg.axis() .scale(y) .tickValues(d3.range(0, 1.01, 0.2)) .orient("left") .ticks(5) graph.append("g") .attr("class", "y axis") .call(yAxis); graph.append("path").attr("d", line(data)); graph.append("text") .attr("class", "x label") .attr("text-anchor", "end") .attr("x", width/2) .attr("y", height+35) .text("z"); graph.append("text") .attr("x", (width / 2)) .attr("y", -10) .attr("text-anchor", "middle") .style("font-size", "16px") .text("step function"); If σσ\sigma had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be 111 or 000 depending on whether w⋅x+bw⋅x+bw\cdot x+b was positive or negative* *Actually, when w⋅x+b=0w⋅x+b=0w \cdot x +b = 0 the perceptron outputs 000, while the step function outputs 111. So, strictly speaking, we'd need to modify the step function at that one point. But you get the idea.. By using the actual σσ\sigma function we get, as already implied above, a smoothed out perceptron. Indeed, it's the smoothness of the σσ\sigma function that is the crucial fact, not its detailed form. The smoothness of σσ\sigma means that small changes ΔwjΔwj\Delta w_j in the weights and ΔbΔb\Delta b in the bias will produce a small change ΔoutputΔoutput\Delta \mbox{output} in the output from the neuron. In fact, calculus tells us that ΔoutputΔoutput\Delta \mbox{output} is well approximated by Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,(5)(5)Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b, \tag{5}\end{eqnarray} where the sum is over all the weights, wjwjw_j, and ∂output/∂wj∂output/∂wj\partial \, \mbox{output} / \partial w_j and ∂output/∂b∂output/∂b\partial \, \mbox{output} /\partial b denote partial derivatives of the outputoutput\mbox{output} with respect to wjwjw_j and bbb, respectively. Don't panic if you're not comfortable with partial derivatives! While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news): ΔoutputΔoutput\Delta \mbox{output} is a linear function of the changes ΔwjΔwj\Delta w_j and ΔbΔb\Delta b in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.If it's the shape of σσ\sigma which really matters, and not its exact form, then why use the particular form used for σσ\sigma in Equation (3)σ(z)≡11+e−zσ(z)≡11+e−z\begin{eqnarray} \sigma(z) \equiv \frac{1}{1+e^{-z}} \nonumber\end{eqnarray}$('#margin_387419264610_reveal').click(function() {$('#margin_387419264610').toggle('slow', function() {});});? In fact, later in the book we will occasionally consider neurons where the output is f(w⋅x+b)f(w⋅x+b)f(w \cdot x + b) for some other activation function f(⋅)f(⋅)f(\cdot). The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5)Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔbΔoutput≈∑j∂output∂wjΔwj+∂output∂bΔb\begin{eqnarray} \Delta \mbox{output} \approx \sum_j \frac{\partial \, \mbox{output}}{\partial w_j} \Delta w_j + \frac{\partial \, \mbox{output}}{\partial b} \Delta b \nonumber\end{eqnarray}$('#margin_727997094331_reveal').click(function() {$('#margin_727997094331').toggle('slow', function() {});}); change. It turns out that when we compute those partial derivatives later, using σσ\sigma will simplify the algebra, simply because exponentials have lovely properties when differentiated. In any case, σσ\sigma is commonly-used in work on neural nets, and is the activation function we'll use most often in this book.How should we interpret the output from a sigmoid neuron? Obviously, one big difference between perceptrons and sigmoid neurons is that sigmoid neurons don't just output 000 or 111. They can have as output any real number between 000 and 111, so values such as 0.173…0.173…0.173\ldots and 0.689…0.689…0.689\ldots are legitimate outputs. This can be useful, for example, if we want to use the output value to represent the average intensity of the pixels in an image input to a neural network. But sometimes it can be a nuisance. Suppose we want the output from the network to indicate either "the input image is a 9" or "the input image is not a 9". Obviously, it'd be easiest to do this if the output was a 000 or a 111, as in a perceptron. But in practice we can set up a convention to deal with this, for example, by deciding to interpret any output of at least 0.50.50.5 as indicating a "9", and any output less than 0.50.50.5 as indicating "not a 9". I'll always explicitly state when we're using such a convention, so it shouldn't cause any confusion. Exercises Sigmoid neurons simulating perceptrons, part I \mbox{} Suppose we take all the weights and biases in a network of perceptrons, and multiply them by a positive constant, c>0c>0c > 0. Show that the behaviour of the network doesn't change.Sigmoid neurons simulating perceptrons, part II \mbox{} Suppose we have the same setup as the last problem - a network of perceptrons. Suppose also that the overall input to the network of perceptrons has been chosen. We won't need the actual input value, we just need the input to have been fixed. Suppose the weights and biases are such that w⋅x+b≠0w⋅x+b≠0w \cdot x + b \neq 0 for the input xxx to any particular perceptron in the network. Now replace all the perceptrons in the network by sigmoid neurons, and multiply the weights and biases by a positive constant c>0c>0c > 0. Show that in the limit as c→∞c→∞c \rightarrow \infty the behaviour of this network of sigmoid neurons is exactly the same as the network of perceptrons. How can this fail when w⋅x+b=0w⋅x+b=0w \cdot x + b = 0 for one of the perceptrons? The architecture of neural networksIn the next section I'll introduce a neural network that can do a pretty good job classifying handwritten digits. In preparation for that, it helps to explain some terminology that lets us name different parts of a network. Suppose we have the network: As mentioned earlier, the leftmost layer in this network is called the input layer, and the neurons within the layer are called input neurons. The rightmost or output layer contains the output neurons, or, as in this case, a single output neuron. The middle layer is called a hidden layer, since the neurons in this layer are neither inputs nor outputs. The term "hidden" perhaps sounds a little mysterious - the first time I heard the term I thought it must have some deep philosophical or mathematical significance - but it really means nothing more than "not an input or an output". The network above has just a single hidden layer, but some networks have multiple hidden layers. For example, the following four-layer network has two hidden layers: Somewhat confusingly, and for historical reasons, such multiple layer networks are sometimes called multilayer perceptrons or MLPs, despite being made up of sigmoid neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I think it's confusing, but wanted to warn you of its existence.The design of the input and output layers in a network is often straightforward. For example, suppose we're trying to determine whether a handwritten image depicts a "9" or not. A natural way to design the network is to encode the intensities of the image pixels into the input neurons. If the image is a 646464 by 646464 greyscale image, then we'd have 4,096=64×644,096=64×644,096 = 64 \times 64 input neurons, with the intensities scaled appropriately between 000 and 111. The output layer will contain just a single neuron, with output values of less than 0.50.50.5 indicating "input image is not a 9", and values greater than 0.50.50.5 indicating "input image is a 9 ". While the design of the input and output layers of a neural network is often straightforward, there can be quite an art to the design of the hidden layers. In particular, it's not possible to sum up the design process for the hidden layers with a few simple rules of thumb. Instead, neural networks researchers have developed many design heuristics for the hidden layers, which help people get the behaviour they want out of their nets. For example, such heuristics can be used to help determine how to trade off the number of hidden layers against the time required to train the network. We'll meet several such design heuristics later in this book. Up to now, we've been discussing neural networks where the output from one layer is used as input to the next layer. Such networks are called feedforward neural networks. This means there are no loops in the network - information is always fed forward, never fed back. If we did have loops, we'd end up with situations where the input to the σσ\sigma function depended on the output. That'd be hard to make sense of, and so we don't allow such loops.However, there are other models of artificial neural networks in which feedback loops are possible. These models are called recurrent neural networks. The idea in these models is to have neurons which fire for some limited duration of time, before becoming quiescent. That firing can stimulate other neurons, which may fire a little while later, also for a limited duration. That causes still more neurons to fire, and so over time we get a cascade of neurons firing. Loops don't cause problems in such a model, since a neuron's output only affects its input at some later time, not instantaneously.Recurrent neural nets have been less influential than feedforward networks, in part because the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent networks are still extremely interesting. They're much closer in spirit to how our brains work than feedforward networks. And it's possible that recurrent networks can solve important problems which can only be solved with great difficulty by feedforward networks. However, to limit our scope, in this book we're going to concentrate on the more widely-used feedforward networks.A simple network to classify handwritten digitsHaving defined neural networks, let's return to handwriting recognition. We can split the problem of recognizing handwritten digits into two sub-problems. First, we'd like a way of breaking an image containing many digits into a sequence of separate images, each containing a single digit. For example, we'd like to break the imageinto six separate images, We humans solve this segmentation problem with ease, but it's challenging for a computer program to correctly break up the image. Once the image has been segmented, the program then needs to classify each individual digit. So, for instance, we'd like our program to recognize that the first digit above,is a 5.We'll focus on writing a program to solve the second problem, that is, classifying individual digits. We do this because it turns out that the segmentation problem is not so difficult to solve, once you have a good way of classifying individual digits. There are many approaches to solving the segmentation problem. One approach is to trial many different ways of segmenting the image, using the individual digit classifier to score each trial segmentation. A trial segmentation gets a high score if the individual digit classifier is confident of its classification in all segments, and a low score if the classifier is having a lot of trouble in one or more segments. The idea is that if the classifier is having trouble somewhere, then it's probably having trouble because the segmentation has been chosen incorrectly. This idea and other variations can be used to solve the segmentation problem quite well. So instead of worrying about segmentation we'll concentrate on developing a neural network which can solve the more interesting and difficult problem, namely, recognizing individual handwritten digits.To recognize individual digits we will use a three-layer neural network: The input layer of the network contains neurons encoding the values of the input pixels. As discussed in the next section, our training data for the network will consist of many 282828 by 282828 pixel images of scanned handwritten digits, and so the input layer contains 784=28×28784=28×28784 = 28 \times 28 neurons. For simplicity I've omitted most of the 784784784 input neurons in the diagram above. The input pixels are greyscale, with a value of 0.00.00.0 representing white, a value of 1.01.01.0 representing black, and in between values representing gradually darkening shades of grey.The second layer of the network is a hidden layer. We denote the number of neurons in this hidden layer by nnn, and we'll experiment with different values for nnn. The example shown illustrates a small hidden layer, containing just n=15n=15n = 15 neurons.The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an output ≈1≈1\approx 1, then that will indicate that the network thinks the digit is a 000. If the second neuron fires then that will indicate that the network thinks the digit is a 111. And so on. A little more precisely, we number the output neurons from 000 through 999, and figure out which neuron has the highest activation value. If that neuron is, say, neuron number 666, then our network will guess that the input digit was a 666. And so on for the other output neurons.You might wonder why we use 101010 output neurons. After all, the goal of the network is to tell us which digit (0,1,2,…,90,1,2,…,90, 1, 2, \ldots, 9) corresponds to the input image. A seemingly natural way of doing that is to use just 444 output neurons, treating each neuron as taking on a binary value, depending on whether the neuron's output is closer to 000 or to 111. Four neurons are enough to encode the answer, since 24=1624=162^4 = 16 is more than the 10 possible values for the input digit. Why should our network use 101010 neurons instead? Isn't that inefficient? The ultimate justification is empirical: we can try out both network designs, and it turns out that, for this particular problem, the network with 101010 output neurons learns to recognize digits better than the network with 444 output neurons. But that leaves us wondering why using 101010 output neurons works better. Is there some heuristic that would tell us in advance that we should use the 101010-output encoding instead of the 444-output encoding?To understand why we do this, it helps to think about what the neural network is doing from first principles. Consider first the case where we use 101010 output neurons. Let's concentrate on the first output neuron, the one that's trying to decide whether or not the digit is a 000. It does this by weighing up evidence from the hidden layer of neurons. What are those hidden neurons doing? Well, just suppose for the sake of argument that the first neuron in the hidden layer detects whether or not an image like the following is present:It can do this by heavily weighting input pixels which overlap with the image, and only lightly weighting the other inputs. In a similar way, let's suppose for the sake of argument that the second, third, and fourth neurons in the hidden layer detect whether or not the following images are present:As you may have guessed, these four images together make up the 000 image that we saw in the line of digits shown earlier:So if all four of these hidden neurons are firing then we can conclude that the digit is a 000. Of course, that's not the only sort of evidence we can use to conclude that the image was a 000 - we could legitimately get a 000 in many other ways (say, through translations of the above images, or slight distortions). But it seems safe to say that at least in this case we'd conclude that the input was a 000.Supposing the neural network functions in this way, we can give a plausible explanation for why it's better to have 101010 outputs from the network, rather than 444. If we had 444 outputs, then the first output neuron would be trying to decide what the most significant bit of the digit was. And there's no easy way to relate that most significant bit to simple shapes like those shown above. It's hard to imagine that there's any good historical reason the component shapes of the digit will be closely related to (say) the most significant bit in the output.Now, with all that said, this is all just a heuristic. Nothing says that the three-layer neural network has to operate in the way I described, with the hidden neurons detecting simple component shapes. Maybe a clever learning algorithm will find some assignment of weights that lets us use only 444 output neurons. But as a heuristic the way of thinking I've described works pretty well, and can save you a lot of time in designing good neural network architectures.Exercise There is a way of determining the bitwise representation of a digit by adding an extra layer to the three-layer network above. The extra layer converts the output from the previous layer into a binary representation, as illustrated in the figure below. Find a set of weights and biases for the new output layer. Assume that the first 333 layers of neurons are such that the correct output in the third layer (i.e., the old output layer) has activation at least 0.990.990.99, and incorrect outputs have activation less than 0.010.010.01. Learning with gradient descentNow that we have a design for our neural network, how can it learn to recognize digits? The first thing we'll need is a data set to learn from - a so-called training data set. We'll use the MNIST data set, which contains tens of thousands of scanned images of handwritten digits, together with their correct classifications. MNIST's name comes from the fact that it is a modified subset of two data sets collected by NIST, the United States' National Institute of Standards and Technology. Here's a few images from MNIST: As you can see, these digits are, in fact, the same as those shown at the beginning of this chapter as a challenge to recognize. Of course, when testing our network we'll ask it to recognize images which aren't in the training set!The MNIST data comes in two parts. The first part contains 60,000 images to be used as training data. These images are scanned handwriting samples from 250 people, half of whom were US Census Bureau employees, and half of whom were high school students. The images are greyscale and 28 by 28 pixels in size. The second part of the MNIST data set is 10,000 images to be used as test data. Again, these are 28 by 28 greyscale images. We'll use the test data to evaluate how well our neural network has learned to recognize digits. To make this a good test of performance, the test data was taken from a different set of 250 people than the original training data (albeit still a group split between Census Bureau employees and high school students). This helps give us confidence that our system can recognize digits from people whose writing it didn't see during training.We'll use the notation xxx to denote a training input. It'll be convenient to regard each training input xxx as a 28×28=78428×28=78428 \times 28 = 784-dimensional vector. Each entry in the vector represents the grey value for a single pixel in the image. We'll denote the corresponding desired output by y=y(x)y=y(x)y = y(x), where yyy is a 101010-dimensional vector. For example, if a particular training image, xxx, depicts a 666, then y(x)=(0,0,0,0,0,0,1,0,0,0)Ty(x)=(0,0,0,0,0,0,1,0,0,0)Ty(x) = (0, 0, 0, 0, 0, 0, 1, 0, 0, 0)^T is the desired output from the network. Note that TTT here is the transpose operation, turning a row vector into an ordinary (column) vector.What we'd like is an algorithm which lets us find weights and biases so that the output from the network approximates y(x)y(x)y(x) for all training inputs xxx. To quantify how well we're achieving this goal we define a cost function* *Sometimes referred to as a loss or objective function. We use the term cost function throughout this book, but you should note the other terminology, since it's often used in research papers and other discussions of neural networks. : C(w,b)≡12n∑x∥y(x)−a∥2.(6)(6)C(w,b)≡12n∑x‖y(x)−a‖2.\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \tag{6}\end{eqnarray} Here, www denotes the collection of all weights in the network, bbb all the biases, nnn is the total number of training inputs, aaa is the vector of outputs from the network when xxx is input, and the sum is over all training inputs, xxx. Of course, the output aaa depends on xxx, www and bbb, but to keep the notation simple I haven't explicitly indicated this dependence. The notation ∥v∥‖v‖\| v \| just denotes the usual length function for a vector vvv. We'll call CCC the quadratic cost function; it's also sometimes known as the mean squared error or just MSE. Inspecting the form of the quadratic cost function, we see that C(w,b)C(w,b)C(w,b) is non-negative, since every term in the sum is non-negative. Furthermore, the cost C(w,b)C(w,b)C(w,b) becomes small, i.e., C(w,b)≈0C(w,b)≈0C(w,b) \approx 0, precisely when y(x)y(x)y(x) is approximately equal to the output, aaa, for all training inputs, xxx. So our training algorithm has done a good job if it can find weights and biases so that C(w,b)≈0C(w,b)≈0C(w,b) \approx 0. By contrast, it's not doing so well when C(w,b)C(w,b)C(w,b) is large - that would mean that y(x)y(x)y(x) is not close to the output aaa for a large number of inputs. So the aim of our training algorithm will be to minimize the cost C(w,b)C(w,b)C(w,b) as a function of the weights and biases. In other words, we want to find a set of weights and biases which make the cost as small as possible. We'll do that using an algorithm known as gradient descent. Why introduce the quadratic cost? After all, aren't we primarily interested in the number of images correctly classified by the network? Why not try to maximize that number directly, rather than minimizing a proxy measure like the quadratic cost? The problem with that is that the number of images correctly classified is not a smooth function of the weights and biases in the network. For the most part, making small changes to the weights and biases won't cause any change at all in the number of training images classified correctly. That makes it difficult to figure out how to change the weights and biases to get improved performance. If we instead use a smooth cost function like the quadratic cost it turns out to be easy to figure out how to make small changes in the weights and biases so as to get an improvement in the cost. That's why we focus first on minimizing the quadratic cost, and only after that will we examine the classification accuracy.Even given that we want to use a smooth cost function, you may still wonder why we choose the quadratic function used in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_501822820305_reveal').click(function() {$('#margin_501822820305').toggle('slow', function() {});});. Isn't this a rather ad hoc choice? Perhaps if we chose a different cost function we'd get a totally different set of minimizing weights and biases? This is a valid concern, and later we'll revisit the cost function, and make some modifications. However, the quadratic cost function of Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_555483302348_reveal').click(function() {$('#margin_555483302348').toggle('slow', function() {});}); works perfectly well for understanding the basics of learning in neural networks, so we'll stick with it for now.Recapping, our goal in training a neural network is to find weights and biases which minimize the quadratic cost function C(w,b)C(w,b)C(w, b). This is a well-posed problem, but it's got a lot of distracting structure as currently posed - the interpretation of www and bbb as weights and biases, the σσ\sigma function lurking in the background, the choice of network architecture, MNIST, and so on. It turns out that we can understand a tremendous amount by ignoring most of that structure, and just concentrating on the minimization aspect. So for now we're going to forget all about the specific form of the cost function, the connection to neural networks, and so on. Instead, we're going to imagine that we've simply been given a function of many variables and we want to minimize that function. We're going to develop a technique called gradient descent which can be used to solve such minimization problems. Then we'll come back to the specific function we want to minimize for neural networks.Okay, let's suppose we're trying to minimize some function, C(v)C(v)C(v). This could be any real-valued function of many variables, v=v1,v2,…v=v1,v2,…v = v_1, v_2, \ldots. Note that I've replaced the www and bbb notation by vvv to emphasize that this could be any function - we're not specifically thinking in the neural networks context any more. To minimize C(v)C(v)C(v) it helps to imagine CCC as a function of just two variables, which we'll call v1v1v_1 and v2v2v_2:What we'd like is to find where CCC achieves its global minimum. Now, of course, for the function plotted above, we can eyeball the graph and find the minimum. In that sense, I've perhaps shown slightly too simple a function! A general function, CCC, may be a complicated function of many variables, and it won't usually be possible to just eyeball the graph to find the minimum.One way of attacking the problem is to use calculus to try to find the minimum analytically. We could compute derivatives and then try using them to find places where CCC is an extremum. With some luck that might work when CCC is a function of just one or a few variables. But it'll turn into a nightmare when we have many more variables. And for neural networks we'll often want far more variables - the biggest neural networks have cost functions which depend on billions of weights and biases in an extremely complicated way. Using calculus to minimize that just won't work!(After asserting that we'll gain insight by imagining CCC as a function of just two variables, I've turned around twice in two paragraphs and said, "hey, but what if it's a function of many more than two variables?" Sorry about that. Please believe me when I say that it really does help to imagine CCC as a function of two variables. It just happens that sometimes that picture breaks down, and the last two paragraphs were dealing with such breakdowns. Good thinking about mathematics often involves juggling multiple intuitive pictures, learning when it's appropriate to use each picture, and when it's not.)Okay, so calculus doesn't work. Fortunately, there is a beautiful analogy which suggests an algorithm which works pretty well. We start by thinking of our function as a kind of a valley. If you squint just a little at the plot above, that shouldn't be too hard. And we imagine a ball rolling down the slope of the valley. Our everyday experience tells us that the ball will eventually roll to the bottom of the valley. Perhaps we can use this idea as a way to find a minimum for the function? We'd randomly choose a starting point for an (imaginary) ball, and then simulate the motion of the ball as it rolled down to the bottom of the valley. We could do this simulation simply by computing derivatives (and perhaps some second derivatives) of CCC - those derivatives would tell us everything we need to know about the local "shape" of the valley, and therefore how our ball should roll.Based on what I've just written, you might suppose that we'll be trying to write down Newton's equations of motion for the ball, considering the effects of friction and gravity, and so on. Actually, we're not going to take the ball-rolling analogy quite that seriously - we're devising an algorithm to minimize CCC, not developing an accurate simulation of the laws of physics! The ball's-eye view is meant to stimulate our imagination, not constrain our thinking. So rather than get into all the messy details of physics, let's simply ask ourselves: if we were declared God for a day, and could make up our own laws of physics, dictating to the ball how it should roll, what law or laws of motion could we pick that would make it so the ball always rolled to the bottom of the valley?To make this question more precise, let's think about what happens when we move the ball a small amount Δv1Δv1\Delta v_1 in the v1v1v_1 direction, and a small amount Δv2Δv2\Delta v_2 in the v2v2v_2 direction. Calculus tells us that CCC changes as follows: ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.(7)(7)ΔC≈∂C∂v1Δv1+∂C∂v2Δv2.\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2. \tag{7}\end{eqnarray} We're going to find a way of choosing Δv1Δv1\Delta v_1 and Δv2Δv2\Delta v_2 so as to make ΔCΔC\Delta C negative; i.e., we'll choose them so the ball is rolling down into the valley. To figure out how to make such a choice it helps to define ΔvΔv\Delta v to be the vector of changes in vvv, Δv≡(Δv1,Δv2)TΔv≡(Δv1,Δv2)T\Delta v \equiv (\Delta v_1, \Delta v_2)^T, where TTT is again the transpose operation, turning row vectors into column vectors. We'll also define the gradient of CCC to be the vector of partial derivatives, (∂C∂v1,∂C∂v2)T(∂C∂v1,∂C∂v2)T\left(\frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2}\right)^T. We denote the gradient vector by ∇C∇C\nabla C, i.e.: ∇C≡(∂C∂v1,∂C∂v2)T.(8)(8)∇C≡(∂C∂v1,∂C∂v2)T.\begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T. \tag{8}\end{eqnarray} In a moment we'll rewrite the change ΔCΔC\Delta C in terms of ΔvΔv\Delta v and the gradient, ∇C∇C\nabla C. Before getting to that, though, I want to clarify something that sometimes gets people hung up on the gradient. When meeting the ∇C∇C\nabla C notation for the first time, people sometimes wonder how they should think about the ∇∇\nabla symbol. What, exactly, does ∇∇\nabla mean? In fact, it's perfectly fine to think of ∇C∇C\nabla C as a single mathematical object - the vector defined above - which happens to be written using two symbols. In this point of view, ∇∇\nabla is just a piece of notational flag-waving, telling you "hey, ∇C∇C\nabla C is a gradient vector". There are more advanced points of view where ∇∇\nabla can be viewed as an independent mathematical entity in its own right (for example, as a differential operator), but we won't need such points of view.With these definitions, the expression (7)ΔC≈∂C∂v1Δv1+∂C∂v2Δv2ΔC≈∂C∂v1Δv1+∂C∂v2Δv2\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2 \nonumber\end{eqnarray}$('#margin_512380394946_reveal').click(function() {$('#margin_512380394946').toggle('slow', function() {});}); for ΔCΔC\Delta C can be rewritten as ΔC≈∇C⋅Δv.(9)(9)ΔC≈∇C⋅Δv.\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v. \tag{9}\end{eqnarray} This equation helps explain why ∇C∇C\nabla C is called the gradient vector: ∇C∇C\nabla C relates changes in vvv to changes in CCC, just as we'd expect something called a gradient to do. But what's really exciting about the equation is that it lets us see how to choose ΔvΔv\Delta v so as to make ΔCΔC\Delta C negative. In particular, suppose we choose Δv=−η∇C,(10)(10)Δv=−η∇C,\begin{eqnarray} \Delta v = -\eta \nabla C, \tag{10}\end{eqnarray} where ηη\eta is a small, positive parameter (known as the learning rate). Then Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_31741254841_reveal').click(function() {$('#margin_31741254841').toggle('slow', function() {});}); tells us that ΔC≈−η∇C⋅∇C=−η∥∇C∥2ΔC≈−η∇C⋅∇C=−η‖∇C‖2\Delta C \approx -\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2. Because ∥∇C∥2≥0‖∇C‖2≥0\| \nabla C \|^2 \geq 0, this guarantees that ΔC≤0ΔC≤0\Delta C \leq 0, i.e., CCC will always decrease, never increase, if we change vvv according to the prescription in (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_48762573303_reveal').click(function() {$('#margin_48762573303').toggle('slow', function() {});});. (Within, of course, the limits of the approximation in Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_919658643545_reveal').click(function() {$('#margin_919658643545').toggle('slow', function() {});});). This is exactly the property we wanted! And so we'll take Equation (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_287729255111_reveal').click(function() {$('#margin_287729255111').toggle('slow', function() {});}); to define the "law of motion" for the ball in our gradient descent algorithm. That is, we'll use Equation (10)Δv=−η∇CΔv=−η∇C\begin{eqnarray} \Delta v = -\eta \nabla C \nonumber\end{eqnarray}$('#margin_718723868298_reveal').click(function() {$('#margin_718723868298').toggle('slow', function() {});}); to compute a value for ΔvΔv\Delta v, then move the ball's position vvv by that amount: v→v′=v−η∇C.(11)(11)v→v′=v−η∇C.\begin{eqnarray} v \rightarrow v' = v -\eta \nabla C. \tag{11}\end{eqnarray} Then we'll use this update rule again, to make another move. If we keep doing this, over and over, we'll keep decreasing CCC until - we hope - we reach a global minimum.Summing up, the way the gradient descent algorithm works is to repeatedly compute the gradient ∇C∇C\nabla C, and then to move in the opposite direction, "falling down" the slope of the valley. We can visualize it like this:Notice that with this rule gradient descent doesn't reproduce real physical motion. In real life a ball has momentum, and that momentum may allow it to roll across the slope, or even (momentarily) roll uphill. It's only after the effects of friction set in that the ball is guaranteed to roll down into the valley. By contrast, our rule for choosing ΔvΔv\Delta v just says "go down, right now". That's still a pretty good rule for finding the minimum!To make gradient descent work correctly, we need to choose the learning rate ηη\eta to be small enough that Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_560455937071_reveal').click(function() {$('#margin_560455937071').toggle('slow', function() {});}); is a good approximation. If we don't, we might end up with ΔC>0ΔC>0\Delta C > 0, which obviously would not be good! At the same time, we don't want ηη\eta to be too small, since that will make the changes ΔvΔv\Delta v tiny, and thus the gradient descent algorithm will work very slowly. In practical implementations, ηη\eta is often varied so that Equation (9)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_157848846275_reveal').click(function() {$('#margin_157848846275').toggle('slow', function() {});}); remains a good approximation, but the algorithm isn't too slow. We'll see later how this works. I've explained gradient descent when CCC is a function of just two variables. But, in fact, everything works just as well even when CCC is a function of many more variables. Suppose in particular that CCC is a function of mmm variables, v1,…,vmv1,…,vmv_1,\ldots,v_m. Then the change ΔCΔC\Delta C in CCC produced by a small change Δv=(Δv1,…,Δvm)TΔv=(Δv1,…,Δvm)T\Delta v = (\Delta v_1, \ldots, \Delta v_m)^T is ΔC≈∇C⋅Δv,(12)(12)ΔC≈∇C⋅Δv,\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v, \tag{12}\end{eqnarray} where the gradient ∇C∇C\nabla C is the vector ∇C≡(∂C∂v1,…,∂C∂vm)T.(13)(13)∇C≡(∂C∂v1,…,∂C∂vm)T.\begin{eqnarray} \nabla C \equiv \left(\frac{\partial C}{\partial v_1}, \ldots, \frac{\partial C}{\partial v_m}\right)^T. \tag{13}\end{eqnarray} Just as for the two variable case, we can choose Δv=−η∇C,(14)(14)Δv=−η∇C,\begin{eqnarray} \Delta v = -\eta \nabla C, \tag{14}\end{eqnarray} and we're guaranteed that our (approximate) expression (12)ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v \nonumber\end{eqnarray}$('#margin_869505431896_reveal').click(function() {$('#margin_869505431896').toggle('slow', function() {});}); for ΔCΔC\Delta C will be negative. This gives us a way of following the gradient to a minimum, even when CCC is a function of many variables, by repeatedly applying the update rule v→v′=v−η∇C.(15)(15)v→v′=v−η∇C.\begin{eqnarray} v \rightarrow v' = v-\eta \nabla C. \tag{15}\end{eqnarray} You can think of this update rule as defining the gradient descent algorithm. It gives us a way of repeatedly changing the position vvv in order to find a minimum of the function CCC. The rule doesn't always work - several things can go wrong and prevent gradient descent from finding the global minimum of CCC, a point we'll return to explore in later chapters. But, in practice gradient descent often works extremely well, and in neural networks we'll find that it's a powerful way of minimizing the cost function, and so helping the net learn.Indeed, there's even a sense in which gradient descent is the optimal strategy for searching for a minimum. Let's suppose that we're trying to make a move ΔvΔv\Delta v in position so as to decrease CCC as much as possible. This is equivalent to minimizing ΔC≈∇C⋅ΔvΔC≈∇C⋅Δv\Delta C \approx \nabla C \cdot \Delta v. We'll constrain the size of the move so that ∥Δv∥=ϵ‖Δv‖=ϵ\| \Delta v \| = \epsilon for some small fixed ϵ>0ϵ>0\epsilon > 0. In other words, we want a move that is a small step of a fixed size, and we're trying to find the movement direction which decreases CCC as much as possible. It can be proved that the choice of ΔvΔv\Delta v which minimizes ∇C⋅Δv∇C⋅Δv\nabla C \cdot \Delta v is Δv=−η∇CΔv=−η∇C\Delta v = - \eta \nabla C, where η=ϵ/∥∇C∥η=ϵ/‖∇C‖\eta = \epsilon / \|\nabla C\| is determined by the size constraint ∥Δv∥=ϵ‖Δv‖=ϵ\|\Delta v\| = \epsilon. So gradient descent can be viewed as a way of taking small steps in the direction which does the most to immediately decrease CCC.Exercises Prove the assertion of the last paragraph. Hint: If you're not already familiar with the Cauchy-Schwarz inequality, you may find it helpful to familiarize yourself with it. I explained gradient descent when CCC is a function of two variables, and when it's a function of more than two variables. What happens when CCC is a function of just one variable? Can you provide a geometric interpretation of what gradient descent is doing in the one-dimensional case? People have investigated many variations of gradient descent, including variations that more closely mimic a real physical ball. These ball-mimicking variations have some advantages, but also have a major disadvantage: it turns out to be necessary to compute second partial derivatives of CCC, and this can be quite costly. To see why it's costly, suppose we want to compute all the second partial derivatives ∂2C/∂vj∂vk∂2C/∂vj∂vk\partial^2 C/ \partial v_j \partial v_k. If there are a million such vjvjv_j variables then we'd need to compute something like a trillion (i.e., a million squared) second partial derivatives* *Actually, more like half a trillion, since ∂2C/∂vj∂vk=∂2C/∂vk∂vj∂2C/∂vj∂vk=∂2C/∂vk∂vj\partial^2 C/ \partial v_j \partial v_k = \partial^2 C/ \partial v_k \partial v_j. Still, you get the point.! That's going to be computationally costly. With that said, there are tricks for avoiding this kind of problem, and finding alternatives to gradient descent is an active area of investigation. But in this book we'll use gradient descent (and variations) as our main approach to learning in neural networks.How can we apply gradient descent to learn in a neural network? The idea is to use gradient descent to find the weights wkwkw_k and biases blblb_l which minimize the cost in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_1246306310_reveal').click(function() {$('#margin_1246306310').toggle('slow', function() {});});. To see how this works, let's restate the gradient descent update rule, with the weights and biases replacing the variables vjvjv_j. In other words, our "position" now has components wkwkw_k and blblb_l, and the gradient vector ∇C∇C\nabla C has corresponding components ∂C/∂wk∂C/∂wk\partial C / \partial w_k and ∂C/∂bl∂C/∂bl\partial C / \partial b_l. Writing out the gradient descent update rule in terms of components, we have wkbl→→w′k=wk−η∂C∂wkb′l=bl−η∂C∂bl.(16)(17)(16)wk→wk′=wk−η∂C∂wk(17)bl→bl′=bl−η∂C∂bl.\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{16}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}. \tag{17}\end{eqnarray} By repeatedly applying this update rule we can "roll down the hill", and hopefully find a minimum of the cost function. In other words, this is a rule which can be used to learn in a neural network.There are a number of challenges in applying the gradient descent rule. We'll look into those in depth in later chapters. But for now I just want to mention one problem. To understand what the problem is, let's look back at the quadratic cost in Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_214093216664_reveal').click(function() {$('#margin_214093216664').toggle('slow', function() {});});. Notice that this cost function has the form C=1n∑xCxC=1n∑xCxC = \frac{1}{n} \sum_x C_x, that is, it's an average over costs Cx≡∥y(x)−a∥22Cx≡‖y(x)−a‖22C_x \equiv \frac{\|y(x)-a\|^2}{2} for individual training examples. In practice, to compute the gradient ∇C∇C\nabla C we need to compute the gradients ∇Cx∇Cx\nabla C_x separately for each training input, xxx, and then average them, ∇C=1n∑x∇Cx∇C=1n∑x∇Cx\nabla C = \frac{1}{n} \sum_x \nabla C_x. Unfortunately, when the number of training inputs is very large this can take a long time, and learning thus occurs slowly.An idea called stochastic gradient descent can be used to speed up learning. The idea is to estimate the gradient ∇C∇C\nabla C by computing ∇Cx∇Cx\nabla C_x for a small sample of randomly chosen training inputs. By averaging over this small sample it turns out that we can quickly get a good estimate of the true gradient ∇C∇C\nabla C, and this helps speed up gradient descent, and thus learning.To make these ideas more precise, stochastic gradient descent works by randomly picking out a small number mmm of randomly chosen training inputs. We'll label those random training inputs X1,X2,…,XmX1,X2,…,XmX_1, X_2, \ldots, X_m, and refer to them as a mini-batch. Provided the sample size mmm is large enough we expect that the average value of the ∇CXj∇CXj\nabla C_{X_j} will be roughly equal to the average over all ∇Cx∇Cx\nabla C_x, that is, ∑mj=1∇CXjm≈∑x∇Cxn=∇C,(18)(18)∑j=1m∇CXjm≈∑x∇Cxn=∇C,\begin{eqnarray} \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C, \tag{18}\end{eqnarray} where the second sum is over the entire set of training data. Swapping sides we get ∇C≈1m∑j=1m∇CXj,(19)(19)∇C≈1m∑j=1m∇CXj,\begin{eqnarray} \nabla C \approx \frac{1}{m} \sum_{j=1}^m \nabla C_{X_{j}}, \tag{19}\end{eqnarray} confirming that we can estimate the overall gradient by computing gradients just for the randomly chosen mini-batch. To connect this explicitly to learning in neural networks, suppose wkwkw_k and blblb_l denote the weights and biases in our neural network. Then stochastic gradient descent works by picking out a randomly chosen mini-batch of training inputs, and training with those, wkbl→→w′k=wk−ηm∑j∂CXj∂wkb′l=bl−ηm∑j∂CXj∂bl,(20)(21)(20)wk→wk′=wk−ηm∑j∂CXj∂wk(21)bl→bl′=bl−ηm∑j∂CXj∂bl,\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{20}\\ b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l}, \tag{21}\end{eqnarray} where the sums are over all the training examples XjXjX_j in the current mini-batch. Then we pick out another randomly chosen mini-batch and train with those. And so on, until we've exhausted the training inputs, which is said to complete an epoch of training. At that point we start over with a new training epoch.Incidentally, it's worth noting that conventions vary about scaling of the cost function and of mini-batch updates to the weights and biases. In Equation (6)C(w,b)≡12n∑x∥y(x)−a∥2C(w,b)≡12n∑x‖y(x)−a‖2\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}$('#margin_85851492824_reveal').click(function() {$('#margin_85851492824').toggle('slow', function() {});}); we scaled the overall cost function by a factor 1n1n\frac{1}{n}. People sometimes omit the 1n1n\frac{1}{n}, summing over the costs of individual training examples instead of averaging. This is particularly useful when the total number of training examples isn't known in advance. This can occur if more training data is being generated in real time, for instance. And, in a similar way, the mini-batch update rules (20)wk→w′k=wk−ηm∑j∂CXj∂wkwk→wk′=wk−ηm∑j∂CXj∂wk\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \nonumber\end{eqnarray}$('#margin_801900730537_reveal').click(function() {$('#margin_801900730537').toggle('slow', function() {});}); and (21)bl→b′l=bl−ηm∑j∂CXj∂blbl→bl′=bl−ηm∑j∂CXj∂bl\begin{eqnarray} b_l & \rightarrow & b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \nonumber\end{eqnarray}$('#margin_985072620111_reveal').click(function() {$('#margin_985072620111').toggle('slow', function() {});}); sometimes omit the 1m1m\frac{1}{m} term out the front of the sums. Conceptually this makes little difference, since it's equivalent to rescaling the learning rate ηη\eta. But when doing detailed comparisons of different work it's worth watching out for.We can think of stochastic gradient descent as being like political polling: it's much easier to sample a small mini-batch than it is to apply gradient descent to the full batch, just as carrying out a poll is easier than running a full election. For example, if we have a training set of size n=60,000n=60,000n = 60,000, as in MNIST, and choose a mini-batch size of (say) m=10m=10m = 10, this means we'll get a factor of 6,0006,0006,000 speedup in estimating the gradient! Of course, the estimate won't be perfect - there will be statistical fluctuations - but it doesn't need to be perfect: all we really care about is moving in a general direction that will help decrease CCC, and that means we don't need an exact computation of the gradient. In practice, stochastic gradient descent is a commonly used and powerful technique for learning in neural networks, and it's the basis for most of the learning techniques we'll develop in this book.Exercise An extreme version of gradient descent is to use a mini-batch size of just 1. That is, given a training input, xxx, we update our weights and biases according to the rules wk→w′k=wk−η∂Cx/∂wkwk→wk′=wk−η∂Cx/∂wkw_k \rightarrow w_k' = w_k - \eta \partial C_x / \partial w_k and bl→b′l=bl−η∂Cx/∂blbl→bl′=bl−η∂Cx/∂blb_l \rightarrow b_l' = b_l - \eta \partial C_x / \partial b_l. Then we choose another training input, and update the weights and biases again. And so on, repeatedly. This procedure is known as online, on-line, or incremental learning. In online learning, a neural network learns from just one training input at a time (just as human beings do). Name one advantage and one disadvantage of online learning, compared to stochastic gradient descent with a mini-batch size of, say, 202020. Let me conclude this section by discussing a point that sometimes bugs people new to gradient descent. In neural networks the cost CCC is, of course, a function of many variables - all the weights and biases - and so in some sense defines a surface in a very high-dimensional space. Some people get hung up thinking: "Hey, I have to be able to visualize all these extra dimensions". And they may start to worry: "I can't think in four dimensions, let alone five (or five million)". Is there some special ability they're missing, some ability that "real" supermathematicians have? Of course, the answer is no. Even most professional mathematicians can't visualize four dimensions especially well, if at all. The trick they use, instead, is to develop other ways of representing what's going on. That's exactly what we did above: we used an algebraic (rather than visual) representation of ΔCΔC\Delta C to figure out how to move so as to decrease CCC. People who are good at thinking in high dimensions have a mental library containing many different techniques along these lines; our algebraic trick is just one example. Those techniques may not have the simplicity we're accustomed to when visualizing three dimensions, but once you build up a library of such techniques, you can get pretty good at thinking in high dimensions. I won't go into more detail here, but if you're interested then you may enjoy reading this discussion of some of the techniques professional mathematicians use to think in high dimensions. While some of the techniques discussed are quite complex, much of the best content is intuitive and accessible, and could be mastered by anyone. Implementing our network to classify digitsAlright, let's write a program that learns how to recognize handwritten digits, using stochastic gradient descent and the MNIST training data. We'll do this with a short Python (2.7) program, just 74 lines of code! The first thing we need is to get the MNIST data. If you're a git user then you can obtain the data by cloning the code repository for this book,git clone https://github.com/mnielsen/neural-networks-and-deep-learning.git If you don't use git then you can download the data and code here.Incidentally, when I described the MNIST data earlier, I said it was

      @fuelpress

    1. gram matrix must be normalized by dividing each element by the total number of elements in the matrix.

      true, after downsampling your gradient will get smaller on later layers

    1. Coming back to Rhizomatic Learning, I am therefore left mulling over how +dave cormier has successfully ‘managed the MOOC’. I must be honest that the word ‘manage’ may be slightly misleading, inferring incorrectly a sense of power and control, I think that instead what the course has done is instigate learning throughout. In some respect this has now been coordinated by everyone, although Dave has ‘set’ the tasks and facilitated the communications and conversations. However, as was demonstrated by +Mariana Funes‘ post, much was left to the community to continue the learning.

      On Dave Cormier and Rhizomatic Learning

    1. The vertical bar on the letter T represents the depth of related skills and expertise in a single field, whereas the horizontal bar is the ability to collaborate across disciplines with experts in other areas and to apply knowledge in areas of expertise other than one's own.

      T shaped knowledge

  3. Sep 2019
    1. One of the most useful features in Xcode is the Filename search. You can open it with the keyboard shortcut Shift + Command + O.

      Found what I was looking for to jump to file in Xcode

    1. Think-pair-share

      They're used to death, but for good reason. There are few things better than a good TPS for getting students warmed up for discussion. One can even allow the TPS to inform the entire lesson: if the TPS results in a class-generated set of questions or learning objectives, teach from that, or plan to teach from it in the next class session.

    1. Anki seems more common among software engineers
    2. Engineers are creatures of habit. Make reviewing your flashcard app your first work task (or the train, the toilet right before Candy Crush). Stop StackOverflowing "how do i amend my git commit" five times every month.

      Spaced repetition is a solution to googling 5 times a month the same thing

    3. Outside of medical students and language learning apps like Duolingo, spaced repetition isn't common. It's not as cool as cramming, but it works. Medical students use it to memorize those awful thousand page textbooks. Duolingo uses it because it's effective

      The most popular appliers of spaced repetition:

      1. Medical students
      2. Duolingo users
    4. But Why Option 3?

      Why spaced repetition is superior to cramming (reviewing just a week before the exam):

      1. Cramming rarely works after it passes from short-term memory. How many cram sessions do you remember from high school?
      2. Evenly spaced reminders sort-of works, but you'd have to review all your knowledge at every interval, which doesn't sound scalable/fun/have a social life.
      3. Our brains work best with exponentially spaced reminders.
    5. Spaced repetition is a remembering technique that will remind you concepts at spaced intervals to maximize memory retention efficiently

      Spaced repetition Spaced repetition

    1. in SM, learning and remembering are blended into one: you read (learn) and review (remember) at the same time. Incremental Reading is essentially “spaced repetition-ing” your reading

      Super Memo combines learning + remembering

    2. Learning = reading and understanding new things Remembering = memorizing what you learned

      Learning vs remembering

    1. At the moment, GPT-2 uses a binary search algorithm, which means that its output can be considered a ‘true’ set of rules. If OpenAI is right, it could eventually generate a Turing complete program, a self-improving machine that can learn (and then improve) itself from the data it encounters. And that would make OpenAI a threat to IBM’s own goals of machine learning and AI, as it could essentially make better than even humans the best possible model that the future machines can use to improve their systems. However, there’s a catch: not just any new AI will do, but a specific type; one that uses deep learning to learn the rules, algorithms, and data necessary to run the machine to any given level of AI.

      This is a machine generated response in 2019. We are clearly closer than most people realize to machines that can can pass a text-based Turing Test.

    1. Supporting Personalised Learning Frequently mentioned throughout the interviews was the goal of allowing learners to explore their personal interests, culture and social context through assessment. Several participants sought to design assessment that allowed learners to tap into these aspects of their personal lives. Where learners could exercise choice and pursue projects of personal interest, a greater sense of ownership was observed. James commented that “they love the idea that they are in control of what they do”, when given more choice around assessment. Other participants suggested it was possible to have learners working on projects that could benefit their personal lives or professional trajectories as part of formal coursework. In her final assignment, Olivia provides the learners “absolute free reign in terms of what kind of a thing they produced.” Learners use their creative interests to develop resources for the course, as Olivia reflects “some opted for essays still, but other students created digital timelines, infographics, podcasts, comic books, videos.” Personalisation of assessment was suggested to allow learners to represent and situate themselves authentically and creatively through their work.

      Giving learners more autonomy in their learning is a great pedagogical principle, and in the context of the article focusing on learning design, I can see how this fits with "open" as it does require that the course design needs to be more "open" as in flexible to allow for this kind of learner autonomy. There is overlap here between authentic learning and open pedagogy.

    1. you don’t have to learn alone. In fact, it is the uniqueness of the people with which you learn and the discussions you have together that make what you learn unforgettable

      Team work applies also into learning

    2. being able to communicate what you’ve learned is one of the main skills that differentiates a good developer from a great one (IMHO).

      Know how to explain what you just learned

    3. When facing procrastination, think of process over product. I often procrastinate when I’m overwhelmed by the thought, “Ok, I have to get X done”. Instead, it can be beneficial to think, “Ok, I will spend an hour on X” — which isn’t overwhelming, doesn’t require a long breakdown of tasks, and gets me started (90% of the battle)

      Solution to procrastination

    4. Know when to apply a particular concept is as important as knowing how.

      Use cases are more important than we think of

    5. Spread it out over many sessions and over many different modes of learning.

      Don't learn everything in a single session!

    6. test yourself as you’re encountering new material. Recall is a simple example of this mini-testing.

      Recall = mini-testing

    7. taking a couple minutes to summarize or recall material you are trying to learn

      It's worth to take the time to ponder

    8. Recently, I found this great application called Highly (you should use this!). They make it really simple to highlight any article that I’m reading on the web using a Chrome extension.

      This inspired me to make a research of similar applications (such as Liner), and finally end up with hypothes.is

    9. Highlighting or underlining are also techniques that often lead to this illusion of learning. On the other hand, brief notes that summarize keys concepts are much more effective.

      Indeed… Therefore, let's leave a note here :)

    10. First, survey and priming — this involves scanning a book or the syllabus of a course, for example, to get a general idea of the bigger picture. Second, observe an example. Then, do it yourself. And, finally, do it again and again in different contexts.

      Chunk the knowledge

    11. take breaks, meditate, think about other things, and give yourself plenty of time in both modes.

      Give yourself some free time while learning

    1. many instructional designers and others adjacent to the field have responded swiftly with critiques that range from outright rejection of the term, to general skepticism about the concept, to distrust for its advocates and their support of learning analytics and outcomes-based learning.

      Why the rejection of the term? Is it too mechanical?

    1. Because documentation of student learning impacts may not reflect the core objectives of all CTLs — and because this investigation is resource-intensive

      Measuring impact of on student learning outcomes is resource-intensive. This makes me think of the Tracer project.

    1. Deep Learning for Search - teaches you how to leverage neural networks, NLP, and deep learning techniques to improve search performance. (2019) Relevant Search: with applications for Solr and Elasticsearch - demystifies relevance work. Using Elasticsearch, it teaches you how to return engaging search results to your users, helping you understand and leverage the internals of Lucene-based search engines. (2016)
    2. Elasticsearch with Machine Learning (English translation) by Kunihiko Kido Recommender System with Mahout and Elasticsearch
    1. moderating discussion forums

      I don't know if I would consider this a routine task considering the amount of facilitation a good discussion often requires. Perhaps moderating a forum related to routine support questions where the questions might be "when is my term paper due", or "how do I access the course syllabus" could be routine posts in a discussion forum. But when you get into forums where learner discourse is key to the learning process, the moderating is not routine, or that moderating is even the right word to use to frame those discussions as these types of discussion forums often require a facilitator, not a moderator.

    1. Since all neurons in a single depth slice share the same parameters, the forward pass in each depth slice of the convolutional layer can be computed as a convolution of the neuron's weights with the input volume.[nb 2] Therefore, it is common to refer to the sets of weights as a filter (or a kernel), which is convolved with the input. The result of this convolution is an activation map, and the set of activation maps for each different filter are stacked together along the depth dimension to produce the output volume. Parameter sharing contributes to the translation invariance of the CNN architecture. Sometimes, the parameter sharing assumption may not make sense. This is especially the case when the input images to a CNN have some specific centered structure; for which we expect completely different features to be learned on different spatial locations. One practical example is when the inputs are faces that have been centered in the image: we might expect different eye-specific or hair-specific features to be learned in different parts of the image. In that case it is common to relax the parameter sharing scheme, and instead simply call the layer a "locally connected layer".

      important terms you hear repeatedly great visuals and graphics @https://distill.pub/2018/building-blocks/

    1. Here's a playground were you can select different kernel matrices and see how they effect the original image or build your own kernel. You can also upload your own image or use live video if your browser supports it. blurbottom sobelcustomembossidentityleft sobeloutlineright sobelsharpentop sobel The sharpen kernel emphasizes differences in adjacent pixel values. This makes the image look more vivid. The blur kernel de-emphasizes differences in adjacent pixel values. The emboss kernel (similar to the sobel kernel and sometimes referred to mean the same) givens the illusion of depth by emphasizing the differences of pixels in a given direction. In this case, in a direction along a line from the top left to the bottom right. The indentity kernel leaves the image unchanged. How boring! The custom kernel is whatever you make it.

      I'm all about my custom kernels!

    1. We developed a new metric, UAR, which compares the robustness of a model against an attack to adversarial training against that attack. Adversarial training is a strong defense that uses knowledge of an adversary by training on adversarially attacked images[3]To compute UAR, we average the accuracy of the defense across multiple distortion sizes and normalize by the performance of an adversarially trained model; a precise definition is in our paper. . A UAR score near 100 against an unforeseen adversarial attack implies performance comparable to a defense with prior knowledge of the attack, making this a challenging objective.

      @metric

  4. Aug 2019
    1. And they have largely moved beyond the mental model of universal design (UD) in the physical environment, which is static, bounded, and predictable—instead designing interactions according to UDL, which sees interactions as dynamic, open, and emergent.

      Really interesting point here about the limit of the "curb cut" metaphor.

    1. The goal is to build a better web thanks to the contribution and collaboration of a diverse set of personalities and thinkers: visual learners; physical learners; social learners; and everyone in-between.
    2. Though the web has evolved, its methods for training future web developers have not. Sure, the resources are more attractive now. They’re better written and more easily findable. But they continue to be geared toward a very narrow kind of person. They’re generally online, self-guided, impersonal, and novice-unfriendly.
    1. HTM and SDR's - part of how the brain implements intelligence.

      "In this first introductory episode of HTM School, Matt Taylor, Numenta's Open Source Flag-Bearer, walks you through the high-level theory of Hierarchical Temporal Memory in less than 15 minutes."

    1. Using multiple copies of a neuron in different places is the neural network equivalent of using functions. Because there is less to learn, the model learns more quickly and learns a better model. This technique – the technical name for it is ‘weight tying’ – is essential to the phenomenal results we’ve recently seen from deep learning.

      This parameter sharing allows CNNs, for example, to need much less params/weights than Fully Connected NNs.

    2. The known connection between geometry, logic, topology, and functional programming suggests that the connections between representations and types may be of fundamental significance.

      Examples for each?

    3. Representations are Types With every layer, neural networks transform data, molding it into a form that makes their task easier to do. We call these transformed versions of data “representations.” Representations correspond to types.

      Interesting.

      Like a Queue Type represents a FIFO flow and a Stack a FILO flow, where the space we transformed is the operation space of the type (eg a Queue has a folded operation space compared to an Array)

      Just free styling here...

    4. In this view, the representations narrative in deep learning corresponds to type theory in functional programming. It sees deep learning as the junction of two fields we already know to be incredibly rich. What we find, seems so beautiful to me, feels so natural, that the mathematician in me could believe it to be something fundamental about reality.

      compositional deep learning

    5. Appendix: Functional Names of Common Layers Deep Learning Name Functional Name Learned Vector Constant Embedding Layer List Indexing Encoding RNN Fold Generating RNN Unfold General RNN Accumulating Map Bidirectional RNN Zipped Left/Right Accumulating Maps Conv Layer “Window Map” TreeNet Catamorphism Inverse TreeNet Anamorphism

      👌translation. I like to think about embeddings as List lookups

    1. As log-bilinear regression model for unsupervised learning of word representations, it combines the features of two model families, namely the global matrix factorization and local context window methods

      What does "log-bilinear regression" mean exactly?

    1. The reward system in the brain can be triggered by the anticipation of all kinds of rewards, from points or praise.

      This is interesting. The reward system is triggered by the anticipation of the reward, not the actual reward itself.

    1. Retrieval practice boosts learning by pulling information out of students’ heads (by responding to a brief writing prompt, for example), rather than cramming information into their heads (by lecturing at students, for example). In the classroom, retrieval practice can take many forms, including a quick no-stakes quiz. When students are asked to retrieve new information, they don’t just show what they know, they solidify and expand it. Feedback boosts learning by revealing to students what they know and what they don’t know. At the same time, this increases students’ metacognition — their understanding about their own learning progress. Spaced practice boosts learning by spreading lessons and retrieval opportunities out over time so that new knowledge and skills are not crammed in all at once. By returning to content every so often, students’ knowledge has time to be consolidated and then refreshed. Interleaving — or practicing a mix of skills (such as doing addition, subtraction, multiplication, and division problems all in one sitting) — boosts learning by encouraging connections between and discrimination among closely related topics. Interleaving sometimes slows students’ initial learning of a concept, but it leads to greater retention and learning over time.

      How can I build this into my curriculum?

    1. Genus Species + Species Hybrids Example

      Great examples of remixes in the real world

    2. Lessig (2005) provides a range of examples of the kinds of digital remix practices that in his view constitute “the more interesting ways [to write]” for young people. These include remixing clips from movies to create “faux” trailers for hypothetical movies; setting remixed movie trailers to remixed music of choice that is synchronized to the visual action; recording a series of anime cartoons and then video-editing them in synchrony with a popular music track; mixing “found” images with original images in order to express a theme or idea (with or without text added); and mixing images, animations and texts to create cartoons or satirical posters (including political cartoons and animations), to name just a few types. We accept this conceptual extension of “writing” to include practices of producing, exchanging and negotiating digitally remixed texts, which may employ a single medium or may be multimedia remixes. (We also recognize as forms of remix various practices that do not necessarily involve digitally remixing sound, image and animation, such as paper-based forms of fanfiction writing and fan-producing manga art and comics, which continue to go on alongside their hugely subscribed digital variants.

      There are all very good examples. The great thing is, that as a language teacher there are so many different types of media that the students can really hone in on their interests.

    3. where someone creates a cultural product by mixing meaningful elements together (e.g., ideas from different people with ideas of one’s own), and then someone else comes along and remixes this cultural artefact with others to create yet another artefact.

      I think this could be fun to with students in Spanish. I can introduce music, poems, art and have students remix them.

  5. Jul 2019
    1. Communities of practice are one of the ways in which experiential learning, social constructivism, and connectivism can be combined, illustrating the limitations of trying to rigidly classify learning theories. Practice tends to be more complex.
      • Constructivism - roots in the philosophical and psychological viewpoints of this century, specially Piaget, Bruner and Goodman. Learning occurs when the mind filters inputs from the world to produce its unique reality. The mind is believed to be the source of all meaning, direct experiences with the environment are considered critical. It crosses both categories by emphasizing the interaction between learner and the real world.

      • Social constructivism would emphasize critical experiences between the learner and other learners and mentors.

      • Connectivism is the integration of principles explored by chaos, network, complexity and self-organization theory. A lot of the content is now offloaded to the machine that was previously residing within the learner.

    1. Open learning, also known as open education

      requires a open, sharing, collaborative environment. Promotes pedagogical dialogue. OER have potential to transcend "geographic, economic, or language barriers". Also, OER strengthens digital literacy.

    2. e-purpose.

      Creative Commons covers 4 areas of practice: -re-use: right to verbatim reuse content

      • revise: right to change/ modify the content -remix: right to combine original or revised with new content -redistribute: right to make and share copies of content

      great for expanding, exploring, sharing and remixing content in the educational world.

    3. free to use and access, and to re-purpose.

      open learning is influential in areas of design, practice, pedagogy, and theory in education. Open Education Resources at the K-12 level are fundamental to OL.

    4. Open learning

      defined as "set of practices, resources, and scholarship that are open to the public and that are accessible, free to use and access, and re-purpose"

    1. We will discuss classification in the context of supportclassificationvector machines

      SVMs aren't used that much in practice anymore. It's more of an academic fling, because they're nice to work with mathematically. Empirically, Tree Ensembles or Neural Nets are almost always better.

    1. Find Native Speakers

      This is a great idea to engage students. I have thought about it before but I have not yet put it into practice. I did pen pal letters one year but snail mail was too slow. I am going to try an incorporate this idea even more.

    1. for caring adults, teachers, parents, learners and their peers to share interests and contribute to a common purpose. The potential of cross-generational learning and connection unfolds when centered on common goals.

      important to have a caring, experienced community to rely on and learn from

    2. Powered with possibilities made available by today’s social media, this peer culture can produce learning that’s engaging and powerful.

      this is what makes connected learning modern

    3. For more than a century, educators have strived to customize education to the learner. Connected Learning leverages the advances of the digital age to make that dream a reality — connecting academics to interests, learners to inspiring peers and mentors, and educational goals to the higher order skills the new economy rewards.

      good summary quote

    1. Implication means co-occurrence, not causality!
    2. Given a set of transactions, find rules that will predict the occurrence of an item based on the occurrences of other itemsin the transaction
    1. Hanauer (2012) contends that “language learning within these settings is defined overwhelmingly in linguistic, structural, and cognitive terms. Thus the language learner at the center of this system becomes nothing more than an intellectual entity involved in an assessable cognitive process” (p. 105). In this assessable cognitive instruction, students are not afforded the opportunity to use English as a social semiotic tool for expressing their own personal feelings (emotions), opinions, and stories as lived experience as well as for enacting social practices.

    1. organizations and caring adults can form partnerships, broker connections across settings, and share on openly networked platforms and portfolios.

      This is where networking, both in person and online, could come into play.

    2. earners need to feel a sense of belonging and be able to make meaningful contributions to a community in order to experience connected learning. Groups that foster connected learning have shared

      I don't think real positive change or learning can occur unless a student feels safe, welcomed, and like they belong. See Maslow's hierarchy of needs.

    3. hrough collaborative production, friendly competition, civic action, and joint research, youth and adults make things, have fun, learn, and make a difference together.

      shared interests and collaboration are instrumental for connected learning; reminds me of the phrase "great minds think alike"

    4. They do this by being sponsors of what youth are genuinely interested in — recognizing diverse interests and providing mentorship, space, and other resources.

      sponsorship/adult support in connected learning = important to learning success and an important resource

    5. Learning is irresistible and life-changing when it connects personal interests to meaningful relationships and real-world opportunity.

      absolutely true. passion+learning+education= change in the world for good

    6. embraces the diverse backgrounds and interests of all young people.

      importance of diversity in connected learning will heighten cultural awareness

    1. Various sources told me that personalized learning, when aided by screens, is a bad fit for vulnerable students—those from low-income families, ethnic and racial minorities, kids with special needs, and English-language learners. In some areas of the country, including Providence, these groups account for almost the entire population of public schools. But the experience of personalized learning is, indeed, personal, and exceptions abound
    1. To understand what has happened, we only need to look at the history of writing and printing to note two very different consequences (a) the first, a vast change over the last 450 years in how the physical and social worlds are dealt with via the inventions of modern science and governance, and (b) that most people who read at all still mostly read fiction, self-help and religion books, and cookbooks, etc.* (all topics that would be familiar to any cave-person).
    1. In Reader, Come Home: The Reading Brain in a Digital World, Maryanne Wolf talks about how technology has led to more skimming rather than reading slowly and carefully. She talks about the benefits of “cognitive patience.” And she reminds us that reading quickly isn’t what makes someone a good reader.
    1. Compared with neural networks configured by a pure grid search,we find that random search over the same domain is able to find models that are as good or betterwithin a small fraction of the computation time.
  6. Jun 2019
    1. To interpret a model, we require the following insights :Features in the model which are most important.For any single prediction from a model, the effect of each feature in the data on that particular prediction.Effect of each feature over a large number of possible predictions

      Machine learning interpretability

    1. Balance exploration and exploitation: the choice of examples to label is seen as a dilemma between the exploration and the exploitation over the data space representation. This strategy manages this compromise by modelling the active learning problem as a contextual bandit problem. For example, Bouneffouf et al.[9] propose a sequential algorithm named Active Thompson Sampling (ATS), which, in each round, assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for this sample point label. Expected model change: label those points that would most change the current model. Expected error reduction: label those points that would most reduce the model's generalization error. Exponentiated Gradient Exploration for Active Learning:[10] In this paper, the author proposes a sequential algorithm named exponentiated gradient (EG)-active that can improve any active learning algorithm by an optimal random exploration. Membership Query Synthesis: This is where the learner generates its own instance from an underlying natural distribution. For example, if the dataset are pictures of humans and animals, the learner could send a clipped image of a leg to the teacher and query if this appendage belongs to an animal or human. This is particularly useful if your dataset is small.[11] Pool-Based Sampling: In this scenario, instances are drawn from the entire data pool and assigned an informative score, a measurement of how well the learner “understands” the data. The system then selects the most informative instances and queries the teacher for the labels. Stream-Based Selective Sampling: Here, each unlabeled data point is examined one at a time with the machine evaluating the informativeness of each item against its query parameters. The learner decides for itself whether to assign a label or query the teacher for each datapoint. Uncertainty sampling: label those points for which the current model is least certain as to what the correct output should be. Query by committee: a variety of models are trained on the current labeled data, and vote on the output for unlabeled data; label those points for which the "committee" disagrees the most Querying from diverse subspaces or partitions:[12] When the underlying model is a forest of trees, the leaf nodes might represent (overlapping) partitions of the original feature space. This offers the possibility of selecting instances from non-overlapping or minimally overlapping partitions for labeling. Variance reduction: label those points that would minimize output variance, which is one of the components of error. Conformal Predictors: This method predicts that a new data point will have a label similar to old data points in some specified way and degree of the similarity within the old examples is used to estimate the confidence in the prediction.[13]
    1. Throughout the past two decades, he has been conducting research in the fields of psychology of learning and hybrid neural network (in particular, applying these models to research on human skill acquisition). Specifically, he has worked on the integrated effect of "top-down" and "bottom-up" learning in human skill acquisition,[1][2] in a variety of task domains, for example, navigation tasks,[3] reasoning tasks, and implicit learning tasks.[4] This inclusion of bottom-up learning processes has been revolutionary in cognitive psychology, because most previous models of learning had focused exclusively on top-down learning (whereas human learning clearly happens in both directions). This research has culminated with the development of an integrated cognitive architecture that can be used to provide a qualitative and quantitative explanation of empirical psychological learning data. The model, CLARION, is a hybrid neural network that can be used to simulate problem solving and social interactions as well. More importantly, CLARION was the first psychological model that proposed an explanation for the "bottom-up learning" mechanisms present in human skill acquisition: His numerous papers on the subject have brought attention to this neglected area in cognitive psychology.
    1. By comparison, Amazon’s Best Seller badges, which flag the most popular products based on sales and are updated hourly, are far more straightforward. For third-party sellers, “that’s a lot more powerful than this Choice badge, which is totally algorithmically calculated and sometimes it’s totally off,” says Bryant.

      "Amazon's Choice" is made by an algorithm.

      Essentially, "Amazon" is Skynet.

    1. This problem is called overfitting—it's like memorizing the answers instead of understanding how to solve a problem.

      Simple and clear explanation of overfitting

    1. Many writers have highlighted the power of the global digital tribe, particularly the way groups tend to solve problems more effectively than individual experts (Surowiecki, 2009). We read of how groups can self-organise and co-ordinate their actions in connected global environments (Shirky, 2008) and that there seems to be no limit what a tribe can do when it is given the appropriate tools (Godin, 2008). Mobile and personal technologies that are connected to global networks have afforded us with the priceless ability to collaborate and cooperate in new and inventive ways (Rheingold, 2002), and allow us to rapidly self organise into new collective forces (Tapscott and Williams, 2008). Connected technology not only gives us access to existing knowledge, it encourages and enables us to create new knowledge and share it widely to a global audience.

      I am enjoying this series Steve. A book that has influenced my thinking on the topic has been Teaching Crowds by Jon Dron and Terry Anderson.

      One thing that I am left wondering is how the benefits and affordances change and develop over time? I was left thinking about this while reading Clive Thompson’s new book Coders compared with his last book Smarter Than You Think.

      Also posted on Read Write Collect

  7. May 2019
    1. disruptive of formal education and enabling of student-centered and interest-driven learning

      To what extent are these actually at odds?

    1. oundaries between different learning and discourse spaces (e.g., public vs. private, formal educationvs. workplace learning) are to be crossed if not totally dissolved

      This is probably a long-term goal of mine that I might as well own up to.

    1. Professors base these grades on a combination of factors and values, such as 10% participation, 20% homework, 30% final exam, and 40% group project. Digital adaptive learning tools can do this too, and then take the student’s score and match it with the next best skill in the subject’s overall scope and sequence.

      This is interesting. This could be interesting in design.

    2. Adaptive learning does not fit easily into the status quo. Besides having to use a blended learning model, in which class-time is divvied up between traditional and electronic learning, teachers must be willing to let students progress at their own pace.

      Could this fit within a trades model?

    3. This is different to simply providing differentiated content for students. For instance, if a learner was not in class during a period when a particular skill was introduced, and years later was learning a new skill that built on that prior knowledge, that learner would struggle. Adaptive sequencing tools could help that student go back to find this gap and learn this content first, rather than following the same sequence as everyone else

      This could be very powerful in trades training.

    4. Practice Engine

      This is brilliant. Start simple and then ramp it up for practice.

    5. A fixed-form assessment is one in which the items are preselected, and every student is tested on the same set of questions (e.g. a final exam).

      fixed form assessment vs. adaptive assessment.

    6. Let’s break these down a little further

      Content, Assement, sequence. The three places adaptive learning occurs.

    7. How do we use testing – or assessment – not simply to rank students but as meaningful windows into why they struggle to learn? And the big one: Can changes in digital curriculum help close the aching achievement gap?

      OMG YES!!!

    8. we define digital adaptive learning tools as education technologies that can respond to a student’s interactions in real-time by automatically providing the student with individual support

      Definition of adaptaive learning

    9. Knewton alone has raised nearly $160 million.

      interesting

    10. The tools, however, are not a panacea. For several reasons, it’s unlikely that a single tool will ever be able to take over a student’s education and direct them to every single thing they should do. Nor is it likely that we would want it to, as a critical part of education is building student agency – helping students own their learning, make decisions, become lifelong learners, and develop their metacognitive skills.

      YES!

    11. But a critical challenge correctly noted in this report, written by EdSurge and supported by Pearson, is to decipher just what it means for a learning technology to be adaptive.
    12. Adaptive learning is an enormously promising field. Educators worldwide are using adaptive tools to change their practice. The tools are growing and gaining acceptance in classrooms.
    1. A PLE can be entirely controlled or adapted by a student according to his or her formal and informal learning needs, however not all students possess the knowledge management and the self-regulatory skills to effectively use social media in order to customize a PLE to provide the learning experience they desire.

      Teaching students to become self-regulated learners

    1. policy change index - machine learning on corpus of text to identify and predict policy changes in China

  8. Apr 2019
    1. Annotation Profile Follow learners as they bookmark content, highlight selected text, and tag digital resources. Analyze annotations to better assess learner engagement, comprehension and satisfaction with the materials assigned.

      There is already a Caliper profile for "annotation." Do we have any suggestions about the model?

    1. game for students, Calculation Nation from the National Council of Teachers of Mathematics is a wonderful resource.

      math game for summer

    2. Beat Summer Slide: The Parent Summer Checklist

      read this

    1. ive into equations: When plunging into a pool, have your child calculate the volume and weight of the water and the rate at which the pool will fill or drain. Be a meteorologist: Track summer weather and convert daily temperatures from Celsius to Fahrenheit and monitor monthly rainfall. Show the relevance: Invite your child to help you prepare poolside treats. Encourage them to use measuring cups and proportion snacks into different size bowls. Connect Math & Language: If your child excells at language, then use that subject as a platform to help them excel in math. Give them picture books and nonfiction texts to read that focus on math. Turn errands into learning opportunities: While at the grocery store, have your child figure out which box of crackers is closest to the $2.50 price point and count the kiwis as they put them in the bag. Add some education to your road trips: Distract your child from asking “Are we there yet?!” by creating paper tickets that identify all the rest stops along the way, so they can practice time and distance on the ride there. Make your beach day mathematical: Have your child arrange their seashells into piles of 3 or 5, and use those piles as the basis for multiplication and subtraction activities. Note Numbers:Have your child pay close attention to numbers found on clocks, cereal boxes, the kitchen calendar and the local newspaper. Have tell you how many articles are on page B4 of the paper and calculate how long they’ve been awake for. Pay close attention to menus: Whether you go out for dinner or order in, there’s bound to be a menu involved. Have your child pinpoint the price specific item, or list items that range between $10 and $15, or calculate how much a hamburger and a juice would cost. Change it up: Give your child a pile of coins–the bigger the assortment, the better! Have them find as many coin combinations as possible that equal the price of a beach ball.

      math ideas

    1. Even organized sports teach children about mathematics, rules, teamwork, planning, and so on. Likewise, a family game like Scrabble is about linguistics, psychology, mathematics, memory, competition, and doggedness. It’s about mastering the rules.

      even sports and family games help...

    1. Updated! 10 Online Summer Learning Opportunities]

      list of ten fun activities...lots of tech stuff

    1. S.M.A.R.T. Cases are boxed kits that include science activities and supplementary materials that make it a complete learning package for young people. S.M.A.R.T. Cases are sponsored by the Torrance Refining Company.  WHY IS THIS CASE SO “S.M.A.R.T.”? S.M.A.R.T. Cases are science kits designed for hands-on learning. They come with the tools and resources to make learning fun and easy. The Next Generation Science Standards (NGSS) requires that students engage in practice-rich activities that support their use of the case contents to figure out and explain complex phenomena and make connections to principles that cut across the content areas (NRC, 2012; NGSS Lead States, 2013). With the assistance of faculty at Torrance Unified School District, each case was evaluated for grade level and compliance with the framework for the NGSS.

      check one out for presentation

    1. Topic: Reading Classroom Ideas 10 Kids Summer Reading Programs We Love Summer reading is better reading. <img alt='' src='https://secure.gravatar.com/avatar/2f903edaa3cdf06132a636fea64aea4e?s=44&#038;d=mm&#038;r=g' srcset='https://secure.gravatar.com/avatar/2f903edaa3cdf06132a636fea64aea4e?s=88&#038;d=mm&#038;r=g 2x' class='avatar avatar-44 photo' height='44' width='44' /> Shellie Deringer on June 13, 2017 .contest-social .share-links svg, .share-links svg { top: 50%; left: 0px; } #atftbx p:first-of-type { display: none; } .entry-content .addthis_toolbox, .entry-content .addthis_button, .entry-header .addthis_toolbox, .entry-header .addthis_button { margin: 0 !important;} .at-style-responsive .at-share-btn { padding: 0 !important;} AddThis Sharing ButtonsShare to FacebookFacebookShare to TwitterTwitterShare to PinterestPinterest <img width="800" height="450" src="https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900.jpg" class="attachment-full size-full wp-post-image" alt="kids summer reading programs" srcset="https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900.jpg 800w, https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900-272x153.jpg 272w, https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900-400x225.jpg 400w, https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900-768x432.jpg 768w, https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900-217x122.jpg 217w, https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900-490x275.jpg 490w, https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900-556x312.jpg 556w, https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900-660x370.jpg 660w, https://s18670.pcdn.co/wp-content/uploads/GettyImages-504870144-e1497380601900-300x169.jpg 300w" sizes="(max-width: 800px) 100vw, 800px" /> ;new advadsCfpAd( 361210 );Next to the benefits of playing and swimming all summer long, reading is just about the most important thing kids can do this summer. We put together this list of free kids summer reading programs to help keep the learning going over the next few months. Share these kids summer reading programs with your students and their families! 1. Barnes & Noble Summer Reading for Kids <img class="alignnone size-medium wp-image-365078" src="https://s18670.pcdn.co/wp-content/uploads/barnes-and-noble-400x112.png" alt="" width="400" height="112" srcset="https://s18670.pcdn.co/wp-content/uploads/barnes-and-noble.png 400w, https://s18670.pcdn.co/wp-content/uploads/barnes-and-noble-220x62.png 220w, https://s18670.pcdn.co/wp-content/uploads/barnes-and-noble-300x84.png 300w" sizes="(max-width: 400px) 100vw, 400px" /> This program begins in May and runs through September. Kids can earn a FREE book after they read eights books and log them on the reading sheet. The Barnes and Noble kids summer reading program is only available to children in grades 1-6. Only one book is available for each child who completes a reading journal and choice must be made from the selected books available at the store.

      Reading programs

    1. READS for Summer Learning.[14] In READS, which has been iteratively modified over several randomized trials, students receive eight books in the mail over the summer that are matched to their reading level and interests. Along with each book, students receive a tri-fold paper that leads them through a pre-reading activity and a post-reading comprehension check. Students are asked to mail the postage-prepaid tri-fold back; families receive reminders when tri-folds are not returned.

      great idea to add to presentation

    2. An early comprehensive review of the literature summarized several findings regarding summer loss.[2] The authors concluded that: (1) on average, students’ achievement scores declined over summer vacation by one month’s worth of school-year learning, (2) declines were sharper for math than for reading, and (3) the extent of loss was larger at higher grade levels.

      research to use

    1. focus on collaboration, connection, diversity, democracy, and critical assessments of educational tools and structures

      Also critical assessments of authority structures, truth claims, value judgments...

    1. PBL) isan instructional method in which students lear

      Problem based learning is an instructional method in which students learn through facilitated problem solving. Problem based curricula provide students with guided experience in learning through solving complex, real-world problems. Rating: 9/10

    1. Emotional learning involves meddling with deeply personal, private aspectsof workers’ lives in an effort to influence and shape their emotions, some-times with constructive and sometimes with destructive results. Two aspectsof emotion have particular relevance in the workplace: emotional intelli-gence and emotion labor.
  9. learn-us-east-1-prod-fleet01-xythos.s3.us-east-1.amazonaws.com learn-us-east-1-prod-fleet01-xythos.s3.us-east-1.amazonaws.com
    1. Articulate what they know; 2. reflect on what they have learned; 3. support the internal negotiation of meaning making; 4. construct personal representations of meaning; and 5. support intentional, mindful thinking

      what technology should do in an online course to reach adults

    2. Since online learning has a different setting from the conventional classroom,online educators need to use some special techniques and perceptions to leadto success. Moreover, adults have special needs and requirements as learnerscompared with children and adolescents, thus online educators should knowhow adults can learn best because of their special characteristics. Philosophicaland methodological shifts also affect instruction. Many researchers havesuggested that constructivism should be applied in distance education. Thus,this paper attempts to examine the impact of constructivism in online learningenvironments when focusing on adult learners. The author develops the con-nection between constructivism and adult learning theory. In addition, thepaper proposes instructional guidelines using the constructivist approach inonline learning for adults.
  10. learn-us-east-1-prod-fleet01-xythos.s3.us-east-1.amazonaws.com learn-us-east-1-prod-fleet01-xythos.s3.us-east-1.amazonaws.com
    1. Workplace-relatedlearningis learning that is related to the firm in which the learner is employed and that is supported at least to some extent by their employer, but that is notfoundationalor higher education. Individuals may engage in this type of learning for the purposeof learning a new job, improving their job performance, for professional development, as an employee benefit or because it is required by legislation.
    2. Key dimensions of adult learning activities

      form, provider, payer, purpose, duration, design, delivery, instructor quality, credential

    3. Fivebroad types of adult learning

      Adult learning types including Foundational, higher education, workplace, personal, social. Includes a list of examples of the types of learning this includes in each category.

    1. The  Use  of  Mobile  Devices  for  Academic  Purposes  at  the  University  of  Washington:  Current  State  and  Future  Prospects

      Professional development opportunities and incentives for faculty to integrate mobile devices and as a teaching and learning tool.

    1. Can Tablet Computers Enhance Faculty Teaching?

      Studies faculty provided with tablet computers and peer mentoring workshops to help increase understanding and use of mobile devices in pedogogical approaches

    1. The ITL department at The Ohio State University at Mansfield has six primary themes: (a) developmentally appropriate practice, (b) integrated curriculum, (c) literature-based instruction, (d) classroom-based inquiry, (e) diversity and equity issues, and (f) technology integration. The goal for technology integration, like the other themes in the program, is to integrate the theme into each course of the program, when appropriate. For example, instructors find ways to integrate children’s literature into each of the methods courses, whether it is a mathematics, science, or social studies methods course. The goal is to integrate the common themes of the program throughout the methods courses and the other graduate courses leading up to student teaching.
    1. Author Tom Vander Ark, also author of Getting Smart: How Digital Learning is Changing the World, brings a reflection of what ends up being 10 trends and 10 suggestions on how to develop impact in relation to the trends. The article is straight forward in the trends, but also does offer platform and educational examples to enhance the content.

      Rating: 8/10

    1. This article is a breakdown from the U.S. Department of Education around the types of learning environments that exist in the technology arena. It provides examples of schools fulfilling these different environments and offers a collection fo additional resources.

      Rating: 9/10

    1. This article discusses adult learners who connected with industry professionals in a career exploration course that focused around technology and coding. The program is hoping to show other places that focus on adult learning a model that would work for adult learners to gain access to industry.

      Rating: 6/10. Interesting article, but not really a focus on how they effectively engaged the adult learns in the program or their approach to actually developing the course and curriculum.

    1. This article is a study of both in-person and online courses and the affect of internet usage on the student's engaged int those courses. The article notes how saturated the learning environment has become and their approach to using student self-reported data to measure engagement. The authors provide an extensive review of prior literature on both technology and student engagement topics. The data should be reviewed with caution, as it is outlined by the authors that the questions have not been thoroughly vetted for validity and reliability.

      Rating: 6/10. The article had positive results, but the data questions being untested is a bit concerning. The article is also from 2009, and the landscape has changed much since then.

  11. Mar 2019
    1. Designing Technology for Adult Learners: Applying Adult Learning Theory

      Discusses how adult learning theory can be applied for digital learning for adults. It suggests making sure interactions are built on real world and relevant situations, that learners and go at their own pace, they are allowed to reflect on their learning, and interact with each other and different points of view. Rating 10/10

    1. This article discusses that technology rich classroom research is lacking in the research world. This paper created a scale in which it could evaluate classroom environments. The authors tested this scale and determined it was a good starting framework for how to improve classroom environments. This scale could be useful later in class when evaluating technologies.Rating 9/10 for help assessment techniques

    1. This paper addresses the question about how today’s modern schools can prepare learners for the future in the age of technology. The response to this question is discussion around innovative learning environments that involve the use of technology. Technology has been a concern for the rapid change in the educational landscape and this paper aims to highlight transformation and innovation in relation to technology for teaching and learning. 9/10 for helpful diagrams and tables.

    1. The eZoomBook Tool: A Blended and Eclectic Approach to Digital Pedagogy

      Discusses the use of the eZoomBook Tool which has the ability to allow learners to navigate back through subject matter they need to refresh on as they learn new material. It allows peer to peer teaching and working which is it's most successful feature for adult learners. the eZB template is open-format and can be adapted to a variety of learning situations. Results from their initial experiments show high use of intrinsic motivation for adult learners once they got a handle on the platform.

    1. This paper discusses the idea that design is responsible for developing learning and teaching in technology rich environments. This paper argues Cultural Historical Activity Theory. This paper uses this perspective to discuss their ideas of design in connection with the digital age. This paper is written from the perspective German, Nordic, Russian and Vygotskyan concepts that seek to define the relationship between learning and teaching in relation to design. Rating 9/10 for mixing design with digital learning

  12. eds.a.ebscohost.com.libproxy.nau.edu eds.a.ebscohost.com.libproxy.nau.edu
    1. The purpose of this book is to help learners plan ,develop and deliver online training programs for adults in the workplace. This book can be understood as a guide for training managers, instructional designers, course developers and educators who are looking to transition from classroom material to self-paced instructional programs.The main purpose of this book is for people who deliver training programs to be able to design programs for online. Most importantly, the learners needs are addressed in development. Rating 7/10 material is interesting and relevant but slightly outdated.

    1. Beyond the Click: Rethinking Assessment of an Adult Professional Development MOOC

      Examines the design and implementation of a MOOC about flipped teaching. It used digital surveys and the LMS system to gauge participant experiences and expectations. A unique aspect of this MOOC is that it asked participants to set what level of activity they expected to have in the program: active, passive, drop-in, observer. And it found that 60% of people engaged directly at that level. This is useful for designing online education experience and connecting participants with each other and in the classroom based upon their learning goals.

    1. Can an Evidence-Based Blended Learning Model Serve Healthcare Patients and Adult Education Students?

      Discusses the use of blended-learning incorporating technology especially for adult education programs that reduce education gaps and help the under-employed with career readiness. This also focuses in on adults with chronic disease and how online education might better support their needs. It uses constructivist leanings placing education in the context of activity and environment and recreating the correct environments online.

    1. The Career Curriculum Continuum

      Discusses the place of universities in lifelong learning, especially with the advancement of technology in education in the workforce. The career curriculum continuum, includes free and self-paced options such as MOOCs, educational video on Youtube, and Wikis, but also suggests more structured learning placed in context. Universities can offer this as short courses that are cheaper and offer more options for pathways to a full degree program. It also suggests professional certificates for expanding the skills of those already working. Digital institutions will be the most widely used methods for consuming knew knowledge and advancing skills. Rating 10/10

    1. Q&A: How to Develop ‘Program Architecture’

      Discusses they ways in which Kacey Thorne of WGU, outlines plans for developing underlying competencies for online programs. Program architecture refers to the connect of skills and competencies for specific industries linking back to a network of what students will learn in school through offered programs. This is necessary for creating relevant programs that teach translatable skills for the real world after college. Rating 10/10

    1. Using Web 2.0 to teach Web 2.0: A case study in aligningteaching, learning and assessment with professionalpractice

      Research article. Discussed the use of web 2.0 including blogs, wikis, and social media as a method of information sharing that is impacting education through teaching and learning management. The work suggests that learning outcomes, activities, and assessment have to be in alignment to create effective learning experiences and uses a case study within an information management program in which students use various web 2.0 tools and document their use .

    1. This article is for teachers and contains multiple resources about how to integrate technology into the classroom and the different types of technology integration. This article is full of examples and ideas teachers can use to facilitate technology in the classroom. Rating: 9/10 for use of examples and practical application.

    1. The use of digital technologies across the adult life span in distance education.

      Research article. This article explores how older and younger student approach studying through the use of technology and reveals that those in older age groups were more likely to use technology in deep in focused ways to study once they got the hang of it and younger groups were more likely to remain on the surface level of a variety of technologies.

    1. Effect of a metacognitive scaffolding on self-efficacy, metacognition, and achievement in e-learning environments

      This article discusses the effect of a metacognitive scaffolding on self-efficacy, metacognition and achievement in e-learning environments. This is a study of 67 higher education students. Half of the group participated in learning through e-learning with scaffolding while the other group did not have the scaffolding. Not surprisingly, the results show that scaffolding is essential to learning and these individuals preformed better than the group without scaffolding.

      Rating 8/10

    2. Effect of a metacognitive scaffolding on self-efficacy, metacognition, and achievement in e-learning environments

      Research paper. This work highlights how scaffolding, meaning students work through their learning in stages with support from digital technology, making adjustments to their learning environment as needed as they progress through material. Self-evaluations are a critical component of this to help reflect on the content learned. Scaffolding helps students determine not only what to do but how to do it until they are ready to learn more fully on their own. Rating 6/10

    1. What Makes for Effective Adult Learning

      This article provides a short overview or strategies and techniques to make adult learning effective. This article quotes adult learning researches like Knowles to provide information about meaningful learning experiences. This article provides idea for activities that fit in the category of affective adult learning.

    1. 1Engaging Adults Learners with TechnologyThrough hands-on experience and reviewing the literature, two instruction librarians explore and model best practice

      This article comes from the Twin Cities Campus Library and discusses how to engage adult learners with technology. First, it looks at Kolb's learning model of instructional design which includes that adults must have applying, awakening, practicing and observing. It is also imperative to have hands on learning when it comes to technology. Rating 7/10

    1. 4Vision: Preparing Learning Communities to succeed in College and Careers in a global society through technology.Vision and Goals

      This proposal outlines a draft for a technology plan for Arizona regarding adult education. This plan outlines the goals of the plan and how Arizona can address them moving forward. This plan outlines trends for the future in technology and acknowledges challenges that might come up later down the line. This plan also reviews teaching standards and instruction, as well as operations for the future. Rating 6/10 for being a draft, but with good ideas!

    1. Online is clearly where the growth is, especially when it comes to enrolling adults.

      This article is based around the idea that online education increases access for learners but lacks in completion data. This article provides data around the United States from a study conducted over a few years. Generally speaking this article encourages blended learning rather than all online to obtain better outcomes for adult learners. Rating 7/10 for use of graphs and evidence from data.

    1. Adult students have a higher incidence of disability and are less likely to seek accommodations than the general student population, so it is critical that institutions of higher education anticipate their needs, especially in online classes.

      This article provides statistics about the number of adult learners who learn online with a disability and how these numbers need to be addressed. The author observes that adult learning are least likely to ask for help and it's the designers job to assess their work to make it more accessible. This article provides recommendations on how to become more familiar with technology and what guidelines people should be following. Rating: 10/10 for addressing accessibility among adult learners and providing recommendations.

    1. This article reviews three learning styles and gives examples of how they fit into the three learning domains. Additionally this article reviews assumptions about adult learning and what it might actually mean. Lastly, this article reviews the instructional system design model and breaks down it's components. Rating 7/10 for lack of discussion but helpful tables

    1. At The Economist, we take data visualisation seriously. Every week we publish around 40 charts across print, the website and our apps. With every single one, we try our best to visualise the numbers accurately and in a way that best supports the story. But sometimes we get it wrong. We can do better in future if we learn from our mistakes — and other people may be able to learn from them, too.

      This is, factually and literally speaking, laudable in the extreme.

      Anybody can make mistakes; the best one can do is to admit that one does, and publicly learn from them - if one is a magazine. This is beauteously done.