8 Matching Annotations
  1. Last 7 days
  2. Aug 2025
    1. Skinner believed that association—learning, through trial and error, to link an action with a punishment or reward—was the building block of every behavior, not just in pigeons but in all living organisms, including human beings. His “behaviorist” theories fell out of favor with psychologists and animal researchers in the 1960s but were taken up by computer scientists who eventually provided the foundation for many of the artificial-intelligence tools from leading firms like Google and OpenAI.

      Animal behavior studies as foundation for reinforcement learning

  3. Jul 2025
    1. AI data centers could use up to 12% of all U.S. electricity by 2028. But how much power does it take to create one video and what really happens after you hit “enter” on that AI prompt? WSJ’s Joanna Stern visited “Data Center Valley” in Virginia to trace the journey and then grills up some steaks to show just how much energy it all takes.

  4. Jan 2025
    1. Distillation is a means of extracting understanding from another model; you can send inputs to the teacher model and record the outputs, and use that to train the student model. This is how you get models like GPT-4 Turbo from GPT-4. Distillation is easier for a company to do on its own models, because they have full access, but you can still do distillation in a somewhat more unwieldy way via API, or even, if you get creative, via chat clients.

      Distillation

      Using the outputs of a "teacher model" to train a "student model".

    2. DeepSeekMLA was an even bigger breakthrough. One of the biggest limitations on inference is the sheer amount of memory required: you both need to load the model into memory and also load the entire context window. Context windows are particularly expensive in terms of memory, as every token requires both a key and corresponding value; DeepSeekMLA, or multi-head latent attention, makes it possible to compress the key-value store, dramatically decreasing memory usage during inference.

      Multi-head Latent Attention

      Compress the key-value store of tokens, which decreases memory usage during inferencing.

    3. The “MoE” in DeepSeekMoE refers to “mixture of experts”. Some models, like GPT-3.5, activate the entire model during both training and inference; it turns out, however, that not every part of the model is necessary for the topic at hand. MoE splits the model into multiple “experts” and only activates the ones that are necessary; GPT-4 was a MoE model that was believed to have 16 experts with approximately 110 billion parameters each. DeepSeekMoE, as implemented in V2, introduced important innovations on this concept, including differentiating between more finely-grained specialized experts, and shared experts with more generalized capabilities. Critically, DeepSeekMoE also introduced new approaches to load-balancing and routing during training; traditionally MoE increased communications overhead in training in exchange for efficient inference, but DeepSeek’s approach made training more efficient as well.

      Mixture-of-Experts

      Split LLM models into components with specialized knowledge, then activate only the modules that are required to address a prompt.

  5. May 2024
  6. Jul 2023
    1. AI-generated content may also feed future generative models, creating a self-referentialaesthetic flywheel that could perpetuate AI-driven cultural norms. This flywheel may in turnreinforce generative AI’s aesthetics, as well as the biases these models exhibit.

      AI bias becomes self-reinforcing

      Does this point to a need for more diversity in AI companies? Different aesthetic/training choices leads to opportunities for more diverse output. To say nothing of identifying and segregating AI-generated output from being used i the training data of subsequent models.