13 Matching Annotations
  1. Jul 2025
    1. Navigating Failures in Pods With Devices

      Summary: Navigating Failures in Pods With Devices

      This article examines the unique challenges Kubernetes faces in managing specialized hardware (e.g., GPUs, accelerators) within AI/ML workloads, and explores current pain points, DIY solutions, and the future roadmap for more robust device failure handling.

      Why AI/ML Workloads Are Different

      • Heavy Dependence on Specialized Hardware: AI/ML jobs require devices like GPUs, with hardware failures causing significant disruptions.
      • Complex Scheduling: Tasks may consume entire machines or need coordinated scheduling across nodes due to device interconnects.
      • High Running Costs: Specialized nodes are expensive; idle time is wasteful.
      • Non-Traditional Failure Models: Standard Kubernetes assumptions (like treating nodes as fungible, or pods as easily replaceable) don’t apply well; failures can trigger large-scale restarts or job aborts.

      Major Failure Modes in Kubernetes With Devices

      1. Kubernetes Infrastructure Failures

        • Multiple actors (device plugin, kubelet, scheduler) must work together; failures can occur at any stage.
        • Issues include pods failing admission, poor scheduling, or pods unable to run despite healthy hardware.
        • Best Practices: Early restarts, close monitoring, canary deployments, use of verified device plugins and drivers.
      2. Device Failures

        • Kubernetes has limited built-in ability to handle device failures—unhealthy devices simply reduce the allocatable count.
        • Lacks correlation between device failure and pod/container failure.
        • DIY Solutions:
          • Node Health Controllers: Restart nodes if device capacity drops, but these can be slow and blunt.
          • Pod Failure Policies: Pods exit with special codes for device errors, but support is limited and mostly for batch jobs.
          • Custom Pod Watchers: Scripts or controllers watch pod/device status, forcibly delete pods attached to failed devices, prompting rescheduling.
      3. Container Code Failures

        • Kubernetes can only restart containers or reschedule pods, with limited expressiveness about what counts as failure.
        • For large AI/ML jobs: Orchestration wrappers restart failed main executables, aiming to avoid expensive full job restart cycles.
      4. Device Degradation

        • Not all device issues result in outright failure; degraded performance now occurs more frequently (e.g., one slow GPU dragging down training).
        • Detection and remediation are largely DIY; Kubernetes does not yet natively express "degraded" status.

      Current Workarounds & Limitations

      • Most device-failure strategies are manual or require high privileges.
      • Workarounds are often fragile, costly, or disruptive.
      • Kubernetes lacks standardized abstractions for device health and device importance at pod or cluster level.

      Roadmap: What’s Next for Kubernetes

      SIG Node and Kubernetes community are focusing on:

      • Improving core reliability: Ensuring kubelet, device manager, and plugins handle failures gracefully.
      • Making Failure Signals Visible: Initiatives like KEP 4680 aim to expose device health at pod status level.
      • Integration With Pod Failure Policies: Plans to recognize device failures as first-class events for triggering recovery.
      • Pod Descheduling: Enabling pods to be rescheduled off failed/unhealthy devices, even with restartPolicy: Always.
      • Better Handling for Large-Scale AI/ML Workloads: More granular recovery, fast in-place restarts, state snapshotting.
      • Device Degradation Signals: Early discussions on tracking performance degradation, but no mature standard yet.

      Key Takeaway

      Kubernetes remains the platform of choice for AI/ML, but device- and hardware-aware failure handling is still evolving. Most robust solutions are still "DIY," but community and upstream investment is underway to standardize and automate recovery and resilience for workloads depending on specialized hardware.

  2. Jun 2025
    1. 1000x Increase in AI Demand
      • NVIDIA’s latest earnings highlight a dramatic surge in AI demand, driven by a shift from simple one-shot inference to more complex, compute-intensive reasoning tasks.
      • Reasoning models require hundreds to thousands of times more computational resources and tokens per task, significantly increasing GPU usage, especially for AI coding agents and advanced applications.
      • Major hyperscalers like Microsoft, Google, and OpenAI are experiencing exponential growth in token generation, with Microsoft alone processing over 100 trillion tokens in Q1—a fivefold year-over-year increase.
      • Hyperscalers are deploying nearly 1,000 NVL72 racks (72,000 Blackwell GPUs) per week, and NVIDIA-powered “AI factories” have doubled year-over-year to nearly 100, with the average GPU count per factory also doubling.
      • To meet this unprecedented demand, more than $300 billion in capital expenditure is being invested this year in data centers (rebranded by NVIDIA as “AI factories”), signaling a new industrial revolution in AI infrastructure.
  3. Jan 2025
  4. Aug 2024
    1. we are using set theory so a certain piece of reference text is part of my collection or it's not if it's part of my collection somewhere in my fingerprint is a corresponding dot for it yeah so there is a very clear direct link from the root data to the actual representation and the position that dot has versus all the other dots so the the topology of that space geometry if you want of that patterns that you get that contains the knowledge of the world which i'm using the language of yeah so that basically and that is super easy to compute for um for for a computer i don't even need a gpu

      for - comparison - cortical io / semantic folding vs standard AI - no GPU required

  5. Apr 2023
  6. Jan 2023
    1. Other hardware options do exist, including Google Tensor Processing Units (TPUs); AMD Instinct GPUs; AWS Inferentia and Trainium chips; and AI accelerators from startups like Cerebras, Sambanova, and Graphcore. Intel, late to the game, is also entering the market with their high-end Habana chips and Ponte Vecchio GPUs. But so far, few of these new chips have taken significant market share. The two exceptions to watch are Google, whose TPUs have gained traction in the Stable Diffusion community and in some large GCP deals, and TSMC, who is believed to manufacture all of the chips listed here, including Nvidia GPUs (Intel uses a mix of its own fabs and TSMC to make its chips).

      Look at market share for tensorflow and pytorch which both offer first-class nvidia support and likely spells out the story. If you are getting in to AI you go learn one of those frameworks and they tell you to install CUDA

  7. Nov 2022
  8. Nov 2021
    1. two major memory consumption of large model training: The majority is occupied by model states, including optimizer states (e.g. Adam momentums and variances), gradients and parameters. Mixed-precision training demands a lot of memory since the optimizer needs to keep a copy of FP32 parameters and other optimizer states, besides the FP16 version. The remaining is consumed by activations, temporary buffers and unusable fragmented memory (named residual states in the paper).

      深度网络训练中的显存开销主要是哪些?

  9. Apr 2021
    1. 3.1 GPU 사용 가능 체크

      여기서 FALSE가 발생하는 경우가 있습니다. 이 경우에 gpu tensor가 작동하지 않는데 GPU의 cuda version 호환에서 문제가 발생하는 것으로 알고 있습니다. 많은 곳에서 10.1, 10.2 version을 사용하기 때문에 저도 해당 버전을 깔아보았습니다. 하지만 cuda를 다시 깔고 라이브러리를 불러와도 여전히 FALSE가 뜨는 것을 볼수 있죠. 아래 코드를 입력해보시기 바랍니다. 1번째 코드는 당연히 본인의 쿠다 버전이 설치된 장소로 지정해주어야 합니다. 2번째 코드에서 에러코드가 발생할 수 있습니다만 패키지를 다시 인스톨 하시고 불러오시면 정상적으로 작동하는 것을 볼 수 있습니다.

      Sys.setenv("CUDA_HOME" = "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.2") source("https://raw.githubusercontent.com/mlverse/torch/master/R/install.R")<br> install.packages("torch")

  10. Apr 2020