Hypothesis

13 Matching Annotations

Jul 2025
kubernetes.io kubernetes.io

Navigating Failures in Pods With Devices

1
1. pyxelr 18 Jul 2025
  
  in Public
  
  Navigating Failures in Pods With Devices
  
  Summary: Navigating Failures in Pods With Devices
  
  This article examines the unique challenges Kubernetes faces in managing specialized hardware (e.g., GPUs, accelerators) within AI/ML workloads, and explores current pain points, DIY solutions, and the future roadmap for more robust device failure handling.
  
  Why AI/ML Workloads Are Different
  
  Heavy Dependence on Specialized Hardware: AI/ML jobs require devices like GPUs, with hardware failures causing significant disruptions.
  
  Complex Scheduling: Tasks may consume entire machines or need coordinated scheduling across nodes due to device interconnects.
  
  High Running Costs: Specialized nodes are expensive; idle time is wasteful.
  
  Non-Traditional Failure Models: Standard Kubernetes assumptions (like treating nodes as fungible, or pods as easily replaceable) don’t apply well; failures can trigger large-scale restarts or job aborts.
  
  Major Failure Modes in Kubernetes With Devices
  
  Kubernetes Infrastructure Failures
  
  Multiple actors (device plugin, kubelet, scheduler) must work together; failures can occur at any stage.
  
  Issues include pods failing admission, poor scheduling, or pods unable to run despite healthy hardware.
  
  Best Practices: Early restarts, close monitoring, canary deployments, use of verified device plugins and drivers.
  
  Device Failures
  
  Kubernetes has limited built-in ability to handle device failures—unhealthy devices simply reduce the allocatable count.
  
  Lacks correlation between device failure and pod/container failure.
  
  DIY Solutions:
  
  Node Health Controllers: Restart nodes if device capacity drops, but these can be slow and blunt.
  
  Pod Failure Policies: Pods exit with special codes for device errors, but support is limited and mostly for batch jobs.
  
  Custom Pod Watchers: Scripts or controllers watch pod/device status, forcibly delete pods attached to failed devices, prompting rescheduling.
  
  Container Code Failures
  
  Kubernetes can only restart containers or reschedule pods, with limited expressiveness about what counts as failure.
  
  For large AI/ML jobs: Orchestration wrappers restart failed main executables, aiming to avoid expensive full job restart cycles.
  
  Device Degradation
  
  Not all device issues result in outright failure; degraded performance now occurs more frequently (e.g., one slow GPU dragging down training).
  
  Detection and remediation are largely DIY; Kubernetes does not yet natively express "degraded" status.
  
  Current Workarounds & Limitations
  
  Most device-failure strategies are manual or require high privileges.
  
  Workarounds are often fragile, costly, or disruptive.
  
  Kubernetes lacks standardized abstractions for device health and device importance at pod or cluster level.
  
  Roadmap: What’s Next for Kubernetes
  
  SIG Node and Kubernetes community are focusing on:
  
  Improving core reliability: Ensuring kubelet, device manager, and plugins handle failures gracefully.
  
  Making Failure Signals Visible: Initiatives like KEP 4680 aim to expose device health at pod status level.
  
  Integration With Pod Failure Policies: Plans to recognize device failures as first-class events for triggering recovery.
  
  Pod Descheduling: Enabling pods to be rescheduled off failed/unhealthy devices, even with restartPolicy: Always.
  
  Better Handling for Large-Scale AI/ML Workloads: More granular recovery, fast in-place restarts, state snapshotting.
  
  Device Degradation Signals: Early discussions on tracking performance degradation, but no mature standard yet.
  
  Key Takeaway
  
  Kubernetes remains the platform of choice for AI/ML, but device- and hardware-aware failure handling is still evolving. Most robust solutions are still "DIY," but community and upstream investment is underway to standardize and automate recovery and resilience for workloads depending on specialized hardware.
  
  Kubernetes AI ML MLOps GPU
Visit annotations in context

Tags

GPU

ML

MLOps

AI

Kubernetes

Annotators

pyxelr

URL

kubernetes.io/blog/2025/07/03/navigating-failures-in-pods-with-devices/
Jun 2025
tomtunguz.com tomtunguz.com

1000x Increase in AI Demand by @ttunguz

1
1. pyxelr 01 Jun 2025
  
  in Public
  
  1000x Increase in AI Demand
  
  NVIDIA’s latest earnings highlight a dramatic surge in AI demand, driven by a shift from simple one-shot inference to more complex, compute-intensive reasoning tasks.
  
  Reasoning models require hundreds to thousands of times more computational resources and tokens per task, significantly increasing GPU usage, especially for AI coding agents and advanced applications.
  
  Major hyperscalers like Microsoft, Google, and OpenAI are experiencing exponential growth in token generation, with Microsoft alone processing over 100 trillion tokens in Q1—a fivefold year-over-year increase.
  
  Hyperscalers are deploying nearly 1,000 NVL72 racks (72,000 Blackwell GPUs) per week, and NVIDIA-powered “AI factories” have doubled year-over-year to nearly 100, with the average GPU count per factory also doubling.
  
  To meet this unprecedented demand, more than $300 billion in capital expenditure is being invested this year in data centers (rebranded by NVIDIA as “AI factories”), signaling a new industrial revolution in AI infrastructure.
  
  AI NVIDIA Microsoft Google OpenAI GPU business
Visit annotations in context

Tags

Microsoft

GPU

OpenAI

business

NVIDIA

AI

Google

Annotators

pyxelr

URL

tomtunguz.com/nvda-2025-05-29/
Jan 2025
news.mlops.community news.mlops.community

Untitled document

1
1. pyxelr 06 Jan 2025
  
  in Public
  
  Improved GPU utilization, better LLM storage solutions, and prompt caching features in deployment tools like KServe will continue to make it more accessible to deploy a variety of models.
  
  MLOps prediction for 2025
  
  MLOps GPU LLM KServe
Visit annotations in context

Tags

LLM

KServe

GPU

MLOps

Annotators

pyxelr

URL

news.mlops.community/deliveries/dgTGyQkDAIyGAYuGAQGUKLZVA1fmirGZv6KDdN0=
Aug 2024
www.youtube.com www.youtube.com

Semantic Folding for Natural Language Understanding with Francisco Webber - #451

1
1. stopresetgo 23 Aug 2024
  
  in Public
  
  we are using set theory so a certain piece of reference text is part of my collection or it's not if it's part of my collection somewhere in my fingerprint is a corresponding dot for it yeah so there is a very clear direct link from the root data to the actual representation and the position that dot has versus all the other dots so the the topology of that space geometry if you want of that patterns that you get that contains the knowledge of the world which i'm using the language of yeah so that basically and that is super easy to compute for um for for a computer i don't even need a gpu
  
  for - comparison - cortical io / semantic folding vs standard AI - no GPU required
  
  comparison - cortical io / semantic folding vs standard AI - no GPU required
Visit annotations in context

Tags

comparison - cortical io / semantic folding vs standard AI - no GPU required

Annotators

stopresetgo

URL

youtube.com/watch
Apr 2023
webgpufundamentals.org webgpufundamentals.org

WebGPU Fundamentals

1
1. kael 15 Apr 2023
  
  in Public
  
  webgpu js gpu graphics
Visit annotations in context

Tags

gpu

webgpu

js

graphics

Annotators

kael

URL

webgpufundamentals.org/
developer.mozilla.org developer.mozilla.org

WebGPU API - Web APIs | MDN

1
1. kael 07 Apr 2023
  
  in Public
  
  webgpu js gpu graphics
Visit annotations in context

Tags

gpu

webgpu

js

graphics

Annotators

kael

URL

developer.mozilla.org/en-US/docs/Web/API/WebGPU_API
yosefk.com yosefk.com

SIMD < SIMT < SMT: parallelism in NVIDIA GPUs

1
1. ankostis 05 Apr 2023
  
  in Public
  
  Very nice article explaining the HW performance of GPUs, from 2011.
  
  HW CPU GPU
Visit annotations in context

Tags

CPU

GPU

HW

Annotators

ankostis

URL

yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html
Jan 2023
a16z.com a16z.com

Who Owns the Generative AI Platform? | Andreessen Horowitz

1
1. ravenscroftj 22 Jan 2023
  
  in Public
  
  Other hardware options do exist, including Google Tensor Processing Units (TPUs); AMD Instinct GPUs; AWS Inferentia and Trainium chips; and AI accelerators from startups like Cerebras, Sambanova, and Graphcore. Intel, late to the game, is also entering the market with their high-end Habana chips and Ponte Vecchio GPUs. But so far, few of these new chips have taken significant market share. The two exceptions to watch are Google, whose TPUs have gained traction in the Stable Diffusion community and in some large GCP deals, and TSMC, who is believed to manufacture all of the chips listed here, including Nvidia GPUs (Intel uses a mix of its own fabs and TSMC to make its chips).
  
  Look at market share for tensorflow and pytorch which both offer first-class nvidia support and likely spells out the story. If you are getting in to AI you go learn one of those frameworks and they tell you to install CUDA
  
  ai generative ai gpu
Visit annotations in context

Tags

gpu

generative ai

ai

Annotators

ravenscroftj

URL

a16z.com/2023/01/19/who-owns-the-generative-ai-platform/
Nov 2022
www.w3.org www.w3.org

WebGPU

1
1. kael 22 Nov 2022
  
  in Public
  
  webgpu js gpu
Visit annotations in context

Tags

gpu

webgpu

js

Annotators

kael

URL

w3.org/TR/webgpu/
Nov 2021
lilianweng.github.io lilianweng.github.io

How to Train Really Large Models on Many GPUs?

2
1. sherlockliao 09 Nov 2021
  
  in Public
  
  two major memory consumption of large model training: The majority is occupied by model states, including optimizer states (e.g. Adam momentums and variances), gradients and parameters. Mixed-precision training demands a lot of memory since the optimizer needs to keep a copy of FP32 parameters and other optimizer states, besides the FP16 version. The remaining is consumed by activations, temporary buffers and unusable fragmented memory (named residual states in the paper).
  
  深度网络训练中的显存开销主要是哪些？
  
  GPU memory amp
2. sherlockliao 09 Nov 2021
  
  in Public
  
  It partitions optimizer state, gradients and parameters across multiple data parallel processes via a dynamic communication schedule to minimize the communication volume.
  
  ZeRO-DP 的原理是什么？
  
  ZeRO GPU memory Data Parallel
Visit annotations in context

Tags

GPU memory

Data Parallel

amp

ZeRO

Annotators

sherlockliao

URL

lilianweng.github.io/lil-log/2021/09/24/train-large-neural-networks.html
Apr 2021
statisticsplaybook.github.io statisticsplaybook.github.io

Chapter 3 텐서의 이동; CPU $\leftrightarrow$ GPU | 딥러닝 공략집 with R

1
1. Sangwon 01 Apr 2021
  
  in Public
  
  3.1 GPU 사용 가능 체크
  
  여기서 FALSE가 발생하는 경우가 있습니다. 이 경우에 gpu tensor가 작동하지 않는데 GPU의 cuda version 호환에서 문제가 발생하는 것으로 알고 있습니다. 많은 곳에서 10.1, 10.2 version을 사용하기 때문에 저도 해당 버전을 깔아보았습니다. 하지만 cuda를 다시 깔고 라이브러리를 불러와도 여전히 FALSE가 뜨는 것을 볼수 있죠. 아래 코드를 입력해보시기 바랍니다. 1번째 코드는 당연히 본인의 쿠다 버전이 설치된 장소로 지정해주어야 합니다. 2번째 코드에서 에러코드가 발생할 수 있습니다만 패키지를 다시 인스톨 하시고 불러오시면 정상적으로 작동하는 것을 볼 수 있습니다.
  
  Sys.setenv("CUDA_HOME" = "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v10.2") source("https://raw.githubusercontent.com/mlverse/torch/master/R/install.R")<br> install.packages("torch")
  
  cuda torch gpu
Visit annotations in context

Tags

gpu

torch

cuda

Annotators

Sangwon

URL

statisticsplaybook.github.io/deeplearning-playbook/텐서의-이동-cpu-leftrightarrow-gpu.html
Apr 2020
gist.github.com gist.github.com

Build tensorflow on OSX with NVIDIA CUDA support.md

1
1. technoplato 11 Apr 2020
  
  in Public
  
  NVIDIA's CUDA libraries
  
  cuda has moved to homebrew-drivers [1] its name has alos changed to nvidia-cuda
  
  To install:
  
  brew tap homebrew/cask-drivers brew cask install nvidia-cuda
  
  https://i.imgur.com/rmnoe6d.png
  
  [1] https://github.com/Homebrew/homebrew-cask/issues/38325#issuecomment-327605803
  
  cuda tensorflow gpu macos
Visit annotations in context

Tags

gpu

tensorflow

macos

cuda

Annotators

technoplato

URL

gist.github.com/ageitgey/819a51afa4613649bd18

Summary: Navigating Failures in Pods With Devices

Why AI/ML Workloads Are Different

Major Failure Modes in Kubernetes With Devices

Current Workarounds & Limitations

Roadmap: What’s Next for Kubernetes

Key Takeaway

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL