97 Matching Annotations
  1. Feb 2024
    1. We’ve (painstakingly) manually reviewed 310 live MLOps positions, advertised across various platforms in Q4 this year

      They went through 310 role descriptions and, even though role descriptions may vary significantly, they found 3 core skills that a large percentage of MLOps roles required:

      📦 Docker and Kubernetes 🐍 Python 🌥 Cloud

  2. Nov 2023
    1. But rather than do all that work to identify the running pod, why not run the backup using the deployment? (That's what I've always done) kubectl exec deployments/gitlab --namespace gitlab -- gitlab-rake gitlab:backup:create
  3. Oct 2023
  4. Mar 2023
    1. cluster with 4096 IP addresses can deploy at most 1024 models assuming each InferenceService has 4 pods on average (two transformer replicas and two predictor replicas).

      Kubernetes clusters have a maximum IP address limitation

    2. According to Kubernetes best practice, a node shouldn't run more than 100 pods.
  5. Feb 2023
  6. Jan 2023
  7. Dec 2022
    1. Kubernetes is a purpose-built container orchestration solution
    2. While a full dive into container orchestration is beyond the scope of this article, two prominent players are Docker with Docker Compose and Docker Swarm mode, and Kubernetes. In roughly order of complexity, Docker Compose is a container orchestration solution that deals with multi-container deployments on a single host. When there are multiple hosts involved, Docker Swarm mode is required.
  8. Nov 2022
    1. Orchestration involves provisioning, configuration, scheduling, scaling, monitoring, deployment, and more. Kubernetes is an example of a popular container orchestration solution.
  9. Jan 2022
    1. Adopting Kubernetes-native environments ensures true portability for the hybrid cloud. However, we also need a Kubernetes-native framework to provide the "glue" for applications to seamlessly integrate with Kubernetes and its services. Without application portability, the hybrid cloud is relegated to an environment-only benefit. That framework is Quarkus.

      Quarkus framework

    2. Kubernetes-native is a specialization of cloud-native, and not divorced from what cloud native defines. Whereas a cloud-native application is intended for the cloud, a Kubernetes-native application is designed and built for Kubernetes.

      Kubernetes-native application

  10. Nov 2021
    1. If for some reason you don’t see a running pod from this command, then using kubectl describe po a is your next-best option. Look at the events to find errors for what might have gone wrong.

      kubectl run a –image alpine –command — /bin/sleep 1d

    2. As with listing nodes, you should first look at the status column and look for errors. The ready column will show how many pods are desired and how many are running.

      kubectl get pods -A -o wide

    3. -o wide option will tell us additional details like operating system (OS), IP address and container runtime. The first thing you should look for is the status. If the node doesn’t say “Ready” you might have a problem, but not always.

      kubectl get nodes -o wide

    4. This command will be the easiest way to discover if your scheduler, controller-manager and etcd node(s) are healthy.

      kubectl get componentstatus

    5. If something broke recently, you can look at the cluster events to see what was happening before and after things broke.

      kubectl get events -A

    6. this command will tell you what CRDs (custom resource definitions) have been installed in your cluster and what API version each resource is at. This could give you some insights into looking at logs on controllers or workload definitions.

      kubectl api-resources -o wide –sort-by name

    7. kubectl get --raw '/healthz?verbose'

      Alternative to kubectl get --raw '/healthz?verbose'. It does not show scheduler or controller-manager output, but it adds a lot of additional checks that might be valuable if things are broken

    8. Here are the eight commands to run

      8 commands to debug Kubernetes cluster:

      kubectl version --short
      kubectl cluster-info
      kubectl get componentstatus
      kubectl api-resources -o wide --sort-by name
      kubectl get events -A
      kubectl get nodes -o wide
      kubectl get pods -A -o wide
      kubectl run a --image alpine --command -- /bin/sleep 1d
      
  11. Oct 2021
    1. Argo Workflow is part of the Argo project, which offers a range of, as they like to call it, Kubernetes-native get-stuff-done tools (Workflow, CD, Events, Rollouts).

      High level definition of Argo Workflow

    2. Argo is designed to run on top of k8s. Not a VM, not AWS ECS, not Container Instances on Azure, not Google Cloud Run or App Engine. This means you get all the good of k8s, but also the bad.

      Pros of Argo Workflow:

      • Resilience
      • Autoscaling
      • Configurability
      • Support for RBAC

      Cons of Argo Workflow:

      • A lot of YAML files required
      • k8s knowledge required
    3. If you are already heavily invested in Kubernetes, then yes look into Argo Workflow (and its brothers and sisters from the parent project).The broader and harder question you should ask yourself is: to go full k8s-native or not? Look at your team’s cloud and k8s experience, size, growth targets. Most probably you will land somewhere in the middle first, as there is no free lunch.

      Should you go into Argo, or not?

    4. In order to reduce the number of lines of text in Workflow YAML files, use WorkflowTemplate . This allow for re-use of common components.

      kind: WorkflowTemplate

  12. Sep 2021
    1. Hence, Podman allows the creation and execution of Pods from a Kubernetes YAML file (see podman-play-kube). Podman can also generate Kubernetes YAML based on a container or Pod (see podman-generate-kube), which allows for an easy transition from a local development environment to a production Kubernetes cluster.
    2. We believe that Kubernetes is the defacto standard for composing Pods and for orchestrating containers, making Kubernetes YAML a defacto standard file format.
    1. kind, microk8s, or k3s are replacements for Docker Desktop. False. Minikube is the only drop-in replacement. The other tools require a Linux distribution, which makes them a non-starter on macOS or Windows. Running any of these in a VM misses the point – you don't want to be managing the Kubernetes lifecycle and a virtual machine lifecycle. Minikube abstracts all of this.

      At the current moment the best approach is to use minikube with a preferred backend (Docker Engine and Podman are already there), and you can simply run one command to configure Docker CLI to use the engine from the cluster.

  13. Aug 2021
    1. 首先我们看nat表的OUTPUT规则:

      使用以下命令可以查看结果:

      sudo iptables -t nat -nvL OUTPUT
      
    1. kubectl run --image=nginx nginx-web-1 --image-pull-policy='IfNotPresent'

      这里应该要创建 Deployment:

      kubectl create deployment nginx-web-1 --image=nginx
      
    1. k3d is basically running k3s inside of Docker. It provides an instant benefit over using k3s on a local machine, that is, multi-node clusters. Running inside Docker, we can easily spawn multiple instances of our k3s Nodes.

      k3d <--- k3s that allows to run mult-node clusters on a local machine

    2. Kubernetes in Docker (KinD) is similar to minikube but it does not spawn VM's to run clusters and works only with Docker. KinD for the most part has the least bells and whistles and offers an intuitive developer experience in getting started with Kubernetes in no time.

      KinD (Kubernetes in Docker) <--- sounds like the most recommended solution to learn k8s locally

    3. Contrary to the name, it comes in a larger binary of 150 MB+. It can be run as a binary or in DinD mode. k0s takes security seriously and out of the box, it meets the FIPS compliance.

      k0s <--- similar to k3s, but not as lightweight

    4. k3s is a lightweight Kubernetes distribution from Rancher Labs. It is specifically targeted for running on IoT and Edge devices, meaning it is a perfect candidate for your Raspberry Pi or a virtual machine.

      k3s <--- lightweight solution

    5. All of the tools listed here more or less offer the same feature, including but not limited to

      7 tools for learning k8s locally:

      1. k3s
      2. k0s
      3. Microk8s
      4. DinD
      5. minikube
      6. KinD
      7. k3d
    6. There are multiple tools for running Kubernetes on your local machine, but it basically boils down to two approaches on how it is done

      We can run Kubernetes locally as a:

      1. binary package
      2. container using dind
    7. Before we move on to talk about all the tools, it will be beneficial if you installed arkade on your machine.

      With arkade, we can quickly set up different k8s tools, while using a single command:

      e.g. arkade get k9s

  14. Jul 2021
    1. there is a drawback, docker-compose runs on a single node which makes scaling hard, manual and very limited. To be able to scale services across multiple hosts/nodes, orchestrators like docker-swarm or kubernetes comes into play.
      • docker-compose runs on a single node (hard to scale)
      • docker-swarm or kubernetes run on multiple nodes
    1. Even though Kubernetes is moving away from Docker, it will always support the OCI and Docker image formats. Kubernetes doesn’t pull and run images itself, instead the Kubelet relies on container engines like CRI-O and containerd to pull and run the images. These are the two main container engines used with CRI-O and they both support the Docker and OCI image formats, so no worries on this one.

      Reason why one should not be worried about k8s depreciating Docker

  15. Jun 2021
    1. Secure service-to-service communication in a cluster with TLS encryption, strong identity-based authentication and authorization

      What about between clusters?

    1. 中文读者可以在完成了本文学习后,参考这篇文章了解。文章对本文的源码分析做了一些补充。

  16. May 2021
    1. The only problem is that Kubeflow Pipelines must be deployed on a Kubernetes Cluster. You will struggle with permissions, VPC and lots of problems to deploy and use it if you are in a small company that uses sensitive data, which makes it a bit difficult to be adoptedVertex AI solves this problem with a managed pipeline runner: you can define a Pipeline and it will executed it, being responsible to provision all resources, store all the artifacts you want and pass them through each of the wanted steps.

      How Vertex AI solves the problem/need of deploying on a Kubernetes Cluster

  17. Apr 2021
    1. With Spark 3.1, the Spark-on-Kubernetes project is now considered Generally Available and Production-Ready.

      With Spark 3.1 k8s becomes the right option to replace YARN

  18. Mar 2021
    1. We use Prometheus to collect time-series metrics and Grafana for graphs, dashboards, and alerts.

      How Prometheus and Grafana can be used to collect information from running ML on K8s

    2. large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node.

      The way OpenAI runs large ML jobs on K8s

    1. We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly.
    2. For high availability, we always have at least 2 masters, and set the --apiserver-count flag to the number of apiservers we’re running (otherwise Prometheus monitoring can get confused between instances).

      Tip for high availability:

      • have at least 2 masters
      • set --apiserver-count flag to the number of running apiservers
    3. We’ve increased the max etcd size with the --quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.

      If we've more than 1k nodes, etcd's hard storage limit might stop accepting writes

    4. Another helpful tweak was storing Kubernetes Events in a separate etcd cluster, so that spikes in Event creation wouldn’t affect performance of the main etcd instances.

      Another trick apart from tweaking default settings of Fluentd & Datadog

    5. The root cause: the default setting for Fluentd’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, this issue which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again:

      Default settings of Fluentd and Datadog might not be suited for running many nodes

    6. We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!

      One of the solutions for etcd using only about 10% of the available IOPS. It was working till about 1k nodes

  19. Dec 2020
  20. Nov 2020
  21. Oct 2020
    1. Kubernetes doesn’t have the ability to schedule and manage GPU resources

      But it's provided as a plugin

  22. May 2020
  23. Apr 2020
    1. It's responsible for allocating and scheduling containers, providing then with abstracted functionality like internal networking and file storage, and then monitoring the health of all of these elements and stepping in to repair or adjust them as necessary.In short, it's all about abstracting how, when and where containers are run.

      Kubernetes (simple explanation)

    1. You’ll see pressure to push towards “Cloud neutral” solutions using Kubernetes in various places

      Maybe Kubernetes has the advantage of being cloud neutral, but: you pay the cost of a cloud migration:

      • maintaining abstractions
      • isolating your way from useful vendor specific features
    2. Heroku? App Services? App Engine?

      You can set up yourself in production in minutes to only a few hours

    3. Kubernetes (often irritatingly abbreviated to k8s, along with it’s wonderful ecosystem of esoterically named additions like helm, and flux) requires a full time ops team to operate, and even in “managed vendor mode” on EKS/AKS/GKS the learning curve is far steeper than the alternatives.

      Kubernetes:

      • require a full time ops team to operate
      • the learning curve is far steeper than the alternatives
    4. Azure App Services, Google App Engine and AWS Lambda will be several orders of magnitude more productive for you as a programmer. They’ll be easier to operate in production, and more explicable and supported.

      Use the closest thing to a pure-managed platform as you possibly can. It will be easier to operate in production, and more explicable and supported:

      • Azure App Service
      • Google App Engine
      • AWS Lambda
    5. With the popularisation of docker and containers, there’s a lot of hype gone into things that provide “almost platform like” abstractions over Infrastructure-as-a-Service. These are all very expensive and hard work.

      Kubernetes aren't always required unless you work on huge problems

  24. Mar 2020
    1. from Docker Compose on a single machine, to Heroku and similar systems, to something like Snakemake for computational pipelines.

      Other alternatives to Kubernetes:

      • Docker Compose on a single machine
      • Heroku and similar systems
      • Snakemake for computational pipelines
    2. if what you care about is downtime, your first thought shouldn’t be “how do I reduce deployment downtime from 1 second to 1ms”, it should be “how can I ensure database schema changes don’t prevent rollback if I screw something up.”

      Caring about downtime

    3. The features Kubernetes provides for reliability (health checks, rolling deploys), can be implemented much more simply, or already built-in in many cases. For example, nginx can do health checks on worker processes, and you can use docker-autoheal or something similar to automatically restart those processes.

      Kubernetes' health checks can be replaced with nginx on worker processes + docker-autoheal to automatically restart those processes

    4. Scaling for many web applications is typically bottlenecked by the database, not the web workers.
    5. Kubernetes might be useful if you need to scale a lot. But let’s consider some alternatives

      Kubernetes alternatives:

      • cloud VMs with up to 416 vCPUs and 8 TiB RAM
      • scale many web apps with Heroku
    6. Distributed applications are really hard to write correctly. Really. The more moving parts, the more these problems come in to play. Distributed applications are hard to debug. You need whole new categories of instrumentation and logging to getting understanding that isn’t quite as good as what you’d get from the logs of a monolithic application.

      Microservices stay as a hard nut to crack.

      They are fine for an organisational scaling technique: when you have 500 developers working on one live website (so they can work independently). For example, each team of 5 developers can be given one microservice

    7. you need to spin up a complete K8s system just to test anything, via a VM or nested Docker containers.

      You need a complete K8s to run your code, or you can use Telepresence to code locally against a remote Kubernetes cluster

    8. “Kubernetes is a large system with significant operational complexity. The assessment team found configuration and deployment of Kubernetes to be non-trivial, with certain components having confusing default settings, missing operational controls, and implicitly defined security controls.”

      Deployment of Kubernetes is non-trivial

    9. Before you can run a single application, you need the following highly-simplified architecture

      Before running the simplest Kubernetes app, you need at least this architecture:

    10. the Kubernetes codebase has significant room for improvement. The codebase is large and complex, with large sections of code containing minimal documentation and numerous dependencies, including systems external to Kubernetes.

      As of March 2020, the Kubernetes code base has more than 580 000 lines of Go code

    11. Kubernetes has plenty of moving parts—concepts, subsystems, processes, machines, code—and that means plenty of problems.

      Kubernetes might be not the best solution in a smaller team

  25. Feb 2020
  26. Jan 2020
  27. Jul 2019
  28. Jun 2019
  29. May 2019
    1. Installing runtime

      apt-get install -y docker.io

    2. apt-get update && apt-get install -y apt-transport-https curl curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add - cat <<EOF >/etc/apt/sources.list.d/kubernetes.list deb https://apt.kubernetes.io/ kubernetes-xenial main EOF apt-get update apt-get install -y kubelet kubeadm kubectl apt-mark hold kubelet kubeadm kubectl

      Install Docker container runtime first.

      apt-get install -y docker.io

    1. Joining your nodes

      Install runtime.

      sudo -i
      apt-get update && apt-get upgrade -y
      apt-get install -y docker.io
      

      Install kubeadm, kubelet and kubectl.

      https://kubernetes.io/docs/setup/independent/install-kubeadm/#installing-kubeadm-kubelet-and-kubectl

      apt-get update && apt-get install -y apt-transport-https curl
      curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -
      cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
      deb https://apt.kubernetes.io/ kubernetes-xenial main
      EOF
      apt-get update
      apt-get install -y kubelet kubeadm kubectl
      apt-mark hold kubelet kubeadm kubectl
      
  30. Apr 2019
  31. Mar 2019
    1. Pipeline de CI/CD no Kubernetes usando Jenkins e Spinnaker

      Uau! Muitos assuntos da prova LPI DevOps são explorados nessa palestra. Fica de olho no tópico: 702 Container Management.

  32. Feb 2019
  33. Jan 2019
  34. Dec 2018
  35. Jan 2018
  36. Jul 2017
    1. 这张图给出了谷歌在2015年提出的Inception-v3模型。这个模型在ImageNet数据集上可以达到95%的正确率。然而,这个模型中有2500万个参数,分类一张图片需要50亿次加法或者乘法运算。

      95%成功率,需要 25,000,000个参数!