Hypothesis

190 Matching Annotations

Nov 2025
authress.io authress.io

How when AWS was down, we were not | Authress - Knowledge Base

1
1. pyxelr 18 Nov 2025
  
  in Public
  
  How when AWS was down, we were not
  
  Brief summary
  
  Authress avoided downtime during the AWS us-east-1 outage by implementing a multi-region, redundant infrastructure with automated DNS failover using custom health checks, edge-optimized routing, and robust anomaly detection, backed by rigorous testing and incremental deployments to minimize risk and impact. Their system design assumes failure is inevitable and focuses on quick detection, seamless failover, and minimizing single points of failure through automation and continuous validation.
  
  Long summary
  
  Authress experienced a major AWS us-east-1 outage affecting DynamoDB and other critical AWS services.
  
  They run infrastructure in us-east-1 due to customer location demands, despite known risks.
  
  AWS services like CloudFront, Certificate Manager, Lambda@Edge, and IAM control planes are centralized in us-east-1, impacting availability during incidents.
  
  Aiming for a 5-nines SLA (99.999% uptime) requires more than relying on AWS SLAs alone, which are insufficient.
  
  Simple single-region architectures fail to meet high reliability due to frequent AWS incidents.
  
  Authress recognizes "everything fails all the time" and designs systems assuming failure.
  
  Retry strategies are mathematically analyzed; third-party components must have at least 99.7% reliability to be usable.
  
  Multi-region redundant infrastructure with DNS failover via AWS Route 53 health checks enables automatic failover.
  
  Custom health checks validate actual service health beyond default DNS checks.
  
  Edge-optimized architecture using CloudFront and Lambda@Edge improves latency and provides better failover options.
  
  DynamoDB Global Tables replicate data across regions to support failover.
  
  Rigorous testing and validation, including application-level tests, mitigate risks of bugs in production.
  
  Incremental deployment (customer buckets) limits impact by rolling out changes gradually.
  
  Asynchronous validation tests check consistency across databases after deployments.
  
  Anomaly detection is used to identify meaningful incidents impacting business logic, beyond mere HTTP error codes.
  
  Customer support feedback is integrated into incident detection to catch undetected or gray failures.
  
  Security measures include rate limiting, AWS WAF with IP reputation lists, and blocking suspicious high-volume requests.
  
  Resource exhaustion prevention is critical, with rate limiting implemented at multiple infrastructure layers.
  
  Infrastructure as Code (IaC) deployment differences across regions and edge leads to challenges in consistency.
  
  Despite all these measures, achieving a true 5-nines SLA is extremely challenging but remains a core commitment.
  
  Summary of HN discussion
  
  https://news.ycombinator.com/item?id=45955565
  
  The discussion highlights concerns about automation and Infrastructure as Code (IaC) being potential failure points, emphasizing the challenge of safely updating these systems.
  
  Rollbacks are rarely automatic; often, knowing in advance to avoid certain rollouts is preferable as automated rollbacks can worsen failures.
  
  Simple, less complex infrastructure changes are preferred to reduce human error, which is the leading cause of incidents.
  
  There is skepticism about the reliability of Route 53 failover in practice, with concerns about its failure modes and the complexity of multi-cloud DNS failover.
  
  Some contributors suggest modular IaC approaches (Pulumi, Terragrunt) for safer, repeatable deployments but warn about added complexity.
  
  Retry logic in failures is criticized; retries may not improve reliability linearly due to correlated failures and overall system overload during outages.
  
  Latency and client timeout constraints limit the practical number of retries possible.
  
  DNS is acknowledged as a single point of failure with caching and failover timing challenges.
  
  Multi-cloud failover at DNS level is complex, costly, and not widely implemented due to infrastructure and coordination requirements.
  
  Gray failures (where the system reports healthy but customers experience issues) and the difficulty in knowing real incident impact without customer feedback are noted.
  
  Customer support is critical in incident detection since automated systems cannot catch every failure.
  
  Detailed monitoring via CloudFront and telemetry helps identify actual service issues during outages.
  
  Overall, the theme is the difficulty in achieving perfect reliability, the importance of simplicity, and the need for layered detection and response strategies to manage failures.
  
  AWS MLOps DevOps cloud
Visit annotations in context

Tags

AWS

cloud

DevOps

MLOps

Annotators

pyxelr

URL

authress.io/knowledge-base/articles/2025/11/01/how-we-prevent-aws-downtime-impacts
news.ycombinator.com news.ycombinator.com

I can't recommend Grafana anymore | Hacker News

1
1. pyxelr 16 Nov 2025
  
  in Public
  
  I can't recommend Grafana anymore
  
  Users appreciate Grafana's rich features but criticize its complexity and brittleness.
  
  Tools like Prometheus and Grafana are seen as over-engineered, especially with components like Mimir requiring dedicated infrastructure.
  
  Concerns exist about the future of self-hosted Grafana due to frequent breaking changes.
  
  Some suggest simpler, stable alternatives for monitoring such as Prometheus alone, Zabbix, VictoriaMetrics, or newer tools like SigNoz, OpenObserve, and Perses.
  
  The community desires a stable, reliable "boring" monitoring stack rather than constant feature churn.
  
  Overall, there is notable dissatisfaction with Grafana’s direction and operational demands, particularly among smaller teams or startups seeking sustainable solutions.
  
  Grafana monitoring MLOps SigNoz OpenObserve Perses Zabbix VictoriaMetrics
Visit annotations in context

Tags

Zabbix

Grafana

monitoring

OpenObserve

VictoriaMetrics

SigNoz

Perses

MLOps

Annotators

pyxelr

URL

news.ycombinator.com/item
henrikgerdes.me henrikgerdes.me

The Grafana trust problem

1
1. pyxelr 16 Nov 2025
  
  in Public
  
  I can’t recommend Grafana anymore
  
  Henrik Gerdes warns that Grafana, despite being powerful, has become maintenance-heavy and complex, especially due to recent additions like Mimir and Kubernetes dependence.
  
  Stability issues, frequent breaking changes, and high operational overhead make it challenging for startups and teams needing reliable monitoring.
  
  Founders should consider total ownership costs, including updates, migrations, and support, before adopting Grafana.
  
  Grafana’s flexibility and scalability come with complexity that can erode trust in production environments.
  
  Grafana monitoring MLOps
Visit annotations in context

Tags

monitoring

MLOps

Grafana

Annotators

pyxelr

URL

henrikgerdes.me/blog/2025-11-grafana-mess/
www.linkedin.com www.linkedin.com

(1) How we slashed our EKS costs by 43% with one simple scheduler tweak 🚀 | LinkedIn

1
1. pyxelr 10 Nov 2025
  
  in Public
  
  How we slashed our EKS costs by 43% with one simple scheduler tweak 🚀
  
  AWS EKS costs can escalate due to massive, parallel workloads in life sciences/drug development (e.g., genomic sequencing, molecular modeling).
  
  Default Kubernetes scheduler uses leastAllocated strategy, spreading pods across many nodes for fairness/high availability.
  
  leastAllocated strategy causes many partially utilized nodes, preventing autoscalers from scaling down idle nodes, increasing costs.
  
  mostAllocated scheduling strategy "packs" pods onto fewer nodes, maximizing utilization and enabling autoscalers like Karpenter to remove idle nodes.
  
  Switching to mostAllocated can reduce runtime costs significantly (e.g., ~10% in UAT, 43% in PROD environments).
  
  Custom scheduler deployment on AWS EKS requires creating a service account, ClusterRoleBindings, RoleBinding, a ConfigMap with the mostAllocated scoring strategy, and a deployment with a matching Kubernetes version container image.
  
  Resource weights can prioritize packing of expensive resources (e.g., high weight on GPUs for ML workloads).
  
  Testing in non-production environments is recommended before full rollout.
  
  Implementing mostAllocated scheduling can dramatically optimize costs by enabling cluster autoscalers to shut down unused nodes.
  
  Karpenter Kubernetes AWS EKS MLOps FinOps
Visit annotations in context

Tags

AWS

EKS

Kubernetes

MLOps

FinOps

Karpenter

Annotators

pyxelr

URL

linkedin.com/pulse/how-we-slashed-our-eks-costs-43-one-simple-scheduler-tweak-quadflieg-rtgoe/
Jul 2025
kubernetes.io kubernetes.io

Navigating Failures in Pods With Devices

1
1. pyxelr 18 Jul 2025
  
  in Public
  
  Navigating Failures in Pods With Devices
  
  Summary: Navigating Failures in Pods With Devices
  
  This article examines the unique challenges Kubernetes faces in managing specialized hardware (e.g., GPUs, accelerators) within AI/ML workloads, and explores current pain points, DIY solutions, and the future roadmap for more robust device failure handling.
  
  Why AI/ML Workloads Are Different
  
  Heavy Dependence on Specialized Hardware: AI/ML jobs require devices like GPUs, with hardware failures causing significant disruptions.
  
  Complex Scheduling: Tasks may consume entire machines or need coordinated scheduling across nodes due to device interconnects.
  
  High Running Costs: Specialized nodes are expensive; idle time is wasteful.
  
  Non-Traditional Failure Models: Standard Kubernetes assumptions (like treating nodes as fungible, or pods as easily replaceable) don’t apply well; failures can trigger large-scale restarts or job aborts.
  
  Major Failure Modes in Kubernetes With Devices
  
  Kubernetes Infrastructure Failures
  
  Multiple actors (device plugin, kubelet, scheduler) must work together; failures can occur at any stage.
  
  Issues include pods failing admission, poor scheduling, or pods unable to run despite healthy hardware.
  
  Best Practices: Early restarts, close monitoring, canary deployments, use of verified device plugins and drivers.
  
  Device Failures
  
  Kubernetes has limited built-in ability to handle device failures—unhealthy devices simply reduce the allocatable count.
  
  Lacks correlation between device failure and pod/container failure.
  
  DIY Solutions:
  
  Node Health Controllers: Restart nodes if device capacity drops, but these can be slow and blunt.
  
  Pod Failure Policies: Pods exit with special codes for device errors, but support is limited and mostly for batch jobs.
  
  Custom Pod Watchers: Scripts or controllers watch pod/device status, forcibly delete pods attached to failed devices, prompting rescheduling.
  
  Container Code Failures
  
  Kubernetes can only restart containers or reschedule pods, with limited expressiveness about what counts as failure.
  
  For large AI/ML jobs: Orchestration wrappers restart failed main executables, aiming to avoid expensive full job restart cycles.
  
  Device Degradation
  
  Not all device issues result in outright failure; degraded performance now occurs more frequently (e.g., one slow GPU dragging down training).
  
  Detection and remediation are largely DIY; Kubernetes does not yet natively express "degraded" status.
  
  Current Workarounds & Limitations
  
  Most device-failure strategies are manual or require high privileges.
  
  Workarounds are often fragile, costly, or disruptive.
  
  Kubernetes lacks standardized abstractions for device health and device importance at pod or cluster level.
  
  Roadmap: What’s Next for Kubernetes
  
  SIG Node and Kubernetes community are focusing on:
  
  Improving core reliability: Ensuring kubelet, device manager, and plugins handle failures gracefully.
  
  Making Failure Signals Visible: Initiatives like KEP 4680 aim to expose device health at pod status level.
  
  Integration With Pod Failure Policies: Plans to recognize device failures as first-class events for triggering recovery.
  
  Pod Descheduling: Enabling pods to be rescheduled off failed/unhealthy devices, even with restartPolicy: Always.
  
  Better Handling for Large-Scale AI/ML Workloads: More granular recovery, fast in-place restarts, state snapshotting.
  
  Device Degradation Signals: Early discussions on tracking performance degradation, but no mature standard yet.
  
  Key Takeaway
  
  Kubernetes remains the platform of choice for AI/ML, but device- and hardware-aware failure handling is still evolving. Most robust solutions are still "DIY," but community and upstream investment is underway to standardize and automate recovery and resilience for workloads depending on specialized hardware.
  
  Kubernetes AI ML MLOps GPU
Visit annotations in context

Tags

ML

Kubernetes

AI

GPU

MLOps

Annotators

pyxelr

URL

kubernetes.io/blog/2025/07/03/navigating-failures-in-pods-with-devices/
Mar 2025
repost.aws repost.aws

Difference between EKS AMI and Bottle Rocket

1
1. pyxelr 05 Mar 2025
  
  in Public
  
  The main difference between the Amazon EKS-optimized AMI (amazon-eks-node-1.29) and the Bottlerocket AMI (bottlerocket-aws-k8s-1.29) lies in their purpose
  
  See the summary below this highlight
  
  BottleRocket AWS EKS Kubernetes Docker MLOps
Visit annotations in context

Tags

AWS

BottleRocket

EKS

Kubernetes

Docker

MLOps

Annotators

pyxelr

URL

repost.aws/questions/QU5-aVwCeCQT6paofciywPXA/difference-between-eks-ami-and-bottle-rocket
aws.amazon.com aws.amazon.com

Reduce container startup time on Amazon EKS with Bottlerocket data volume | Amazon Web Services

1
1. pyxelr 04 Mar 2025
  
  in Public
  
  Reduce container startup time on Amazon EKS with Bottlerocket data volume
  
  Introduction
  
  Containers are widely used for scalable applications but face challenges with startup times for large images (e.g., AI/ML workloads).
  
  Pulling large images from Amazon Elastic Container Registry (ECR) can take several minutes, impacting performance.
  
  Bottlerocket, an AWS open-source Linux OS optimized for containers, offers a solution to reduce container startup time.
  
  Solution Overview
  
  Bottlerocket's data volume feature allows prefetching container images locally, eliminating the need for downloading during startup.
  
  Prefetching is achieved by creating an Amazon Elastic Block Store (EBS) snapshot of Bottlerocket's data volume and mapping it to new Amazon EKS nodes.
  
  Steps to implement:
  
  Spin up an Amazon EC2 instance with Bottlerocket AMI.
  
  Pull application images from the repository.
  
  Create an EBS snapshot of the data volume.
  
  Map the snapshot to Amazon EKS node groups.
  
  Benefits of Bottlerocket
  
  It separates OS and container data volumes, ensuring consistency and security during updates.
  
  Prefetched images significantly reduce startup times for large containers.
  
  Implementation Walkthrough
  
  Step 1: Build EBS Snapshot
  
  Automate snapshot creation using a script.
  
  Prefetch images like Jupyter-PyTorch and Kubernetes pause containers.
  
  Export the snapshot ID for use in node group configuration.
  
  Step 2: Setup Amazon EKS Cluster
  
  Create two node groups:
  
  no-prefetch-mng: Without prefetched images.
  
  prefetch-mng: With prefetched images mapped via EBS snapshot.
  
  Step 3: Deploy Pods
  
  Test deployment on both node groups.
  
  Prefetched nodes start pods in just 3 seconds, compared to 49 seconds without prefetching.
  
  Results
  
  Prefetching reduced container startup time from 49 seconds to 3 seconds, improving efficiency and user experience.
  
  Further Enhancements
  
  Use Karpenter for automated scaling with Bottlerocket nodes.
  
  Automate snapshot creation in CI pipelines using GitHub Actions.
  
  Cleaning Up
  
  Delete AWS resources (EKS cluster, Cloud9 environment, EBS snapshots) to avoid charges after testing.
  
  Conclusion
  
  Bottlerocket's data volume prefetching dramatically enhances container startup performance for large workloads on Amazon EKS.
  
  BottleRocket AWS EKS Kubernetes Docker MLOps
Visit annotations in context

Tags

AWS

BottleRocket

EKS

Kubernetes

Docker

MLOps

Annotators

pyxelr

URL

aws.amazon.com/blogs/containers/reduce-container-startup-time-on-amazon-eks-with-bottlerocket-data-volume/
Feb 2025
paulbutler.org paulbutler.org

The hater’s guide to Kubernetes

1
1. pyxelr 23 Feb 2025
  
  in Public
  
  The hater’s guide to Kubernetes
  
  Why use Kubernetes
  
  Best for running multiple processes/servers/jobs with redundancy and load balancing
  
  Enables infrastructure-as-code configuration for service relationships
  
  Outsourced infrastructure management via cloud providers (e.g., Google Kubernetes Engine)
  
  What they use
  
  Core resources: Deployments (with rolling updates), Services (ClusterIP/LoadBalancer), CronJobs
  
  Configuration: ConfigMaps and Secrets via Pulumi (TypeScript) instead of raw YAML
  
  Cautious adoptions: StatefulSets for limited persistence, RBAC only when necessary
  
  What they avoid
  
  Hand-written YAML and Helm charts ("fragility for no gain")
  
  Operators, custom resources, service meshes, and most third-party controllers
  
  Local k8s stack replication (prefer Docker Compose for local dev)
  
  Key insights
  
  "A human should never wait for a pod" - unsuitable for interactive workloads requiring fast startup
  
  Use managed databases/storage for critical data instead of k8s volumes
  
  Alternatives like Railway/Render may be better for simpler SaaS apps
  
  Recently adopted Ingress controllers for Cloud Armor integration despite initial reservations
  
  MLOps Kubernetes Helm
Visit annotations in context

Tags

MLOps

Helm

Kubernetes

Annotators

pyxelr

URL

paulbutler.org/2024/the-haters-guide-to-kubernetes/
Jan 2025
www.figma.com www.figma.com

How We Migrated onto K8s in Less Than 12 months | Figma Blog

1
1. pyxelr 19 Jan 2025
  
  in Public
  
  How we migrated onto K8s in less than 12 months
  
  Figma's Initial Infrastructure Challenges:
  
  Figma's monolithic architecture struggled with resource allocation inefficiencies and limited scalability.
  
  High traffic spikes from collaborative design workflows required more robust solutions for resource autoscaling and failover.
  
  Why Kubernetes Was Chosen:
  
  Kubernetes' container orchestration capabilities promised better resource management and service isolation.
  
  Features like Horizontal Pod Autoscaling (HPA), robust networking via Kubernetes Services, and support for StatefulSets made it an ideal fit for Figma’s needs.
  
  The platform also wanted better alignment with cloud-native practices and modern CI/CD workflows.
  
  Incremental Migration Approach:
  
  Step 1: Non-Critical Services: Figma migrated stateless services first, allowing experimentation without risking core functionality.
  
  Step 2: Custom Tooling: Internal tooling was built to manage Kubernetes manifests and automate Helm chart creation for standardization.
  
  Step 3: Stateful Services: For databases and other stateful components, Figma relied on Kubernetes' StatefulSets and persistent volumes (PVs) to ensure data integrity during the migration.
  
  Step 4: Observability Enhancements: Kubernetes-native tools like Prometheus and Grafana were integrated to provide detailed metrics and system insights.
  
  Key Technical Adjustments During Migration:
  
  Service Discovery: Transitioned to Kubernetes-native DNS for internal service communication, replacing legacy methods.
  
  Load Balancing: Leveraged Kubernetes Ingress and external load balancers (e.g., NGINX or cloud-native solutions) for traffic routing.
  
  Networking Complexity: Resolved challenges around multi-cluster networking using Kubernetes CNI plugins like Calico.
  
  Resource Management: Used Resource Quotas and Limits to prevent pod overcommitment and optimize cluster utilization.
  
  Challenges Faced:
  
  Stateful Services: Ensuring zero-downtime migration for databases required careful orchestration of PersistentVolumeClaims (PVCs) and StatefulSets.
  
  Networking: Handling cross-region traffic and external dependencies required tweaking Kubernetes Ingress configurations.
  
  Resource Constraints: Balancing costs and performance involved tuning cluster-autoscaler configurations and evaluating node pool setups.
  
  Benefits Realized Post-Migration:
  
  Scalability: Kubernetes' HPA allowed Figma to scale pods dynamically based on traffic patterns, ensuring consistent performance.
  
  Deployment Efficiency: CI/CD pipelines integrated seamlessly with Kubernetes, enabling faster and more reliable rollouts using tools like Argo CD.
  
  Reliability: Self-healing capabilities, such as pod restarts and node failover, reduced downtime during failures.
  
  Observability: Improved system monitoring with Kubernetes' native metrics server and integrations with Prometheus and Grafana.
  
  Future Enhancements Planned:
  
  Service Mesh Integration: Adoption of Istio or Linkerd to enhance observability, security (e.g., mutual TLS), and traffic management.
  
  Cost Optimization: Further tuning autoscaling policies and resource limits to minimize waste.
  
  Edge Improvements: Deploying Kubernetes clusters closer to end-users for reduced latency, potentially using Kubernetes' Cluster Federation.
  
  Kubernetes ArgoCD MLOps Figma
Visit annotations in context

Tags

MLOps

ArgoCD

Kubernetes

Figma

Annotators

pyxelr

URL

figma.com/blog/migrating-onto-kubernetes/
araji.medium.com araji.medium.com

How Tesla is using Kubernetes and Kafka to handle trillions of events per day

1
1. pyxelr 19 Jan 2025
  
  in Public
  
  How Tesla is using Kubernetes and Kafka to handle trillions of events per day
  
  Overview of Tesla's Data Infrastructure Challenges:
  
  Modern Tesla vehicles generate an enormous volume of telemetry data related to sensor readings, driver behavior, energy consumption, and more.
  
  The primary challenge is ingesting, processing, and analyzing this data at scale while maintaining real-time capabilities.
  
  Kubernetes for Orchestration:
  
  Tesla uses Kubernetes to manage containerized microservices across a distributed cloud environment.
  
  Kubernetes ensures dynamic scaling to handle fluctuating workloads, providing high availability for critical services.
  
  Each microservice is encapsulated in its own container, improving isolation and deployability.
  
  Kafka for Real-Time Event Streaming:
  
  Apache Kafka is the backbone of Tesla’s data pipeline, managing trillions of events daily from globally distributed vehicles.
  
  Kafka topics are structured to partition and replicate data efficiently, ensuring fault tolerance and high throughput.
  
  Producers (vehicles) send data to Kafka brokers, while consumers (analytics systems, data lakes) process these streams in real-time.
  
  Data Processing Pipelines:
  
  Data from Kafka is ingested into processing systems for real-time analytics, anomaly detection, and predictive maintenance.
  
  Stream processing frameworks (e.g., Apache Flink or Kafka Streams) analyze data for immediate feedback.
  
  Batch systems handle aggregation and storage in Tesla’s data lake for long-term insights.
  
  Key Technical Advantages:
  
  Scalability: Kubernetes dynamically allocates resources based on the volume of incoming data and computational requirements.
  
  Resilience: Kafka’s replication factor ensures that no single broker failure impacts the system.
  
  Low Latency: Data streams from Kafka enable Tesla to act on insights in milliseconds, critical for safety and performance monitoring.
  
  Simplified Management:
  
  The platform supports multi-cluster Kubernetes configurations for geographic data segregation.
  
  A central control plane monitors system health, manages deployments, and ensures compliance with data regulations.
  
  Future Goals and Improvements:
  
  Enhancing AI-driven analytics to derive deeper insights from vehicle data.
  
  Further optimizing Kafka’s cluster topology to improve fault tolerance and reduce operational costs.
  
  Expanding edge processing capabilities in vehicles to pre-filter data, reducing bandwidth requirements to the cloud.
  
  Kubernetes Kafka MLOps Tesla
Visit annotations in context

Tags

Tesla

MLOps

Kafka

Kubernetes

Annotators

pyxelr

URL

araji.medium.com/how-tesla-is-using-kubernetes-and-kafka-to-handle-trillions-of-events-per-day-01e6c370d49e
thenewstack.io thenewstack.io

Reddit No Longer Haunted by Drifting Kubernetes Configurations

1
1. pyxelr 19 Jan 2025
  
  in Public
  
  Reddit No Longer Haunted by Drifting Kubernetes Configurations
  
  Kubernetes Configuration Drift Issue:
  
  Reddit experienced a significant outage on March 13, 2022, during a Kubernetes upgrade from version 1.23 to 1.24.
  
  The outage was caused by configuration drift, where unintended changes accumulated over time, leading to inconsistencies across clusters.
  
  New Platform Abstraction with Declarative APIs:
  
  Reddit adopted a declarative approach, leveraging Kubernetes controllers to manage configurations and enforce consistency.
  
  The implementation of these controllers enabled Reddit to abstract platform complexities and ensure a uniform deployment environment.
  
  Centralized Control Plane for Multi-Cluster Management:
  
  The team built a centralized control plane to manage multiple Kubernetes clusters effectively.
  
  Cluster provisioning time was drastically reduced from over 30 hours to approximately 2 hours.
  
  Centralization facilitated standardized configurations and reduced operational overhead.
  
  Development of Achilles SDK:
  
  Achilles, an SDK developed in-house by Reddit, simplified the creation of Kubernetes controllers.
  
  It allowed infrastructure engineers to automate and manage resources programmatically without deep Kubernetes expertise.
  
  The SDK supported a more proactive approach to problem-solving, preventing drift by design.
  
  Benefits and Lessons Learned:
  
  The new system ensured robust monitoring, minimized manual intervention, and improved scalability.
  
  Configuration drift was effectively mitigated, providing a more stable and predictable infrastructure.
  
  The experience highlighted the importance of using Kubernetes-native solutions and declarative configurations for managing large-scale deployments.
  
  Future Goals:
  
  Further refinement of the platform to address edge cases and improve developer experience.
  
  Continued investment in tools and processes to maintain infrastructure consistency at scale.
  
  Kubernetes Reddit MLOps
Visit annotations in context

Tags

Reddit

MLOps

Kubernetes

Annotators

pyxelr

URL

thenewstack.io/reddit-no-longer-haunted-by-drifting-kubernetes-configurations/
news.mlops.community news.mlops.community

Untitled document

4
1. pyxelr 06 Jan 2025
  
  in Public
  
  Really good PMs and engineers will actually start to converge. With LLMs, coding won't be enough to differentiate as an engineer, you'll need to think about the product, business KPIs, strategy etc. You need to think about solutions to problems, not software tools. And PMs are going to be expected to get more technical.
  
  MLOps prediction for 2025
  
  MLOps LLM work teamwork PM
2. pyxelr 06 Jan 2025
  
  in Public
  
  Improved GPU utilization, better LLM storage solutions, and prompt caching features in deployment tools like KServe will continue to make it more accessible to deploy a variety of models.
  
  MLOps prediction for 2025
  
  MLOps GPU LLM KServe
3. pyxelr 06 Jan 2025
  
  in Public
  
  People will use LLMs for simpler tasks, not more complex ones.
  
  MLOps prediction for 2025
  
  MLOps LLM
4. pyxelr 06 Jan 2025
  
  in Public
  
  We’ll also see a big surge in the use of buzzword-heavy AI concepts like Retrieval-Augmented Generation (RAG) systems, generative AI, and cloud-based AI products, all of which will become easier to use and, hopefully, cheaper, thereby driving further broad adoption.
  
  RAG will shine even more in 2025
  
  MLOps RAG LLM
Visit annotations in context

Tags

LLM

RAG

PM

teamwork

GPU

work

KServe

MLOps

Annotators

pyxelr

URL

news.mlops.community/deliveries/dgTGyQkDAIyGAYuGAQGUKLZVA1fmirGZv6KDdN0=
Nov 2024
kubernetes.io kubernetes.io

Kubernetes 1.31: Read Only Volumes Based On OCI Artifacts (alpha)

1
1. pyxelr 17 Nov 2024
  
  in Public
  
  Data scientists, MLOps engineers, or AI developers, can mount large language model weights or machine learning model weights in a pod alongside a model-server, so that they can efficiently serve them without including them in the model-server container image. They can package these in an OCI object to take advantage of OCI distribution and ensure efficient model deployment. This allows them to separate the model specifications/content from the executables that process them.
  
  The introduction of the Image Volume Source feature in Kubernetes 1.31 allows MLOps practitioners to mount OCI-compatible artifacts, such as large language model weights or machine learning models, directly into pods without embedding them in container images. This streamlines model deployment, enhances efficiency, and leverages OCI distribution mechanisms for effective model management.
  
  Kubernetes MLOps LLM LLMOps
Visit annotations in context

Tags

LLM

MLOps

LLMOps

Kubernetes

Annotators

pyxelr

URL

kubernetes.io/blog/2024/08/16/kubernetes-1-31-image-volume-source/
python.plainenglish.io python.plainenglish.io

Deploying Machine Learning Models with Flask and AWS Lambda: A Complete Guide

1
1. pyxelr 03 Nov 2024
  
  in Public
  
  Deploying Machine Learning Models with Flask and AWS Lambda: A Complete Guide
  
  In essence, this article is about:
  
  1) Training a sample model and uploading it to an S3 bucket:
  
```python from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression import joblib

Load the Iris dataset

iris = load_iris() X, y = iris.data, iris.target

Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train the logistic regression model

model = LogisticRegression(max_iter=200) model.fit(X_train, y_train)

Save the trained model to a file

joblib.dump(model, 'model.pkl') ```

Creating a sample Zappa config, because AWS Lambda doesn’t natively support Flask, we need to use Zappa, a tool that helps deploy WSGI applications (like Flask) to AWS Lambda:

```json { "dev": { "app_function": "app.app", "exclude": [ "boto3", "dateutil", "botocore", "s3transfer", "concurrent" ], "profile_name": null, "project_name": "flask-test-app", "runtime": "python3.10", "s3_bucket": "zappa-31096o41b" },
  
  "production": { "app_function": "app.app", "exclude": [ "boto3", "dateutil", "botocore", "s3transfer", "concurrent" ], "profile_name": null, "project_name": "flask-test-app", "runtime": "python3.10", "s3_bucket": "zappa-31096o41b" }
  
  } ```
  
  Writing a sample Flask app:
  
```python import boto3 import joblib import os

Initialize the Flask app

app = Flask(name)

S3 client to download the model

s3 = boto3.client('s3')

Download the model from S3 when the app starts

s3.download_file('your-s3-bucket-name', 'model.pkl', '/tmp/model.pkl') model = joblib.load('/tmp/model.pkl')

@app.route('/predict', methods=['POST']) def predict(): # Get the data from the POST request data = request.get_json(force=True)

# Convert the data into a numpy array input_data = np.array(data['input']).reshape(1, -1) # Make a prediction using the model prediction = model.predict(input_data) # Return the prediction as a JSON response return jsonify({'prediction': int(prediction[0])})

if name == 'main': app.run(debug=True) ```

Deploying this app to production (to AWS):

bash zappa deploy production

and later eventually updating it:

bash zappa update production

We should get a URL like this:

https://xyz123.execute-api.us-east-1.amazonaws.com/production

which we can query:

curl -X POST -H "Content-Type: application/json" -d '{"input": [5.1, 3.5, 1.4, 0.2]}' https://xyz123.execute-api.us-east-1.amazonaws.com/production/predict

MLOps Python Flask Zappa AWS Lambda S3
Visit annotations in context

Tags

Python

Lambda

AWS

Flask

S3

MLOps

Zappa

Annotators

pyxelr

URL

python.plainenglish.io/deploying-machine-learning-models-with-flask-and-aws-lambda-a-complete-guide-103bf575ce87
www.devopsdigest.com www.devopsdigest.com

Optimizing Kubernetes Costs with Multi-Tenancy and Virtual Clusters | DEVOPSdigest

1
1. pyxelr 03 Nov 2024
  
  in Public
  
  Optimizing Kubernetes Costs with Multi-Tenancy and Virtual Clusters
  
  The blog post by Cliff Malmborg from Loft Labs discusses optimizing Kubernetes costs using multi-tenancy and virtual clusters. With Kubernetes expenses rising rapidly at scale, traditional cost-saving methods like autoscaling, resource quotas, and monitoring tools help but are not enough for complex environments where underutilized clusters are common. Multi-tenancy enables resource sharing, reducing the number of clusters and, in turn, management and operational costs.
  
  A virtual cluster is a fully functional Kubernetes cluster running within a larger host cluster, providing better isolation and flexibility than namespaces. Unlike namespaces, each virtual cluster has its own Kubernetes control plane, so resources like statefulsets and webhooks are isolated within it, while only core resources (like pods and services) are shared with the host cluster. This setup addresses the "noisy neighbor" problem, where workloads in a shared environment interfere with each other due to resource contention.
  
  Virtual clusters offer the isolation benefits of individual physical clusters but are cheaper and easier to manage than deploying separate physical clusters for each tenant or application. They also support "sleep mode," automatically scaling down unused resources to save costs, and allow shared use of central tools (like ingress controllers) installed in the host cluster. By transitioning to virtual clusters, companies can balance security, isolation, and cost-effectiveness, reducing the need for multiple physical clusters and making Kubernetes infrastructure scalable for modern, resource-demanding applications.
  
  Kubernetes MLOps FinOps
Visit annotations in context

Tags

MLOps

Kubernetes

FinOps

Annotators

pyxelr

URL

devopsdigest.com/optimizing-kubernetes-costs-with-multi-tenancy-and-virtual-clusters
Feb 2024
marvelousmlops.substack.com marvelousmlops.substack.com

What do teams really want from an MLOps Engineer?

1
1. pyxelr 02 Feb 2024
  
  in Public
  
  We’ve (painstakingly) manually reviewed 310 live MLOps positions, advertised across various platforms in Q4 this year
  
  They went through 310 role descriptions and, even though role descriptions may vary significantly, they found 3 core skills that a large percentage of MLOps roles required:
  
  📦 Docker and Kubernetes 🐍 Python 🌥 Cloud
  
  MLOps career Docker Kubernetes Python Cloud
Visit annotations in context

Tags

Python

Cloud

Docker

Kubernetes

career

MLOps

Annotators

pyxelr

URL

marvelousmlops.substack.com/p/what-do-teams-really-want-from-an
Mar 2023
tomaszdudek.substack.com tomaszdudek.substack.com

The modularity of Amazon SageMaker

1
1. pyxelr 23 Mar 2023
  
  in Public
  
  You can freely replace SageMaker services with other components as your project grows and potentially outgrows SageMaker.
  
  MLOps AWS SageMaker MLflow EKS
Visit annotations in context

Tags

AWS

EKS

MLflow

MLOps

SageMaker

Annotators

pyxelr

URL

tomaszdudek.substack.com/p/the-modularity-of-amazon-sagemaker
medium.com medium.com

The best orchestration tool for MLOps: a real story about difficult choices

7
1. pyxelr 20 Mar 2023
  
  in Public
  
  Ultimately, after researching how we can overcome some inconveniences in Kubeflow, we decided to continue using it. Even though the UI could use some improvements in terms of clarity, we didn’t want to give up the advantages of configured CI/CD and containerization, which allowed us to use different environments. Also, for our projects, it is convenient to develop each ML pipeline in separate Git repositories.
  
  Kubeflow sounds like the most feature rich solution, whose main con is its UI and the setup process
  
  MLOps Kubeflow
2. pyxelr 20 Mar 2023
  
  in Public
  
  So, let’s sum up the pros and cons of each tool:
  
  Summary of pros/cons of Airflow, Kubeflow and Prefect
  
  MLOps Airflow Kubeflow Prefect
3. pyxelr 20 Mar 2023
  
  in Public
  
  The airflow environment must have all the libraries that are being imported in all DAGs. Without using containerization all Airflow pipelines are launched within the same environment. This leads to limitations in using exotic libraries or conflicting module versions for different projects.
  
  Main con of Airflow
  
  MLOps Airflow
4. pyxelr 20 Mar 2023
  
  in Public
  
  Prefect is a comparatively new but promising orchestration tool that appeared in 2018. The tool positions itself as a replacement for Airflow, featuring greater flexibility and simplicity. It is an open-source project; however, there is a paid cloud version to track workflows.
  
  Prefect
  
  Prefect Airflow MLOps
5. pyxelr 20 Mar 2023
  
  in Public
  
  Airflow has been one of the most popular orchestrating tools for several years.
  
  (see the graph above)
  
  MLOps Airflow Prefect Kubeflow MLflow Argo
6. pyxelr 20 Mar 2023
  
  in Public
  
  An orchestration tool usually doesn’t do the hard work of translating and processing data itself, but tells other systems and frameworks what to do and monitors the status of the execution.
  
  Responsibility of the orchestration tool
  
  MLOps
7. pyxelr 20 Mar 2023
  
  in Public
  
  To this day, the field of machine learning does not have a single generally accepted approach to solving problems in terms of practical use of models.
  
  Business ¯\_(ツ)_/¯
  
  DataScience MLOps
Visit annotations in context

Tags

Airflow

Argo

DataScience

Kubeflow

MLflow

Prefect

MLOps

Annotators

pyxelr

URL

medium.com/exness-blog/the-best-orchestration-tool-for-mlops-a-real-story-about-difficult-choices-5ee6a087c9e3
kserve.io kserve.io

ModelMesh Overview - KServe Documentation Website

1
1. pyxelr 14 Mar 2023
  
  in Public
  
  ServingRuntime - Templates for Pods that can serve one or more particular model formats. There are three "built in" runtimes that cover the out-of-the-box model types, custom runtimes can be defined by creating additional ones.
  
  ServingRuntime
  
  Kubeflow MLOps
Visit annotations in context

Tags

Kubeflow

MLOps

Annotators

pyxelr

URL

kserve.io/website/0.10/modelserving/mms/modelmesh/overview/
kserve.io kserve.io

The Scalability Problem - KServe Documentation Website

5
1. pyxelr 14 Mar 2023
  
  in Public
  
  cluster with 4096 IP addresses can deploy at most 1024 models assuming each InferenceService has 4 pods on average (two transformer replicas and two predictor replicas).
  
  Kubernetes clusters have a maximum IP address limitation
  
  Kubernetes Kubeflow KServe MLOps
2. pyxelr 14 Mar 2023
  
  in Public
  
  According to Kubernetes best practice, a node shouldn't run more than 100 pods.
  
  Kubernetes Kubeflow KServe MLOps
3. pyxelr 14 Mar 2023
  
  in Public
  
  Each model’s resource overhead is 1CPU and 1 GB memory. Deploying many models using the current approach will quickly use up a cluster's computing resource. With Multi-model serving, these models can be loaded in one InferenceService, then each model's average overhead is 0.1 CPU and 0.1GB memory.
  
  If I am not mistaken, the multi-model approach reduces the size by 90% in this case
  
  Kubeflow KServe MLOps
4. pyxelr 14 Mar 2023
  
  in Public
  
  Multi-model serving is designed to address three types of limitations KServe will run into
  
  Benefits of multi-model serving
  
  Kubeflow KServe MLOps
5. pyxelr 14 Mar 2023
  
  in Public
  
  While you get the benefit of better inference accuracy and data privacy by building models for each use case, it is more challenging to deploy thousands to hundreds of thousands of models on a Kubernetes cluster.
  
  With more separation, comes the problem of distribution
  
  Kubeflow KServe MLOps
Visit annotations in context

Tags

Kubeflow

Kubernetes

KServe

MLOps

Annotators

pyxelr

URL

kserve.io/website/0.10/modelserving/mms/multi-model-serving/
pythonspeed.com pythonspeed.com

Using Conda? You might not need Docker

1
1. pyxelr 10 Mar 2023
  
  in Public
  
  Mlflow supports both Conda and Docker-based projects.
  
  https://mlflow.org/docs/latest/projects.html
  
  Conda Docker Mlflow MLOps
Visit annotations in context

Tags

Conda

Docker

MLOps

Mlflow

Annotators

pyxelr

URL

pythonspeed.com/articles/conda-vs-docker/
blog.devops.dev blog.devops.dev

Best Practices to Deploying your ML Model

4
1. pyxelr 05 Mar 2023
  
  in Public
  
  response times, error rates, and request rates
  
  Sample metrics to monitor
  
  MLOps monitoring
2. pyxelr 05 Mar 2023
  
  in Public
  
  You can use authentication mechanisms such as OAuth2, JSON Web Tokens (JWT), or HTTP Basic Authentication to ensure that only authorized users or applications can access your API.
  
  MLOps cybersecurity
3. pyxelr 05 Mar 2023
  
  in Public
  
  In this example, we’ve defined an API endpoint called /predict_image that accepts a file upload using FastAPI's UploadFile type. When a client sends an image file to this endpoint, the file is read and its contents are passed to a preprocessing function that prepares the image for input into the model. Once the image has been preprocessed, the model can make a prediction on it, and the result can be returned to the client as a JSON response.
  
  Example above shows how to upload an image to an API endpoint with FastAPI.
  
  Example below is a bit more complex.
  
  MLOps FastAPI
4. pyxelr 05 Mar 2023
  
  in Public
  
  For example, if you are using TensorFlow, you might save your model as a .h5 file using the Keras API. If you are using PyTorch, you might save your model as a .pt file using the torch.save() function. By saving your model as a file, you can easily load it into a deployment environment (such as FastAPI) and use it to make predictions on new images
  
  MLOps Keras PyTorch FastAPI
Visit annotations in context

Tags

PyTorch

monitoring

Keras

MLOps

cybersecurity

FastAPI

Annotators

pyxelr

URL

blog.devops.dev/best-practices-to-deploying-your-ml-model-d640a0f071f1
Jan 2023
a16z.com a16z.com

Who Owns the Generative AI Platform? | Andreessen Horowitz

1
1. ravenscroftj 22 Jan 2023
  
  in Public
  
  We’re also not going deep here on MLops or LLMops tooling, which is not yet highly standardized and will be addressed in a future post.
  
  first mention of LLMops I've seen in the wild
  
  mlops llmops ai
Visit annotations in context

Tags

llmops

mlops

ai

Annotators

ravenscroftj

URL

a16z.com/2023/01/19/who-owns-the-generative-ai-platform/
www.mankier.com www.mankier.com

kubectl-auth-can-i command man page - kubernetes-client

1
1. pyxelr 18 Jan 2023
  
  in Public
  
  kubectl auth can-i
  
  Command to check whether an action is allowed
  
  Kubernetes Kubeflow kubectl MLOps
Visit annotations in context

Tags

kubectl

Kubeflow

MLOps

Kubernetes

Annotators

pyxelr

URL

mankier.com/1/kubectl-auth-can-i
www.youtube.com www.youtube.com

The Best DevOps Tools, Platforms, And Services In 2023?

1
1. pyxelr 17 Jan 2023
  
  in Public
  
  tl;dw (best DevOps tools in 2023)
  
  Low-budget cloud computing : Civo (close to Scaleway)
  
  Infrastructure and Service Management: Crossplane
  
  App Management - manifests : cdk8s (yes, not Kustomize or Helm)
  
  App Management - k8s operators: tie between Knative and Crossplane
  
  App Management - managed services: Google Cloud Run
  
  Dev Envs: Okteto (yeap, not GitPod)
  
  CI/CD: GitHub Actions (as it's simplest to use)
  
  GitOps (CD): Argo CD (wins with Flux due to its adoption rate)
  
  Policy Management: Kyverno (simpler to use than industry's most powerful tool: OPA / Gatekeeper)
  
  Observability: OpenTelemetry (instrumentation of apps), VictoriaMetrics (metrics - yes not Prometheus), Grafana / Loki (logs), Grafana Tempo (tracing), Grafana (dashboards), Robusta (alerting), Komodor (troubleshooting)
  
  DevOps MLOps Civo Scaleway Crossplane cdk8s Kustomize Helm Knative GoogleCloudRun Okteto GitPod GitHubActions ArgoCD Flux Kyverno OPA OPA-Gatekeeper OpenTelemetry VictoriaMetrics Prometheus Grafana Loki GrafanaTempo Robusta Komodor YouTube
Visit annotations in context

Tags

Kustomize

Robusta

cdk8s

OpenTelemetry

Kyverno

Okteto

Crossplane

Komodor

Helm

OPA

Scaleway

Civo

MLOps

GrafanaTempo

Knative

Flux

ArgoCD

Grafana

YouTube

VictoriaMetrics

Prometheus

Loki

GitHubActions

GoogleCloudRun

GitPod

DevOps

OPA-Gatekeeper

Annotators

pyxelr

URL

youtube.com/watch
jameshwade.com jameshwade.com

James H Wade - MLOps: The Whole Game

2
1. pyxelr 09 Jan 2023
  
  in Public
  
  The {vetiver} package provides a set of tools for building, deploying, and managing machine learning models in production. It allows users to easily create, version, and deploy machine learning models to various hosting platforms, such as Posit Connect or a cloud hosting service like Azure.
  
  https://rstudio.github.io/vetiver-r/
  
  R vetiver MLOps
2. pyxelr 09 Jan 2023
  
  in Public
  
  I hope to show how to demonstrate how easy model deployment can be using Posit’s open source tools for MLOps. This includes {pins}, {vetiver}, and the {tidymodels} bundle of packages along with the {tidyverse}.
  
  Consider the following packages while doing MLOps in R: - pins - vetiver - tidymodels - tidyverse
  
  R MLOps
Visit annotations in context

Tags

vetiver

MLOps

R

Annotators

pyxelr

URL

jameshwade.com/posts/2022-12-27_mlops-the-whole-game.html
iximiuz.com iximiuz.com

A Visual Guide to SSH Tunnels: Local and Remote Port Forwarding

1
1. pyxelr 01 Jan 2023
  
  in Public
  
  Use ssh -f -N -L to run the port-forwarding session in the background.
  
  SSH networking DevOps MLOps
Visit annotations in context

Tags

networking

DevOps

MLOps

SSH

Annotators

pyxelr

URL

iximiuz.com/en/posts/ssh-tunnels/
Dec 2022
ryxcommar.com ryxcommar.com

Goodbye, Data Science

1
1. pyxelr 26 Dec 2022
  
  in Public
  
  Ultimately the data scientists need me more than I need them; I’m the reason their stuff is in production and runs smoothly.
  
  DataEngineering MLOps
Visit annotations in context

Tags

MLOps

DataEngineering

Annotators

pyxelr

URL

ryxcommar.com/2022/11/27/goodbye-data-science/
Nov 2022
www.linuxfoundation.org www.linuxfoundation.org

Announcing Availability of MLflow 2.0

3
1. pyxelr 21 Nov 2022
  
  in Public
  
  in MLflow 2.0, the mlflow.evaluate() API for model evaluation is now stable and production-ready. With just a single line of code, mlflow.evaluate() creates a comprehensive performance report for any ML model.
  
  mlflow.evaluate()
  
  MLOps MLflow
2. pyxelr 20 Nov 2022
  
  in Public
  
  MLflow 2.0 also adds AutoML to MLflow Recipes, dramatically reducing the amount of time required to produce a high-quality model.
  
  AutoML in MLflow 2.0
  
  MLOps MLflow
3. pyxelr 20 Nov 2022
  
  in Public
  
  In MLflow 2.0, MLflow Recipes is now a core platform component with several new features, including support for classification models, improved data profiling and hyperparameter tuning capabilities.
  
  MLflow Recipes in MLflow 2.0
  
  MLOps MLflow
Visit annotations in context

Tags

MLflow

MLOps

Annotators

pyxelr

URL

linuxfoundation.org/blog/announcing-availability-of-mlflow-2.0
medium.com medium.com

Ml microservice, much more than inference service

1
1. pyxelr 17 Nov 2022
  
  in Public
  
  As I think today microservice can do much more than just gives predictions using a single model, like:
  
  List of differences between a microservice and inference service.
  
  (see bullet points below annotation)
  
  MLOps microservices inference
Visit annotations in context

Tags

inference

microservices

MLOps

Annotators

pyxelr

URL

medium.com/@oryan.omer/ml-microservice-much-more-than-inference-service-3d7b4842978a
docs.google.com docs.google.com

DAG execution

1
1. ankostis 15 Nov 2022
  
  in Public
  
  My research on choosing a DAG library to extend, ca. Autumn 2019.
  
  graphtik MLOps
Visit annotations in context

Tags

graphtik

MLOps

Annotators

ankostis

URL

docs.google.com/spreadsheets/d/1HPgtg2l6v3uDS81hLOcFOZxIBLCnHGrcFOh3pFRIDio/edit
www.youtube.com www.youtube.com

[MLOps] The Clear SHOW - S02E13 - mlops_this: Copilot Shenanigans

1
1. ankostis 15 Nov 2022
  
  in Public
  
  ClearML MLOps
Visit annotations in context

Tags

MLOps

ClearML

Annotators

ankostis

URL

youtube.com/watch
github.com github.com

mbunse/mlcomops at meetup_erlangen

1
1. ankostis 15 Nov 2022
  
  in Public
  
  Real project demoing MLflow & DVC, Prometheus & Grafana.
  
  MLflow DVC MLOps
Visit annotations in context

Tags

DVC

MLflow

MLOps

Annotators

ankostis

URL

github.com/mbunse/mlcomops/tree/meetup_erlangen
www.reddit.com www.reddit.com

r/mlops - DVC vs ML Flow: Which is the best? Pros and Cons of each?

3
1. ankostis 15 Nov 2022
  
  in Public
  
  See an example for combining mlflow and DVC e.g. here: https://github.com/mbunse/mlcomops/tree/meetup_erlangen
  
  Real project combining MLflow & DVC.
  
  MLflow DVC MLOps
2. ankostis 15 Nov 2022
  
  in Public
  
  What do you mean, would ( ͡° ͜ʖ ͡°)
  
  Voting for ClearML with a video.
  
  ClearML MLOps
3. ankostis 15 Nov 2022
  
  in Public
  
  Infos in the comments about DVC MLOps & one suggesting ClearML.
  
  DVC MLOps MLflow
Visit annotations in context

Tags

DVC

ClearML

MLflow

MLOps

Annotators

ankostis

URL

reddit.com/r/mlops/comments/qfibp4/dvc_vs_ml_flow_which_is_the_best_pros_and_cons_of/
twitter.com twitter.com

(ankostis) + #COVIDisAirborne on Twitter

1
1. ankostis 15 Nov 2022
  
  in Public
  
  -Stars history comparing MLOps platforms: - Airflow - mlflow - Argo-workflows - DVC - pachyderm - ClearML - Kubeflow - Luigi
  
  MLOps MLflow
Visit annotations in context

Tags

MLflow

MLOps

Annotators

ankostis

URL

twitter.com/ankostis/status/1588886779071201282
censius.ai censius.ai

MLflow Alternative: DVC vs MLflow

1
1. ankostis 06 Nov 2022
  
  in Public
  
  Combine the use of DVC & mlflow.
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

ankostis

URL

censius.ai/blogs/dvc-vs-mlflow
valohai.com valohai.com

MLOps Platforms Compared

1
1. ankostis 05 Nov 2022
  
  in Public
  
  Multi-faced comparisons, but not all of them #FOSS.
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

ankostis

URL

valohai.com/mlops-platforms-compared/
star-history.com star-history.com

Star History

1
1. ankostis 05 Nov 2022
  
  in Public
  
  GitHub stars evolution for the top-contenders described in this tweet.
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

ankostis

URL

star-history.com/blog/star-history-monthly-pick-202302
www.datarevenue.com www.datarevenue.com

Airflow vs Luigi vs Argo vs Kubeflow vs MLFlow

1
1. ankostis 05 Nov 2022
  
  in Public
  
  Textual description of by-pair comparisons of the top contenders of the ETL/Pipelne/ML space.
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

ankostis

URL

datarevenue.com/en-blog/airflow-vs-luigi-vs-argo-vs-mlflow-vs-kubeflow
Oct 2022
medium.com medium.com

MLOps is part of DevOps. Not a fork — my thoughts on THE MLOps paper as an MLOps startup CEO

5
1. pyxelr 31 Oct 2022
  
  in Public
  
  As of today, a lot of the things in ML are not automated. They are manual or semi-manual.
  
  MachineLearning MLOps
2. pyxelr 31 Oct 2022
  
  in Public
  
  If we say that MLOps is just DevOps + “some things”, then CI/CD is a core principle of that.
  
  MLOps DevOps
3. pyxelr 31 Oct 2022
  
  in Public
  
  I believe that packaging/building/deploying the vanilla, run-of-the-mill ML model will become common knowledge for backend devs.
  
  MLOps MachineLearning
4. pyxelr 31 Oct 2022
  
  in Public
  
  MLOps engineer today is either an ML engineer (building ML-specific software) or a DevOps engineer. Nothing special here.Should we call a DevOps engineer who primarily operates ML-fueled software delivery an MLOps engineer?I mean, if you really want, we can, but I don’t think we need a new role here. It is just a DevOps eng.
  
  Who really is MLOps Engineer ;)
  
  MLOps
5. pyxelr 31 Oct 2022
  
  in Public
  
  The MLOps team should consist of a DevOps engineer, a backend software engineer, a data scientist, + regular software folks.
  
  Recommended MLOps team structure
  
  MLOps
Visit annotations in context

Tags

MachineLearning

DevOps

MLOps

Annotators

pyxelr

URL

medium.com/@piotr.niedzwiedz/mlops-is-part-of-devops-not-a-fork-my-thoughts-on-the-mlops-paper-as-an-mlops-startup-ceo-f324131709a7
neptune.ai neptune.ai

The Best Open-Source MLOps Tools You Should Know - neptune.ai

1
1. ankostis 29 Oct 2022
  
  in Public
  
  18 MLOps FOSS tools. The Q remains: how some of these combine all together?
  
  MLOps data_science FOSS
Visit annotations in context

Tags

FOSS

MLOps

data_science

Annotators

ankostis

URL

neptune.ai/blog/best-open-source-mlops-tools
www.oreilly.com www.oreilly.com

Becoming a machine learning company means investing in foundational technologies

1
1. ankostis 29 Oct 2022
  
  in Public
  
  Explain the need for [[mlflow]] and more tools for data/model governance.
  
  MLOps data_science
Visit annotations in context

Tags

MLOps

data_science

Annotators

ankostis

URL

oreilly.com/radar/becoming-a-machine-learning-company-means-investing-in-foundational-technologies/
neptune.ai neptune.ai

Neptune vs MLflow - neptune.ai

1
1. ankostis 29 Oct 2022
  
  in Public
  
  Somewhat biased for neptune, omitting things unsupported on its side,
  
  MLOps data_science
Visit annotations in context

Tags

MLOps

data_science

Annotators

ankostis

URL

neptune.ai/vs/mlflow
www.netguru.com www.netguru.com

Machine Learning Tools Comparison

1
1. ankostis 29 Oct 2022
  
  in Public
  
  Comparing:
  
  Experiment-tracking alone tools:
  
  neptune.ai
  
  Wandb
  
  Full-lifecycle tools:
  
  MLflow/Databricks ...
  
  MLOps data_science
Visit annotations in context

Tags

MLOps

data_science

Annotators

ankostis

URL

netguru.com/blog/machine-learning-tools-comparison
neptune.ai neptune.ai

15 Best Tools for Tracking Machine Learning Experiments | Neptune's blog

2
1. ankostis 29 Oct 2022
  
  in Public
  
  a Data Scientist or a Researcher,
  
  Analyze pros & cons based on the the well-written distinction between these roles: - Data scientist/Researcher - ML Engineer - Project Lead
  
  IT data_science MLOps
2. ankostis 29 Oct 2022
  
  in Public
  
  Neptune.io PoV.
  
  MLOps data_science IT
Visit annotations in context

Tags

IT

data_science

MLOps

Annotators

ankostis

URL

neptune.ai/blog/best-ml-experiment-tracking-tools
postgresml.org postgresml.org

PostgresML is 8-40x faster than Python HTTP microservices

2
1. pyxelr 23 Oct 2022
  
  in Public
  
  Python is known for using more memory than more optimized languages and, in this case, it uses 7 times more than PostgresML.
  
  PostgresML PostgreSQL Python MLOps microservices
2. pyxelr 23 Oct 2022
  
  in Public
  
  PostgresML outperforms traditional Python microservices by a factor of 8 in local tests and by a factor of 40 on AWS EC2.
  
  PostgresML PostgreSQL Python MLOps microservices
Visit annotations in context

Tags

PostgreSQL

Python

PostgresML

microservices

MLOps

Annotators

pyxelr

URL

postgresml.org/blog/postgresml-is-8x-faster-than-python-http-microservices/
Jun 2022
refactoring.fm refactoring.fm

Do you really need a Staging environment? 🚢

2
1. pyxelr 05 Jun 2022
  
  in Public
  
  In a Staging workflow, releases are slower because of more steps, and bigger because of batching.
  
  MLOps DevOps
2. pyxelr 05 Jun 2022
  
  in Public
  
  For Staging to be useful, it has to catch a special kind of issues that 1) would happen in production, but 2) wouldn’t happen on a developer's laptop.What are these? They might be problems with data migrations, database load and queries, and other infra-related problems.
  
  How "Staging" environment can be useful
  
  MLOps DevOps
Visit annotations in context

Tags

DevOps

MLOps

Annotators

pyxelr

URL

refactoring.fm/p/do-you-need-staging
blog.zenml.io blog.zenml.io

The Framework Way is the Best Way: the pitfalls of MLOps and how to avoid them

1
1. pyxelr 02 Jun 2022
  
  in Public
  
  Another disadvantage of managed platforms is that they are inflexible and slow to change. They might provide 80% of the functionality we require, but it is often the case that the missing 20% provides functionality that is mission critical for machine learning projects. The closed design and architecture of managed platforms makes it difficult to make even the most trivial changes. To compensate for this lack of flexibility, we often have to design custom, inefficient and hard-to-maintain mechanisms that add technical debt to the project.
  
  Main disadvantage of managed MLOps platforms
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

pyxelr

URL

blog.zenml.io/zenml-mlops-framework/
May 2022
earthly.dev earthly.dev

CI Free Tier Showdown

1
1. pyxelr 20 May 2022
  
  in Public
  
  Overall, if speed is your primary concern and you’re on a budget, then Circle CI is the clear choice. If you’re not looking to run a ton of builds each month and your code is already in Github, then Github Actions can offer similar performance with the added convenience of having everything under one service. Even though we liked Travis better, our main criteria was value, and since you can’t use Travis for free after the first month, GitLab was able to grab the third slot, despite it being weaker in almost every other category.
  
  4 CI free tier comparison: * Quality of Documentation * Compute Power * Available Disk Space * Free Build Minutes * Speed and Performance
  
  MLOps GitLab GitHub CircleCI TravisCI
Visit annotations in context

Tags

GitHub

TravisCI

GitLab

MLOps

CircleCI

Annotators

pyxelr

URL

earthly.dev/blog/ci-comparison/
sarusso.github.io sarusso.github.io

Container engines, runtimes and orchestrators: an overview - Stefano Alberto Russo

2
1. pyxelr 08 May 2022
  
  in Public
  
  As of today, the Docker Engine is to be intended as an open source software for Linux, while Docker Desktop is to be intended as the freemium product of the Docker, Inc. company for Mac and Windows platforms. From Docker's product page: "Docker Desktop includes Docker Engine, Docker CLI client, Docker Build/BuildKit, Docker Compose, Docker Content Trust, Kubernetes, Docker Scan, and Credential Helper".
  
  About Docker Engine and Docker Desktop
  
  Docker MLOps
2. pyxelr 08 May 2022
  
  in Public
  
  The diagram below tries to summarise the situation as of today, and most importantly to clarify the relationships between the various moving parts.
  
  Containers (the backend):
  
  Docker MLOps
Visit annotations in context

Tags

Docker

MLOps

Annotators

pyxelr

URL

sarusso.github.io/blog_container_engines_runtimes_orchestrators.html
gabnotes.org gabnotes.org

Lighten your Python image with Docker multi-stage builds

1
1. pyxelr 04 May 2022
  
  in Public
  
  Without accounting for what we install or add inside, the base python:3.8.6-buster weighs 882MB vs 113MB for the slim version. Of course it's at the expense of many tools such as build toolchains3 but you probably don't need them in your production image.4 Your ops teams should be happier with these lighter images: less attack surface, less code that can break, less transfer time, less disk space used, ... And our Dockerfile is still readable so it should be easy to maintain.
  
  See sample Dockerfile above this annotation (below there is a version tweaked even further)
  
  Docker MLOps Python Poetry
Visit annotations in context

Tags

Python

Docker

Poetry

MLOps

Annotators

pyxelr

URL

gabnotes.org/lighten-your-python-image-docker-multi-stage-builds/
Apr 2022
squeaky.ai squeaky.ai

The Squeaky Blog | Why we don’t use a staging environment

1
1. pyxelr 10 Apr 2022
  
  in Public
  
  Most companies are not prepared to pay for a staging environment identical to production
  
  Keeping staging environment has its cost
  
  MLOps GitOps
Visit annotations in context

Tags

MLOps

GitOps

Annotators

pyxelr

URL

squeaky.ai/blog/development/why-we-dont-use-a-staging-environment
Mar 2022
www.redhat.com www.redhat.com

How Podman can transfer container images without a registry

1
1. pyxelr 20 Mar 2022
  
  in Public
  
  Have you ever built an image only to realize that you actually need it on a user account other than root, requiring you to rebuild the image again in rootless mode? Or have you built an image on one machine but run containers on the image using multiple different machines? Now you need to set up an account on a registry, push the image to the registry, Secure Shell (SSH) to each device you want to run the image on, and then pull the image. The podman image scp command solves both of these annoying scenarios as quickly as they occur.
  
  Podman 4.0 can transfer container images without a registry.
  
  For example: * You can copy a root image to a non-root account:
  
  $ podman image scp root@localhost::IMAGE USER@localhost:: * Or copy an image from one machine to another with this command:
  
  $ podman image scp me@192.168.68.122::IMAGE you@192.168.68.128::
  
  Podman Docker MLOps
Visit annotations in context

Tags

Podman

Docker

MLOps

Annotators

pyxelr

URL

redhat.com/sysadmin/podman-transfer-container-images-without-registry
r.bluethl.net r.bluethl.net

How to design better APIs

4
1. pyxelr 20 Mar 2022
  
  in Public
  
  As mentioned earlier, PATCH requests should apply partial updates to a resource, whereas PUT replaces an existing resource entirely. It's usually a good idea to design updates around PATCH requests
  
  Prefer PATCH over PUT
  
  API MLOps
2. pyxelr 20 Mar 2022
  
  in Public
  
  Aside from using HTTP status codes that indicate the outcome of the request (success or error), when returning errors, always use a standardized error response that includes more detailed information on what went wrong.
  
  For example: ``` // Request => GET /users/4TL011ax
  
  // Response <= 404 Not Found { "code": "user/not_found", "message": "A user with the ID 4TL011ax could not be found." } ```
  
  API MLOps
3. pyxelr 20 Mar 2022
  
  in Public
  
  https://api.averagecompany.com/v1/health https://api.averagecompany.com/health?api_version=1.0
  
  2 examples of versioning APIs
  
  API MLOps
4. pyxelr 20 Mar 2022
  
  in Public
  
  When dealing with date and time, APIs should always return ISO 8601-formatted strings.
  
  https://en.wikipedia.org/wiki/ISO_8601
  
  API MLOps
Visit annotations in context

Tags

API

MLOps

Annotators

pyxelr

URL

r.bluethl.net/how-to-design-better-apis
pythonspeed.com pythonspeed.com

Poetry vs. Docker caching: Fight!

3
1. pyxelr 17 Mar 2022
  
  in Public
  
  But the problem with Poetry is arguably down to the way Docker’s build works: Dockerfiles are essentially glorified shell scripts, and the build system semantic units are files and complete command runs. There is no way in a normal Docker build to access the actually relevant semantic information: in a better build system, you’d only re-install the changed dependencies, not reinstall all dependencies anytime the list changed. Hopefully someday a better build system will eventually replace the Docker default. Until then, it’s square pegs into round holes.
  
  Problem with Poetry/Docker
  
  Docker poetry Python MLOps
2. pyxelr 17 Mar 2022
  
  in Public
  
  Third, you can use poetry-dynamic-versioning, a plug-in for Poetry that uses Git tags instead of pyproject.toml to set your application’s version. That way you won’t have to edit pyproject.toml to update the version. This seems appealing until you realize you now need to copy .git into your Docker build, which has its own downsides, like larger images unless you’re using multi-stage builds.
  
  Approach of using poetry-dynamic-versioning plugin
  
  Docker poetry Python MLOps
3. pyxelr 17 Mar 2022
  
  in Public
  
  But if you’re doing some sort of continuous deployment process where you’re continuously updating the version field, your Docker builds are going to be slow.
  
  Be careful when updating the version field of pyproject.toml around Docker
  
  Docker poetry Python MLOps
Visit annotations in context

Tags

Python

poetry

Docker

MLOps

Annotators

pyxelr

URL

pythonspeed.com/articles/poetry-vs-docker-caching/
www.pluralsight.com www.pluralsight.com

Explore Python Libraries: Speed Up HTTP Tests with VCR.py | Pluralsight

2
1. pyxelr 17 Mar 2022
  
  in Public
  
  VCR.py works primarily via the @vcr decorator. You can import this decorator by writing: import vcr.
  
  How VCR.py works
  
  Python MLOps VCR.py
2. pyxelr 17 Mar 2022
  
  in Public
  
  The VCR.py library records the responses from HTTP requests made within your unit tests. The first time you run your tests using VCR.py is like any previous run. But the after VCR.py has had the chance to run once and record, all subsequent tests are:Fast! No more waiting for slow HTTP requests and responses in your tests.Deterministic. Every test is repeatable since they run off of previously recorded responses.Offline-capable! Every test can now run offline.
  
  VCR.py library to speed up Python HTTP tests
  
  Python MLOps VCR.py
Visit annotations in context

Tags

Python

MLOps

VCR.py

Annotators

pyxelr

URL

pluralsight.com/guides/explore-python-libraries:-speed-up-http-tests-with-vcr.py
mlops.community mlops.community

MLOps Is a Mess But That's to be Expected - MLOps Community

2
1. pyxelr 09 Mar 2022
  
  in Public
  
  DevOps is an interesting case study for understanding MLOps for a number of reasons: It underscores the long period of transformation required for enterprise adoption.It shows how the movement is comprised of both tooling advances as well as shifts in cultural mindset at organizations. Both must march forward hand-in-hand.It highlights the emerging need for practitioners with cross-functional skills and expertise. Silos be damned.
  
  3 things MLOps can learn from DevOps
  
  MLOps DevOps
2. pyxelr 09 Mar 2022
  
  in Public
  
  MLOps today is in a very messy state with regards to tooling, practices, and standards. However, this is to be expected given that we are still in the early phases of broader enterprise machine learning adoption. As this transformation continues over the coming years, expect the dust to settle while ML-driven value becomes more widespread.
  
  State of MLOps in March 2022
  
  MLOps
Visit annotations in context

Tags

DevOps

MLOps

Annotators

pyxelr

URL

mlops.community/mlops-is-a-mess-but-thats-to-be-expected/
Jan 2022
developers.redhat.com developers.redhat.com

Why Kubernetes native instead of cloud native? | Red Hat Developer

3
1. pyxelr 25 Jan 2022
  
  in Public
  
  Adopting Kubernetes-native environments ensures true portability for the hybrid cloud. However, we also need a Kubernetes-native framework to provide the "glue" for applications to seamlessly integrate with Kubernetes and its services. Without application portability, the hybrid cloud is relegated to an environment-only benefit. That framework is Quarkus.
  
  Quarkus framework
  
  Kubernetes MLOps Quarkus Java
2. pyxelr 25 Jan 2022
  
  in Public
  
  Kubernetes-native is a specialization of cloud-native, and not divorced from what cloud native defines. Whereas a cloud-native application is intended for the cloud, a Kubernetes-native application is designed and built for Kubernetes.
  
  Kubernetes-native application
  
  Kubernetes MLOps
3. pyxelr 25 Jan 2022
  
  in Public
  
  According to Wilder, a cloud-native application is any application that was architected to take full advantage of cloud platforms. These applications: Use cloud platform services. Scale horizontally. Scale automatically, using proactive and reactive actions. Handle node and transient failures without degrading. Feature non-blocking asynchronous communication in a loosely coupled architecture.
  
  Cloud-native applications
  
  cloud MLOps
Visit annotations in context

Tags

cloud

Kubernetes

Quarkus

Java

MLOps

Annotators

pyxelr

URL

developers.redhat.com/blog/2020/04/08/why-kubernetes-native-instead-of-cloud-native
towardsdatascience.com towardsdatascience.com

Serve hundreds to thousands of ML models — architectures from industry

5
1. pyxelr 20 Jan 2022
  
  in Public
  
  Salesforce has a unique use case where they need to serve 100K-500K models because the Salesforce Einstein product builds models for every customer. Their system serves multiple models in each ML serving framework container. To avoid the noisy neighbor problem and prevent some containers from taking significantly more load than others, they use shuffle sharding [8] to assign models to containers. I won’t go into the details and I recommend watching their excellent presentation in [3].
  
  Case of Salesforce serving 100K-500K ML models with the use of shuffle sharding
  
  MLOps
2. pyxelr 20 Jan 2022
  
  in Public
  
  Batching predictions can be especially beneficial when running neural networks on GPUs since batching takes better advantage of the hardware.
  
  Barching predictions
  
  MLOps
3. pyxelr 20 Jan 2022
  
  in Public
  
  Inference Service — provides the serving API. Clients can send requests to different routes to get predictions from different models. The Inference Service unifies serving logic across models and provides easier interaction with other internal services. As a result, data scientists don’t need to take on those concerns. Also, the Inference Service calls out to ML serving containers to obtain model predictions. That way, the Inference Service can focus on I/O-bound operations while the model serving frameworks focus on compute-bound operations. Each set of services can be scaled independently based on their unique performance characteristics.
  
  Responsibilities of Inference Service
  
  MLOps
4. pyxelr 20 Jan 2022
  
  in Public
  
  Provide a model config file with the model’s input features, the model location, what it needs to run (like a reference to a Docker image), CPU & memory requests, and other relevant information.
  
  Contents of a model config file
  
  MLOps
5. pyxelr 20 Jan 2022
  
  in Public
  
  what changes when you need to deploy hundreds to thousands of online models? The TLDR: much more automation and standardization.
  
  MLOps focuses deeply on automation and standardization
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

pyxelr

URL

towardsdatascience.com/serve-hundreds-to-thousands-of-ml-models-architectures-from-industry-bf3d9474d427
christophergs.com christophergs.com

Deploying Machine Learning Models in Shadow Mode

1
1. pyxelr 20 Jan 2022
  
  in Public
  
  “Shadow Mode” or “Dark Launch” as Google calls it is a technique where production traffic and data is run through a newly deployed version of a service or machine learning model, without that service or model actually returning the response or prediction to customers/other systems. Instead, the old version of the service or model continues to serve responses or predictions, and the new version’s results are merely captured and stored for analysis.
  
  Shadow mode
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

pyxelr

URL

christophergs.com/machine learning/2019/03/30/deploying-machine-learning-applications-in-shadow-mode/
levelup.gitconnected.com levelup.gitconnected.com

5 Advance Features of FastAPI You Should Try

1
1. pyxelr 17 Jan 2022
  
  in Public
  
  you can also mount different FastAPI applications within the FastAPI application. This would mean that every sub-FastAPI application would have its docs, would run independent of other applications, and will handle its path-specific requests. To mount this, simply create a master application and sub-application file. Now, import the app object from the sub-application file to the master application file and pass this object directly to the mount function of the master application object.
  
  It's possible to mount FastAPI applications within a FastAPI application
  
  FastAPI MLOps DevOps
Visit annotations in context

Tags

DevOps

MLOps

FastAPI

Annotators

pyxelr

URL

levelup.gitconnected.com/5-advance-features-of-fastapi-you-should-try-7c0ac7eebb3e
www.percona.com www.percona.com

UUIDs are Popular, but Bad for Performance — Let’s Discuss - Percona Database Performance Blog

2
1. pyxelr 16 Jan 2022
  
  in Public
  
  There are officially 5 types of UUID values, version 1 to 5, but the most common are: time-based (version 1 or version 2) and purely random (version 3). The time-based UUIDs encode the number of 10ns since January 1st, 1970 in 7.5 bytes (60 bits), which is split in a “time-low”-“time-mid”-“time-hi” fashion. The missing 4 bits is the version number used as a prefix to the time-hi field. This yields the 64 bits of the first 3 groups. The last 2 groups are the clock sequence, a value incremented every time the clock is modified and a host unique identifier.
  
  There are 5 types of UUIDs (source):
  
  Type 1: stuffs MAC address+datetime into 128 bits
  
  Type 3: stuffs an MD5 hash into 128 bits
  
  Type 4: stuffs random data into 128 bits
  
  Type 5: stuffs an SHA1 hash into 128 bits
  
  Type 6: unofficial idea for sequential UUIDs
  
  UUID DevOps MLOps
2. pyxelr 16 Jan 2022
  
  in Public
  
  Even though most posts are warning people against the use of UUIDs, they are still very popular. This popularity comes from the fact that these values can easily be generated by remote devices, with a very low probability of collision.
  
  UUID DevOps MLOps
Visit annotations in context

Tags

UUID

DevOps

MLOps

Annotators

pyxelr

URL

percona.com/blog/uuids-are-popular-but-bad-for-performance-lets-discuss/
Dec 2021
arnoldgalovics.com arnoldgalovics.com

Don’t start with microservices – monoliths are your friend – Arnold Galovics

5
1. pyxelr 29 Dec 2021
  
  in Public
  
  Artifactory/Nexus/Docker repo was unavailable for a tiny fraction of a second when downloading/uploading packagesThe Jenkins builder randomly got stuck
  
  Typical random issues when deploying microservices
  
  MLOps DevOps microservices
2. pyxelr 29 Dec 2021
  
  in Public
  
  Microservices can really bring value to the table, but the question is; at what cost? Even though the promises sound really good, you have more moving pieces within your architecture which naturally leads to more failure. What if your messaging system breaks? What if there’s an issue with your K8S cluster? What if Jaeger is down and you can’t trace errors? What if metrics are not coming into Prometheus?
  
  Microservices have quite many moving parts
  
  MLOps DevOps microservices
3. pyxelr 29 Dec 2021
  
  in Public
  
  If you’re going with a microservice:
  
  9 things needed for deploying a microservice (listed below)
  
  MLOps DevOps microservices
4. pyxelr 29 Dec 2021
  
  in Public
  
  Let’s take a simple online store app as an example.
  
  5 things needed for deploying a monolith (listed below)
  
  MLOps DevOps
5. pyxelr 29 Dec 2021
  
  in Public
  
  some of the pros for going microservices
  
  Pros of microservices (not always all are applicable):
  
  Fault isolation
  
  Eliminating the technology lock
  
  Easier understanding
  
  Faster deployment
  
  Scalability
  
  MLOps DevOps microservices
Visit annotations in context

Tags

DevOps

microservices

MLOps

Annotators

pyxelr

URL

arnoldgalovics.com/microservices-in-production/
Nov 2021
pythonspeed.com pythonspeed.com

The best Docker base image for your Python application (April 2020)

3
1. pyxelr 29 Nov 2021
  
  in Public
  
  I’d probably choose the official Docker Python image (python:3.9-slim-bullseye) just to ensure the latest bugfixes are always available.
  
  python:3.9-slim-bullseye may be the sweet spot for a Python Docker image
  
  Docker Python MLOps
2. pyxelr 29 Nov 2021
  
  in Public
  
  So which should you use? If you’re a RedHat shop, you’ll want to use their image. If you want the absolute latest bugfix version of Python, or a wide variety of versions, the official Docker Python image is your best bet. If you care about performance, Debian 11 or Ubuntu 20.04 will give you one of the fastest builds of Python; Ubuntu does better on point releases, but will have slightly larger images (see above). The difference is at most 10% though, and many applications are not bottlenecked on Python performance.
  
  Choosing the best Python base Docker image depends on different factors.
  
  Docker Python MLOps
3. pyxelr 29 Nov 2021
  
  in Public
  
  There are three major operating systems that roughly meet the above criteria: Debian “Bullseye” 11, Ubuntu 20.04 LTS, and RedHat Enterprise Linux 8.
  
  3 candidates for the best Python base Docker image
  
  Docker Python MLOps
Visit annotations in context

Tags

Python

Docker

MLOps

Annotators

pyxelr

URL

pythonspeed.com/articles/base-image-python-docker-images/
thenewstack.io thenewstack.io

Living with Kubernetes: Debug Clusters in 8 Commands - The New Stack

8
1. pyxelr 11 Nov 2021
  
  in Public
  
  If for some reason you don’t see a running pod from this command, then using kubectl describe po a is your next-best option. Look at the events to find errors for what might have gone wrong.
  
  kubectl run a –image alpine –command — /bin/sleep 1d
  
  Kubernetes MLOps
2. pyxelr 11 Nov 2021
  
  in Public
  
  As with listing nodes, you should first look at the status column and look for errors. The ready column will show how many pods are desired and how many are running.
  
  kubectl get pods -A -o wide
  
  Kubernetes MLOps
3. pyxelr 11 Nov 2021
  
  in Public
  
  -o wide option will tell us additional details like operating system (OS), IP address and container runtime. The first thing you should look for is the status. If the node doesn’t say “Ready” you might have a problem, but not always.
  
  kubectl get nodes -o wide
  
  Kubernetes MLOps
4. pyxelr 11 Nov 2021
  
  in Public
  
  This command will be the easiest way to discover if your scheduler, controller-manager and etcd node(s) are healthy.
  
  kubectl get componentstatus
  
  Kubernetes MLOps
5. pyxelr 11 Nov 2021
  
  in Public
  
  If something broke recently, you can look at the cluster events to see what was happening before and after things broke.
  
  kubectl get events -A
  
  Kubernetes MLOps
6. pyxelr 11 Nov 2021
  
  in Public
  
  this command will tell you what CRDs (custom resource definitions) have been installed in your cluster and what API version each resource is at. This could give you some insights into looking at logs on controllers or workload definitions.
  
  kubectl api-resources -o wide –sort-by name
  
  Kubernetes MLOps
7. pyxelr 11 Nov 2021
  
  in Public
  
  kubectl get --raw '/healthz?verbose'
  
  Alternative to kubectl get --raw '/healthz?verbose'. It does not show scheduler or controller-manager output, but it adds a lot of additional checks that might be valuable if things are broken
  
  Kubernetes MLOps
8. pyxelr 11 Nov 2021
  
  in Public
  
  Here are the eight commands to run
  
  8 commands to debug Kubernetes cluster:
  
  kubectl version --short kubectl cluster-info kubectl get componentstatus kubectl api-resources -o wide --sort-by name kubectl get events -A kubectl get nodes -o wide kubectl get pods -A -o wide kubectl run a --image alpine --command -- /bin/sleep 1d
  
  Kubernetes MLOps
Visit annotations in context

Tags

MLOps

Kubernetes

Annotators

pyxelr

URL

thenewstack.io/living-with-kubernetes-debug-clusters-in-8-commands/
Oct 2021
www.oreilly.com www.oreilly.com

MLOps and DevOps: Why Data Makes It Different

3
1. pyxelr 21 Oct 2021
  
  in Public
  
  few battle-hardened options, for instance: Airflow, a popular open-source workflow orchestrator; Argo, a newer orchestrator that runs natively on Kubernetes, and managed solutions such as Google Cloud Composer and AWS Step Functions.
  
  Current top orchestrators:
  
  Airflow
  
  Argo
  
  Google Cloud Composer
  
  AWS Step Functions
  
  MLOps Airflow Argo GCP AWS
2. pyxelr 21 Oct 2021
  
  in Public
  
  To make ML applications production-ready from the beginning, developers must adhere to the same set of standards as all other production-grade software. This introduces further requirements:
  
  Requirements specific to MLOps systems:
  
  Large scale of operations
  
  Orchestration
  
  Robust versioning (data, models, code)
  
  Apps integrated to surrounding busness systems
  
  MLOps
3. pyxelr 21 Oct 2021
  
  in Public
  
  In contrast, a defining feature of ML-powered applications is that they are directly exposed to a large amount of messy, real-world data which is too complex to be understood and modeled by hand.
  
  One of the best ways to picture a difference between DevOps and MLOps
  
  DevOps MLOps
Visit annotations in context

Tags

AWS

GCP

DevOps

Airflow

Argo

MLOps

Annotators

pyxelr

URL

oreilly.com/radar/mlops-and-devops-why-data-makes-it-different/
medium.com medium.com

What to consider before choosing Argo Workflow?

4
1. pyxelr 10 Oct 2021
  
  in Public
  
  Argo Workflow is part of the Argo project, which offers a range of, as they like to call it, Kubernetes-native get-stuff-done tools (Workflow, CD, Events, Rollouts).
  
  High level definition of Argo Workflow
  
  Kubernetes MLOps Argo
2. pyxelr 10 Oct 2021
  
  in Public
  
  Argo is designed to run on top of k8s. Not a VM, not AWS ECS, not Container Instances on Azure, not Google Cloud Run or App Engine. This means you get all the good of k8s, but also the bad.
  
  Pros of Argo Workflow:
  
  Resilience
  
  Autoscaling
  
  Configurability
  
  Support for RBAC
  
  Cons of Argo Workflow:
  
  A lot of YAML files required
  
  k8s knowledge required
  
  Kubernetes MLOps Argo
3. pyxelr 10 Oct 2021
  
  in Public
  
  If you are already heavily invested in Kubernetes, then yes look into Argo Workflow (and its brothers and sisters from the parent project).The broader and harder question you should ask yourself is: to go full k8s-native or not? Look at your team’s cloud and k8s experience, size, growth targets. Most probably you will land somewhere in the middle first, as there is no free lunch.
  
  Should you go into Argo, or not?
  
  Argo Kubernetes MLOps
4. pyxelr 10 Oct 2021
  
  in Public
  
  In order to reduce the number of lines of text in Workflow YAML files, use WorkflowTemplate . This allow for re-use of common components.
  
  kind: WorkflowTemplate
  
  argo Kubernetes MLOps
Visit annotations in context

Tags

argo

Kubernetes

Argo

MLOps

Annotators

pyxelr

URL

medium.com/datamindedbe/what-to-consider-before-choosing-argo-workflow-54f6067307a8
github.com github.com

tiangolo/uvicorn-gunicorn-fastapi-docker

1
1. pyxelr 08 Oct 2021
  
  in Public
  
  You probably shouldn't use Alpine for Python projects, instead use the slim Docker image versions.
  
  (have a look below this highlight for a full reasoning)
  
  Python MLOps Alpine Docker FastAPI
Visit annotations in context

Tags

Python

Alpine

Docker

MLOps

FastAPI

Annotators

pyxelr

URL

github.com/tiangolo/uvicorn-gunicorn-fastapi-docker
Sep 2021
mattturck.com mattturck.com

Red Hot: The 2021 Machine Learning, AI and Data (MAD) Landscape

1
1. pyxelr 29 Sep 2021
  
  in Public
  
  It’s been a hot, hot year in the world of data, machine learning and AI.
  
  Summary of data tools in October 2021: http://46eybw2v1nh52oe80d3bi91u-wpengine.netdna-ssl.com/wp-content/uploads/2021/09/ML-AI-Data-Landscape-2021.pdf
  
  DataScience MachineLearning MLOps
Visit annotations in context

Tags

DataScience

MLOps

MachineLearning

Annotators

pyxelr

URL

mattturck.com/data2021/
blog.kubeflow.org blog.kubeflow.org

KServe: The next generation of KFServing

1
1. pyxelr 29 Sep 2021
  
  in Public
  
  we will be releasing KServe 0.7 outside of the Kubeflow Project and will provide more details on how to migrate from KFServing to KServe with minimal disruptions
  
  KFServing is now KServe
  
  KFServing KServe MLOps Kubeflow
Visit annotations in context

Tags

KFServing

Kubeflow

KServe

MLOps

Annotators

pyxelr

URL

blog.kubeflow.org/release/official/2021/09/27/kfserving-transition.html
matt-rickard.com matt-rickard.com

An Overview of Docker Desktop Alternatives

1
1. pyxelr 10 Sep 2021
  
  in Public
  
  kind, microk8s, or k3s are replacements for Docker Desktop. False. Minikube is the only drop-in replacement. The other tools require a Linux distribution, which makes them a non-starter on macOS or Windows. Running any of these in a VM misses the point – you don't want to be managing the Kubernetes lifecycle and a virtual machine lifecycle. Minikube abstracts all of this.
  
  At the current moment the best approach is to use minikube with a preferred backend (Docker Engine and Podman are already there), and you can simply run one command to configure Docker CLI to use the engine from the cluster.
  
  Drivers | minikube
  
  docker-env | minikube
  
  Docker minikube Kubernetes kind microk8s k3s MLOps
Visit annotations in context

Tags

microk8s

k3s

MLOps

Kubernetes

minikube

Docker

kind

Annotators

pyxelr

URL

matt-rickard.com/docker-desktop-alternatives/
Aug 2021
yankee.dev yankee.dev

6 Tools to Run Kubernetes Locally

7
1. pyxelr 09 Aug 2021
  
  in Public
  
  k3d is basically running k3s inside of Docker. It provides an instant benefit over using k3s on a local machine, that is, multi-node clusters. Running inside Docker, we can easily spawn multiple instances of our k3s Nodes.
  
  k3d <--- k3s that allows to run mult-node clusters on a local machine
  
  Kubernetes MLOps
2. pyxelr 09 Aug 2021
  
  in Public
  
  Kubernetes in Docker (KinD) is similar to minikube but it does not spawn VM's to run clusters and works only with Docker. KinD for the most part has the least bells and whistles and offers an intuitive developer experience in getting started with Kubernetes in no time.
  
  KinD (Kubernetes in Docker) <--- sounds like the most recommended solution to learn k8s locally
  
  Kubernetes MLOps
3. pyxelr 09 Aug 2021
  
  in Public
  
  Contrary to the name, it comes in a larger binary of 150 MB+. It can be run as a binary or in DinD mode. k0s takes security seriously and out of the box, it meets the FIPS compliance.
  
  k0s <--- similar to k3s, but not as lightweight
  
  Kubernetes MLOps
4. pyxelr 09 Aug 2021
  
  in Public
  
  k3s is a lightweight Kubernetes distribution from Rancher Labs. It is specifically targeted for running on IoT and Edge devices, meaning it is a perfect candidate for your Raspberry Pi or a virtual machine.
  
  k3s <--- lightweight solution
  
  Kubernetes MLOps
5. pyxelr 09 Aug 2021
  
  in Public
  
  All of the tools listed here more or less offer the same feature, including but not limited to
  
  7 tools for learning k8s locally:
  
  k3s
  
  k0s
  
  Microk8s
  
  DinD
  
  minikube
  
  KinD
  
  k3d
  
  Kubernetes MLOps
6. pyxelr 09 Aug 2021
  
  in Public
  
  There are multiple tools for running Kubernetes on your local machine, but it basically boils down to two approaches on how it is done
  
  We can run Kubernetes locally as a:
  
  binary package
  
  container using dind
  
  Kubernetes Docker MLOps
7. pyxelr 09 Aug 2021
  
  in Public
  
  Before we move on to talk about all the tools, it will be beneficial if you installed arkade on your machine.
  
  With arkade, we can quickly set up different k8s tools, while using a single command:
  
  e.g. arkade get k9s
  
  Kubernetes MLOps
Visit annotations in context

Tags

Kubernetes

Docker

MLOps

Annotators

pyxelr

URL

yankee.dev/6-tools-to-run-kubernetes-locally
Jul 2021
mtszkw.medium.com mtszkw.medium.com

Is MLOps Built Upon a Lie?

1
1. pyxelr 29 Jul 2021
  
  in Public
  
  Why do 87% of data science projects never make it into production?
  
  It turns out that this phrase doesn't lead to an existing research. If one goes down the rabbit hole, it all ends up with dead links
  
  MLOps DataScience R&D
Visit annotations in context

Tags

DataScience

R&D

MLOps

Annotators

pyxelr

URL

mtszkw.medium.com/is-mlops-built-upon-a-lie-8282948b41ae
devops.com devops.com

Feature Branching vs. Feature Flags: What’s the Right Tool for the Job? - DevOps.com

1
1. pyxelr 23 Jul 2021
  
  in Public
  
  Furthermore, in order to build a comprehensive pipeline, the code quality, unit test, automated test, infrastructure provisioning, artifact building, dependency management and deployment tools involved have to connect using APIs and extend the required capabilities using IaC.
  
  Vital components of a pipeline
  
  MLOps CICD IaC
Visit annotations in context

Tags

IaC

CICD

MLOps

Annotators

pyxelr

URL

devops.com/report-finds-software-engineering-productivity-gains/
testdriven.io testdriven.io

Developing and Testing an Asynchronous API with FastAPI and Pytest

1
1. pyxelr 19 Jul 2021
  
  in Public
  
  The fact that FastAPI does not come with a development server is both a positive and a negative in my opinion. On the one hand, it does take a bit more to serve up the app in development mode. On the other, this helps to conceptually separate the web framework from the web server, which is often a source of confusion for beginners when one moves from development to production with a web framework that does have a built-in development server (like Django or Flask).
  
  FastAPI does not include a web server like Flask. Therefore, it requires Uvicorn.
  
  Not having a web server has pros and cons listed here
  
  FastAPI Python MLOps Uvicorn
Visit annotations in context

Tags

Python

MLOps

Uvicorn

FastAPI

Annotators

pyxelr

URL

testdriven.io/blog/fastapi-crud/
speedtestdemon.com speedtestdemon.com

Cheat Sheet on Curl Performance Metrics: how to benchmark server latency with curl - Speed Test Demon

1
1. pyxelr 18 Jul 2021
  
  in Public
  
  Get the `curl-format.txt` from github and then run this curl command in order to get the output $ curl -L -w "@curl-format.txt" -o tmp -s $YOUR_URL
  
  Testing server latency with curl:
  
  1) Get this file from GitHub
  
  2) Run the curl: curl -L -w "@curl-format.txt" -o tmp -s $YOUR_URL
  
  curl MLOps
Visit annotations in context

Tags

curl

MLOps

Annotators

pyxelr

URL

speedtestdemon.com/a-guide-to-curls-performance-metrics-how-to-analyze-a-speed-test-result/
pythonspeed.com pythonspeed.com

Where’s that log file? Debugging failed Docker builds

1
1. pyxelr 07 Jul 2021
  
  in Public
  
  We comment out the failed line, and the Dockerfile now looks like this:
  
  To test a failing Dockerfile step, it is best to comment it out, successfully build an image, and then run this command from inside of the Dockerfile
  
  Docker Dockerfile MLOps
Visit annotations in context

Tags

Dockerfile

Docker

MLOps

Annotators

pyxelr

URL

pythonspeed.com/articles/debugging-docker-build/
github.com github.com

Document docker poetry best practices · Discussion #1879 · python-poetry/poetry

1
1. pyxelr 06 Jul 2021
  
  in Public
  
  Some options (you will have to use your own judgment, based on your use case)
  
  4 different options to install Poetry through a Dockerfile
  
  Poetry Dockerfile Docker Python MLOps
Visit annotations in context

Tags

Python

Dockerfile

Docker

Poetry

MLOps

Annotators

pyxelr

URL

github.com/python-poetry/poetry/discussions/1879
towardsdatascience.com towardsdatascience.com

Lessons on ML Platforms — from Netflix, DoorDash, Spotify, and more

7
1. pyxelr 04 Jul 2021
  
  in Public
  
  To prevent this skew, companies like DoorDash and Etsy log a variety of data at online prediction time, like model input features, model outputs, and data points from relevant production systems.
  
  Log inputs and outputs of your online models to prevent training-serving skew
  
  MLOps
2. pyxelr 04 Jul 2021
  
  in Public
  
  idempotent jobs — you should be able to run the same job multiple times and get the same result.
  
  Encourage idempotency
  
  MLOps
3. pyxelr 04 Jul 2021
  
  in Public
  
  Uber and Booking.com’s ecosystem was originally JVM-based but they expanded to support Python models/scripts. Spotify made heavy use of Scala in the first iteration of their platform until they received feedback like:some ML engineers would never consider adding Scala to their Python-based workflow.
  
  Python might be even more popular due to MLOps
  
  MLOps Python Scala Java
4. pyxelr 04 Jul 2021
  
  in Public
  
  Spotify has a CLI that helps users build Docker images for Kubeflow Pipelines components. Users rarely need to write Docker files.
  
  Spotify approach towards writing Dockerfiles for Kubeflow Pipelines
  
  Docker MLOps Kubeflow
5. pyxelr 04 Jul 2021
  
  in Public
  
  Most serving systems are built in-house, I assume for similar reasons as a feature store — there weren’t many serving tools until recently and these companies have stringent production requirements.
  
  The reason of many feature stores and model serving tools built in house, might be, because there were not many open-source tools before
  
  MLOps
6. pyxelr 04 Jul 2021
  
  in Public
  
  Models require a dedicated system because their behavior is determined not only by code, but also by the training data, and hyper-parameters. These three aspects should be linked to the artifact, along with metrics about performance on hold-out data.
  
  Why model registry is a must in MLOps
  
  MLOps
7. pyxelr 04 Jul 2021
  
  in Public
  
  five ML platform components stand out which are indicated by the green boxes in the diagram below
  
  Feature store
  
  Workflow orchestration
  
  Model registry
  
  Model serving
  
  Model quality monitoring
  
  MLOps
Visit annotations in context

Tags

Python

Docker

Kubeflow

Java

MLOps

Scala

Annotators

pyxelr

URL

towardsdatascience.com/lessons-on-ml-platforms-from-netflix-doordash-spotify-and-more-f455400115c7
eng.uber.com eng.uber.com

Continuous Integration and Deployment for Machine Learning Online Serving and Models

8
1. pyxelr 04 Jul 2021
  
  in Public
  
  we employed a three-stage strategy for validating and deploying the latest binary of the Real-time Prediction Service: staging integration test, canary integration test, and production rollout. The staging integration test and canary integration tests are run against non-production environments. Staging integration tests are used to verify the basic functionalities. Once the staging integration tests have been passed, we run canary integration tests to ensure the serving performance across all production models. After ensuring that the behavior for production models will be unchanged, the release is deployed onto all Real-time Prediction Service production instances, in a rolling deployment fashion.
  
  3-stage strategy for validating and deploying the latest binary of the Real-time Prediction Service:
  
  Staging integration test <--- verify the basic functionalities
  
  Canary integration tests <--- ensure the serving performance across all production models
  
  Production rollout <--- deploy release onto all Real-time Prediction Service production instances, in a rolling deployment fashion
  
  MLOps CICD
2. pyxelr 04 Jul 2021
  
  in Public
  
  We add auto-shadow configuration as part of the model deployment configurations. Real-time Prediction Service can check on the auto-shadow configurations, and distribute traffic accordingly. Users only need to configure shadow relations and shadow criteria (what to shadow and how long to shadow) through API endpoints, and make sure to add features that are needed for the shadow model but not for the primary model.
  
  auto-shadow configuration
  
  MLOps
3. pyxelr 04 Jul 2021
  
  in Public
  
  In a gradual rollout, clients fork traffic and gradually shift the traffic distribution among a group of models. In shadowing, clients duplicate traffic on an initial (primary) model to apply on another (shadow) model).
  
  gradual rollout (model A,B,C) vs shadowing (model D,B):
  
  MLOps
4. pyxelr 04 Jul 2021
  
  in Public
  
  we built a model auto-retirement process, wherein owners can set an expiration period for the models. If a model has not been used beyond the expiration period, the Auto-Retirement workflow, in Figure 1 above, will trigger a warning notification to the relevant users and retire the model.
  
  Model Auto-Retirement - without it, we may observe unnecessary storage costs and an increased memory footprint
  
  MLOps
5. pyxelr 04 Jul 2021
  
  in Public
  
  For helping machine learning engineers manage their production models, we provide tracking for deployed models, as shown above in Figure 2. It involves two parts:
  
  Things to track in model deployment (listed below)
  
  MLOps
6. pyxelr 04 Jul 2021
  
  in Public
  
  Model deployment does not simply push the trained model into Model Artifact & Config store; it goes through the steps to create a self-contained and validated model package
  
  3 steps (listed below) are executed to validate the packaged model
  
  MLOps
7. pyxelr 04 Jul 2021
  
  in Public
  
  we implemented dynamic model loading. The Model Artifact & Config store holds the target state of which models should be served in production. Realtime Prediction Service periodically checks that store, compares it with the local state, and triggers loading of new models and removal of retired models accordingly. Dynamic model loading decouples the model and server development cycles, enabling faster production model iteration.
  
  Dynamic Model Loading technique
  
  MLOps
8. pyxelr 04 Jul 2021
  
  in Public
  
  The first challenge was to support a large volume of model deployments on a daily basis, while keeping the Real-time Prediction Service highly available.
  
  A typical MLOps use case
  
  MLOps
Visit annotations in context

Tags

CICD

MLOps

Annotators

pyxelr

URL

eng.uber.com/continuous-integration-deployment-ml/
stackoverflow.com stackoverflow.com

Integrating Python Poetry with Docker

1
1. pyxelr 04 Jul 2021
  
  in Public
  
  pip install 'poetry==$POETRY_VERSION'
  
  Install Poetry with pip to control its version
  
  Poetry pip Docker MLOps
Visit annotations in context

Tags

Docker

Poetry

MLOps

pip

Annotators

pyxelr

URL

stackoverflow.com/questions/53835198/integrating-python-poetry-with-docker
Jun 2021
stackoverflow.com stackoverflow.com

What does set -e and exec "$@" do for docker entrypoint scripts?

1
1. pyxelr 25 Jun 2021
  
  in Public
  
  It basically takes any command line arguments passed to entrypoint.sh and execs them as a command. The intention is basically "Do everything in this .sh script, then in the same shell run the command the user passes in on the command line".
  
  What is the use of this part in a Docker entry point:
  
  #!/bin/bash set -e ... code ... exec "$@"
  
  Docker MLOps Bash DevOps
Visit annotations in context

Tags

Bash

Docker

MLOps

DevOps

Annotators

pyxelr

URL

stackoverflow.com/questions/39082768/what-does-set-e-and-exec-do-for-docker-entrypoint-scripts
May 2021
towardsdatascience.com towardsdatascience.com

Serverless Machine Learning Pipelines with Vertex AI: An Introduction

2
1. pyxelr 20 May 2021
  
  in Public
  
  Kubeflow Pipelines comes to solve this problem. KFP, for short, is a toolkit dedicated to run ML Workflows (as experiments for model training) on Kubernetes, and it does it in a very clever way:Along with other ways, Kubeflow lets us define a workflow as a series of Python functions that pass results, and Artifacts for one another.For each Python function, we can define dependencies (for libs used) and Kubeflow will create a container to run each function in an isolated way, and passing any wanted object to a next step on the workflow. We can set needed resources, (as memory or GPUs) and it will provision them for our workflow step. It feels like magic.Once you’ve ran your pipeline, you will be able to see it in a nice UI, like this:
  
  Brief explanation of Kubeflow Pipelines
  
  Kubeflow MLOps
2. pyxelr 20 May 2021
  
  in Public
  
  Vertex AI came from the skies to solve our MLOps problem with a managed — and reasonably priced—alternative. Vertex AI comes with all the AI Platform classic resources plus a ML metadata store, a fully managed feature store, and a fully managed Kubeflow Pipelines runner.
  
  Vertex AI - Google Cloud’s new unified ML platform
  
  MLOps VertexAI GCP
Visit annotations in context

Tags

Kubeflow

GCP

VertexAI

MLOps

Annotators

pyxelr

URL

towardsdatascience.com/serverless-machine-learning-pipelines-with-vertex-ai-an-introduction-30af8b53188e
medium.com medium.com

The Cheesy Analogy of MLflow and Kubeflow

3
1. pyxelr 05 May 2021
  
  in Public
  
  In short, MLflow makes it far easier to promote models to API endpoints on various cloud vendors compared to Kubeflow, which can do this but only with more development effort.
  
  MLflow seems to be much easier
  
  MLflow Kubeflow MLOps
2. pyxelr 01 May 2021
  
  in Public
  
  Bon Appétit?
  
  Quick comparison of MLflow and Kubeflow (check below the annotation)
  
  MLflow Kubeflow MLOps
3. pyxelr 01 May 2021
  
  in Public
  
  MLflow is a single python package that covers some key steps in model management. Kubeflow is a combination of open-source libraries that depends on a Kubernetes cluster to provide a computing environment for ML model development and production tools.
  
  Brief comparison of MLflow and Kubeflow
  
  MLflow Kubeflow MLOps
Visit annotations in context

Tags

Kubeflow

MLflow

MLOps

Annotators

pyxelr

URL

medium.com/weareservian/the-cheesy-analogy-of-mlflow-and-kubeflow-715a45580fbe
Apr 2021
cloud.google.com cloud.google.com

MLOps: Continuous delivery and automation pipelines in machine learning

1
1. pyxelr 27 Apr 2021
  
  in Public
  
  To summarize, implementing ML in a production environment doesn't only mean deploying your model as an API for prediction. Rather, it means deploying an ML pipeline that can automate the retraining and deployment of new models. Setting up a CI/CD system enables you to automatically test and deploy new pipeline implementations. This system lets you cope with rapid changes in your data and business environment. You don't have to immediately move all of your processes from one level to another. You can gradually implement these practices to help improve the automation of your ML system development and production.
  
  The ideal state of MLOps in a project (2nd level)
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

pyxelr

URL

cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning
towardsdatascience.com towardsdatascience.com

Deep Learning on a Budget: $450 eGPU vs Google Colab

1
1. pyxelr 08 Apr 2021
  
  in Public
  
  On the median case, Colab is going to assign users a K80, and the GTX 1080 is around double the speed, which does not stack up particularly well for Colab. However, on occasion, when a P100 is assigned, the P100 is an absolute killer GPU (again, for FREE).
  
  Some of the GPUs from Google Colab are outstanding.
  
  Colab MLOps
Visit annotations in context

Tags

Colab

MLOps

Annotators

pyxelr

URL

towardsdatascience.com/deep-learning-on-a-budget-450-egpu-vs-google-colab-494f9a2ff0db
datamechanics.co datamechanics.co

Apache Spark 3.1 Release: Spark on Kubernetes is now Generally Available

1
1. pyxelr 01 Apr 2021
  
  in Public
  
  With Spark 3.1, the Spark-on-Kubernetes project is now considered Generally Available and Production-Ready.
  
  With Spark 3.1 k8s becomes the right option to replace YARN
  
  Spark Kubernetes MLOps YARN
Visit annotations in context

Tags

YARN

Spark

MLOps

Kubernetes

Annotators

pyxelr

URL

datamechanics.co/blog-post/apache-spark-3-1-release-spark-on-kubernetes-is-now-ga
Mar 2021
cloud.google.com cloud.google.com

Architecture for MLOps using TFX, Kubeflow Pipelines, and Cloud Build

1
1. pyxelr 31 Mar 2021
  
  in Public
  
  The key libraries of TFX are as follows
  
  TensorFlow Extend (TFX) = TFDV + TFT + TF Estmators and Keras + TFMA + TFServing
  
  TensorFlow TFX MLOps
Visit annotations in context

Tags

MLOps

TensorFlow

TFX

Annotators

pyxelr

URL

cloud.google.com/solutions/machine-learning/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build
blog.usejournal.com blog.usejournal.com

You Don’t Need All That Complex/Expensive/Distracting Infrastructure

1
1. pyxelr 14 Mar 2021
  
  in Public
  
  Simple … a single Linode VPS.
  
  You might not need all the Kubernetes clusters and run well on a single Linode VPS.
  
  Twitter thread: https://twitter.com/levelsio/status/1101581928489078784
  
  MLOps Kubernetes
Visit annotations in context

Tags

MLOps

Kubernetes

Annotators

pyxelr

URL

blog.usejournal.com/you-dont-need-all-that-complex-expensive-distracting-infrastructure-a70dbe0dbccb
openai.com openai.com

Scaling Kubernetes to 7,500 Nodes

2
1. pyxelr 07 Mar 2021
  
  in Public
  
  We use Prometheus to collect time-series metrics and Grafana for graphs, dashboards, and alerts.
  
  How Prometheus and Grafana can be used to collect information from running ML on K8s
  
  Kubernetes MLOps Prometheus Grafana
2. pyxelr 07 Mar 2021
  
  in Public
  
  large machine learning job spans many nodes and runs most efficiently when it has access to all of the hardware resources on each node. This allows GPUs to cross-communicate directly using NVLink, or GPUs to directly communicate with the NIC using GPUDirect. So for many of our workloads, a single pod occupies the entire node.
  
  The way OpenAI runs large ML jobs on K8s
  
  Kubernetes MLOps
Visit annotations in context

Tags

Prometheus

Kubernetes

Grafana

MLOps

Annotators

pyxelr

URL

openai.com/blog/scaling-kubernetes-to-7500-nodes/
openai.com openai.com

Scaling Kubernetes to 2,500 Nodes

6
1. pyxelr 07 Mar 2021
  
  in Public
  
  We use Kubernetes mainly as a batch scheduling system and rely on our autoscaler to dynamically scale up and down our cluster — this lets us significantly reduce costs for idle nodes, while still providing low latency while iterating rapidly.
  
  Kubernetes autoscaler
  
  MLOps Kubernetes
2. pyxelr 07 Mar 2021
  
  in Public
  
  For high availability, we always have at least 2 masters, and set the --apiserver-count flag to the number of apiservers we’re running (otherwise Prometheus monitoring can get confused between instances).
  
  Tip for high availability:
  
  have at least 2 masters
  
  set --apiserver-count flag to the number of running apiservers
  
  MLOps Kubernetes
3. pyxelr 07 Mar 2021
  
  in Public
  
  We’ve increased the max etcd size with the --quota-backend-bytes flag, and the autoscaler now has a sanity check not to take action if it would terminate more than 50% of the cluster.
  
  If we've more than 1k nodes, etcd's hard storage limit might stop accepting writes
  
  MLOps Kubernetes etcd
4. pyxelr 07 Mar 2021
  
  in Public
  
  Another helpful tweak was storing Kubernetes Events in a separate etcd cluster, so that spikes in Event creation wouldn’t affect performance of the main etcd instances.
  
  Another trick apart from tweaking default settings of Fluentd & Datadog
  
  MLOps Kubernetes
5. pyxelr 07 Mar 2021
  
  in Public
  
  The root cause: the default setting for Fluentd’s and Datadog’s monitoring processes was to query the apiservers from every node in the cluster (for example, this issue which is now fixed). We simply changed these processes to be less aggressive with their polling, and load on the apiservers became stable again:
  
  Default settings of Fluentd and Datadog might not be suited for running many nodes
  
  Datadog Fluentd Kubernetes MLOps
6. pyxelr 07 Mar 2021
  
  in Public
  
  We then moved the etcd directory for each node to the local temp disk, which is an SSD connected directly to the instance rather than a network-attached one. Switching to the local disk brought write latency to 200us, and etcd became healthy!
  
  One of the solutions for etcd using only about 10% of the available IOPS. It was working till about 1k nodes
  
  MLOps Kubernetes etcd
Visit annotations in context

Tags

Datadog

Kubernetes

Fluentd

etcd

MLOps

Annotators

pyxelr

URL

openai.com/blog/scaling-kubernetes-to-2500-nodes/
Feb 2021
itnext.io itnext.io

Machine Learning Model Serving Options

3
1. pyxelr 26 Feb 2021
  
  in Public
  
  Consider the amount of data and the speed of the data, if low latency is your priority use Akka Streams, if you have huge amounts of data use Spark, Flink or GCP DataFlow.
  
  For low latency = Akka Streams
  
  For huge amounts of data = Spark, Flink or GCP DataFlow
  
  MLOps Spark Akka Flink GCP
2. pyxelr 22 Feb 2021
  
  in Public
  
  As we mentioned before, the majority of machine learning implementations are based on running model serving as a REST service, which might not be appropriate for the high volume data processing or usage of the streaming system, which requires re coding/starting systems for model update, for example, TensorFlow or Flink. Model as Data is a great fit for big data pipelines. For online inference, it is quite easy to implement, you can store the model anywhere (S3, HDFS…), read it into memory and call it.
  
  Model as Data <--- more appropriate approach than REST service for serving big data pipelines
  
  MLOps
3. pyxelr 22 Feb 2021
  
  in Public
  
  The most common way to deploy a trained model is to save into the binary format of the tool of your choice, wrap it in a microservice (for example a Python Flask application) and use it for inference.
  
  Model as Code <--- the most common way of deploying ML models
  
  MLOps Flask
Visit annotations in context

Tags

Spark

Akka

GCP

Flask

Flink

MLOps

Annotators

pyxelr

URL

itnext.io/machine-learning-model-serving-options-1edf790d917
towardsdatascience.com towardsdatascience.com

Deploying Keras models using TensorFlow Serving and Flask

1
1. pyxelr 24 Feb 2021
  
  in Public
  
  When we are providing our API endpoint to frontend team we need to ensure that we don’t overwhelm them with preprocessing technicalities.We might not always have a Python backend server (eg. Node.js server) so using numpy and keras libraries, for preprocessing, might be a pain.If we are planning to serve multiple models then we will have to create multiple TensorFlow Serving servers and will have to add new URLs to our frontend code. But our Flask server would keep the domain URL same and we only need to add a new route (a function).Providing subscription-based access, exception handling and other tasks can be carried out in the Flask app.
  
  4 reasons why we might need Flask apart from TensorFlow serving
  
  MLOps TensorFlow Flask
Visit annotations in context

Tags

MLOps

Flask

TensorFlow

Annotators

pyxelr

URL

towardsdatascience.com/deploying-keras-models-using-tensorflow-serving-and-flask-508ba00f1037
towardsdatascience.com towardsdatascience.com

Catwalk: Serving Machine Learning Models at Scale

1
1. pyxelr 23 Feb 2021
  
  in Public
  
  Next, imagine you have more models to deploy. You have three optionsLoad the models into the existing cluster — having one cluster serve all models.Spin up a new cluster to serve each model — having multiple clusters, one cluster serves one model.Combination of 1 and 2 — having multiple clusters, one cluster serves a few models.The first option would not scale, because it’s just not possible to load all models into one cluster as the cluster has limited resources.The second option will definitely work but it doesn’t sound like an effective process, as you need to create a set of resources every time you have a new model to deploy. Additionally, how do you optimize the usage of resources, e.g., there might be unutilized resources in your clusters that could potentially be shared by the rest.The third option looks promising, you can manually choose the cluster to deploy each of your new models into so that all the clusters’ resource utilization is optimal. The problem is you have to manuallymanage it. Managing 100 models using 25 clusters can be a challenging task. Furthermore, running multiple models in a cluster can also cause a problem as different models usually have different resource utilization patterns and can interfere with each other. For example, one model might use up all the CPU and the other model won’t be able to serve anymore.Wouldn’t it be better if we had a system that automatically orchestrates model deployments based on resource utilization patterns and prevents them from interfering with each other? Fortunately, that is exactly what Kubernetes is meant to do!
  
  Solution for deploying lots of ML models
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

pyxelr

URL

towardsdatascience.com/catwalk-serving-machine-learning-models-at-scale-221d1100aa2b
towardsdatascience.com towardsdatascience.com

Applying the MLOps Lifecycle

1
1. pyxelr 22 Feb 2021
  
  in Public
  
  If you’re running lots of deployments of models then it becomes important to record which versions were deployed and when. This is needed to be able to go back to specific versions. Model registries help with this problem by providing ways to store and version models.
  
  Model Registries <--- way to handle multiple ML models in production
  
  MLOps
Visit annotations in context

Tags

MLOps

Annotators

pyxelr

URL

towardsdatascience.com/applying-the-mlops-lifecycle-3b60033b7cbf
www.weave.works www.weave.works

GitOps what you need to know

2
1. pyxelr 11 Feb 2021
  
  in Public
  
  The benefits of applying GitOps best practices are far reaching and provide:
  
  The 6 provided benefits also explain GitOps in simple terms
  
  GitOps MLOps
2. pyxelr 11 Feb 2021
  
  in Public
  
  GitOps is a way to do Kubernetes cluster management and application delivery. It works by using Git as a single source of truth for declarative infrastructure and applications. With GitOps, the use of software agents can alert on any divergence between Git with what's running in a cluster, and if there's a difference, Kubernetes reconcilers automatically update or rollback the cluster depending on the case. With Git at the center of your delivery pipelines, developers use familiar tools to make pull requests to accelerate and simplify both application deployments and operations tasks to Kubernetes.
  
  Other definition of GitOps (source):
  
  GitOps is a way of implementing Continuous Deployment for cloud native applications. It focuses on a developer-centric experience when operating infrastructure, by using tools developers are already familiar with, including Git and Continuous Deployment tools.
  
  GitOps MLOps
Visit annotations in context

Tags

MLOps

GitOps

Annotators

pyxelr

URL

weave.works/technologies/gitops/
Jan 2021
www.tecton.ai www.tecton.ai

Why We Need DevOps for ML Data - Tecton

2
1. pyxelr 31 Jan 2021
  
  in Public
  
  Different data sources are better suited for different types of data transformations and provide access to different data quantities at different freshnesses
  
  Comparison of data sources
  
  Data warehouses / lakes (such as Snowflake or Redshift) tend to hold a lot of information but with low data freshness (hours or days). They can be a gold mine, but are most useful for large-scale batch aggregations with low freshness requirements, such as “number of lifetime transactions per user.”
  
  Transactional data sources (such as MongoDB or MySQL) usually store less data at a higher freshness and are not built to process large analytical transformations. They’re better suited for small-scale aggregations over limited time horizons, like the number of orders placed by a user in the past 24 hrs.
  
  Data streams (such as Kafka) store high-velocity events and provide them in near real-time (within milliseconds). In common setups, they retain 1-7 days of historical data. They are well-suited for aggregations over short time-windows and simple transformations with high freshness requirements, like calculating that “trailing count over the last 30 minutes” feature described above.
  
  Prediction request data is raw event data that originates in real-time right before an ML prediction is made, e.g. the query a user just entered into the search box. While the data is limited, it’s often as “fresh” as can be and contains a very predictive signal. This data is provided with the prediction request and can be used for real-time calculations like finding the similarity score between a user’s search query and documents in a search corpus.
  
  DevOps MLOps
2. pyxelr 31 Jan 2021
  
  in Public
  
  MLOps platforms like Sagemaker and Kubeflow are heading in the right direction of helping companies productionize ML. They require a fairly significant upfront investment to set up, but once properly integrated, can empower data scientists to train, manage, and deploy ML models.
  
  Two popular MLOps platforms: Sagemaker and Kubeflow
  
  DevOps MLOps
Visit annotations in context

Tags

DevOps

MLOps

Annotators

pyxelr

URL

tecton.ai/blog/devops-ml-data/

Brief summary

Long summary

Summary of HN discussion

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Summary: Navigating Failures in Pods With Devices

Why AI/ML Workloads Are Different

Major Failure Modes in Kubernetes With Devices

Current Workarounds & Limitations

Roadmap: What’s Next for Kubernetes

Key Takeaway

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Load the Iris dataset

Split the data into training and testing sets

Train the logistic regression model

Save the trained model to a file

Initialize the Flask app

S3 client to download the model

Download the model from S3 when the app starts

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags

Annotators

URL

Tags