60 Matching Annotations
  1. Apr 2025
    1. In order to restart an existing Galera cluster, one needs first to identify a node whose local database contains the latest transaction acknowleged by the cluster, i.e. the one with the biggest seqno.

      In a 2 nodes cluster with etcd, this should mean that all writes are acknowledged only if received by both nodes and that any node can restart the cluster

  2. Dec 2024
    1. consider performing a scale test on your controller in a large cluster (tools like kwok can help with creating synthetic clusters), and monitor your controller’s behavior through audit logs, and metrics around how long reconciliation loops take, whether your controller is able to keep up with the rate of changes in the cluster, etc.

      kwok is a useful tool to create synthetic cluster, to help test the controller in large clusters

    2. decided to move all static node labels directly to the kubelet configuration file

      The kubelet configuration file is a good storage for static node labels

    3. NodeFeature custom resources have their owner references set to the nfd-worker Pod, which is a bad idea, because these Pod get deleted all the time during upgrades etc, and Kubernetes would garbage-collect these NodeFeature resources.

      Write a Custom Resource (CR) with owner reference set to a volatile resource (e.g. a Pod) is a bad idea, as these resources are deleted often and so the CR is garbage collected often as well

    4. using a higher level controller development framework like controller-runtime would make it impossible to get this wrong

      relying on high level controller framework development like controller-runtime makes implementing controllers less risky

    5. This is rather proving my point that implementing controllers correctly is inherently hard

      Implement kubernetes controllers is hard and might easily lead to big mistakes

    6. Normally Kubernetes controllers must not start unless the controller has successfully built an informer cache. However, NFD did not check the return value of the WaitForCacheSync() method

      WaitForCacheSync() method tells the controller when the cache is comlete and that responses from API Server are authoritative

    7. The large object sizes has made NFD controller unable to list the large number of NodeFeatures from the apiserver, causing its list requests to repetitively timeout.

      The data request from apiserver might timeout if the data is large

  3. Oct 2024
  4. Sep 2024
    1. If a cluster permanently loses a majority of its members, a new cluster will need to be started from an old data directory to recover the previous state.

      permanent loss of quorum, requires a new cluster

    2. To join the new etcd member into the existing cluster, specify the correct initial-cluster and set initial-cluster-state to existing

      new member join expecting cluster-state to be "existing"

    1. Nella scelta del PAC migliore è importante “conoscere i flussi in entrata, di uscita e i tempi di spesa dell’investitore, così da scegliere i mercati o le strategie più idonee ad ogni profilo”

      che significa?

  5. Aug 2024
    1. The main assets generated by the installation program are the Ignition config files for the bootstrap, master, and worker machines.

      Ignition has configuration files for - bootstrap node - master node - worker node

    2. By meeting dependencies instead of running commands, the installation program is able to recognize and use existing components instead of running the commands to create them again.

      The installation program is idempotent. It does not make the same operation twice if unnecessary.

    3. These two basic types of OpenShift Container Platform clusters are frequently called installer-provisioned infrastructure clusters and user-provisioned infrastructure clusters.
      • installer-provisioned: the installer provisions the infrastructure and deploys a cluster, the cluster maintains the infrastructure.
      • user-provisioned: the client prepares and maintains the infrastructure, but the installer can still deploy the cluster in it.
    1. The Kubernetes Controller Manager (kube controller) reads the status values every 10 seconds, by default. If the kube controller cannot read a node status value, it loses contact with that node after a configured period.

      how kube-controller checks nodes

      If the kube-controller cannot read a node's value, by default sets it's Ready status to Uknown, which causes - the scheduler to stop scheduling pods on that node - taint the node with node.kubernetes.io/unreachable

    1. Define your audience

      ... in terms of their "proximity" to the knowledge

      • do they know similar concepts?
      • have they been in contact with the concepts recently?
    1. Generally speaking, embedded lists are a poor way to present technical information. Try to transform embedded lists into either bulleted lists or numbered lists.

      This refers to list of items separated by commas. The initial "Generally" adverb is the key, do not convert all comma-separated items in a list, but consider doing it more often.

    2. If you rearrange the items in a bulleted list, the list's meaning does not change. If you rearrange the items in a numbered list, the list's meaning changes.

      Actually use numbered list

    3. When editing, scrutinize subordinate clauses. Keep the one sentence = one idea, single-responsibility principle in mind. Do the subordinate clauses in a sentence extend the single idea or do they branch off into a separate idea? If the latter, consider dividing the offending subordinate clause(s) into separate sentences.

      the single responsibility principle applies to sentences too: - keep subordinate clauses that extend the main concept - move the others in a separate concept

      The examples in the section below are enlightening

    4. Namely, when introducing a long-winded concept name or product name, you may also specify a shortened version of that name

      Use the same term for a concept. Otherwise introduce a short-version immediately, so you can use it later.

  6. Jul 2024
    1. To apply a custom layered image, you create a Containerfile that references an OpenShift Container Platform image and the RPM that you want to apply.

      I need an image with my rpm in it. Can I use dnf or must be rpm-ostree?

      Maybe it depends on the base image I use: the base image must be the same used for RHCOS, given by

      oc adm release info --image-for rhel-coreos

      all the following examples use rpm-ostree, though

    2. You create a custom layered image by using a Containerfile and applying it to nodes by using a MachineConfig object.

      are you not supposed to use rpm-ostree directly, then?

    1. Slave is functionally identical to the Started state in a stateless resource agent

      for each resource the start action is the same (as well as stop). If the resource is stateful, then it has promote action too which makes it Master.

    2. A resource agent receives all configuration information about the resource it manages via environment variables. The names of these environment variables are always the name of the resource parameter, prefixed with OCF_RESKEY_. For example, if the resource has an ip parameter set to 192.168.1.1, then the resource agent will have access to an environment variable OCF_RESKEY_ip holding that value.

      configuration via ENV Variable only - prefix OCF_RESKEY_<name> - a default value if not mandatory

    3. Please name resource agents using lower case letters, with words separated by dashes (example-agent-name).

      naming convention: not camel case nor snake case, dash case?

    1. Just because a node is unresponsive doesn’t mean it has stopped accessing your data. The only way to be 100% sure that your data is safe, is to use fencing to ensure that the node is truly offline before allowing the data to be accessed from another node.

      100% safe machine is a disconnected machine

    2. Using IPMI as a power fencing device may seem like a good choice. However, if the IPMI shares power and/or network access with the host (such as most onboard IPMI controllers), a power or network failure will cause both the host and its fencing device to fail.

      IPMI as fencing device is cool until the IPMI shares the same power (and or network) access with the resource to fence

      (from STONITH to STIF shot thyself in the foot)

    1. Be mindful of the difference between local and cluster bindings. For example, if you bind the cluster-admin role to a user by using a local role binding, it might appear that this user has the privileges of a cluster administrator. This is not the case. Binding the cluster-admin to a user in a project grants super administrator privileges for only that project to the user.

      No matter if you bind a ClusterRole, if bound locally only affect the user for a single project

    1. you can define a Service without specifying a selector to match Pods

      The effect is that the EndpointSlice won't be created automatically. You must do it manually, pointing to whatever you want this Service to use

      (EndpointSlice being the references to the pods matching the selector)

    2. The controller for that Service continuously scans for Pods that match its selector, and then makes any necessary updates to the set of EndpointSlices for the Service.

      Basically it routes TCP traffic from port 80 (of the Service) to port 9376 for each Pod matching the selector

  7. Jun 2024
    1. we need to be able to deploy the app to the cluster

      deploy it with Tilt

      (now, I'm unsure whether Tilt is used here only because it automatically rebuilds the code, or because it is actually useful to debug it. If the code is already deployed, can I ignore Tilt?)

    2. Notice that in the Dockerfile we also install the Delve debugger.

      important: the image built with the Dockerfile must have the Delve debugger

      (it's not just "Notice that", it should be highlighted more)

    3. The idea is to launch our application through a debug server and expose it so that we can connect remotely from our terminal or IDE to debug it as if we were running our application from our machine.

      the idea: launch the application in a way we can connect remotely

    4. The main goal is to ease the developer experience by helping with local continuous development and deployment of apps to local Kubernetes clusters. It does this by monitoring the source code and automatically building and pushing the deployments.

      Tilt rebuilds and pushes the deployments at each code change

  8. Mar 2024
    1. ClusterOperators and Operands. It also provides information about developing with OpenShift operators and the OpenShift release payload. When updating READMEs in core OpenShift repositories

      Is ClusterOperator and Core Operator the same thing? (yes, see below)

  9. Feb 2024
    1. you might want to check for functionality, readability, performance, security, documentation, testing, or compliance

      Review the code 6/7 times to check each property, or try to remember each property while review it the first time?

  10. Apr 2023
    1. This folder is for things that seem interesting to me, but are either not interesting enough to motivate me to give them the attention I’d like to, and/or they’re not relevant enough to any topics I’m working on.

      so why keep it?

  11. Oct 2022
    1. make install run

      pay attention that is both "install" and "run". The install target is most likely applying the manifest for the CR definition (config/crd/bases/cache.example.com_memcacheds.yaml)

  12. Sep 2022
  13. Mar 2020
    1. For each process, the operating system maintains 2 integers with the bits corresponding to a signal numbers

      how the OS manages signals for each process?