102 Matching Annotations
  1. Aug 2023
    1. Partition Leader Balancing

      balance the load on leader preferred replica - first replica of each partition preferred replica is evenly distributed among brokers tries to make the preferred replica as the leader and hence balancing the load

    2. Replication

      followers that lag behind replica lag time max ms is removed from the ISR

    3. Kafka

      follower failure is handled by removing the lagging or failed followers from the ISR and considering only the rest and moves the high watermark

    4. Protocol

      a log is committed when all replicas are in sync - ISR - then we move watermark

    5. Data Plane

      only data upto high watermark is visible to the consumers

    6. Replication

      leaders&followers leader commits log with an epoch(indicates generation of lifetime of a log) follower fetch with offset no to sync with leader leader reponds with new record logs with epoch in the next fetch request, leader moves high watermark based on offset in request (n-1) in that response it passes the high watermark to the follower

    1. course

      how data is received client -> socket -> n/w thread(lightweight) -> req queue -> i/o thread(crc check & append to commit log[.index and .log file]) -> resp queue -> n/w-> socket

      purgatory map -> used for req until data is replicated

      how data is fetched client -> socket -> n/w -> req queue -> i/o thread (fetch ranges are calcualted with .index) -> res queue -> n/w -> socket

      purgatory map -> waits until is arrived based on consumer config

      also follows zero copy - meaning data fetched from file directly to n/w thread - mostly page cached

      if not cached -> may need to use tiered storage

    2. Inside the Apache Kafka Broker

      consumer properties => same as producer, time and batch size.

    3. Kafka Broker

      producer key properties => linger time -> linger.ms batch size -> batch.size

      these determine the throughput and latency of kafkaproducers

  2. May 2023
    1. Today is 9th Feb. The oldest segment – segment 100 – still can’t be deleted by the 7-day topic retention policy, because the most recent message in that segment is only 5 days old. The result is that they can see messages dating back to 28th January on this topic, even though the 28th Jan is now 12 days ago. In a couple of days, all the messages in segment 100 will be old enough to exceed the retention threshold so that segment file will be deleted.

      retention.ms set to 7 days doesn't guarantee that you will only see topic messages from the last 7 days. Think of it as a threshold that the Kafka broker can use to decide when messages are eligible for being automatically deleted.

  3. Mar 2023
  4. Feb 2023
    1. The Bitnami Apache Kafka docker image disables the PLAINTEXT listener for security reasons. You can enable the PLAINTEXT listener by adding the next environment variable, but remember that this configuration is not recommended for production.

      production note

  5. Dec 2022
  6. Aug 2022
    1. 它采用pull机制,而 不是一般MQ的push模型


  7. Mar 2022
  8. Sep 2021
    1. You don’t have to download them manually, as a docker-compose.yml will do that for you. Here’s the code, so you can copy it to your machine:

      Sample docker-compose.yml file to download both: Kafka and Zookeeper containers

    2. Kafka version 2.8.0 introduced early access to a Kafka version without Zookeeper, but it’s not ready yet for production environments.

      In the future, Zookeeper might be no longer needed to operate Kafka

    3. Kafka consumer — A program you write to get data out of Kafka. Sometimes a consumer is also a producer, as it puts data elsewhere in Kafka.

      Simple Kafka consumer terminology

    4. Kafka producer — An application (a piece of code) you write to get data to Kafka.

      Simple producer terminology

    5. Kafka topic — A category to which records are published. Imagine you had a large news site — each news category could be a single Kafka topic.

      Simple Kafka topic terminology

    6. Kafka broker — A single Kafka Cluster is made of Brokers. They handle producers and consumers and keeps data replicated in the cluster.

      Simple Kafka broker terminology

    7. Kafka — Basically an event streaming platform. It enables users to collect, store, and process data to build real-time event-driven applications. It’s written in Java and Scala, but you don’t have to know these to work with Kafka. There’s also a Python API.

      Simple Kafka terminology

  9. Oct 2020
    1. User topics must be created and manually managed ahead of time

      Javadoc says: "should be created".

      The reason is: Auto-creation of topics may be disabled in your Kafka cluster. Auto-creation automatically applies the default topic settings such as the replicaton factor. These default settings might not be what you want for certain output topics (e.g., auto.create.topics.enable=true in the Kafka broker configuration).

  10. Jun 2020
  11. Apr 2020
  12. Feb 2020
  13. Jul 2019
    1. Macksey is also responsible for one of his first short films. “When I was at Hopkins, there was no film program. We talked to Dick [about making a short film] and he said ‘let’s do it,’ and we ended up doing a movie in which he has a small part.”

      The short black and white film of just a few minutes was called Fratricide and is based on the Franz Kafka story A Fratricide.

      <small>Caleb Deschanel on the set of Fratricide. This may likely have been his first DP job circa ’66 while a student at Johns Hopkins. Photo courtesy of classmate Henry James Korn.</small>

  14. Feb 2018
    1. This minimal, yet poignant presence is reflected in the brick work—Kafka’s novel showcasing how a small idea can have a monumental presence.

      love it!

    1. Now he stood there naked.

      Why naked? How do you read this little ritual of disrobing? What might it have to do with the comedy that happens in the preceding paragraph?

    2. his women

      Third or fourth mention of the Commandant's 'women.' Connection to 'Honour your superiors'? Objectification of marginalized groups.

    3. I usually kneel down at this point and observe the phenomenon.

      The fetishistic obsession with inscription is particularly disturbing here.

  15. Jan 2018
    1. Guilt is always beyond a doubt.


    2. “It would be useless to give him that information. He experiences it on his own body.

      But hasn't the Condemned Man already experienced this subjection and degradation in his body? (like a vandalized house) The inscription of it on his body simply embodies what has already been true.

    3. “That’s cotton wool?” asked the Traveler and bent down. “Yes, it is,” said the Officer smiling, “feel it for yourself.”

      Kafka keeps bringing our attention to the cotton, perhaps to ensure we recognize the historical implications?

    4. harrow

      Common term used in farming/planting.

    5. epaulettes

      According to Google, "an ornamental shoulder piece on an item of clothing, typically on the coat or jacket of a military uniform"

    6. the administration of the colony was so self-contained that even if his successor had a thousand new plans in mind, he would not be able to alter anything of the old plan, at least not for several years.


    7. vacant-looking man with a broad mouth and dilapidated hair and face

      The descriptors "Vacant-looking" and "dilapidated" summon up imagery of haunted houses and manors left in ruin rather than people. These terms are primarily used to describe things, not people.

      Why then is our "Condemned" an empty house? What has pushed him from subject to object in this way?

    8. Officer to the Traveler,

      Officer, Traveler, Condemned. Everyone is defined solely by the roles that they inhabit.

    9. Then the Traveler heard a cry of rage from the Officer.

      How does affect work in this tale? What kinds of feelings are evoked in whom by what kinds of stimuli? What do these eruptions of feeling tell us about the unspoken value system that undergirds this society?

    10. That gave rise to certain technical difficulties with fastening the needles securely, but after several attempts we were successful. We didn’t spare any efforts. And now, as the inscription is made on the body, everyone can see through the glass. Don’t you want to come closer and see the needles for yourself.”

      Why glass? Given that the Apparatus is a mere tool, an agent of "justice," why such pains to make its workings visible? Why talk about it so much?

    11. The Traveler wanted to raise various questions, but after looking at the Condemned Man he merely asked, “Does he know his sentence?” “No,” said the Officer. He wished to get on with his explanation right away, but the Traveler interrupted him: “He doesn’t know his own sentence?” “No,” said the Officer once more. He then paused for a moment, as if he was asking the Traveler for a more detailed reason for his question, and said, “It would be useless to give him that information. He experiences it on his own body.”

      How you you read this crucial moment? Who knows what in this story, and how does Kafka exploit the lack of symmetry between Commandant, Officer, Traveler, Condemned, and so on?

    12. “He was indeed,” said the Officer, nodding his head with a fixed and thoughtful expression. Then he looked at his hands, examining them. They didn’t seem to him clean enough to handle the diagrams. So he went to the bucket and washed them again. Then he pulled out a small leather folder and said, “Our sentence does not sound severe. The law which a condemned man has violated is inscribed on his body with the harrow. This Condemned Man, for example,” and the Officer pointed to the man, “will have inscribed on his body, ‘Honour your superiors.’”

      Alas, the double entendre of "sentence" as a grammatical and legal entity at once is not active in German, but the slippage certainly fits here!

    13. “However,” the Officer said, interrupting himself, “I’m chattering, and his apparatus stands here in front of us. As you see, it consists of three parts. With the passage of time certain popular names have been developed for each of these parts. The one underneath is called the bed, the upper one is called the inscriber, and here in the middle, this moving part is called the harrow.” “The harrow?” the Traveler asked. He had not been listening with full attention. The sun was excessively strong, trapped in the shadowless valley, and one could hardly collect one’s thoughts. So the Officer appeared to him all the more admirable in his tight tunic weighed down with epaulettes and festooned with braid, ready to go on parade, as he explained the matter so eagerly and, while he was talking, adjusted screws here and there with a screwdriver.

      What's the effect of Kafka's use of abstraction here? Those who know his other works are perhaps used to this stylistic feature, but why the abstract titles/names, from Commandant to Traveler to apparatus?

    14. Of course, interest in the execution was not very high, not even in the penal colony itself.

      What's the tone of this story? Why does it matter that no one is interested in the execution, including the condemned?

  16. Nov 2017
    1. SubscribePattern allows you to use a regex to specify topics of interest

      This can remove the need to reload the kafka writers in order to take consume messages.

      regex - "topic-ua-*"

    2. The cache for consumers has a default maximum size of 64. If you expect to be handling more than (64 * number of executors) Kafka partitions, you can change this setting via spark.streaming.kafka.consumer.cache.maxCapacity.

      You might need this for keeping track of all partitions consumed.

  17. Jul 2017
    1. In distributed mode, you start many worker processes using the same group.id and they automatically coordinate to schedule execution of connectors and tasks across all available workers. I

      Distributed workers.


    2. Connectors and tasks are logical units of work and must be scheduled to execute in a process. Kafka Connect calls these processes workers and has two types of workers: standalone and distributed.

      Workers = JVM processes

    1. up vote 7 down vote accepted When you are starting your kafka broker you can define set of properties in conf/server.properties file. This file is just key value property file. One of the property is auto.create.topics.enable if it set tot true(by default) kafka will create topic automatically when you send message to non existing topic. All config options you can find here Imho Simple rule for creating topics is the following: number of replicas must be not less than number of nodes that you have. Number of topics must be the multiplier of number of node in your cluster for example: You have 9 node cluster your topic must have 9 partitions and 9 replicas or 18 partitions and 9 replicas or 36 partitions and 9 replicas and so on

      Number of replicas = #replicas Number of nodes = #nodes Number of topics = #topic

      replicas >= #nodes

      k x (#topics) = #nodes

    1. ab -t 15 -k -T "application/vnd.kafka.binary.v1+json" -p postfile http://localhost:8082/topics/test

      ab benchmark

  18. Jun 2017
    1. in sync replicas (ISRs) should be exactly equal to the total number of replicas.

      ISRs are a very imp metric

    2. Kafka metrics can be broken down into three categories:Kafka server (broker) metricsProducer metricsConsumer metrics

      3 Metrics:

      • Broker
      • Producer (Netty)
      • Consumer (SECOR)
    1. "isr" is the set of "in-sync" replicas.

      ISR are pretty import as when nodes go down you will see replicas created later.

    1. You measure the throughout that you can achieve on a single partition for production (call it p) and consumption (call it c). Let’s say your target throughput is t.

      t = throughput (QPS) p = single partition for production c = consumption

    1. Messages are immediately written to the filesystem when they are received. Messages are not deleted when they are read but retained with some configurable SLA (say a few days or a week)
    1. ZooKeeper snapshots can be one such a source of concurrent writes, and ideally should be written on a disk group separate from the transaction log.

      zookeeper maintains concurrency in its own way.

    2. If you do end up sharing the ensemble, you might want to use the chroot feature. With chroot, you give each application its own namespace.

      jail zookeeper instance from the other apps

    1. In merced, we used the low-level simple consumer and wrote our own work dispatcher to get precise control.

      difference between merced and secor

    1. A better alternative is at least once message delivery. For at least once delivery, the consumer reads data from a partition, processes the message, and then commits the offset of the message it has processed. In this case, the consumer could crash between processing the message and committing the offset and when the consumer restarts it will process the message again. This leads to duplicate messages in downstream systems but no data loss.

      This is what SECOR does.

    2. no data loss will occur as long as producers and consumers handle this possibility and retry appropriately.

      Retries should be built into the consumer and producer code. If leader for the partition fails, you will see a LeaderNotAvailable Exception.

    3. By electing a new leader as soon as possible messages may be dropped but we will minimized downtime as any new machine can be leader.

      two scenarios to get the leader back: 1.) Wait to bring the master back online. 2.) Or elect the first node that comes back up. But in this scenario if that replica partition was a bit behind the master then the time from when this replica went down to when the master went down. All that data is Lost.

      SO there is a trade off between availability and consistency. (Durability)

    4. keep in mind that these guarantees hold as long as you are producing to one partition and consuming from one partition.

      This is very important a 1-to-1 mapping between writer and reader with partition. If you have more producers per partition or more consumers per partition your consistency is going to go haywire

    1. On every received heartbeat, the coordinator starts (or resets) a timer. If no heartbeat is received when the timer expires, the coordinator marks the member dead and signals the rest of the group that they should rejoin so that partitions can be reassigned. The duration of the timer is known as the session timeout and is configured on the client with the setting session.timeout.ms. 

      Time to live for the consumers. If the heartbeat doesn't reach the co-ordindator in this duration then the co-ordinator redistributes the partitions to the remaining consumers in the consumer group.

    2. The high watermark is the offset of the last message that was successfully copied to all of the log’s replicas.

      High Watermark: messages copied over to log replicas

    3. Kafka new Client which uses a different protocol for consumption in a distributed environment.

    4. Kafka scales topic consumption by distributing partitions among a consumer group, which is a set of consumers sharing a common group identifier.

      Topic consumption is distributed among a list of consumer group.

    1. Kafka consumer offset management protocol to keep track of what’s been uploaded to S3

      consumers keep track of what's written and where it left off by looking at kafka consumer offsets rather than checking S3 since S3 is an eventually consistent system.

    2. Data lost or corrupted at this stage isn’t recoverable so the greatest design objective for Secor is data integrity.

      data loss in S3 is being mitigated.

    1. An index can potentially store a large amount of data that can exceed the hardware limits of a single node. For example, a single index of a billion documents taking up 1TB of disk space may not fit on the disk of a single node or may be too slow to serve search requests from a single node alone.

      Indexes may overflow the disk space. Hence you want to get the most out of your instances by indexing the nodes.

    1. incidents are an unavoidable reality of working with distributed systems, no matter how reliable. A prompt alerting solution should be an integral part of the design,

      see how it can hook into the current logging mechanism

    2. Consumers in this group are designed to be dead-simple, performant, and highly resilient. Since the data copied verbatim, no code upgrades are required to support new message types.

      exactly what we want

  19. May 2017
    1. With Flume & FlumeNG, and a File channel, if you loose a broker node you will lose access to those events until you recover that disk.

      In flume you loose events if the disk is down. This is very bad for our usecase.

    1. The Kafka cluster retains all published records—whether or not they have been consumed—using a configurable retention period. For example, if the retention policy is set to two days, then for the two days after a record is published, it is available for consumption, after which it will be discarded to free up space. Kafka's performance is effectively constant with respect to data size so storing data for a long time is not a problem.

      irrespective of the fact that the consumer has consumed the message that message is kept in kafka for the entire retention policy duration.

      You can have two or more consumer groups: 1 -> real time 2 -> back up consumer group

    2. Kafka for Stream Processing

      Could be something we can consider for directing data from a raw log to a tenant based topic.

    3. replication factor N, we will tolerate up to N-1 server failures without losing any records

      Replication Factor means number of nodes/brokers which could go down before we start losing data.

      So if you have a replication factor of 6 for a 11 node cluster, then you will be fault tolerant till 5 nodes go down. After that point you are going to loose data for a particular partition.

    4. Messages sent by a producer to a particular topic partition will be appended in the order they are sent. That is, if a record M1 is sent by the same producer as a record M2, and M1 is sent first, then M1 will have a lower offset than M2 and appear earlier in the log.

      ordering is guaranteed.

    5. Consumers label themselves with a consumer group name, and each record published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes or on separate machines.

      kafka takes care of the consumer groups. Just create one Consumer Group for each topic.

    6. The partitions of the log are distributed over the servers in the Kafka cluster with each server handling data and requests for a share of the partitions.

      partitions of logs are per TOPIC basis

    1. The first limitation is that each partition is physically represented as a directory of one or more segment files. So you will have at least one directory and several files per partition. Depending on your operating system and filesystem this will eventually become painful. However this is a per-node limit and is easily avoided by just adding more total nodes in the cluster.

      total number of topics supported depends on the total number of partitions per topic.

      partition = directory of 1 or more segment files This is a per node limit

    1. the number of partitions -- there's no real "formula" other than this: you can have no more parallelism than you have partitions.

      This is an important thing to keep in mind. If we need massive parallelism we need to have more partitions.

    1. The offset the ordering of messages as an immutable sequence. Kafka maintains this message ordering for you.

      Kafka maintains the ordering for you...

    1. replication-factor 3

      If n-1=2 nodes go down you will start loosing data. So that means if both the nodes go down you will loose data.

    2. For a topic with replication factor N, we will tolerate up to N-1 server failures without losing any records committed to the log.

      for Eg for a given topic there are 11 brokers/servers and for each topic the replication factor is 6. That means the topic will start loosing data if more than 5 brokers go down.

    3. The way consumption is implemented in Kafka is by dividing up the partitions in the log over the consumer instances so that each instance is the exclusive consumer of a "fair share" of partitions at any point in time. This process of maintaining membership in the group is handled by the Kafka protocol dynamically. If new instances join the group they will take over some partitions from other members of the group; if an instance dies, its partitions will be distributed to the remaining instances.

      The coolest feature: this way all you need to do is add new consumers in a consumer group to auto scale per topic

    4. Consumers label themselves with a consumer group name

      maintain separate consumer group per tenant basis. Helps to scale out when we have more load per tenant.

    5. The producer is responsible for choosing which record to assign to which partition within the topic.

      Producer can publish to a specific topics

    6. individual partition must fit on the servers that host it

      Each Partition is bounded by the server that hosts that partition.

    7. the only metadata retained on a per-consumer basis is the offset or position of that consumer in the log. This offset is controlled by the consumer: normally a consumer will advance its offset linearly as it reads records, but, in fact, since the position is controlled by the consumer it can consume records in any order it likes.

      partition offset maintained by kafka. Offset number is maintained so that if the consumer goes down nothing breaks.

    8. the retention policy is set to two days, then for the two days after a record is published,

      Might have to tweek this based on the persistence level we want to keep.