131 Matching Annotations
  1. Dec 2023
    1. Shared Process Pool

      We can update the nested for-loop to use a shared process pool accessed from each task.

      Firstly, we can create a manager, which will create a server process for hosting a centralized process pool.

      We can then create a hosted process pool and get a proxy object for the pool that we can share around with tasks in other processes. ``` ...

      create manager

      with multiprocessing.Manager() as manager: # create the shared pool pool = manager.Pool(100) ```

      If we just tried to create a process pool and share it with workers, we would get an error indicating that the Pool object cannot be pickled.

      This can be achieved by changing our task/subtask/etc functions to take the pool as an argument and using the starmap() method instead of map() method that supports multiple arguments.

      This approach requires prior knowledge of the total number of tasks that will be issued to the pool in the hierarchy, and the configuration of a pool large enough to support all of those tasks concurrently.

      This is bad news because we very likely have more CPU-bound tasks than CPU cores available and would prefer to queue up tasks and have them executed one per CPU core.

      As such, this design is fragile.

    2. Running loops in parallel is more challenging than using process pools.

      The reason is twofold.

      Firstly, workers in a process are daemon processes and are unable to create their own child processes.

      This means that tasks executed by workers in a process pool cannot create their own process pool to execute subtasks. Doing so will result in an error: AssertionError: daemonic processes are not allowed to have children

      Secondly, processes do not have shared memory.

      Instead, at best they have a copy of the global variables from the parent process, depending on the start method.

      Therefore, in order to share a process pool across all tasks/subtasks/etc in the hierarchy we must simulate a shared memory, such as by using a centralized process pool in a server process via a multiprocessing.Manager object.

      We can explore this with a multiprocessing.Pool, although a concurrent.futures.ProcessPoolExecutor would be very similar.

    3. Separate Thread Pools

      An alternate solution to making a nested for-loop concurrent is to create a separate pool of workers at each level of the hierarchy of tasks.

      This is easily done with thread pools.

      ```

      SuperFastPython.com

      example of a nested for-loop with separate thread pools

      import time import multiprocessing.pool

      third level task

      def task3(arg): # report message print(f'\t\t>>>task3 {arg}') # do work time.sleep(1)

      second level task

      def task2(arg): # report message print(f'\t>>task2 {arg}') # do work time.sleep(1) # create pool with multiprocessing.pool.ThreadPool(3) as pool: # issue third level tasks pool.map(task3, range(2))

      top level task

      def task1(arg): # report message print(f'>task1 {arg}') # do work time.sleep(1) # create pool with multiprocessing.pool.ThreadPool(3) as pool: # issue second level tasks and wait pool.map(task2, range(3))

      protect the entry point

      if name == 'main': # create pool with multiprocessing.pool.ThreadPool(5) as pool: # issue top level tasks to pool and wait pool.map(task1, range(5)) ```

      This approach overcomes the problem of needing to know the total number of tasks in the hierarchy as we did in the first approach but means we have many more concurrent thread pools. Importantly, we have exactly enough concurrent workers to complete all tasks.

      This approach is fine, as long as the total number of concurrent tasks does not exceed the capability of the system. This may not be the case if a level in the hierarchy balloons.

      A modern system may only be able to support a maximum of a few thousand concurrent threads before running out of main memory and/or taxing the underlying operating system too much with context switching required between threads.

    4. Single ThreadPool With a Shared Queue (Unbounded)

      We can update the single thread pool with the shared queue so that the total number of tasks does not need to be known.

      This can be achieved by maintaining a shared thread-safe counter that is incremented for each task/subtask/subsubtask that is issued and is only decremented once a given task is completed and known not to issue a subtask.

      Once the counter is zero, we know there are no further tasks running and no further tasks to be issued, meaning we can stop consuming items from the queue and close the thread pool.

      The downside is that both the queue and the counter must be shared with each task. This could be abstracted away with a more clever design (something like: sharing them as arguments to worker initialization functions, store them in thread local or global vars, then using a separate single function from all tasks used to issue subtasks).

      We can develop our own thread-safe counter, but for brevity, we can use a threading.Semaphore class.

      ``` ...

      shared counter for total tasks

      global counter counter = threading.Semaphore(5) ```

      The Semaphore can be incremented by calling the release() method and specifying the “n” argument as the number of subtasks by which to increment the counter.

      The Semaphore can be decremented by calling the acquire() method.

      We can check if the counter is zero by acquiring the internal condition variable and checking the internal counter value.

      Therefore, our loop in the main thread can loop forever until we explicitly break the loop when the counter is zero. ``` ...

      loop over all known tasks

      while True: # check for no further tasks with counter._cond: if not counter._value: break ``` This is fragile because we are accessing private members in the stdlib. A custom counter class is preferred if you are implementing this for a production system.

      It is possible that we may try to get a task from the queue when there are no further tasks to consume, such as when the final task is running but not yet done. The queue is empty in this case, but the counter is non-zero.

      To overcome this case we can get items from the queue with a small timeout, allowing us to give up and check the counter again. ``` ...

      consume a task

      try: task, args = task_queue.get(timeout=0.5) except queue.Empty: continue ```

      We can decrement the counter automatically after every task is completed using a callback function on the asynchronous task in the ThreadPool.

      We can define a callback function and have it decrement the counter. ```

      callback on all tasks in the thread pool

      def callback(arg): global counter # decrement the counter counter.acquire() ```

      The callback can be attached to each task issued to the thread pool via the apply_async() method. ``` ...

      issue task to the thread pool

      async_result = pool.apply_async(task, args, callback=callback) ```

      Each task can be updated to declare the global counter (semaphore) variable as well as the queue.

      ```...

      declare the global queue and counter

      global task_queue, counter ```

      Then after subtasks and subsubtasks are issued to the queue, we can increment the counter. ``` ...

      increment the counter by the number of subtasks

      counter.release(n=3) ```

    5. Single ThreadPool With a Shared Queue

      We can make a nested for-loop concurrent using a single thread pool and a shared queue.

      You may recall that Python provides a thread-safe queue via the queue.Queue class.

      Each task can push its subtasks into the queue and finish, allowing workers in the thread pool to be released, rather than wait for the subtasks to be completed.

      Another thread, or the main thread, can consume task requests from the queue and issue them asynchronously to the thread pool, such as via the apply_async() method.

      This is functionally the same as sharing a ThreadPool object via a global variable and having tasks issue their subtasks directly.

      The important difference is that because works are released immediately after subtasks are issued, we do not need to have one worker per task in the ThreadPool. We have far fewer workers per task, without the risk of a deadlock.

      The downside, still, is that we must know how many tasks overall will be issued.

      The reason is that the thread responsible for issuing tasks to the thread pool needs to know when to stop consuming items from the queue and wait for all issued tasks to complete before terminating the program.

      The queue can be defined as a global variable and shared with all tasks at all levels. Tasks issue their subtasks to the queue as a tuple containing the function name and any arguments.

      The main thread creates the thread pool and then consumes a fixed number of tasks from the queue before stopping and waiting for all issued tasks to the pool to finish.

    6. Single Shared ThreadPool

      We can update the nested for-loop to use a single shared thread pool.

      This can be achieved by creating a thread pool as a global variable, then having each task and subtask function access the global variable and issue tasks to the same shared thread pool directly.

      It requires that the pool have enough capacity to execute all tasks.

      One approach is to have each task wait for its subtasks. This is a straightforward solution, although it has the downside of occupying a worker thread while it waits for all subtasks to complete. This could turn disastrous, resulting in a deadlock if more tasks/subtasks/subsubtassk are issued to the thread pool than there are worker threads.

      This approach can only be used if the total number of overall tasks/subtasks/etc is known and as many or more workers than tasks can be specified.

    7. Single Shared ThreadPool

      We can update the nested for-loop to use a single shared thread pool.

      This can be achieved by creating a thread pool as a global variable, then having each task and subtask function access the global variable and issue tasks to the same shared thread pool directly.

      It requires that the pool have enough capacity to execute all tasks

    8. There are two main approaches we can use to make a nested for-loop concurrent.

      They are:

      1. Create a pool of workers at each level in the hierarchy.
      2. Share a pool of workers across the hierarchy.

      Approach 1: One Pool of Workers Per Level

      Each level in a nested for-loop can have its own pool of workers.

      That is, each task runs, does its work, creates a pool of workers, and issues the subtasks to the pool. If there is another level of subsubtasks, each of these would create its own pool of workers and issue its own tasks.

      This is suited to nested for-loops that have a large number of tasks to execute at a given level.

      The downside is the redundancy of having many pools of workers competing with each other. This is not a problem with thread pools, as we may have many thousands of concurrent threads, but process pools are typically limited to one worker per CPU core.

      As such, some tuning of the number of workers per pool may be required.

      Another downside of this approach is when using process pools, child processes are typically daemonic and are unable to create their own child processes. This means if tasks executing in a child process tries to create their own pool of workers it will fail with an error.

      As such, this approach may only be viable when working with thread pools, and even then, perhaps only in a nested loop with tasks and subtasks with many subtasks per task.

      Approach 2: Shared Pool of Workers Across Levels

      Another approach is to create one pool of workers and issue all tasks, subtasks, and subsubtasks to this pool.

      When using thread pools in one process, the pool can be shared with tasks and subtasks as a shared global variable, allowing tasks to be issued directly.

      When using process pools, things are more tricky. A centralized pool of workers can be created in a server process using a multiprocessing.Manager and the proxy objects for using the centralized server can be shared among all tasks and subtasks.

      An alternate design might be to use a shared queue. All tasks and subtasks may be placed onto the queue and a single consumer of tasks can retrieve items from the queue and issue them to the pool of workers.

      This is functionally the same, although it separates the concern of issuing tasks from how they are executed, potentially allowing the consumer to decide to use a thread pool or process pool based on the types of tasks issued to the queue.

    1. Comparison of Time Functions

      The time.get_clock_info(name) function can be used to report the technical details of each timer.

      The program below reports these details. ```

      report the details of each time function

      from time import get_clock_info

      time.time()

      print(get_clock_info('time'))

      time.perf_counter()

      print(get_clock_info('perf_counter'))

      time.monotonic()

      print(get_clock_info('monotonic'))

      time.process_time()

      print(get_clock_info('process_time'))

      time.thread_time()

      print(get_clock_info('thread_time')) ```

    2. Measure Execution Time With time.thread_time()

      The time.thread_time() reports the time that the current thread has been executing.

      The time begins or is zero when the current thread is first created.

      Return the value (in fractional seconds) of the sum of the system and user CPU time of the current thread.

      It is an equivalent value to the time.process_time(), except calculated at the scope of the current thread, not the current process.

      This value is calculated as the sum of the system time and the user time.

      thread time = user time + system time

      The reported time does not include sleep time.

      This means if the thread is blocked by a call to time.sleep() or perhaps is suspended by the operating system, then this time is not included in the reported time. This is called a “thread-wide” or “thread-specific” time.

    3. Measure Execution Time With time.process_time()

      The time.process_time() reports the time that the current process has been executed.

      The time begins or is zero when the current process is first created.

      Calculated as the sum of the system time and the user time:

      process time = user time + system time

      System time is time that the CPU is spent executing system calls for the kernel (e.g. the operating system)

      User time is time spent by the CPU executing calls in the program (e.g. your code).

      When a program loops through an array, it is accumulating user CPU time. Conversely, when a program executes a system call such as exec or fork, it is accumulating system CPU time.

      The reported time does not include sleep time.

      This means if the process is blocked by a call to time.sleep() or perhaps is suspended by the operating system, then this time is not included in the reported time. This is called a “process-wide” time.

      As such, it only reports the time that the current process was executed since it was created by the operating system.

    4. Measure Execution Time With time.monotonic()

      The time.monotonic() function returns time stamps from a clock that cannot go backwards, as its name suggests.

      In mathematics, monotonic, e.g. a monotonic function means a function whose output over increases (or decreaes).

      This means that the result from the time.monotonic() function will never be before the result from a prior call.

      Return the value (in fractional seconds) of a monotonic clock, i.e. a clock that cannot go backwards.

      It is a high-resolution time stamp, although is not relative to epoch-like time.time(). Instead, like time.perf_counter() uses a separate timer separate from the system clock.

      The time.monotonic() has a lower resolution than the time.perf_counter() function.

      This means that values from the time.monotonic() function can be compared to each other, relatively, but not to the system clock.

      Like the time.perf_counter() function, time.monotonic() function is “system-wide”, meaning that it is not affected by changes to the system clock, such as updates or clock adjustments due to time synchronization.

      Like the time.perf_counter() function, the time.monotonic() function was introduced in Python version 3.3 with the intent of addressing the limitations of the time.time() function tied to the system clock, such as use in short-duration benchmarking.

      Monotonic clock (cannot go backward), not affected by system clock updates.

    5. Measure Execution Time With time.perf_counter()

      The time.perf_counter() function reports the value of a performance counter on the system.

      It does not report the time since epoch like time.time().

      Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide.

      The returned value in seconds with fractional components (e.g. milliseconds and nanoseconds), provides a high-resolution timestamp.

      Calculating the difference between two timestamps from the time.perf_counter() allows high-resolution execution time benchmarking, e.g. in the millisecond and nanosecond range.

      The timestamp from the time.perf_counter() function is consistent, meaning that two durations can be compared relative to each other in a meaningful way.

      The time.perf_counter() function was introduced in Python version 3.3 with the intended use for short-duration benchmarking.

      The perf_counter() function was specifically designed to overcome the limitations of other time functions to ensure that the result is consistent across platforms and monotonic (always increasing).

      For accuracy, the timeit module uses the time.perf_counter() internally.

    6. Measure Execution Time With time.time()

      The time.time() function reports the number of seconds since the epoch (epoch is January 1st 1970, which is used on Unix systems and beyond as an arbitrary fixed time in the past) as a floating point number.

      The result is a floating point value, potentially offering fractions of a seconds (e.g. milliseconds), if the platforms support it.

      The time.time() function is not perfect.

      It is possible for a subsequent call to time.time() to return a value in seconds less than the previous value, due to rounding.

      Note: even though the time is always returned as a floating point number, not all systems provide time with a better precision than 1 second. While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls.

    7. there are automatic ways to measure execution time, such as via the timeit module.
    8. There are 5 ways to measure execution time manually in Python using the time module, they are:
      1. Use time.time()
      2. Use time.perf_counter()
      3. Use time.monotonic()
      4. Use time.process_time()
      5. Use time.thread_time()

      Note, each function returns a time in seconds and has an equivalent function that returns the time in nanoseconds, e.g. time.time_ns(), time.perf_counter_ns(), time.monotonic_ns(), time.process_time_ns() and time.thread_time_ns().

      Recall that there are 1,000 nanoseconds in one microsecond, 1,000 microseconds in 1 millisecond, and 1,000 milliseconds in one second. This highlights that the nanosecond versions of the function are for measuring very short time scales indeed.

    9. It is critical to be systematic when benchmarking code.

      The first step is to record how long an unmodified version of the program takes to run. This provides a baseline in performance to which all other versions of the program must be compared. If we are adding concurrency, then the unmodified version of the program will typically perform tasks sequentially, e.g. one-by-one.

      The performance of the modified versions of the program must have better performance than the unmodified version of the program. If they do not, they are not improvements and should not be adopted.

    10. Benchmarking is the practice of comparing business processes and performance metrics to industry bests and best practices from other companies. Dimensions typically measured are quality, time and cost.
    11. Benchmarking Python code refers to comparing the performance of one program to variations of the program.
    1. Error 5: Assuming Race Conditions and Deadlocks are Impossible

      Concurrent programming has the hazard of concurrency-specific failure modes.

      This includes problems such as race conditions and deadlocks.

      A race condition involves two or more units of concurrency executing the same critical section at the same time and leaving a resource or data in an inconsistent or unexpected state. This can lead to data corruption and data loss.

      A deadlock is when a unit of concurrency waits for a condition that can never occur, such as for a resource to become available.

      Many Python developers believe these problems are not possible with coroutines in asyncio.

      The reason is that only one coroutine can run within the event loop at any one time.

      It is true that only one coroutine can run at a time.

      The problem is, coroutines can suspend and resume and may do so while using a shared resource or shared variable.

      Without protecting critical sections, race conditions can occur in asyncio programs.

      Without careful management of synchronization primitives, deadlocks can occur in asyncio programs.

      As such, it is important that asyncio programs are created to ensure coroutine-safety, a concept similar to thread safety and process-safety, applied to coroutines.

    2. Error 4: Exiting the Main Coroutine Too Early

      A major point of confusion in asyncio programs is not giving tasks enough time to complete.

      We can schedule many coroutines to run independently within an asyncio program via the asyncio.create_task() method.

      The main coroutine, the entry point for the asyncio program, can then carry on with other activities.

      If the main coroutine exits, then the asyncio program will terminate.

      The program will terminate even if there are one or many coroutines running independently as tasks.

      This can catch you off guard.

      You may issue many tasks and then allow the main coroutine to resume, expecting all issued tasks to complete in their own time.

      Instead, if the main coroutine has nothing else to do, it should wait on the remaining tasks.

      This can be achieved by first getting a set of all running tasks via the asyncio.all_tasks() function, removing itself from this set, then waiting on the remaining tasks via the asyncio.wait() function. ``` ...

      get a set of all running tasks

      all_tasks = asyncio.all_tasks()

      get the current tasks

      current_task = asyncio.current_task()

      remove the current task from the list of all tasks

      all_tasks.remove(current_task)

      suspend until all tasks are completed

      await asyncio.wait(all_tasks) ```

    3. Error 4: Exiting the Main Coroutine Too Early

      a

    4. Error 3: Using the Low-Level Asyncio API

      A big problem with beginners is that they use the wrong asyncio API.

      This is common for a number of reasons.

      • The API has changed a lot with recent versions of Python.
      • The API docs page makes things confusing, showing both APIs.
      • Examples elsewhere on the web mix up using the different APIs.

      Using the wrong API makes things more verbose (e.g. more code), more difficult, and way less understandable.

      Asyncio offers two APIs:

      • High-level API for application developers (us)
      • Low-level API for framework and library developers (not us)

      The lower-level API provides the foundation for the high-level API and includes the internals of the event loop, transport protocols, policies, and more.

      We should almost always stick to the high-level API. We absolutely must stick to the high-level API when getting started.

      We may dip into the low-level API to achieve specific outcomes on occasion.

      If you start getting a handle on the event loop or use a “loop” variable to do things, you are doing it wrong.

      I am not saying don’t learn the low-level API. Go for it. It’s great. Just don’t start there.

      Drive asyncio via the high-level API for a while. Develop some programs. Get comfortable with asynchronous programming and running coroutines at will. Then later, dip in and have a look around.

    5. Error 2: Not Letting Coroutines Run in the Event Loop

      If a coroutine is not run, you will get a runtime warning as follows: sys:1: RuntimeWarning: coroutine 'custom_coro' was never awaited

      This will happen if you create a coroutine object but do not schedule it for execution within the asyncio event loop.

      For example, you may attempt to call a coroutine from a regular Python program: ``` ...

      attempt to call the coroutine

      custom_coro() ```

      This will not call the coroutine. Instead, it will create a coroutine object. ```...

      create a coroutine object

      coro = custom_coro() ```

      If you do not allow this coroutine to run, you will get a runtime error.

      You can let the coroutine run, as we saw in the previous section, by starting the asyncio event loop and passing it to the coroutine object. ```...

      create a coroutine object

      coro = custom_coro()

      run a coroutine

      asyncio.run(coro) ```

      Or, on one line in a compound statement: ``` ...

      run a coroutine

      asyncio.run(custom_coro()) ```

      If you get this error within an asyncio program, it is because you have created a coroutine and have not scheduled it for execution.

      This can be achieved using the await expression. ``` ...

      create a coroutine object

      coro = custom_coro()

      suspend and allow the other coroutine to run

      await coro ```

      Or, you can schedule it to run independently as a task. ``` ...

      create a coroutine object

      coro = custom_coro()

      schedule the coro to run as a task interdependently

      task = asyncio.create_task(coro) ```

    6. Error 1: Trying to Run Coroutines by Calling Them

      The most common error encountered by beginners to asyncio is calling a coroutine like a function.

      For example, we can define a coroutine using the “async def” expression: ```

      custom coroutine

      async def custom_coro(): print('hi there') **Calling a coroutine like a function will not execute the body of the coroutine.** ...

      error attempt at calling a coroutine like a function

      custom_coro() ```

      Instead, it will create a coroutine object.

      This object can then be awaited within the asyncio runtime, e.g. the event loop.

      We can start the event loop to run the coroutine using the asyncio.run() function. ``` ...

      run a coroutine

      asyncio.run(custom_coro()) ```

      Alternatively, we can suspend the current coroutine and schedule the other coroutine using the “await” expression. ``` ...

      schedule a coroutine

      await custom_coro() ```

    7. There are common errors experienced by beginners when getting started with asyncio in Python.

      They are:

      1. Trying to run coroutines by calling them.
      2. Not letting coroutines run in the event loop.
      3. Using the asyncio low-level API.
      4. Exiting the main coroutine too early.
      5. Assuming race conditions and deadlocks are impossible.
    1. And this is where the asynchronicity comes in: The "results" list does not actually contain the results from running our functions. Instead, it contains "futures" which are similar to the JavaScript idea of "promises." In order to allow our program to continue running, we get back these futures that represent a placeholder for a value. If we try to print the future, depending on whether it's finished running or not, we'll either get back a state of "pending" or "finished." Once it's finished we can get the return value (assuming there is one) using var.result().
    2. The difference between asyncio.sleep() and time.sleep() is that asyncio.sleep() is non-blocking.
    3. The calls don't actually get made until we schedule them with await asyncio.gather(*tasks). This runs all of the tasks in our list and waits for them to finish before continuing with the rest of our program.
    4. programming with asyncio pretty much enforces* using some sort of "main" function.

      This is because you need to use the "async" keyword in order to use the "await" syntax, and the "await" syntax is the only way to actually run other async functions.`

    5. async for (not used here) iterates over an asynchronous stream.
    6. async with allows awaiting async responses and file operations.
    7. When should you use multiprocessing vs asyncio or threading?
      1. Use multiprocessing when you need to do many heavy calculations and you can split them up.
      2. Use asyncio or threading when you're performing I/O operations -- communicating with external resources or reading/writing from/to files.
      3. Multiprocessing and asyncio can be used together, but a good rule of thumb is to fork a process before you thread/use asyncio instead of the other way around -- threads are relatively cheap compared to processes.
    8. Is it possible to combine asyncio with multiprocessing?

      We can do that too.

    9. What's the difference between concurrency and parallelism?

      concurrent process performs multiple tasks at the same time whether they're being diverted total attention or not, a parallel process is physically performing multiple tasks all at the same time.

    10. What is parallelism?

      Parallelism is very-much related to concurrency. In fact, parallelism is a subset of concurrency: whereas a concurrent process performs multiple tasks at the same time whether they're being diverted total attention or not, a parallel process is physically performing multiple tasks all at the same time.

    11. When should you use threading, and when should you use asyncio?

      When you're writing new code, use asyncio. If you need to interface with older libraries or those that don't support asyncio, you might be better off with threading.

    12. Why is the asyncio method always a bit faster than the threading method?

      This is because when we use the "await" syntax, we essentially tell our program "hold on, I'll be right back," but our program keeps track of how long it takes us to finish what we're doing. Once we're done, our program will know, and will pick back up as soon as it's able. Threading in Python allows asynchronicity, but our program could theoretically skip around different threads that may not yet be ready, wasting time if there are threads ready to continue running.

    13. What does it mean when something is non-blocking?

      "Non-blocking" means a program will allow other threads to continue running while it's waiting. This is opposed to "blocking" code, which stops execution of your program completely. Normal, synchronous I/O operations suffer from this limitation.

    14. What is a thread?

      A thread is a way of allowing your computer to break up a single process/program into many lightweight pieces that execute in parallel. Somewhat confusingly, Python's standard implementation of threading limits threads to only being able to execute one at a time due to something called the Global Interpreter Lock (GIL). The GIL is necessary because CPython's (Python's default implementation) memory management is not thread-safe. Because of this limitation, threading in Python is concurrent, but not parallel. To get around this, Python has a separate multiprocessing module not limited by the GIL that spins up separate processes, enabling parallel execution of your code. Using the multiprocessing module is nearly identical to using the threading module.

      Asynchronous nature of threading: as one function waits, another one begins, and so on.

    15. What's a callback?

      The idea of performing a function in response to another function is called a "callback."

    16. What is an event loop?

      Event loops are constructs inherent to asynchronous programming that allow performing tasks asynchronously.

      In its purest essence, an event loop is a process that waits around for triggers and then performs specific (programmed) actions once those triggers are met. They often return a "promise" (JavaScript syntax) or "future" (Python syntax) of some sort to denote that a task has been added. Once the task is finished, the promise or future returns a value passed back from the called function (assuming the function does return a value).

    17. when we join threads with thread.join(), all we're doing is ensuring the thread has finished before continuing on with our code.
    18. Creating a thread is not the same as starting a thread, however. To start your thread, use {the name of your thread}.start(). Starting a thread means "starting its execution."
    19. Without the "User-Agent" header you will receive a 304.
    20. concurrency is great for I/O-intensive processes -- tasks that involve waiting on web requests or file read/write operations.
    21. What is concurrency?

      An effective definition for concurrency is "being able to perform multiple tasks at once". This is a bit misleading though, as the tasks may or may not actually be performed at exactly the same time. Instead, a process might start, then once it's waiting on a specific instruction to finish, switch to a new task, only to come back once it's no longer waiting. Once one task is finished, it switches again to an unfinished task until they have all been performed. Tasks start asynchronously, get performed asynchronously, and then finish asynchronously.

    22. There are many reasons your applications can be slow. Sometimes this is due to poor algorithmic design or the wrong choice of data structure. Sometimes, however, it's due to forces outside of our control, such as hardware constraints or the quirks of networking.

      That's where concurrency and parallelism fit in. They allow your programs to do multiple things at once, either at the same time or by wasting the least possible time waiting on busy tasks.

    1. You need to measure

      Before you spend too much time on trying to fix this particular performance issue, you really should measure your software’s performance and figuring out where its actual bottlenecks are. It’s quite possible that threading actually works just fine (option #1), or that the extra overhead from communicating across processes doesn’t matter (option #2).

      You will only know if you profile your software and figure out what the actual bottlenecks are.

    2. Running the code in a subprocess is much slower than running a thread, not because the computation is slower, but because of the overhead of copying and (de)serializing the data. So how do you avoid this overhead?

      Reducing the performance hit of copying data between processes:

      Option #1: Just use threads

      Processes have overhead, threads do not. And while it’s true that generic Python code won’t parallelize well when using multiple threads, that’s not necessarily true for your Python code. For example, NumPy releases the GIL for many of its operations, which means you can use multiple CPU cores even with threads.

      ``` # numpy_gil.py import numpy as np from time import time from multiprocessing.pool import ThreadPool

      arr = np.ones((1024, 1024, 1024))

      start = time() for i in range(10): arr.sum() print("Sequential:", time() - start)

      expected = arr.sum()

      start = time() with ThreadPool(4) as pool: result = pool.map(np.sum, [arr] * 10) assert result == [expected] * 10 print("4 threads:", time() - start) ```

      When run, we see that NumPy uses multiple cores just fine when using threads, at least for this operation:

      $ python numpy_gil.py Sequential: 4.253053188323975 4 threads: 1.3854241371154785

      Pandas is built on NumPy, so many numeric operations will likely release the GIL as well. However, anything involving strings, or Python objects in general, will not. So another approach is to use a library like Polars which is designed from the ground-up for parallelism, to the point where you don’t have to think about it at all, it has an internal thread pool.

      Option #2: Live with it

      If you’re stuck with using processes, you might just decide to live with the overhead of pickling. In particular, if you minimize how much data gets passed and forth between processes, and the computation in each process is significant enough, the cost of copying and serializing data might not significantly impact your program’s runtime. Spending a few seconds on pickling doesn’t really matter if your subsequent computation takes 10 minutes.

      Option #3: Write the data to disk

      Instead of passing data directly, you can write the data to disk, and then pass the path to this file: * to the subprocess (as an argument) * to parent process (as the return value of the function running in the worker process).

      The recipient process can then parse the file.

      ``` import pandas as pd import multiprocessing as mp from pathlib import Path from tempfile import mkdtemp from time import time

      def noop(df: pd.DataFrame): # real code would process the dataframe here pass

      def noop_from_path(path: Path): df = pd.read_parquet(path, engine="fastparquet") # real code would process the dataframe here pass

      def main(): df = pd.DataFrame({"column": list(range(10_000_000))})

      with mp.get_context("spawn").Pool(1) as pool:
          # Pass the DataFrame to the worker process
          # directly, via pickling:
          start = time()
          pool.apply(noop, (df,))
          print("Pickling-based:", time() - start)
      
          # Write the DataFrame to a file, pass the path to
          # the file to the worker process:
          start = time()
          path = Path(mkdtemp()) / "temp.parquet"
          df.to_parquet(
              path,
              engine="fastparquet",
              # Run faster by skipping compression:
              compression="uncompressed",
          )
          pool.apply(noop_from_path, (path,))
          print("Parquet-based:", time() - start)
      

      if name == "main": main() `` **Option #4:multiprocessing.shared_memory`**

      Because processes sometimes do want to share memory, operating systems typically provide facilities for explicitly creating shared memory between processes. Python wraps this facilities in the multiprocessing.shared_memory module.

      However, unlike threads, where the same memory address space allows trivially sharing Python objects, in this case you’re mostly limited to sharing arrays. And as we’ve seen, NumPy releases the GIL for expensive operations, which means you can just use threads, which is much simpler. Still, in case you ever need it, it’s worth knowing this module exists.

      Note: The module also includes ShareableList, which is a bit like a Python list but limited to int, float, bool, small str and bytes, and None. But this doesn’t help you cheaply share an arbitrary Python object.

      A bad option for Linux: the "fork" context

      You may have noticed we did multiprocessing.get_context("spawn").Pool() to create a process pool. This is because Python has multiple implementations of multiprocessing on some OSes. "spawn" is the only option on Windows, the only non-broken option on macOS, and available on Linux. When using "spawn", a completely new process is created, so you always have to copy data across.

      On Linux, the default is "fork": the new child process has a complete copy of the memory of the parent process at the time of the child process’ creation. This means any objects in the parent (arrays, giant dicts, whatever) that were created before the child process was created, and were stored somewhere helpful like a module, are accessible to the child. Which means you don’t need to pickle/unpickle to access them.

      Sounds useful, right? There’s only one problem: the "fork" context is super-broken, which is why it will stop being the default in Python 3.14.

      Consider the following program:

      ``` import threading import sys from multiprocessing import Process

      def thread1(): for i in range(1000): print("hello", file=sys.stderr)

      threading.Thread(target=thread1).start()

      def foo(): pass

      Process(target=foo).start() ```

      On my computer, this program consistently deadlocks: it freezes and never exits. Any time you have threads in the parent process, the "fork" context can cause in potential deadlocks, or even corrupted memory, in the child process.

      You might think that you’re fine because you don’t start any threads. But many Python libraries start a thread pool on import, for example NumPy. If you’re using NumPy, Pandas, or any other library that depends on NumPy, you are running a threaded program, and therefore at risk of deadlocks, segfaults, or data corruption when using the "fork" multiprocessing context. For more details see this article on why multiprocessing’s default is broken on Linux.

      You’re just shooting yourself in the foot if you take this approach.

    3. When you’re writing Python, though, you want to share Python objects between processes.

      To enable this, when you pass Python objects between processes using Python’s multiprocessing library:

      • On the sender side, the arguments get serialized to bytes with the pickle module.
      • On the receiver side, the bytes are unserialized using pickle.

      This serialization and deserialization process involves computation, which can potentially be slow.

    4. Threads vs. processes

      Multiple threads let you run code in parallel, potentially on multiple CPUs. On Python, however, the global interpreter lock makes this parallelism harder to achieve.

      Multiple processes also let you run code in parallel—so what’s the difference between threads and processes?

      All the threads inside a single process share the same memory address space. If thread 1 in a process stores some memory at address 0x7f0cd1a88810, thread 2 can access the same memory at the same address. That means passing objects between threads is cheap: you just need to get the pointer to the memory address from one thread to the other. A memory address is 8 bytes: this is not a lot of data to move around.

      In contrast, processes do not share the same memory space. There are some shared memory facilities provided by the operating system, typically, and we’ll get to that later. But by default, no memory is shared. That means you can’t just share the address of your data across processes: you have to copy the data.

    1. Technique #2: Sampling

      How do you load only a subset of the rows?

      When you load your data, you can specify a skiprows function that will randomly decide whether to load that row or not:

      ```

      from random import random

      def sample(row_number): ... if row_number == 0: ... # Never drop the row with column names: ... return False ... # random() returns uniform numbers between 0 and 1: ... return random() > 0.001 ... sampled = pd.read_csv("/tmp/voting.csv", skiprows=sample) len(sampled) 973 ```

    2. Technique #1: Changing numeric representations

      For most purposes having a huge amount of accuracy isn’t too important.

      Instead of representing the values as floating numbers, we can represent them as percentages between 0 and 100. We’ll be down to two-digit accuracy, but again for many use cases that’s sufficient. Plus, if this is output from a model, those last few digits of “accuracy” are likely to be noise, they won’t actually tell us anything useful.

      Whole percentages have the nice property that they can fit in a single byte, an int8—as opposed to float64, which uses eight bytes:

    3. Lossy compression is often about the specific structure of your data, and your own personal understanding of which details matter and which details don’t.

      So if you’re running low on memory, think about what data you really need, and what alternative representations can make it smaller.

      And if compression still isn’t enough, you can also try process your data in chunks

    4. lossy compression: drop some of your data in a way that doesn’t impact your final results too much.

      If parts of your data don’t impact your analysis, no need to waste memory keeping extraneous details around.

    1. Finally, here's a cool thing I learned from the asyncio docs. When writing decorators, you can use partial() to bind the decorated function to an existing wrapper, instead of always returning a new one. The result is a more descriptive representation:

      ```

      return_args_and_exceptions(do_stuff) functools.partial(<function _return_args_and_exceptions at 0x10647fd80>, <function do_stuff at 0x10647d8a0>) ```

      Compare with the traditional version: def return_args_and_exceptions(func): async def wrapper(*args): ... return wrapper

    2. except StopAsyncIteration if is_async else StopIteration:

      Interesting: using ternary operator in except clause

    3. In sync code, you might use

      a thread pool and imap_unordered():

      ``` pool = multiprocessing.dummy.Pool(2)

      for result in pool.imap_unordered(do_stuff, things_to_do): print(result) ```

      Here, concurrency is limited by the fixed number of threads.

    4. So, you're doing some async stuff, repeatedly, many times.

      Like, hundreds of thousands of times.

      Either way, it's a good idea to not do it all at once. For one, it's not polite to the services you're calling. For another, it'll load everything in memory, all at once.

    1. Gunicorn and multiprocessing

      Gunicorn forks a base process into n worker processes, and each worker is managed by Uvicorn (with the asynchronous uvloop). Which means:

      • Each worker is concurrent
      • The worker pool implements parallelism

      This way, we can have the best of both worlds: concurrency (multithreading) and parallelism (multiprocessing).

    2. There is another way to declare a route with FastAPI

      Using the asyncio:

      ``` import asyncio

      from fastapi import FastAPI

      app = FastAPI()

      @app.get("/asyncwait") async def asyncwait(): duration = 0.05 await asyncio.sleep(duration) return {"duration": duration} ```

    1. Use Python asyncio.as_completed

      There will be moments when you don't have to await for every single task to be processed right away.

      We do this by using asyncio.as_completed which returns a generator with completed coroutines.

    2. When to use Python Async

      Async only makes sense if you're doing IO.

      There's ZERO benefit in using async to stuff like this that is CPU-bound:

      ``` import asyncio

      async def sum_two_numbers_async(n1: int, n2: int) -> int: return n1 + n2

      async def main(): await sum_two_numbers_async(2, 2) await sum_two_numbers_async(4, 4)

      asyncio.run(main()) ```

      Your code might even get slower by doing that due to the Event Loop.

      That's because Python async only optimizes IDLE time!

    3. If you want 2 or more functions to run concurrently, you need asyncio.create_task.

      Creating a task triggers the async operation, and it needs to be awaited at some point.

      For example:

      task = create_task(my_async_function('arg1')) result = await task

      As we're creating many tasks, we need asyncio.gather which awaits all tasks to be done.

    4. If you want 2 or more functions to run concurrently, you need asyncio.create_task.

      Creating a task triggers the async operation, and it needs to be awaited at some point.

      For example:

      task = create_task(my_async_function('arg1')) result = await task

    5. IO-bound operations are related to reading/writing operations.

      A good example would be:

      • Requesting some data from HTTP
      • Reading/Writing some json/txt file
      • Reading data from a database

      All these operations consist of waiting for the data to be available.

      While the data is UNAVAILABLE the EVENT LOOP does something else.

      This is Concurrency.

      NOT Parallelism.

    6. they think async is parallel which is not true
    1. Fast API

      Fast API is a high-level web framework like flask, but that happens to be async, unlike flask. With the added benefit of using type hints and pydantic to generate schemas.

      It's not a building block like twisted, gevent, trio or asyncio. In fact, it's built on top of asyncio. It's in the same group as flask, bottle, django, pyramid, etc. Although it's a micro-framework, so it's focused on routing, data validation and API delivery.

    2. The code isn't that different from your typical asyncio script:

      ``` import re import time

      import httpx import trio

      urls = [ "https://www.bitecode.dev/p/relieving-your-python-packaging-pain", "https://www.bitecode.dev/p/hype-cycles", "https://www.bitecode.dev/p/why-not-tell-people-to-simply-use", "https://www.bitecode.dev/p/nobody-ever-paid-me-for-code", "https://www.bitecode.dev/p/python-cocktail-mix-a-context-manager", "https://www.bitecode.dev/p/the-costly-mistake-so-many-makes", "https://www.bitecode.dev/p/the-weirdest-python-keyword", ]

      title_pattern = re.compile(r"<title[^>]>(.?)</title>", re.IGNORECASE)

      user_agent = ( "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" )

      async def fetch_url(url): start_time = time.time()

      async with httpx.AsyncClient() as client:
          headers = {"User-Agent": user_agent}
          response = await client.get(url, headers=headers)
          match = title_pattern.search(response.text)
          title = match.group(1) if match else "Unknown"
          print(f"URL: {url}\nTitle: {title}")
      
      end_time = time.time()
      elapsed_time = end_time - start_time
      print(f"Time taken for {url}: {elapsed_time:.4f} seconds\n")
      

      async def main(): global_start_time = time.time()

      # That's the biggest API difference
      async with trio.open_nursery() as nursery:
          for url in urls:
              nursery.start_soon(fetch_url, url)
      
      global_end_time = time.time()
      global_elapsed_time = global_end_time - global_start_time
      print(f"Total time taken for all URLs: {global_elapsed_time:.4f} seconds")
      

      if name == "main": trio.run(main) ```

      Because it doesn't create nor schedule coroutines immediately (notice the nursery.start_soon(fetch_url, url) is not nursery.start_soon(fetch_url(url))), it will also consume less memory. But the most important part is the nursery:

      # That's the biggest API difference async with trio.open_nursery() as nursery: for url in urls: nursery.start_soon(fetch_url, url)

      The with block scopes all the tasks, meaning everything that is started inside that context manager is guaranteed to be finished (or terminated) when it exits. First, the API is better than expecting the user to wait manually like with asyncio.gather: you cannot start concurrent coroutines without a clear scope in trio, it doesn't rely on the coder's discipline. But under the hood, the design is also different. The whole bunch of coroutines you group and start can be canceled easily, because trio always knows where things begin and end.

      As soon as things get complicated, code with curio-like design become radically simpler than ones with asyncio-like design.

    3. trio

      For many years, the very talented dev and speaker David Beazley has been showing unease with asyncio's design, and made more and more experiments and public talks about what could an alternative look like. It culminated with the excellent Die Threads presentation, live coding the sum of the experience of all those ideas, that eventually would become the curio library. Watch it. It’s so good.

      Trio is not compatible with asyncio, nor gevent or twisted by default. This means it's also its little own async island.

      But in exchange for that, it provides a very different internal take on how to deal with this kind of concurrency, where every coroutine is tied to an explicit scope, everything can be awaited easily, or canceled.

    4. Because of the way gevent works, you can take a blocking script, and with very few modifications, make it async. Let's take the original stdlib one, and convert it to gevent:

      ``` import re import time

      import gevent from gevent import monkey

      monkey.patch_all() # THIS MUST BE DONE BEFORE IMPORTING URLLIB

      from urllib.request import Request, urlopen

      urls = [ "https://www.bitecode.dev/p/relieving-your-python-packaging-pain", "https://www.bitecode.dev/p/hype-cycles", "https://www.bitecode.dev/p/why-not-tell-people-to-simply-use", "https://www.bitecode.dev/p/nobody-ever-paid-me-for-code", "https://www.bitecode.dev/p/python-cocktail-mix-a-context-manager", "https://www.bitecode.dev/p/the-costly-mistake-so-many-makes", "https://www.bitecode.dev/p/the-weirdest-python-keyword", ]

      title_pattern = re.compile(r"<title[^>]>(.?)</title>", re.IGNORECASE)

      user_agent = ( "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" )

      We move the fetching into a function so we can isolate it into a green thread

      def fetch_url(url): start_time = time.time()

      headers = {"User-Agent": user_agent}
      
      with urlopen(Request(url, headers=headers)) as response:
          html_content = response.read().decode("utf-8")
          match = title_pattern.search(html_content)
          title = match.group(1) if match else "Unknown"
      
          print(f"URL: {url}\nTitle: {title}")
      
      end_time = time.time()
      elapsed_time = end_time - start_time
      
      print(f"Time taken: {elapsed_time:.4f} seconds\n")
      

      def main(): global_start_time = time.time()

      # Here is where we convert synchronous calls into async ones
      greenlets = [gevent.spawn(fetch_url, url) for url in urls]
      gevent.joinall(greenlets)
      
      global_end_time = time.time()
      global_elapsed_time = global_end_time - global_start_time
      
      print(f"Total time taken: {global_elapsed_time:.4f} seconds")
      

      main() ```

      No async, no await. No special lib except for gevent. In fact it would work with the requests lib just as well. Very few modifications are needed, for a net perf gain.

      The only danger is if you call gevent.monkey.patch_all() too late. You get a cryptic error that crashes your program.

    5. So what's the deal with asyncio, twisted, gevent, trio and all that stuff?

      asyncio

      asyncio is the modern module for asynchronous network programming provided with the python stdlib since 3.4. In other words, it's the default stuff at your disposal if you want to code something without waiting on the network.

      asyncio replaces the old deprecated asyncore module. It is quite low level, so while you can manually code most network-related things with it, you are still at the level of TCP or UDP. If you want higher-level protocols, like FTP, HTTP or SSH, you have to either code it yourself, or install a third party library or module.

      Because asyncio is the default solution, it has a the biggest ecosystem of 3rd party libs, and pretty much everything async strives to be compatible with it directly, or through compatibility layers like anyio.

      Twisted

      20 years ago, there was no asyncio, there was no async/await, nodejs didn't exist and Python 3 was half a decade away. Yet, it was the .com bubble, everything needed to be connected now. And so was born twisted, the grandfather of all the asynchronous frameworks we have today. Twisted ecosystem grew to include everything, from mail to ssh.

      To this day, twisted is still a robust and versatile tool. But you do pay the price of its age. It doesn't follow PEP8 very well, and the design lean on the heavy size.

      Tornado

      Tornado was developed after Twisted, by FriendFeed, at this weird 2005-2015 web dev period where everything needed to be social web scale. It was like Twisted, but tooted to be faster, and was higher level. Out of the box, the HTTP story is way nicer.

      Today, you are unlikely to use Tornado unless you work at Facebook or contribute to jupyter. After all, if you want to make async web things, the default tool is FastAPI in 2023.

      gevent

      Gevent came about in 2009, the same year as Tornado, but with a fundamentally different design. Instead of attempting to provide an asychronous API, it decided to do black magic. When you use gevent, you call from gevent import monkey; monkey.patch_all() and it changes the underlying mechanism of Python networking, making everything non-blocking.

    6. asyncio, twisted, tornado and gevent have one trick up their sleeve: they can send a message to the network, and while waiting for the response, wake up another part of the program to do some other work. And they can do that with many messages in a row. While waiting for the network, they can let other parts of the program use the CPU core.

      Note that they only can speed up waiting on the network. They will not make two calculations at the same time (can't use several CPU cores like with multiprocessing) and you can't speed up waiting on other types of I/O (like when you use threads to not block on user input or disk writes).

      All in all, they are good for writing things like bots (web crawler, chat bots, network sniffers, etc.) and servers (web servers, proxies, ...). For maximum benefits, it's possible to use them inside other concurrency tools, such as multiprocessing or multithreading. You can perfectly have 4 processes, each of them containing 4 threads (so 16 threads in total), and each thread with their own asyncio loop running.

    7. Which, means, if you think about it, that concurrency has a lot to do with sharing one resource.

      In a computer, you may have to share different things:

      • Battery charge.
      • CPU calculation power.
      • RAM space.
      • Disk space and throughput.
      • Network throughput.
      • File system handles.
      • User input.
      • Screen real estate.
    8. The typical analogy is this:
      • concurrency is having two lines of customers ordering from a one cashier;
      • parallelism is having two lines of customers ordering from two cashiers.
    9. parallelism

      "doing lots of things at once" (As Rob Pike said)

    10. concurrency

      "is about dealing with a lot of things as once" (As Rob Pike said)

    1. Tips
      • if name == "main" is important for multiprocessing because it will spawn a new Python, that will import the module. You don't want this module to spawn a new Python that imports the module that will spawn a new Python...
      • If the function to submit to the executor has complicated arguments to be passed to it, use a lambda or functools.partial.
      • max_worker = 1 is a very nice way to get a poor man’s task queue.
    2. Both are bad if you need to cancel tasks, collaborate a lot between tasks, deal precisely with the task lifecycle, needs a huge number of workers or want to milk out every single bit of perfs. You won’t get nowhere near Rust level of speed.
    3. Process pools are good for:
      • When you don't need to share data between tasks.
      • When you are CPU bound.
      • When you don't have too many tasks to run at the same time.
      • When you need true parallelism and want to exercise your juicy cores.
    4. Thread pools are good for:
      • Tasks (network, file, etc.) that needs less than 10_000 I/O interactions per second. The number is higher than you would expect, because threads are surprisingly cheap nowadays, and you can spawn a lot of them without bloating memory too much. The limit is more the price of context switching. This is not a scientific number, it's a general direction that you should challenge by measuring your own particular case.
      • When you need to share data between the tasks.
      • When you are not CPU bound.
      • When you are OK to execute tasks a bit slower to you ensure you are not blocking any of them (E.G: user UI and a long calculation).
      • When you are CPU bound, but the CPU calculations are delegating to a C extension that releases the GIL, such as numpy. Free parallelism on the cheap, yeah!

      E.G: a web scraper, a GUI to zip files, a development server, sending emails without blocking web page rendering, etc.

    5. What would a version with multiprocessing look like?

      Pretty much the same, but, we use ProcessPoolExecutor instead.

      ```python from concurrent.futures import ProcessPoolExecutor, as_completed

      ...

      with ProcessPoolExecutor(max_workers=5) as executor: ... ```

      Note that here the number of workers maps to the number of CPU cores I want to dedicate to the program. Processes are way more expensive than threads, as each starts a new Python instance.

    6. Python standard library comes with a beautiful abstraction for them I see too few people use: the pool executors.
    7. ThreadPoolExecutor.

      ```python from concurrent.futures import ThreadPoolExecutor, as_completed

      def main(): with ThreadPoolExecutor(max_workers=len(URLs)) as executor: tasks = {} for url in URLs: future = executor.submit(fetch_url, url) tasks[future] = url

          for future in as_completed(tasks):
              title = future.result()
              url = tasks[future]
              print(f"URL: {url}\nTitle: {title}")
      

      ```

    8. You can distribute work to a bunch of process workers or thread workers with a few lines of code:

      ```python from concurrent.futures import ThreadPoolExecutor, as_completed

      with ThreadPoolExecutor(max_workers=5) as executor: executor.submit(do_something_blockint) ```

    1. The true distinction: static vs. dynamic

      The true distinction that we should be teaching students is the difference between properties of languages that can be determined statically—that is, by just staring at the code without running it—and properties that can only be known dynamically, during runtime.

      Notice that I said “properties” and not “languages”. Every programming language chooses its own set of properties that can be determined either statically or dynamically, and taken together, this makes a language more “dynamic” or more “static”. Static versus dynamic is a spectrum, and yes, Python falls on the more dynamic end of the spectrum. A language like Java has far more static features than Python, but even Java includes things like reflection, which is inarguably a dynamic feature.

    2. “Compiled vs. Interpreted” limits what we think is possible with programming languages

      For instance, JavaScript is commonly lumped into the “interpreted language” category. But for a while, JavaScript running in Google Chrome would never be interpreted—instead, JavaScript was compiled directly to machine code! As a result, JavaScript can keep pace with C++.

    3. When you run your Python program using [CPython], the code is parsed and converted to an internal bytecode format, which is then executed inside the VM. From the user’s perspective, this is clearly an interpreter—they run their program from source. But if you look under CPython’s scaly skin, you’ll see that there is definitely some compiling going on. The answer is that it is both. CPython is an interpreter, and it has a compiler.
    4. “Compiled vs. Interpreted language” is a false dichotomy

      A language is not inherently compiled or interpreted; whether a language is compiled or interpreted (or both!) is an implementation detail.

    5. You can actually compile all of your Python code beforehand using the compileall module on the command line:

      $ python3 -m compileall .

      This will place the compiled bytecode of all Python files in the current directory in pycache/ and show you any compiler errors.

    6. Python is both a compiled and interpreted language

      The CPython interpreter really is an interpreter. But it also is a compiler. Python must go through a few stages before ever running the first line of code:

      1. scanning
      2. parsing

      Older versions of Python added an additional stage:

      1. scanning
      2. parsing
      3. checking for valid assignment targets

      Let’s compare this to the stages of compiling a C program:

      1. ~~preprocessing~~
      2. lexical analysis (another term for “scanning”)
      3. syntactic analysis (another term for “parsing”)
      4. ~~semantic analysis~~
      5. ~~linking~~
    7. next stage is parsing (also known as syntactic analysis) and the parser reports the first error in the source code. Parsing the whole file happens before running the first line of code which means that Python does not even see the error on line 1 and reports the syntax error on line 2.
    8. I haven’t done a deep dive into the source code of the CPython interpreter to verify this, but I think the reason that this is the first error detected is because one of the first steps that Python 3.12 does is scanning (also known as lexical analysis). The scanner converts the ENTIRE file into a series of tokens before continuing to the next stage. A missing quotation mark at the end of a string literal is an error that is detected by the scanner—the scanner wants to turn the ENTIRE string into one big token, but it can’t do that until it finds the closing quotation mark. The scanner runs first, before anything else in Python 3.12, hence why this is the first error message.
    9. Python reports only one error message at a time—so the game is which error message will be reported first?

      Here is the buggy program:

      python 1 / 0 print() = None if False ñ = "hello

      Each line of code generates a different error message:

      • 1 / 0 will generate ZeroDivisionError: division by zero.
      • print() = None will generate SyntaxError: cannot assign to function call.
      • if False will generate SyntaxError: expected ':'.
      • ñ = "hello will generate SyntaxError: EOL while scanning string literal.

      The question is… which will be reported first?

      Spoilers: the specific version of Python matters (more than I thought it would) so keep that in mind if you see different results.

      The first error message detected is on the last line of source code. What this tells us is that Python must read the entire source code file before running the first line of code. If you have a definition in your head of an “interpreted language” that includes “interpreted languages run the code one line at a time”, then I want you to cross that out!

    10. The fact that error messages are generated by different stages of the compiler, and compilers generally issue errors from earlier stages before continuing also means that you can discover the stages of your compiler by deliberately creating errors in a program.
    11. GCC splits the task of turning your code into a running program into various different stages:
      1. preprocessing
      2. lexical analysis
      3. syntactic analysis
      4. semantic analysis
      5. linking
    1. How to Use Multiple Desktops on One Screen in Windows 11
      • Quickly add a desktop by using the keyboard shortcut <kbd>Windows Key + Ctrl + D</kbd>.

      • Quickly switch desktops by using the keyboard shortcuts <kbd>Windows Key + Ctrl + Left Arrow</kbd> or <kbd>Windows Key + Ctrl + Right Arrow</kbd>.

      • To rename your desktops, open the Task View pane, right-click a desktop and click Rename.

      • To change desktop backgrounds open the Task View pane, right-click a desktop and click Choose background.

      • You can click and drag applications from one desktop to another through the Task View pane, or you can right-click an application, click Move to and then click which desktop you want to move the application to.

      • To close a virtual desktop, open up the Task View pane and hover over the desktop you want to close until an X appears in the upper-right corner. Click the X to close the desktop. You can also open Task View by clicking <kbd>Windows Key + Tab</kbd>. Then, use your arrow keys to select a virtual desktop and clicking the <kbd>Delete</kbd> key on the virtual desktop you want to close.

    1. The tokenizer takes your source code and chunks it into “tokens”. Tokens are just small pieces of source code that you can identify in isolation. As examples, there will be tokens for numbers, mathematical operators, variable names, and keywords (like if or for). The parser will take that linear sequence of tokens and essentially reshape them into a tree structure (that's what the T in AST stands for: Tree). This tree is what gives meaning to your tokens, providing a nice structure that is easier to reason about and work on. As soon as we have that tree structure, our compiler can go over the tree and figure out what bytecode instructions represent the code in the tree. For example, if part of the tree represents a function, we may need a bytecode for the return statement of that function. Finally, the interpreter takes those bytecode instructions and executes them, producing the results of our original program.
    2. Recap

      In this article you started implementing your own version of Python. To do so, you needed to create four main components:

      A tokenizer: * accepts strings as input (supposedly, source code); * chunks the input into atomic pieces called tokens; * produces tokens regardless of their sequence making sense or not.

      A parser: * accepts tokens as input; * consumes the tokens one at a time, while making sense they come in an order that makes sense; * produces a tree that represents the syntax of the original code.

      A compiler: * accepts a tree as input; * traverses the tree to produce bytecode operations.

      An interpreter: * accepts bytecode as input; * traverses the bytecode and performs the operation that each one represents; * uses a stack to help with the computations.

    3. Each bytecode is defined by two things: the type of bytecode operation we're dealing with (e.g., pushing things on the stack or doing an operation); and the data associated with that bytecode operation, which not all bytecode operations need.
    4. The interpreter accepts a list of bytecode operations and its method interpret will go through the list of bytecodes, interpreting one at a time.

      ``` from .compiler import Bytecode, BytecodeType

      ...

      class Interpreter: def init(self, bytecode: list[Bytecode]) -> None: self.stack = Stack() self.bytecode = bytecode self.ptr: int = 0

      def interpret(self) -> None:
          for bc in self.bytecode:
              # Interpret this bytecode operator.
              if bc.type == BytecodeType.PUSH:
                  self.stack.push(bc.value)
              elif bc.type == BytecodeType.BINOP:
                  right = self.stack.pop()
                  left = self.stack.pop()
                  if bc.value == "+":
                      result = left + right
                  elif bc.value == "-":
                      result = left - right
                  else:
                      raise RuntimeError(f"Unknown operator {bc.value}.")
                  self.stack.push(result)
      
          print("Done!")
          print(self.stack)
      

      ```

    5. The interpreter is the part of the program that is responsible for taking bytecode operations as input and using those to actually run the source code you started off with.
    6. To write our compiler, we'll just create a class with a method compile. The method compile will mimic the method parse in its structure. However, the method parse produces tree nodes and the method compile will produce bytecode operations.
    7. The compiler is the part of our program that will take a tree (an AST, to be more precise) and it will produce a sequence of instructions that are simple and easy to follow.
    8. Instead of interpreting the tree directly, we'll use a compiler to create an intermediate layer.
    9. After we have our sequence of operations (bytecodes), we will “interpret” it. To interpret the bytecode means that we go over the bytecode, sequence by sequence, and at each point we perform the simple operation that the bytecode tells us to perform.
    10. Bytecodes are just simple, atomic instructions that do one thing, and one thing only.
    11. Abstract syntax tree

      It's an abstract syntax tree because it is a tree representation that doesn't care about the original syntax we used to write the operation. It only cares about the operations we are going to perform.

    12. The parser is the part of our program that accepts a stream of tokens and makes sure they make sense.
    13. The tokenizer

      The tokenizer is the part of your program that accepts the source code and produces a linear sequence of tokens – bits of source code that you identify as being relevant.

    14. The four parts of our program
      • Tokenizer takes source code as input and produces tokens;
      • Parser takes tokens as input and produces an AST;
      • Compiler takes an AST as input and produces bytecode;
      • Interpreter takes bytecode as input and produces program results.
    1. Once an interpreter is running (remembering what I said that it is preferable to leave them running) you can share data using a channel. The channels module is also part of PEP554 and available using a secret-import:

      ``` import _xxsubinterpreters as interpreters import _xxinterpchannels as channels

      interp_id = interpreters.create(site=site) channel_id = channels.create()

      interpreters.run_string( interp_id, """ import _xxinterpchannels as channels channels.send('hello!') """, shared={ "channel_id": channel_id } )

      print(channels.recv(channel_id)) ```

    2. To share data, you can use the shared argument and provide a dictionary with shareable (int, float, bool, bytes, str, None, tuple) values:

      ``` import _xxsubinterpreters as interpreters

      interp_id = interpreters.create(site=site)

      interpreters.run_string( interp_id, "print(message)", shared={ "message": "hello world!" } )

      interpreters.run_string( interp_id, """ for message in messages: print(message) """, shared={ "messages": ("hello world!", "this", "is", "me") } )

      interpreters.destroy(interp_id) ```

    3. To start an interpreter that sticks around, you can use interpreters.create() which returns the interpreter ID. This ID can be used for subsequent .run_string calls:

      ``` import _xxsubinterpreters as interpreters

      interp_id = interpreters.create(site=site)

      interpreters.run_string(interp_id, "print('hello world')") interpreters.run_string(interp_id, "print('hello universe')")

      interpreters.destroy(interp_id) ```

    4. Starting a sub interpreter is a blocking operation, so most of the time you want to start one inside a thread.

      ``` from threading import Thread import _xxsubinterpreters as interpreters

      t = Thread(target=interpreters.run, args=("print('hello world')",)) t.start() ```

    5. You can create, run and stop a sub interpreter with the .run() function which takes a string or a simple function

      ``` import _xxsubinterpreters as interpreters

      interpreters.run(''' print("Hello World") ''') ```

    6. Worker state management

      If a sub interpreter crashes, it won’t kill the main interpreter. Exceptions can be raised up to the main interpreter and handled gracefully.

    7. Inter-Worker communication

      Whether using sub interpreters or multiprocessing you cannot simply send existing Python objects to worker processes.

      Multiprocessing uses pickle by default. When you start a process or use a process pool, you can use pipes, queues and shared memory as mechanisms to sending data to/from the workers and the main process. These mechanisms revolve around pickling. Pickling is the builtin serialization library for Python that can convert most Python objects into a byte string and back into a Python object.

      Pickle is very flexible. You can serialize a lot of different types of Python objects (but not all) and Python objects can even define a method for how they can be serialized. It also handles nested objects and properties. However, with that flexibility comes a performance hit. Pickle is slow. So if you have a worker model that relies upon continuous inter-worker communication of complex pickled data you’ll likely see a bottleneck.

      Sub interpreters can accept pickled data. They also have a second mechanism called shared data. Shared data is a high-speed shared memory space that interpreters can write to and share data with other interpreters. It supports only immutable types, those are:

      • Strings
      • Byte Strings
      • Integers and Floats
      • Boolean and None
      • Tuples (and tuples of tuples)

      To share data with an interpreter, you can either set it as initialization data or you can send it through a channel.

    8. The next point when using a parallel execution model like multiprocessing or sub interpreters is how you share data.

      Once you get over the hurdle of starting one, this quickly becomes the most important point. You have two questions to answer:

      • How do we communicate between workers?
      • How do we manage the state of workers?
    9. Half of the time taken to start an interpreter is taken up running “site import”. This is a special module called site.py that lives within the Python installation. Interpreters have their own caches, their own builtins, they are effectively mini-Python processes. Starting a thread or a coroutine is so fast because it doesn’t have to do any of that work (it shares that state with the owning interpreter), but it’s bound by the lock and isn’t parallel.
    10. Both multiprocessing processes and interpreters have their own import state. This is drastically different to threads and coroutines. When you await an async function, you don’t need to worry about whether that coroutine has imported the required modules. The same applies for threads.

      For example, you can import something in your module and reference it from inside the thread function:

      ```python import threading from super.duper.module import cool_function

      def worker(info): # This already exists in the interpreter state cool_function()

      info = {'a': 1} thread = Thread(target=worker, args=(info, )) ```

    11. Another important point is that multiprocessing is often used in a model where the processes are long-running and handed lots of tasks instead of being spawned and destroyed for a single workload. One great example is Gunicorn, the popular Python web server. Gunicorn will spawn “workers” using multiprocessing and those workers will live for the lifetime of the main process. The time to start a process or a sub interpreter then becomes irrelevant (at 89 ms or 1 second) when the web worker can be running for weeks, months or years. The ideal way to use these parallel workers for small tasks (like handle a single web request) is to keep them running and use a main process to coordinate and distribute the workload
    12. Threads are only parallel with IO-bound tasks
    13. Python’s system architecture is roughly made up of three parts:
      • A Python process, which contains one or more interpreters
      • An interpreter, which contains a lock (the GIL) and one or more Python threads
      • A thread, which contains information about the currently executing code.
    14. What is the difference between threading, multiprocessing, and sub interpreters?

      The Python standard library has a few options for concurrent programming, depending on some factors:

      • Is the task you’re completing IO-bound (e.g. reading from a network, writing to disk)
      • Does the task require CPU-heavy work, e.g. computation
      • Can the tasks be broken into small chunks or are they large pieces of work?

      Here are the models:

      • Threads are fast to create, you can share any Python objects between them and have a small overhead. Their drawback is that Python threads are bound to the GIL of the process, so if the workload is CPU-intensive then you won’t see any performance gains. Threading is very useful for background, polling tasks like a function that waits and listens for a message on a queue.
      • Coroutines are extremely fast to create, you can share any Python objects between them and have a miniscule overhead. Coroutines are ideal for IO-based activity that has an underlying API that supports async/await.
      • Multiprocessing is a Python wrapper that creates Python processes and links them together. These processes are slow to start, so the workload that you give them needs to be large enough to see the benefit of parallelising the workload. However, they are truly parallel since each one has it’s own GIL.
      • Sub interpreters have the parallelism of multiprocessing, but with a much faster startup time.
    1. Rebuild your images regularly

      If you want both the benefits of caching, and to get security updates within a reasonable amount of time, you will need two build processes:

      1. The normal image build process that happens whenever you release new code.
      2. Once a week, or every night, rebuild your Docker image from scratch using docker build --pull --no-cache to ensure you have security updates.
    2. Disabling caching

      That suggests that sometimes you’re going to want to bypass the caching. You can do so by passing two arguments to docker build:

      • --pull: This pulls the latest version of the base Docker image, instead of using the locally cached one.
      • --no-cache: This ensures all additional layers in the Dockerfile get rebuilt from scratch, instead of relying on the layer cache.

      If you add those arguments to docker build you will be ensured that the new image has the latest (system-level) packages and security updates.

    3. As long as you’re relying on caching, you’ll still get the old, insecure packages distributed in your images
    4. caching means no updates
    5. caching can lead to insecure images.