923 Matching Annotations
  1. Feb 2024
    1. The result? Our runtime image just got 6x smaller! Six times! From > 1.1 GB to 170 MB.

      See (above this annotation) the most optimized & CI friendly Python Docker build with Poetry (until this issue gets resolved)

    1. Python List append() Method

      This web page offers a brief overview of the append() function for lists in python.

    1. We’ve (painstakingly) manually reviewed 310 live MLOps positions, advertised across various platforms in Q4 this year

      They went through 310 role descriptions and, even though role descriptions may vary significantly, they found 3 core skills that a large percentage of MLOps roles required:

      📦 Docker and Kubernetes 🐍 Python 🌥 Cloud

  2. Jan 2024
    1. setuptools is the most popular (at 50k packages), Poetry is second at 41k, Hatchling is third at 8.1k. Other tools to cross 500 users include Flit (4.4k), PDM (1.3k), Maturin (1.3k, build backend for Rust-based packages).

      Popularity of Python package managers in 2024

  3. Dec 2023
    1. ambiguous src

      Why is using "src" folder to contain module files ambiguous? In this case "sample" seams more ambiguous cause it could be refering a folder containing sample data. Also how do you know the module is not called docs?

    1. Measure Execution Time With time.thread_time()

      The time.thread_time() reports the time that the current thread has been executing.

      The time begins or is zero when the current thread is first created.

      Return the value (in fractional seconds) of the sum of the system and user CPU time of the current thread.

      It is an equivalent value to the time.process_time(), except calculated at the scope of the current thread, not the current process.

      This value is calculated as the sum of the system time and the user time.

      thread time = user time + system time

      The reported time does not include sleep time.

      This means if the thread is blocked by a call to time.sleep() or perhaps is suspended by the operating system, then this time is not included in the reported time. This is called a “thread-wide” or “thread-specific” time.

    2. Measure Execution Time With time.process_time()

      The time.process_time() reports the time that the current process has been executed.

      The time begins or is zero when the current process is first created.

      Calculated as the sum of the system time and the user time:

      process time = user time + system time

      System time is time that the CPU is spent executing system calls for the kernel (e.g. the operating system)

      User time is time spent by the CPU executing calls in the program (e.g. your code).

      When a program loops through an array, it is accumulating user CPU time. Conversely, when a program executes a system call such as exec or fork, it is accumulating system CPU time.

      The reported time does not include sleep time.

      This means if the process is blocked by a call to time.sleep() or perhaps is suspended by the operating system, then this time is not included in the reported time. This is called a “process-wide” time.

      As such, it only reports the time that the current process was executed since it was created by the operating system.

    3. Measure Execution Time With time.monotonic()

      The time.monotonic() function returns time stamps from a clock that cannot go backwards, as its name suggests.

      In mathematics, monotonic, e.g. a monotonic function means a function whose output over increases (or decreaes).

      This means that the result from the time.monotonic() function will never be before the result from a prior call.

      Return the value (in fractional seconds) of a monotonic clock, i.e. a clock that cannot go backwards.

      It is a high-resolution time stamp, although is not relative to epoch-like time.time(). Instead, like time.perf_counter() uses a separate timer separate from the system clock.

      The time.monotonic() has a lower resolution than the time.perf_counter() function.

      This means that values from the time.monotonic() function can be compared to each other, relatively, but not to the system clock.

      Like the time.perf_counter() function, time.monotonic() function is “system-wide”, meaning that it is not affected by changes to the system clock, such as updates or clock adjustments due to time synchronization.

      Like the time.perf_counter() function, the time.monotonic() function was introduced in Python version 3.3 with the intent of addressing the limitations of the time.time() function tied to the system clock, such as use in short-duration benchmarking.

      Monotonic clock (cannot go backward), not affected by system clock updates.

    4. Measure Execution Time With time.perf_counter()

      The time.perf_counter() function reports the value of a performance counter on the system.

      It does not report the time since epoch like time.time().

      Return the value (in fractional seconds) of a performance counter, i.e. a clock with the highest available resolution to measure a short duration. It does include time elapsed during sleep and is system-wide.

      The returned value in seconds with fractional components (e.g. milliseconds and nanoseconds), provides a high-resolution timestamp.

      Calculating the difference between two timestamps from the time.perf_counter() allows high-resolution execution time benchmarking, e.g. in the millisecond and nanosecond range.

      The timestamp from the time.perf_counter() function is consistent, meaning that two durations can be compared relative to each other in a meaningful way.

      The time.perf_counter() function was introduced in Python version 3.3 with the intended use for short-duration benchmarking.

      The perf_counter() function was specifically designed to overcome the limitations of other time functions to ensure that the result is consistent across platforms and monotonic (always increasing).

      For accuracy, the timeit module uses the time.perf_counter() internally.

    5. Measure Execution Time With time.time()

      The time.time() function reports the number of seconds since the epoch (epoch is January 1st 1970, which is used on Unix systems and beyond as an arbitrary fixed time in the past) as a floating point number.

      The result is a floating point value, potentially offering fractions of a seconds (e.g. milliseconds), if the platforms support it.

      The time.time() function is not perfect.

      It is possible for a subsequent call to time.time() to return a value in seconds less than the previous value, due to rounding.

      Note: even though the time is always returned as a floating point number, not all systems provide time with a better precision than 1 second. While this function normally returns non-decreasing values, it can return a lower value than a previous call if the system clock has been set back between the two calls.

    6. there are automatic ways to measure execution time, such as via the timeit module.
    7. There are 5 ways to measure execution time manually in Python using the time module, they are:
      1. Use time.time()
      2. Use time.perf_counter()
      3. Use time.monotonic()
      4. Use time.process_time()
      5. Use time.thread_time()

      Note, each function returns a time in seconds and has an equivalent function that returns the time in nanoseconds, e.g. time.time_ns(), time.perf_counter_ns(), time.monotonic_ns(), time.process_time_ns() and time.thread_time_ns().

      Recall that there are 1,000 nanoseconds in one microsecond, 1,000 microseconds in 1 millisecond, and 1,000 milliseconds in one second. This highlights that the nanosecond versions of the function are for measuring very short time scales indeed.

    1. There are common errors experienced by beginners when getting started with asyncio in Python.

      They are:

      1. Trying to run coroutines by calling them.
      2. Not letting coroutines run in the event loop.
      3. Using the asyncio low-level API.
      4. Exiting the main coroutine too early.
      5. Assuming race conditions and deadlocks are impossible.
    1. And this is where the asynchronicity comes in: The "results" list does not actually contain the results from running our functions. Instead, it contains "futures" which are similar to the JavaScript idea of "promises." In order to allow our program to continue running, we get back these futures that represent a placeholder for a value. If we try to print the future, depending on whether it's finished running or not, we'll either get back a state of "pending" or "finished." Once it's finished we can get the return value (assuming there is one) using var.result().
    2. The difference between asyncio.sleep() and time.sleep() is that asyncio.sleep() is non-blocking.
    3. The calls don't actually get made until we schedule them with await asyncio.gather(*tasks). This runs all of the tasks in our list and waits for them to finish before continuing with the rest of our program.
    4. programming with asyncio pretty much enforces* using some sort of "main" function.

      This is because you need to use the "async" keyword in order to use the "await" syntax, and the "await" syntax is the only way to actually run other async functions.`

    5. async for (not used here) iterates over an asynchronous stream.
    6. async with allows awaiting async responses and file operations.
    7. What is a thread?

      A thread is a way of allowing your computer to break up a single process/program into many lightweight pieces that execute in parallel. Somewhat confusingly, Python's standard implementation of threading limits threads to only being able to execute one at a time due to something called the Global Interpreter Lock (GIL). The GIL is necessary because CPython's (Python's default implementation) memory management is not thread-safe. Because of this limitation, threading in Python is concurrent, but not parallel. To get around this, Python has a separate multiprocessing module not limited by the GIL that spins up separate processes, enabling parallel execution of your code. Using the multiprocessing module is nearly identical to using the threading module.

      Asynchronous nature of threading: as one function waits, another one begins, and so on.

    8. when we join threads with thread.join(), all we're doing is ensuring the thread has finished before continuing on with our code.
    9. Creating a thread is not the same as starting a thread, however. To start your thread, use {the name of your thread}.start(). Starting a thread means "starting its execution."
    1. Running the code in a subprocess is much slower than running a thread, not because the computation is slower, but because of the overhead of copying and (de)serializing the data. So how do you avoid this overhead?

      Reducing the performance hit of copying data between processes:

      Option #1: Just use threads

      Processes have overhead, threads do not. And while it’s true that generic Python code won’t parallelize well when using multiple threads, that’s not necessarily true for your Python code. For example, NumPy releases the GIL for many of its operations, which means you can use multiple CPU cores even with threads.

      ``` # numpy_gil.py import numpy as np from time import time from multiprocessing.pool import ThreadPool

      arr = np.ones((1024, 1024, 1024))

      start = time() for i in range(10): arr.sum() print("Sequential:", time() - start)

      expected = arr.sum()

      start = time() with ThreadPool(4) as pool: result = pool.map(np.sum, [arr] * 10) assert result == [expected] * 10 print("4 threads:", time() - start) ```

      When run, we see that NumPy uses multiple cores just fine when using threads, at least for this operation:

      $ python numpy_gil.py Sequential: 4.253053188323975 4 threads: 1.3854241371154785

      Pandas is built on NumPy, so many numeric operations will likely release the GIL as well. However, anything involving strings, or Python objects in general, will not. So another approach is to use a library like Polars which is designed from the ground-up for parallelism, to the point where you don’t have to think about it at all, it has an internal thread pool.

      Option #2: Live with it

      If you’re stuck with using processes, you might just decide to live with the overhead of pickling. In particular, if you minimize how much data gets passed and forth between processes, and the computation in each process is significant enough, the cost of copying and serializing data might not significantly impact your program’s runtime. Spending a few seconds on pickling doesn’t really matter if your subsequent computation takes 10 minutes.

      Option #3: Write the data to disk

      Instead of passing data directly, you can write the data to disk, and then pass the path to this file: * to the subprocess (as an argument) * to parent process (as the return value of the function running in the worker process).

      The recipient process can then parse the file.

      ``` import pandas as pd import multiprocessing as mp from pathlib import Path from tempfile import mkdtemp from time import time

      def noop(df: pd.DataFrame): # real code would process the dataframe here pass

      def noop_from_path(path: Path): df = pd.read_parquet(path, engine="fastparquet") # real code would process the dataframe here pass

      def main(): df = pd.DataFrame({"column": list(range(10_000_000))})

      with mp.get_context("spawn").Pool(1) as pool:
          # Pass the DataFrame to the worker process
          # directly, via pickling:
          start = time()
          pool.apply(noop, (df,))
          print("Pickling-based:", time() - start)
      
          # Write the DataFrame to a file, pass the path to
          # the file to the worker process:
          start = time()
          path = Path(mkdtemp()) / "temp.parquet"
          df.to_parquet(
              path,
              engine="fastparquet",
              # Run faster by skipping compression:
              compression="uncompressed",
          )
          pool.apply(noop_from_path, (path,))
          print("Parquet-based:", time() - start)
      

      if name == "main": main() `` **Option #4:multiprocessing.shared_memory`**

      Because processes sometimes do want to share memory, operating systems typically provide facilities for explicitly creating shared memory between processes. Python wraps this facilities in the multiprocessing.shared_memory module.

      However, unlike threads, where the same memory address space allows trivially sharing Python objects, in this case you’re mostly limited to sharing arrays. And as we’ve seen, NumPy releases the GIL for expensive operations, which means you can just use threads, which is much simpler. Still, in case you ever need it, it’s worth knowing this module exists.

      Note: The module also includes ShareableList, which is a bit like a Python list but limited to int, float, bool, small str and bytes, and None. But this doesn’t help you cheaply share an arbitrary Python object.

      A bad option for Linux: the "fork" context

      You may have noticed we did multiprocessing.get_context("spawn").Pool() to create a process pool. This is because Python has multiple implementations of multiprocessing on some OSes. "spawn" is the only option on Windows, the only non-broken option on macOS, and available on Linux. When using "spawn", a completely new process is created, so you always have to copy data across.

      On Linux, the default is "fork": the new child process has a complete copy of the memory of the parent process at the time of the child process’ creation. This means any objects in the parent (arrays, giant dicts, whatever) that were created before the child process was created, and were stored somewhere helpful like a module, are accessible to the child. Which means you don’t need to pickle/unpickle to access them.

      Sounds useful, right? There’s only one problem: the "fork" context is super-broken, which is why it will stop being the default in Python 3.14.

      Consider the following program:

      ``` import threading import sys from multiprocessing import Process

      def thread1(): for i in range(1000): print("hello", file=sys.stderr)

      threading.Thread(target=thread1).start()

      def foo(): pass

      Process(target=foo).start() ```

      On my computer, this program consistently deadlocks: it freezes and never exits. Any time you have threads in the parent process, the "fork" context can cause in potential deadlocks, or even corrupted memory, in the child process.

      You might think that you’re fine because you don’t start any threads. But many Python libraries start a thread pool on import, for example NumPy. If you’re using NumPy, Pandas, or any other library that depends on NumPy, you are running a threaded program, and therefore at risk of deadlocks, segfaults, or data corruption when using the "fork" multiprocessing context. For more details see this article on why multiprocessing’s default is broken on Linux.

      You’re just shooting yourself in the foot if you take this approach.

    2. When you’re writing Python, though, you want to share Python objects between processes.

      To enable this, when you pass Python objects between processes using Python’s multiprocessing library:

      • On the sender side, the arguments get serialized to bytes with the pickle module.
      • On the receiver side, the bytes are unserialized using pickle.

      This serialization and deserialization process involves computation, which can potentially be slow.

    3. Threads vs. processes

      Multiple threads let you run code in parallel, potentially on multiple CPUs. On Python, however, the global interpreter lock makes this parallelism harder to achieve.

      Multiple processes also let you run code in parallel—so what’s the difference between threads and processes?

      All the threads inside a single process share the same memory address space. If thread 1 in a process stores some memory at address 0x7f0cd1a88810, thread 2 can access the same memory at the same address. That means passing objects between threads is cheap: you just need to get the pointer to the memory address from one thread to the other. A memory address is 8 bytes: this is not a lot of data to move around.

      In contrast, processes do not share the same memory space. There are some shared memory facilities provided by the operating system, typically, and we’ll get to that later. But by default, no memory is shared. That means you can’t just share the address of your data across processes: you have to copy the data.

    1. Technique #2: Sampling

      How do you load only a subset of the rows?

      When you load your data, you can specify a skiprows function that will randomly decide whether to load that row or not:

      ```

      from random import random

      def sample(row_number): ... if row_number == 0: ... # Never drop the row with column names: ... return False ... # random() returns uniform numbers between 0 and 1: ... return random() > 0.001 ... sampled = pd.read_csv("/tmp/voting.csv", skiprows=sample) len(sampled) 973 ```

    2. lossy compression: drop some of your data in a way that doesn’t impact your final results too much.

      If parts of your data don’t impact your analysis, no need to waste memory keeping extraneous details around.

    1. except StopAsyncIteration if is_async else StopIteration:

      Interesting: using ternary operator in except clause

    2. In sync code, you might use

      a thread pool and imap_unordered():

      ``` pool = multiprocessing.dummy.Pool(2)

      for result in pool.imap_unordered(do_stuff, things_to_do): print(result) ```

      Here, concurrency is limited by the fixed number of threads.

    1. Gunicorn and multiprocessing

      Gunicorn forks a base process into n worker processes, and each worker is managed by Uvicorn (with the asynchronous uvloop). Which means:

      • Each worker is concurrent
      • The worker pool implements parallelism

      This way, we can have the best of both worlds: concurrency (multithreading) and parallelism (multiprocessing).

    2. There is another way to declare a route with FastAPI

      Using the asyncio:

      ``` import asyncio

      from fastapi import FastAPI

      app = FastAPI()

      @app.get("/asyncwait") async def asyncwait(): duration = 0.05 await asyncio.sleep(duration) return {"duration": duration} ```

    1. Use Python asyncio.as_completed

      There will be moments when you don't have to await for every single task to be processed right away.

      We do this by using asyncio.as_completed which returns a generator with completed coroutines.

    2. When to use Python Async

      Async only makes sense if you're doing IO.

      There's ZERO benefit in using async to stuff like this that is CPU-bound:

      ``` import asyncio

      async def sum_two_numbers_async(n1: int, n2: int) -> int: return n1 + n2

      async def main(): await sum_two_numbers_async(2, 2) await sum_two_numbers_async(4, 4)

      asyncio.run(main()) ```

      Your code might even get slower by doing that due to the Event Loop.

      That's because Python async only optimizes IDLE time!

    3. If you want 2 or more functions to run concurrently, you need asyncio.create_task.

      Creating a task triggers the async operation, and it needs to be awaited at some point.

      For example:

      task = create_task(my_async_function('arg1')) result = await task

      As we're creating many tasks, we need asyncio.gather which awaits all tasks to be done.

    4. they think async is parallel which is not true
    1. Fast API

      Fast API is a high-level web framework like flask, but that happens to be async, unlike flask. With the added benefit of using type hints and pydantic to generate schemas.

      It's not a building block like twisted, gevent, trio or asyncio. In fact, it's built on top of asyncio. It's in the same group as flask, bottle, django, pyramid, etc. Although it's a micro-framework, so it's focused on routing, data validation and API delivery.

    2. The code isn't that different from your typical asyncio script:

      ``` import re import time

      import httpx import trio

      urls = [ "https://www.bitecode.dev/p/relieving-your-python-packaging-pain", "https://www.bitecode.dev/p/hype-cycles", "https://www.bitecode.dev/p/why-not-tell-people-to-simply-use", "https://www.bitecode.dev/p/nobody-ever-paid-me-for-code", "https://www.bitecode.dev/p/python-cocktail-mix-a-context-manager", "https://www.bitecode.dev/p/the-costly-mistake-so-many-makes", "https://www.bitecode.dev/p/the-weirdest-python-keyword", ]

      title_pattern = re.compile(r"<title[^>]>(.?)</title>", re.IGNORECASE)

      user_agent = ( "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" )

      async def fetch_url(url): start_time = time.time()

      async with httpx.AsyncClient() as client:
          headers = {"User-Agent": user_agent}
          response = await client.get(url, headers=headers)
          match = title_pattern.search(response.text)
          title = match.group(1) if match else "Unknown"
          print(f"URL: {url}\nTitle: {title}")
      
      end_time = time.time()
      elapsed_time = end_time - start_time
      print(f"Time taken for {url}: {elapsed_time:.4f} seconds\n")
      

      async def main(): global_start_time = time.time()

      # That's the biggest API difference
      async with trio.open_nursery() as nursery:
          for url in urls:
              nursery.start_soon(fetch_url, url)
      
      global_end_time = time.time()
      global_elapsed_time = global_end_time - global_start_time
      print(f"Total time taken for all URLs: {global_elapsed_time:.4f} seconds")
      

      if name == "main": trio.run(main) ```

      Because it doesn't create nor schedule coroutines immediately (notice the nursery.start_soon(fetch_url, url) is not nursery.start_soon(fetch_url(url))), it will also consume less memory. But the most important part is the nursery:

      # That's the biggest API difference async with trio.open_nursery() as nursery: for url in urls: nursery.start_soon(fetch_url, url)

      The with block scopes all the tasks, meaning everything that is started inside that context manager is guaranteed to be finished (or terminated) when it exits. First, the API is better than expecting the user to wait manually like with asyncio.gather: you cannot start concurrent coroutines without a clear scope in trio, it doesn't rely on the coder's discipline. But under the hood, the design is also different. The whole bunch of coroutines you group and start can be canceled easily, because trio always knows where things begin and end.

      As soon as things get complicated, code with curio-like design become radically simpler than ones with asyncio-like design.

    3. trio

      For many years, the very talented dev and speaker David Beazley has been showing unease with asyncio's design, and made more and more experiments and public talks about what could an alternative look like. It culminated with the excellent Die Threads presentation, live coding the sum of the experience of all those ideas, that eventually would become the curio library. Watch it. It’s so good.

      Trio is not compatible with asyncio, nor gevent or twisted by default. This means it's also its little own async island.

      But in exchange for that, it provides a very different internal take on how to deal with this kind of concurrency, where every coroutine is tied to an explicit scope, everything can be awaited easily, or canceled.

    4. Because of the way gevent works, you can take a blocking script, and with very few modifications, make it async. Let's take the original stdlib one, and convert it to gevent:

      ``` import re import time

      import gevent from gevent import monkey

      monkey.patch_all() # THIS MUST BE DONE BEFORE IMPORTING URLLIB

      from urllib.request import Request, urlopen

      urls = [ "https://www.bitecode.dev/p/relieving-your-python-packaging-pain", "https://www.bitecode.dev/p/hype-cycles", "https://www.bitecode.dev/p/why-not-tell-people-to-simply-use", "https://www.bitecode.dev/p/nobody-ever-paid-me-for-code", "https://www.bitecode.dev/p/python-cocktail-mix-a-context-manager", "https://www.bitecode.dev/p/the-costly-mistake-so-many-makes", "https://www.bitecode.dev/p/the-weirdest-python-keyword", ]

      title_pattern = re.compile(r"<title[^>]>(.?)</title>", re.IGNORECASE)

      user_agent = ( "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0" )

      We move the fetching into a function so we can isolate it into a green thread

      def fetch_url(url): start_time = time.time()

      headers = {"User-Agent": user_agent}
      
      with urlopen(Request(url, headers=headers)) as response:
          html_content = response.read().decode("utf-8")
          match = title_pattern.search(html_content)
          title = match.group(1) if match else "Unknown"
      
          print(f"URL: {url}\nTitle: {title}")
      
      end_time = time.time()
      elapsed_time = end_time - start_time
      
      print(f"Time taken: {elapsed_time:.4f} seconds\n")
      

      def main(): global_start_time = time.time()

      # Here is where we convert synchronous calls into async ones
      greenlets = [gevent.spawn(fetch_url, url) for url in urls]
      gevent.joinall(greenlets)
      
      global_end_time = time.time()
      global_elapsed_time = global_end_time - global_start_time
      
      print(f"Total time taken: {global_elapsed_time:.4f} seconds")
      

      main() ```

      No async, no await. No special lib except for gevent. In fact it would work with the requests lib just as well. Very few modifications are needed, for a net perf gain.

      The only danger is if you call gevent.monkey.patch_all() too late. You get a cryptic error that crashes your program.

    1. Tips
      • if name == "main" is important for multiprocessing because it will spawn a new Python, that will import the module. You don't want this module to spawn a new Python that imports the module that will spawn a new Python...
      • If the function to submit to the executor has complicated arguments to be passed to it, use a lambda or functools.partial.
      • max_worker = 1 is a very nice way to get a poor man’s task queue.
    2. Both are bad if you need to cancel tasks, collaborate a lot between tasks, deal precisely with the task lifecycle, needs a huge number of workers or want to milk out every single bit of perfs. You won’t get nowhere near Rust level of speed.
    3. Process pools are good for:
      • When you don't need to share data between tasks.
      • When you are CPU bound.
      • When you don't have too many tasks to run at the same time.
      • When you need true parallelism and want to exercise your juicy cores.
    4. Thread pools are good for:
      • Tasks (network, file, etc.) that needs less than 10_000 I/O interactions per second. The number is higher than you would expect, because threads are surprisingly cheap nowadays, and you can spawn a lot of them without bloating memory too much. The limit is more the price of context switching. This is not a scientific number, it's a general direction that you should challenge by measuring your own particular case.
      • When you need to share data between the tasks.
      • When you are not CPU bound.
      • When you are OK to execute tasks a bit slower to you ensure you are not blocking any of them (E.G: user UI and a long calculation).
      • When you are CPU bound, but the CPU calculations are delegating to a C extension that releases the GIL, such as numpy. Free parallelism on the cheap, yeah!

      E.G: a web scraper, a GUI to zip files, a development server, sending emails without blocking web page rendering, etc.

    5. What would a version with multiprocessing look like?

      Pretty much the same, but, we use ProcessPoolExecutor instead.

      ```python from concurrent.futures import ProcessPoolExecutor, as_completed

      ...

      with ProcessPoolExecutor(max_workers=5) as executor: ... ```

      Note that here the number of workers maps to the number of CPU cores I want to dedicate to the program. Processes are way more expensive than threads, as each starts a new Python instance.

    6. Python standard library comes with a beautiful abstraction for them I see too few people use: the pool executors.
    7. ThreadPoolExecutor.

      ```python from concurrent.futures import ThreadPoolExecutor, as_completed

      def main(): with ThreadPoolExecutor(max_workers=len(URLs)) as executor: tasks = {} for url in URLs: future = executor.submit(fetch_url, url) tasks[future] = url

          for future in as_completed(tasks):
              title = future.result()
              url = tasks[future]
              print(f"URL: {url}\nTitle: {title}")
      

      ```

    8. You can distribute work to a bunch of process workers or thread workers with a few lines of code:

      ```python from concurrent.futures import ThreadPoolExecutor, as_completed

      with ThreadPoolExecutor(max_workers=5) as executor: executor.submit(do_something_blockint) ```

    1. When you run your Python program using [CPython], the code is parsed and converted to an internal bytecode format, which is then executed inside the VM. From the user’s perspective, this is clearly an interpreter—they run their program from source. But if you look under CPython’s scaly skin, you’ll see that there is definitely some compiling going on. The answer is that it is both. CPython is an interpreter, and it has a compiler.
    2. You can actually compile all of your Python code beforehand using the compileall module on the command line:

      $ python3 -m compileall .

      This will place the compiled bytecode of all Python files in the current directory in pycache/ and show you any compiler errors.

    3. Python is both a compiled and interpreted language

      The CPython interpreter really is an interpreter. But it also is a compiler. Python must go through a few stages before ever running the first line of code:

      1. scanning
      2. parsing

      Older versions of Python added an additional stage:

      1. scanning
      2. parsing
      3. checking for valid assignment targets

      Let’s compare this to the stages of compiling a C program:

      1. ~~preprocessing~~
      2. lexical analysis (another term for “scanning”)
      3. syntactic analysis (another term for “parsing”)
      4. ~~semantic analysis~~
      5. ~~linking~~
    4. next stage is parsing (also known as syntactic analysis) and the parser reports the first error in the source code. Parsing the whole file happens before running the first line of code which means that Python does not even see the error on line 1 and reports the syntax error on line 2.
    5. I haven’t done a deep dive into the source code of the CPython interpreter to verify this, but I think the reason that this is the first error detected is because one of the first steps that Python 3.12 does is scanning (also known as lexical analysis). The scanner converts the ENTIRE file into a series of tokens before continuing to the next stage. A missing quotation mark at the end of a string literal is an error that is detected by the scanner—the scanner wants to turn the ENTIRE string into one big token, but it can’t do that until it finds the closing quotation mark. The scanner runs first, before anything else in Python 3.12, hence why this is the first error message.
    6. Python reports only one error message at a time—so the game is which error message will be reported first?

      Here is the buggy program:

      python 1 / 0 print() = None if False ñ = "hello

      Each line of code generates a different error message:

      • 1 / 0 will generate ZeroDivisionError: division by zero.
      • print() = None will generate SyntaxError: cannot assign to function call.
      • if False will generate SyntaxError: expected ':'.
      • ñ = "hello will generate SyntaxError: EOL while scanning string literal.

      The question is… which will be reported first?

      Spoilers: the specific version of Python matters (more than I thought it would) so keep that in mind if you see different results.

      The first error message detected is on the last line of source code. What this tells us is that Python must read the entire source code file before running the first line of code. If you have a definition in your head of an “interpreted language” that includes “interpreted languages run the code one line at a time”, then I want you to cross that out!

    1. The tokenizer takes your source code and chunks it into “tokens”. Tokens are just small pieces of source code that you can identify in isolation. As examples, there will be tokens for numbers, mathematical operators, variable names, and keywords (like if or for). The parser will take that linear sequence of tokens and essentially reshape them into a tree structure (that's what the T in AST stands for: Tree). This tree is what gives meaning to your tokens, providing a nice structure that is easier to reason about and work on. As soon as we have that tree structure, our compiler can go over the tree and figure out what bytecode instructions represent the code in the tree. For example, if part of the tree represents a function, we may need a bytecode for the return statement of that function. Finally, the interpreter takes those bytecode instructions and executes them, producing the results of our original program.
    2. Recap

      In this article you started implementing your own version of Python. To do so, you needed to create four main components:

      A tokenizer: * accepts strings as input (supposedly, source code); * chunks the input into atomic pieces called tokens; * produces tokens regardless of their sequence making sense or not.

      A parser: * accepts tokens as input; * consumes the tokens one at a time, while making sense they come in an order that makes sense; * produces a tree that represents the syntax of the original code.

      A compiler: * accepts a tree as input; * traverses the tree to produce bytecode operations.

      An interpreter: * accepts bytecode as input; * traverses the bytecode and performs the operation that each one represents; * uses a stack to help with the computations.

    3. Each bytecode is defined by two things: the type of bytecode operation we're dealing with (e.g., pushing things on the stack or doing an operation); and the data associated with that bytecode operation, which not all bytecode operations need.
    4. The interpreter accepts a list of bytecode operations and its method interpret will go through the list of bytecodes, interpreting one at a time.

      ``` from .compiler import Bytecode, BytecodeType

      ...

      class Interpreter: def init(self, bytecode: list[Bytecode]) -> None: self.stack = Stack() self.bytecode = bytecode self.ptr: int = 0

      def interpret(self) -> None:
          for bc in self.bytecode:
              # Interpret this bytecode operator.
              if bc.type == BytecodeType.PUSH:
                  self.stack.push(bc.value)
              elif bc.type == BytecodeType.BINOP:
                  right = self.stack.pop()
                  left = self.stack.pop()
                  if bc.value == "+":
                      result = left + right
                  elif bc.value == "-":
                      result = left - right
                  else:
                      raise RuntimeError(f"Unknown operator {bc.value}.")
                  self.stack.push(result)
      
          print("Done!")
          print(self.stack)
      

      ```

    5. The interpreter is the part of the program that is responsible for taking bytecode operations as input and using those to actually run the source code you started off with.
    6. To write our compiler, we'll just create a class with a method compile. The method compile will mimic the method parse in its structure. However, the method parse produces tree nodes and the method compile will produce bytecode operations.
    7. The compiler is the part of our program that will take a tree (an AST, to be more precise) and it will produce a sequence of instructions that are simple and easy to follow.
    8. Instead of interpreting the tree directly, we'll use a compiler to create an intermediate layer.
    9. After we have our sequence of operations (bytecodes), we will “interpret” it. To interpret the bytecode means that we go over the bytecode, sequence by sequence, and at each point we perform the simple operation that the bytecode tells us to perform.
    10. Bytecodes are just simple, atomic instructions that do one thing, and one thing only.
    11. Abstract syntax tree

      It's an abstract syntax tree because it is a tree representation that doesn't care about the original syntax we used to write the operation. It only cares about the operations we are going to perform.

    12. The parser is the part of our program that accepts a stream of tokens and makes sure they make sense.
    13. The tokenizer

      The tokenizer is the part of your program that accepts the source code and produces a linear sequence of tokens – bits of source code that you identify as being relevant.

    14. The four parts of our program
      • Tokenizer takes source code as input and produces tokens;
      • Parser takes tokens as input and produces an AST;
      • Compiler takes an AST as input and produces bytecode;
      • Interpreter takes bytecode as input and produces program results.
    1. Once an interpreter is running (remembering what I said that it is preferable to leave them running) you can share data using a channel. The channels module is also part of PEP554 and available using a secret-import:

      ``` import _xxsubinterpreters as interpreters import _xxinterpchannels as channels

      interp_id = interpreters.create(site=site) channel_id = channels.create()

      interpreters.run_string( interp_id, """ import _xxinterpchannels as channels channels.send('hello!') """, shared={ "channel_id": channel_id } )

      print(channels.recv(channel_id)) ```

    2. To share data, you can use the shared argument and provide a dictionary with shareable (int, float, bool, bytes, str, None, tuple) values:

      ``` import _xxsubinterpreters as interpreters

      interp_id = interpreters.create(site=site)

      interpreters.run_string( interp_id, "print(message)", shared={ "message": "hello world!" } )

      interpreters.run_string( interp_id, """ for message in messages: print(message) """, shared={ "messages": ("hello world!", "this", "is", "me") } )

      interpreters.destroy(interp_id) ```

    3. To start an interpreter that sticks around, you can use interpreters.create() which returns the interpreter ID. This ID can be used for subsequent .run_string calls:

      ``` import _xxsubinterpreters as interpreters

      interp_id = interpreters.create(site=site)

      interpreters.run_string(interp_id, "print('hello world')") interpreters.run_string(interp_id, "print('hello universe')")

      interpreters.destroy(interp_id) ```

    4. Starting a sub interpreter is a blocking operation, so most of the time you want to start one inside a thread.

      ``` from threading import Thread import _xxsubinterpreters as interpreters

      t = Thread(target=interpreters.run, args=("print('hello world')",)) t.start() ```

    5. You can create, run and stop a sub interpreter with the .run() function which takes a string or a simple function

      ``` import _xxsubinterpreters as interpreters

      interpreters.run(''' print("Hello World") ''') ```

    6. Inter-Worker communication

      Whether using sub interpreters or multiprocessing you cannot simply send existing Python objects to worker processes.

      Multiprocessing uses pickle by default. When you start a process or use a process pool, you can use pipes, queues and shared memory as mechanisms to sending data to/from the workers and the main process. These mechanisms revolve around pickling. Pickling is the builtin serialization library for Python that can convert most Python objects into a byte string and back into a Python object.

      Pickle is very flexible. You can serialize a lot of different types of Python objects (but not all) and Python objects can even define a method for how they can be serialized. It also handles nested objects and properties. However, with that flexibility comes a performance hit. Pickle is slow. So if you have a worker model that relies upon continuous inter-worker communication of complex pickled data you’ll likely see a bottleneck.

      Sub interpreters can accept pickled data. They also have a second mechanism called shared data. Shared data is a high-speed shared memory space that interpreters can write to and share data with other interpreters. It supports only immutable types, those are:

      • Strings
      • Byte Strings
      • Integers and Floats
      • Boolean and None
      • Tuples (and tuples of tuples)

      To share data with an interpreter, you can either set it as initialization data or you can send it through a channel.

    7. The next point when using a parallel execution model like multiprocessing or sub interpreters is how you share data.

      Once you get over the hurdle of starting one, this quickly becomes the most important point. You have two questions to answer:

      • How do we communicate between workers?
      • How do we manage the state of workers?
    8. Half of the time taken to start an interpreter is taken up running “site import”. This is a special module called site.py that lives within the Python installation. Interpreters have their own caches, their own builtins, they are effectively mini-Python processes. Starting a thread or a coroutine is so fast because it doesn’t have to do any of that work (it shares that state with the owning interpreter), but it’s bound by the lock and isn’t parallel.
    9. Both multiprocessing processes and interpreters have their own import state. This is drastically different to threads and coroutines. When you await an async function, you don’t need to worry about whether that coroutine has imported the required modules. The same applies for threads.

      For example, you can import something in your module and reference it from inside the thread function:

      ```python import threading from super.duper.module import cool_function

      def worker(info): # This already exists in the interpreter state cool_function()

      info = {'a': 1} thread = Thread(target=worker, args=(info, )) ```

    10. Another important point is that multiprocessing is often used in a model where the processes are long-running and handed lots of tasks instead of being spawned and destroyed for a single workload. One great example is Gunicorn, the popular Python web server. Gunicorn will spawn “workers” using multiprocessing and those workers will live for the lifetime of the main process. The time to start a process or a sub interpreter then becomes irrelevant (at 89 ms or 1 second) when the web worker can be running for weeks, months or years. The ideal way to use these parallel workers for small tasks (like handle a single web request) is to keep them running and use a main process to coordinate and distribute the workload
    11. What is the difference between threading, multiprocessing, and sub interpreters?

      The Python standard library has a few options for concurrent programming, depending on some factors:

      • Is the task you’re completing IO-bound (e.g. reading from a network, writing to disk)
      • Does the task require CPU-heavy work, e.g. computation
      • Can the tasks be broken into small chunks or are they large pieces of work?

      Here are the models:

      • Threads are fast to create, you can share any Python objects between them and have a small overhead. Their drawback is that Python threads are bound to the GIL of the process, so if the workload is CPU-intensive then you won’t see any performance gains. Threading is very useful for background, polling tasks like a function that waits and listens for a message on a queue.
      • Coroutines are extremely fast to create, you can share any Python objects between them and have a miniscule overhead. Coroutines are ideal for IO-based activity that has an underlying API that supports async/await.
      • Multiprocessing is a Python wrapper that creates Python processes and links them together. These processes are slow to start, so the workload that you give them needs to be large enough to see the benefit of parallelising the workload. However, they are truly parallel since each one has it’s own GIL.
      • Sub interpreters have the parallelism of multiprocessing, but with a much faster startup time.
  4. Nov 2023
    1. Rosetta is now Generally Available for all users on macOS 13 or later. It provides faster emulation of Intel-based images on Apple Silicon. To use Rosetta, see Settings. Rosetta is enabled by default on macOS 14.1 and later.

      Tested it on my side, and poetry install of one Python project took 44 seconds instead of 2 minutes 53 seconds, so it's nearly a 4x speed increase!

  5. Oct 2023
    1. Method 1: numpy.any() to check if the NumPy array is empty in Python numpy.any() method is used to test whether any array element along a given axis evaluates to True. Syntax: numpy.any(a, axis = None, out = None, keepdims = <no value>) Parameters: array: Input array whose elements need to be checked.axis: Axis along which array elements are evaluated.out: Output array having the same dimensions as Input arraykeepdmis: If this is set to True, the axes which are reduced are left in the result. Return Value: A new Boolean array (depending on the ‘out;’ parameter) 1234567import numpy as nparr = np.array([])flag = not np.any(arr)if flag:    print('Array is empty')else:    print('Array is not empty') Output: Array is empty In this example, we have used numpy.any() method to check whether the array is empty or not. As the array is empty, the value of the flag variable becomes True, and so the output ‘Array is empty’ is displayed. The limitation to this function is that it does not work if the array contains the value 0 in it.

      This is WRONG.

      numpy.any() checks if there is at least one non-zero element in an array.

    1. nserting a new chicken into a ring at some specified location in it, usu-ally first or last.2. Removing a chicken from a ring.3. Putting all the chickens of one ring, in order, into another at some speci-fied location in it, usually first or last.4. Performing some auxiliary operation on each member of a ring in eitherforward or reverse order

      In simpler terms, Sutherland's thesis is discussing the basic operations of a data structure known as a ring. A ring is a type of list where the elements are connected in a circular manner. The operations he mentions are:

      1. Inserting a new element into the ring at a specified location.
      2. Removing an element from the ring.
      3. Moving all elements of one ring, in order, into another ring at a specified location.
      4. Performing an operation on each member of a ring in either forward or reverse order.

      These operations are implemented using macro instructions in the compiler language. A macro instruction is a directive to the compiler which specifies how an input pattern should be mapped to an output pattern.

      The thesis also discusses the generation of new elements. Subroutines are used to set up new elements in free spaces in the storage structure. When parts of the drawing are deleted, the registers representing them become free and are placed in a 'FREES' ring. New components are set up at the end of the storage area, while free blocks are allowed to accumulate. A process called 'garbage collection' periodically compacts the storage structure by removing the free blocks and relocating the information above them.

      In Python, the ring data structure can be implemented using a doubly linked list. Here's a simple example:

      ```python class Node: def init(self, data): self.data = data self.next = None self.prev = None

      class Ring: def init(self): self.head = None

      def append(self, data):
          if not self.head:
              self.head = Node(data)
              self.head.next = self.head
              self.head.prev = self.head
          else:
              new_node = Node(data)
              new_node.prev = self.head.prev
              new_node.next = self.head
              self.head.prev.next = new_node
              self.head.prev = new_node
      
      def display(self):
          temp = self.head
          while True:
              print(temp.data, end = " ")
              temp = temp.next
              if temp == self.head:
                  break
      

      ```

      In this example, the Node class represents an element in the ring, and the Ring class represents the ring itself. The append method is used to add a new element to the ring, and the display method is used to print all elements in the ring.

    2. RING STRUCTURE

      Ivan writes about a concept called "Ring Structure". This is a way of organizing and linking different elements or components in a system.

      In simpler terms, imagine you have a bunch of different objects (like points and lines in a drawing program like Sketchpad). You want to keep track of how these objects are related to each other. For example, you might want to know all the lines that end at a particular point.

      To do this, Ivan uses a "ring structure". Each object has a "string of pointers" - basically a list of references to other objects. This list is circular - the last item in the list points back to the first item. This makes it easy to move forwards and backwards through the list.

      Each object has two "registers" or slots for keeping track of these relationships. One slot is for the object itself, and the other is for the list of related objects.

      Ivan uses the terms "hen" and "chicken" to describe these slots. The "hen" is the object itself, and the "chicken" is the list of related objects.

      Here's a simple Python code example to illustrate this concept:

      ```python class Point: def init(self): self.hen = self self.chickens = []

      class Line: def init(self, point1, point2): self.hen = self self.chickens = [point1, point2] point1.chickens.append(self) point2.chickens.append(self) ```

      In this example, a Point object has a hen that refers to itself and a list of chickens that will contain any Line objects that end at this point. When a Line is created, it adds itself to the chickens list of its end points.

      The "ring structure" is a way to organize and link different elements in a system, making it easier to find and update related elements.

    3. MNEMONICS AND CONVENTIONS

      Mnemonics for Registers: Instead of remembering numerical indices, we use human-readable keys.

      python point = {'TYPE': 'Point', 'PVAL_X': 5, 'PVAL_Y': 10}

      Flexibility: If we want to change the internal structure, we can easily do so by changing the keys.

      ```python

      Changing 'PVAL_X' to 'X_COORD'

      point = {'TYPE': 'Point', 'X_COORD': 5, 'PVAL_Y': 10} ```

      Conventions:

      • The first component ('TYPE') indicates the type of element.
      • Numerical information like coordinates is stored at the end.

      python line = {'TYPE': 'Line', 'START': 'point1', 'END': 'point2', 'LENGTH': 7.2}

      Pointers and Topology: We can use pointers (references) to other elements to establish relationships.

      python point1 = {'TYPE': 'Point', 'PVAL_X': 1, 'PVAL_Y': 1} point2 = {'TYPE': 'Point', 'PVAL_X': 4, 'PVAL_Y': 5} line = {'TYPE': 'Line', 'START': point1, 'END': point2}

      Relocation: If we move point1 to a new variable, we update the pointer in line.

      python new_point1 = point1 # Relocating point1 to new_point1 line['START'] = new_point1 # Updating the pointer

      Segregation of Data: Numerical data is at the end, so if we need to move elements, the numerical data remains untouched.

      ```python

      Even if we relocate, the numerical data ('PVAL_X' and 'PVAL_Y') remains the same.

      ```

  6. Sep 2023
    1. Configuring PyCharm: Open PyCharm with ‘Pytest Web Framework’ Press Ctrl+Alt+S > Project Click ‘Project Interpreter’ Select Python 3.6 Click ‘OK’ Go to write over 100500 automated tests!!!

      This section provides a step-by-step guide on setting up PyCharm for automated testing using the 'Pytest Web Framework'.

  7. Aug 2023
  8. Jul 2023
    1. cat requirements.txt | grep -E '^[^# ]' | cut -d= -f1 | xargs -n 1 poetry add

      Use poetry init to create a sample pyproject.toml, and then trigger this line to export requirements.txt into a pyproject.toml

    1. As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the expression arr2d[:2] as "select the first two rows of arr2d."

      Slices follow a similar logic than indexing in NumPy array's. array[:2] selects a range of elements along a single axis,, but array[:2, 1:] does it along two axis.

    1. You might want to suppress only ValueError, since a TypeError (the input was not a string or numeric value) might indicate a legitimate bug in your program. To do that, write the exception type after except: def attempt_float(x): try: return float(x) except ValueError: return x
    2. Since generators produce output one element at a time versus an entire list all at once, it can help your program use less memory.
    3. It is not until you request elements from the generator that it begins executing its code:

      A generator is a function-like iterator object.

    4. A generator is a convenient way, similar to writing a normal function, to construct a new iterable object. Whereas normal functions execute and return a single result at a time, generators can return a sequence of multiple values by pausing and resuming execution each time the generator is used. To create a generator, use the yield keyword instead of return in a function:
    5. In this case, return_value would be a 3-tuple with the three returned variables. A potentially attractive alternative to returning multiple values like before might be to return a dictionary instead:

      Returning multiple values in Python is expressed as a tuple by default and each value is correspondingly assigned. Optionally, you can return a dictionary if specified.

    6. The for parts of the list comprehension are arranged according to the order of nesting, and any filter condition is put at the end as before.

      Nested list comprehensions follow the same logic as nested for loops. The difference strives that in list comprehensions the filtered variable is mentioned twice.

    1. python -m calendar

      So surprised that you can output a calendar view using Python

    2. python -m site, which outputs useful information about your installation

      python -m site <--- see useful information about your Python installation

    1. ```python from flask import Flask, request from collections import defaultdict import re import random

      GREEN ="🟩" YELLOW ="🟨" WHITE ="⬜"

      def get_answers(): with open("allowed_answers.txt") as f: answers = set(l for l in f.read().splitlines() if l) return answers

      def get_guesses(): guesses = get_answers() with open("allowed_guesses.txt") as f: for l in f.read().splitlines(): if l: guesses.add(l) return guesses

      app = Flask(name, static_folder="static") app.answers = get_answers() app.guesses = get_guesses() word = random.choice(list(app.answers)) print(f"The word is {word}")

      def with_header(content): return f"""

      <html> <head> <link rel="search" type="application/opensearchdescription+xml" title="searchGame" href="http://searchgame:5000/static/opensearch.xml" /> </head> <body> {content} </body></html>

      """

      @app.route("/") def home(): return with_header("

      Right click on the address bar to install the search engine.

      ")

      @app.route("/search") def search(): return with_header(f"Content: {request.args.get('q')}")

      def to_result(guess, answer): chars = [WHITE] * 5 count = defaultdict(int) for idx, (g, a) in enumerate(zip(guess, answer)): if g == a: chars[idx] = GREEN else: count[a] += 1

      for idx, g in enumerate(guess):
          if g in count and count[g] > 0 and chars[idx] == WHITE:
              chars[idx] = YELLOW
              count[g] -= 1
      
      return "".join(chars)
      

      def maybe_error(guess): if len(guess) < 5: return f"less than 5 characters" if len(guess) > 5: return f"greater than 5 characters" if guess not in app.guesses: return f"not in wordlist" return None

      @app.route("/game") def game(): query = request.args.get("q") guesses = [x for x in re.split("[. ]", query) if x] response = [] if not guesses: response.append("Enter 5-letter guesses separated by spaces") else: most_recent = guesses[-1] # Don't show "too short" error for most recent guess if len(most_recent) < 5: guesses = guesses[:-1] if not guesses: response.append("Enter a wordle guess") for guess in guesses[::-1]: error = maybe_error(guess) if error is None: result = to_result(guess, word) s = f"{guess} | {result}" if result == GREEN * 5: s = f"{s} | CORRECT!" response.append(s) else: response.append(f"{guess} | ERROR: {error}")

      return [query, response]
      

      ```

    1. o con

      حالا می خواد بره با Axios وصلش کنه

    2. e

      حالا رفت سراغ Vue

    3. requests

      در واقع تعیین می کنه که درخواست ها از چه پروتکلی، با چه IP یا Domain Name و رو چه Port می تونی جواب بدی. خیلی خووبه براش تعیین کنی تو فقط باید از این IP و Port که در Front تعیین کردی استفاده کنی

    4. Flask-CORS

      تازه رفت سراغ Flask-origin چقد جاالب.

    5. pyth

      اول یه Virtual Environment می سازه

    1. For a new project, I’d just immediately start with Ruff; for existing projects, I would strongly recommend trying it as soon as you start getting annoyed about how long linting is taking in CI (or even worse, on your computer).

      Recommendation for when to use Ruff over PyLint or Flake8

  9. Jun 2023
    1. ```python def split_user(userid): """ Return the user and domain parts from the given user id as a dict.

      For example if userid is u'acct:seanh@hypothes.is' then return
      {'username': u'seanh', 'domain': u'hypothes.is'}'
      
      :raises InvalidUserId: if the given userid isn't a valid userid
      
      """
      match = re.match(r"^acct:([^@]+)@(.*)$", userid)
      if match:
          return {"username": match.groups()[0], "domain": match.groups()[1]}
      raise InvalidUserId(userid)
      

      ```

    1. ```html

      <body> <div> {% for chat in chats %} <div>{{ chat.contents }}</div> <button id={{chat.id}} ✅ onClick=getId(id) ✅ > print this chat_id out </button> {% endfor %} </div> ... <script> function getId(id){ console.log(id) } </script> </body>

      ```

    1. Examples of frontends include: pip, build, poetry, hatch, pdm, flit Examples of backends include: setuptools (>=61), poetry-core, hatchling, pdm-backend, flit-core

      Frontend and backend examples of Python's build backends

    2. pyproject.toml-based builds are the future, and they promote better practices for reliable package builds and installs. You should prefer to use them!

      setup.py is considered a "legacy" functionality these days

    3. Did you say setuptools? Yes! You may be familiar with setuptools as the thing that used your setup.py files to build packages. Setuptools now also fully supports pyproject.toml-based builds since version 61.0.0. You can do everything in pyproject.toml and no longer need setup.py or setup.cfg.

      setuptools can now utilize pyproject.toml

  10. May 2023
    1. With this dataclass, I have an explicit description of what the function returns.

      Dataclasses give you a lot more clarity of what the function returns, in comparison to returning tuples or dictionaries

    1. Host machine: docker run -it -p 8888:8888 image:version Inside the Container : jupyter notebook --ip 0.0.0.0 --no-browser --allow-root Host machine access this url : localhost:8888/tree‌

      3 ways of running jupyter notebook in a container

    1. How can I add, subtract, and compare binary numbers in Python without converting to decimal?

      I think the requirements of this were not spelled out well. After reading this over a couple of times, I think the problem should be…

      "Add, subtract, and compare binary numbers in Python as strings, without converting them to decimal."

      I'll take on that problem sometime when I get free time!

    1. 'handlers': { 'console': { 'level': 'INFO', 'class': 'logging.StreamHandler', 'stream': sys.stdout, 'formatter': 'verbose' }, },

      It's as simple as adding "sys.stdout" to the "stream" attribute.

  11. Apr 2023
    1. If you install a package with pip’s --user option, all its files will be installed in the .local directory of the current user’s home directory.

      One of the recommendations for Python multi-stage Docker builds. Thanks to pip install --user, the packages won't be spread across 3 different paths.

  12. Mar 2023
    1. Honestly, all the activation scripts do are:

      See the 4 steps below to understand what activating an environment in Python really does

    1. Using pex in combination with S3 for storing the pex files, we built a system where the fast path avoids the overhead of building and launching Docker images.Our system works like this: when you commit code to GitHub, the GitHub action either does a full build or a fast build depending on if your dependencies have changed since the previous deploy. We keep track of the set of dependencies specified in setup.py and requirements.txt.For a full build, we build your project dependencies into a deps.pex file and your code into a source.pex file. Both are uploaded to Dagster cloud. For a fast build we only build and upload the source.pex file.In Dagster Cloud, we may reuse an existing container or provision a new container as the code server. We download the deps.pex and source.pex files onto this code server and use them to run your code in an isolated environment.

      Fast vs full deployments

    1. Well, in short, with iterators, the flow of information is one-way only. When you have an iterator, all you can really do call the __next__ method to get the very next value to be yielded. In contrast, the flow of information with generators is bidirectional: you can send information back into the generator via the send method.
      • Iterator ← one-way communication (can only yield stuff)
      • Generator ← bidirectional communication (can also accept information via the send method)
    1. PythonCopyconfigs = {"fs.azure.account.auth.type": "OAuth", "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider", "fs.azure.account.oauth2.client.id": "<application-id>", "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope="<scope-name>",key="<service-credential-key-name>"), "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"} # Optionally, you can add <directory-name> to the source URI of your mount point. dbutils.fs.mount( source = "abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/", mount_point = "/mnt/<mount-name>", extra_configs = configs)
    1. So why aren't more people using Nim? I don't know! It's the closest thing to a perfect language that I've used by far.

      Nim sounds as the most ideal language when comparing to Python, Rust, Julia, C#, Swift, C

    1. depending on how smart the framework is, you might find yourself installing Conda packages over and over again on every run. This is inefficient, even when using a faster installer like Mamba.
    2. there’s the bootstrapping problem: depending on the framework you’re using, you might need to install Conda and the framework driver before you can get anything going. A Docker image would come prepackaged with both, in addition to your code and its dependencies. So even if your framework supports Conda directly, you might want to use Docker anyway.
    3. The only thing that will depend on the host operating system is glibc, pretty much everything else will be packaged by Conda. So a pinned environment.yml or conda-lock.yml file is a reasonable alternative to a Docker image as far as having consistent dependencies.

      Conda can be a sufficient alternative to Docker

    4. To summarize, for the use case of Python development environments, here’s how you might choose alternatives to Docker:

      (see table below)

    5. Conda packages everything but the standard C library, from C libraries to the Python interpreter to command-line tools to compilers.
    1. When you call 'foo' in Ruby, what you're actually doing is sending a message to its owner: "please call your method 'foo'". You just can't get a direct hold on functions in Ruby in the way you can in Python; they're slippery and elusive. You can only see them as though shadows on a cave wall; you can only reference them through strings/symbols that happen to be their name. Try and think of every method call 'object.foo(args)' you do in Ruby as the equivalent of this in Python: 'object.getattribute('foo')(args)'.
    2. def document(f): def wrap(x): print "I am going to square", x f(x) return wrap @document def square(x): print math.pow(x, 2) square(5)
    1. Post involved with finding a good jupyter notebook visualization lib for DAGs.

  13. Feb 2023
    1. What object-oriented means

      What does the object-oriented means? Objects are models of somethings that can do certain things and have certain things done to them. Formally, an object is a collection of data and associated behaviors.

    2. The difference between object-oriented design and object-orientedprogramming

      What is the design and programming mean in OOP?

    Tags

    Annotators