46 Matching Annotations
  1. Oct 2024
    1. if (dtc->wb_thresh < 2 * wb_stat_error()) { wb_reclaimable = wb_stat_sum(wb, WB_RECLAIMABLE); dtc->wb_dirty = wb_reclaimable + wb_stat_sum(wb, WB_WRITEBACK); } else { wb_reclaimable = wb_stat(wb, WB_RECLAIMABLE); dtc->wb_dirty = wb_reclaimable + wb_stat(wb, WB_WRITEBACK); }

      This is a configuration policy that does a more accurate calculation on the number of reclaimable pages and dirty pages when the threshold for the dirty pages in the writeback context is lower than 2 times the maximal error of a stat counter.

    2. static long wb_min_pause(struct bdi_writeback *wb, long max_pause, unsigned long task_ratelimit, unsigned long dirty_ratelimit, int *nr_dirtied_pause

      This function is an algorithmic policy that determines the minimum throttle time for a process between consecutive writeback operations for dirty pages based on heuristics. It is used for balancing the load of the I/O subsystems so that there will not be excessive I/O operations that impact the performance of the system.

    3. if (!laptop_mode && nr_reclaimable > gdtc->bg_thresh && !writeback_in_progress(wb)) wb_start_background_writeback(wb);

      This is a configuration policy that determines whether to start background writeout. The code here indicates that if laptop_mode, which will reduce disk activity for power saving, is not set, then when the number of dirty pages reaches the bg_thresh threshold, the system starts writing back pages.

    4. if (thresh > dirty) return 1UL << (ilog2(thresh - dirty) >> 1);

      This implements a configuration policy that determines the interval for the kernel to wake up and check for dirty pages that need to be written back to disk.

    5. limit -= (limit - thresh) >> 5;

      This is a configuration policy that determines how much should the limit be updated. The limit controls the amount of dirty memory allowed in the system.

    6. if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && (!mdtc || m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { unsigned long intv; unsigned long m_intv; free_running: intv = dirty_poll_interval(dirty, thresh); m_intv = ULONG_MAX; current->dirty_paused_when = now; current->nr_dirtied = 0; if (mdtc) m_intv = dirty_poll_interval(m_dirty, m_thresh); current->nr_dirtied_pause = min(intv, m_intv); break; } /* Start writeback even when in laptop mode */ if (unlikely(!writeback_in_progress(wb))) wb_start_background_writeback(wb); mem_cgroup_flush_foreign(wb); /* * Calculate global domain's pos_ratio and select the * global dtc by default. */ if (!strictlimit) { wb_dirty_limits(gdtc); if ((current->flags & PF_LOCAL_THROTTLE) && gdtc->wb_dirty < dirty_freerun_ceiling(gdtc->wb_thresh, gdtc->wb_bg_thresh)) /* * LOCAL_THROTTLE tasks must not be throttled * when below the per-wb freerun ceiling. */ goto free_running; } dirty_exceeded = (gdtc->wb_dirty > gdtc->wb_thresh) && ((gdtc->dirty > gdtc->thresh) || strictlimit); wb_position_ratio(gdtc); sdtc = gdtc; if (mdtc) { /* * If memcg domain is in effect, calculate its * pos_ratio. @wb should satisfy constraints from * both global and memcg domains. Choose the one * w/ lower pos_ratio. */ if (!strictlimit) { wb_dirty_limits(mdtc); if ((current->flags & PF_LOCAL_THROTTLE) && mdtc->wb_dirty < dirty_freerun_ceiling(mdtc->wb_thresh, mdtc->wb_bg_thresh)) /* * LOCAL_THROTTLE tasks must not be * throttled when below the per-wb * freerun ceiling. */ goto free_running; } dirty_exceeded |= (mdtc->wb_dirty > mdtc->wb_thresh) && ((mdtc->dirty > mdtc->thresh) || strictlimit); wb_position_ratio(mdtc); if (mdtc->pos_ratio < gdtc->pos_ratio) sdtc = mdtc; }

      This is an algorithmic policy that determines whether the process can run freely or a throttle is needed to control the rate of the writeback by checking if the number of dirty pages exceed the average of the global threshold and background threshold.

    7. shift = dirty_ratelimit / (2 * step + 1); if (shift < BITS_PER_LONG) step = DIV_ROUND_UP(step >> shift, 8); else step = 0; if (dirty_ratelimit < balanced_dirty_ratelimit) dirty_ratelimit += step; else dirty_ratelimit -= step;

      This is a configuration policy that determines how much we should increase/decrease the dirty_ratelimit, which controls the rate that processors write dirty pages back to storage.

    8. ratelimit_pages = dirty_thresh / (num_online_cpus() * 32); if (ratelimit_pages < 16) ratelimit_pages = 16;

      This is a configuration policy that dynamically determines the rate that kernel can write dirty pages back to storage in a single writeback cycle.

    9. t = wb_dirty / (1 + bw / roundup_pow_of_two(1 + HZ / 8));

      This implements a configuration policy that determines the maximum time that the kernel should wait between writeback operations for dirty pages. This ensures that dirty pages are flushed to disk within a reasonable time frame and control the risk of data loss in case of a system crash.

    10. if (IS_ENABLED(CONFIG_CGROUP_WRITEBACK) && mdtc) {

      This is a configuration policy that controls whether to update the limit in the control group. The config enables support for controlling the writeback of dirty pages on a per-cgroup basis in the Linux kernel. This allows for better resource management and improved performance.

    1. if (si->cluster_info) { if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) goto scan; } else if (unlikely(!si->cluster_nr--)) {

      Algorithmic policy decision: If SSD, use SSD wear-leveling friendly algorithm. Otherwise, use HDD algo which minimizes seek times. Potentially, certain access patterns may make one of these algorithms less effective (i.e. in the case of wear leveling, the swap is constantly full)

    2. scan_base = offset = si->lowest_bit; last_in_cluster = offset + SWAPFILE_CLUSTER - 1; /* Locate the first empty (unaligned) cluster */ for (; last_in_cluster <= si->highest_bit; offset++) { if (si->swap_map[offset]) last_in_cluster = offset + SWAPFILE_CLUSTER; else if (offset == last_in_cluster) { spin_lock(&si->lock); offset -= SWAPFILE_CLUSTER - 1; si->cluster_next = offset; si->cluster_nr = SWAPFILE_CLUSTER - 1; goto checks; } if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; } }

      Here, (when using HDDs), a policy is implemented that places the swapped page in the first available slot. This is supposed to reduce seek time in spinning drives, as it encourages having swap entries near each other.

    3. /* * Even if there's no free clusters available (fragmented), * try to scan a little more quickly with lock held unless we * have scanned too many slots already. */ if (!scanned_many) { unsigned long scan_limit; if (offset < scan_base) scan_limit = scan_base; else scan_limit = si->highest_bit; for (; offset <= scan_limit && --latency_ration > 0; offset++) { if (!si->swap_map[offset]) goto checks; } }

      Here we have a configuration policy where we do another smaller scan as long as we haven't exhausted our latency_ration. Another alternative could be yielding early in anticipation that we aren't going to find a free slot.

    4. /* * select a random position to start with to help wear leveling * SSD */ for_each_possible_cpu(cpu) { per_cpu(*p->cluster_next_cpu, cpu) = get_random_u32_inclusive(1, p->highest_bit); }

      An algorithmic (random) policy is used here to spread swap pages around an SSD to help with wear leveling instead of writing to the same area of an SSD often.

    5. cluster_list_add_tail(&si->discard_clusters, si->cluster_info, idx);

      A policy decision is made to add a cluster to the discard list on a first-come, first-served basis. However, this approach could be enhanced by prioritizing certain clusters higher on the list based on their 'importance.' This 'importance' could be defined by how closely a cluster is related to other pages in the swap. By doing so, the system can reduce seek time as mentioned on line 817.

    6. if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; scanned_many = true; } if (swap_offset_available_and_locked(si, offset)) goto checks; } offset = si->lowest_bit; while (offset < scan_base) { if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; scanned_many = true; }

      Here, a policy decision is made to fully replenish the latency_ration with the LATENCY_LIMIT and then yield back to the scheduler if we've exhausted it. This makes it so that when scheduled again, we have the full LATENCY_LIMIT to do a scan. Alternative policies could grow/shrink this to find a better heuristic instead of fully replenishing each time.

      Marked config/value as awe're replacing latency_ration with a compiletime-defined limit.

    7. while (scan_swap_map_ssd_cluster_conflict(si, offset)) { /* take a break if we already got some slots */ if (n_ret) goto done; if (!scan_swap_map_try_ssd_cluster(si, &offset, &scan_base)) goto scan;

      Here, a policy decision is made to stop scanning if some slots were already found. Other policy decisions could be made to keep scanning or take into account how long the scan took or how many pages were found.

    8. else if (!cluster_list_empty(&si->discard_clusters)) { /* * we don't have free cluster but have some clusters in * discarding, do discard now and reclaim them, then * reread cluster_next_cpu since we dropped si->lock */ swap_do_scheduled_discard(si); *scan_base = this_cpu_read(*si->cluster_next_cpu); *offset = *scan_base; goto new_cluster;

      This algorithmic policy discards + reclaims pages as-needed whenever there is no free cluster. Other policies could be explored that do this preemptively in order to avoid the cost of doing it here.

    9. #ifdef CONFIG_THP_SWAP

      This is a build-time flag that configures how hugepages are handled when swapped. When defined, it swaps them in one piece, while without it splits them into smaller units and swaps those units.

    10. if (swap_flags & SWAP_FLAG_DISCARD_ONCE) p->flags &= ~SWP_PAGE_DISCARD; else if (swap_flags & SWAP_FLAG_DISCARD_PAGES) p->flags &= ~SWP_AREA_DISCARD

      This is a configuration policy decision where a sysadmin can pass flags to sys_swapon() to control the behavior discards are handled. If DISCARD_ONCE is set, a flag which "discard[s] swap area at swapon-time" is unset, and if DISCARD_PAGES is set, a flag which "discard[s] page-clusters after use" is unset.

    1. randomize_stack_top

      This function uses a configuration policy to enable Address Space Layout Randomization (ASLR) for a specific process if PF_RANDOMIZE flag is set. It randomly arranges the positions of stack of a process to help defend certain attacks by making memory addresses unpredictable.

    1. if (prev_class) { if (can_merge(prev_class, pages_per_zspage, objs_per_zspage)) { pool->size_class[i] = prev_class; continue; } }

      This is an algorithmic policy. A zs_pool maintains zs_pages of different size_class. However, some size_classes share exactly same characteristics, namely pages_per_zspage and objs_per_zspage. Recall the other annotation of mine, it searched free zspages by size_class: zspage = find_get_zspage(class);. Thus, grouping different classes improves memory utilization.

    2. zspage = find_get_zspage(class); if (likely(zspage)) { obj = obj_malloc(pool, zspage, handle); /* Now move the zspage to another fullness group, if required */ fix_fullness_group(class, zspage); record_obj(handle, obj); class_stat_inc(class, ZS_OBJS_INUSE, 1); goto out; }

      This is an algorithmic policy. Instead of immediately allocating new zspages for each memory request, the algorithm first attempts to find and reuse existing partially filled zspages from a given size class by invoking the find_get_zspage(class) function. It also updates the corresponding fullness groups.

    1. 1

      This is a configuration policy that sets the timeout between retries if vmap_pages_range() fails. This could be tunable variable.

    2. if (!(flags & VM_ALLOC)) area->addr = kasan_unpoison_vmalloc(area->addr, requested_size, KASAN_VMALLOC_PROT_NORMAL);

      This is an algorithmic policy. This is an optimization that prevents duplicate marks of accessibility. Only pages allocated without VM_ALLOC (e.g. ioremap) was not set accessible (unpoison), thus requiring explicit setting here.

    3. 100U

      This is an configuration policy that determines 100 pages are the upper limit for the bulk-allocator. However, the implementation of alloc_pages_bulk_array_mempolicy does not explicitly limit in the implementation. So I believe it is an algorithmic policy related to some sort of optimization.

    4. if (!order) {

      This is an algorithmic policy determines that only use the bulk allocator for order-0 pages (non-super pages). Maybe the bulk allocator could be applied to super pages to speed up allocation. Currently I haven't seen the reason why it cannot be applied.

    5. if (likely(count <= VMAP_MAX_ALLOC)) { mem = vb_alloc(size, GFP_KERNEL); if (IS_ERR(mem)) return NULL; addr = (unsigned long)mem; } else { struct vmap_area *va; va = alloc_vmap_area(size, PAGE_SIZE, VMALLOC_START, VMALLOC_END, node, GFP_KERNEL, VMAP_RAM); if (IS_ERR(va)) return NULL; addr = va->va_start; mem = (void *)addr; }

      This is an algorithmic policy that determines whether to use a more efficient vl_alloc which does not involve complex virtual-to-physical mappings. Unlike the latter alloc_vmap_area, vb_alloc does not need to traverse the rb-tree of free vmap areas. It simply find a larger enough block from vmap_block_queue.

    6. VMAP_PURGE_THRESHOLD

      The threshold VMAP_PURGE_THRESHOLD is a configuration policy that could be tuned by machine learning. Setting this threshold lower reduces purging activities while setting it higher reduces framentation.

    7. if (!(force_purge

      This is an algorithmic policy that prevents purging blocks with considerable amount of "usable memory" unless requested with force_purge.

    8. resched_threshold = lazy_max_pages() << 1;

      The assignment of resched_threshold and lines 1776-1777 are configuration policies to determine fewer than which number of lazily-freed pages it should yield CPU temporarily to higher-priority tasks.

    9. log = fls(num_online_cpus());

      This heuristic scales lazy_max_pages logarithmically, which is a configuration policy. Alternatively, machine learning could determine the optimal scaling function—whether linear, logarithmic, square-root, or another approach.

    10. 32UL * 1024 * 1024

      This is a configuration policy that decides to always returns multiples of 32 MB worth of pages. This could be a configurable variable rather than a fixed magic number.

  2. Sep 2024
    1. if (lruvec->file_cost + lruvec->anon_cost > lrusize / 4) { lruvec->file_cost /= 2; lruvec->anon_cost /= 2; }

      It is a configuration policy. The policy here is to adjust the cost of current page. The cost means the overhead for kernel to operate the page if the page is swapped out. Kernel adopts a decay policy. Specifically, if current cost is greater than lrusize/4, its cost will be divided by 2. If kernel has no this policy, after a long-term running, the cost of each page is very high. Kernel might mistakenly reserve pages which are frequently visited long time ago but are inactive currently. It might cause performance degradation. The value (lrusize/4) here has a trade-off between performance and sensitivity. For example, if the value is too small, kernel will frequently adjust the priority of each page, resulting in process's performance degradation. If the value is too large, kernel might be misleaded by historical data, causing wrongly swapping currently popular pages and further performance degradation.

    2. if (megs < 16) page_cluster = 2; else page_cluster = 3;

      It is a configuration policy. The policy here is to determine the size of page cluster. "2" and "3" is determined externally to the kernel. Page cluster is the actual execution unit when swapping. If the machine is small-memory, kernel executes swap operation on 4 (2^2) pages for each time. Otherwise, kernel operats 8 (2^3) for each time. The rational here is to avoid much memory pressure for small-memory system and to improve performance for large-memory system.

    1. mod_memcg_page_state(area->pages[i], MEMCG_VMALLOC, 1);

      It is a configuration policy. It is used to count the physical page for memory control group. The policy here is one physical page corresponds to one count to the memory control group. If using huge page, the value here might not be 1,

    2. schedule_timeout_uninterruptible(1);

      It is a configuration policy. If kernel cannot allocate an enough virtual space with alignment, while nofail is specified to disallow failure, the kernel will let the process to sleep for 1 time slot. In this period, the process cannot be interrupted and quit the schedule queue of CPU. The "1" here is a configuration parameter, made externally to the kernel. It is not large enough for latency-sensitive process and is not small enough to retry frequently.

    3. if (array_size > PAGE_SIZE) { area->pages = __vmalloc_node(array_size, 1, nested_gfp, node, area->caller); } else { area->pages = kmalloc_node(array_size, nested_gfp, node); }

      It is an algorithmic policy and also a configuration policy. The algorithmic policy here is to maxmize the data locality of virtual memory address, by letting the space be in continuous virtual pages. If the demanded size of virtual memory is larger than one page, the kernel will allocate multiple continuous pages for it by using __vmalloc_node(). Otherwise, the kernel will call kmalloc_node() to place it in a single page. On the other hand, to allocate continuous pages, we should invoke __vmalloc_node() by declaring align = 1, which is a configuration policy.

    4. if (!counters) return; if (v->flags & VM_UNINITIALIZED) return;

      It is an algorithmic policy. The show_numa_info() is used for printing how the virtual memory of v (vm_struct) distributes its pages across numa nodes. The two if conditions are used for filtering the invalid v. The first if means the virtual memory has not been associated with physical page. The second if means the virtual memory has not been initialized yet.

    5. if (page) copied = copy_page_to_iter_nofault(page, offset, length, iter); else copied = zero_iter(iter, length);

      It is an algorithmic policy. The policy here is to avoid lock/unlock overhead by using copy_page_to_iter_nofault() function plus if-else branch. Because copy_page_to_iter_nofault() is based on no page fault, we need to check the physical page is still alive. So the kernel uses an if condition to guarantee copy_page_to_iter_nofault() without page fault. On the other hand, if the page is not valid, kernel chooses to fill the iter using zero value instead of returning an error. I think it is because it can make invocation more convenient. The caller does not need to do a verification of the return value and does not need further actions to handle it.

    6. if (base + last_end < vmalloc_start + last_end) goto overflow; /* * Fitting base has not been found. */ if (va == NULL) goto overflow;

      It is an algorithmic policy. The two if conditions are used for checking the feasibility of the allocation. The first if means the adress of the allocation is beyond the allowable address, causing overflow. In this case, since the base address is decided from higher one to smaller one, reducing base address cannot work. Note that the base address is dependent on the va. The second if means we don't find a valid va which has appropriate base address. So the policy here is going to overflow branch to re-allocate va list.

    7. if (base + start < va->va_start) { va = node_to_va(rb_prev(&va->rb_node)); base = pvm_determine_end_from_reverse(&va, align) - end; term_area = area; continue; }

      It is an algorithmic policy. It is used for checking whether the start address of current va is valid. The policy here is to change a new va, different from the solution of "end address check" (explained in my other annotation). The reason is that we find the base address from high address to low address. If we continue to reduce the base address, the if condition will be always wrong. Changing a new va with a lower start address is the only solution.

    8. if (base + end > va->va_end) { base = pvm_determine_end_from_reverse(&va, align) - end; term_area = area; continue; }

      It is an algorithmic policy. The outer loop is an endless loop until finding a vaild virtual memory space satisfying the requirements. The if condition here is used to guarantee the space in va is large enough. If not, we continue to scan the memory space from high address to low address while satisfying the alignment requirement. Specifically, tuning the base address to a lower address. And then, it will retry for verification.

    9. va = kmem_cache_zalloc(vmap_area_cachep, GFP_NOWAIT); if (WARN_ON_ONCE(!va)) continue;

      It is an algorithmic policy. The for loop is used for transferring the virtual memory managment by using from vm_struct (link list) to vm_area (tree?). Inside the for loop, if we cannot allocate a new vmap_area from vmap_area_cachep (va = NULL), the program will simply print a warning and skip current tmp, instread of retry or anything to guarantee the integrity.

    10. if (!spin_trylock(&vmap_area_lock)) return false;

      It is an algorithmic policy. The vmalloc_dump_obj() function is used for printing the information of specified virtual memory. Due to concurrency, before manipulating the critical state, we must obtain the lock. The policy here is spin_trylock. It only tries once. If fails, the function will simply returns false. I think the policy here is to avoid long-term waiting to improve the performance.

  3. Aug 2024
    1. if (hugetlb_cgroup_disabled())

      This probably not an interesting policy decision for ldos. It is a feature flag for the running OS. But if cgroups were decided by policy then this flag would be controlled by the cgroup decision.