65 Matching Annotations
  1. Feb 2024
    1. , as itis common for a page to convert from being frequently accessedto not being accessed at all on two consecutive epochs, thus theprediction MAE can be significantly high.

      this is interesting

    2. Overall, we prove that the accuracy of the RNN predictions issuch that it can deliver application performance similar to whatwould be possible with oracular knowledge of the access frequency.

      I would've liked to see these results with real workloads, these benchmarks have a much smaller footprint than real workloads, which makes me skeptical as to how much the 100 pages needs to be changed to achieve the same results for real workloads

    3. lso, we assume dedicated DMA enginesthat allow seamless page migration, which is overlapped with thecomputation, as explored in [14, 19].

      This would be very messy in reality

    4. Concerning the memory footprint of these applications,

      As the footprint of the app increases, it might be harder to apply this solution as the number of selected pages might have to increase

    5. Identifies the subset of application pages that are important toperformance, through its page selector component, described indetail later on

      We can re-use this for the LRB model

    6. However, whennormalizing hundred thousand values in such a way (total numberof pages according to Table 1), there will be vast information loss.

      This is what transforMAP does?

    7. More specifically, itis the sequence of the page accesses that were serviced from mainmemory and not the processor’s hardware caches, as they happenedthroughout the application run time.

      I believe this means cache misses.

    8. , so as to aggregate the accesses on an application pagegranularity and then determine an ordering of heavily accessedpages. These predictions need to happen periodically, when thepage scheduler is invoked, so that the appropriate page migrationsare determined and executed.

      This is very similar to what we want to do, although at a different granularity. I really like this learning objective.

    9. History page scheduler can reduce performanceup to 55% (in the case of lulesh) and 13% on average.

      13% on average might be very hard to recover, I find that ML can be good for workloads where heuristics leave a lot on the table.

    10. Purely history-based pagescheduling methods are limited in the performance opportunitiesthey can provide to applications running on hybrid memory systems.Instead, they must be augmented with more intelligent, predictivemethods

      I think this is a bit of a broad claim because though history page scheduler is one method, we cannot make such a strong claim about all possible heuristic solutions.

    11. commontheme among these approaches is that they rely exclusively onhistoric information about page accesses. Specifically, the state-of-the-art [ 20 , 27 , 28 ] in system-level dynamic page managementsolutions for HMS utilize the immediate observed behavior to makedecisions on the best future page placement.

      A mistake even we might be making

    12. An effective page scheduler is responsible for ensur-ing that hot pages – the ones that are accessed frequently

      Is having the entire page in DRAM always better or for some pages just hardware cache-line caching is sufficient? I have been thinking about this for some time now.

    13. Kleio reduces on average 80% of the performance gapbetween the existing solutions and an oracle with knowledge offuture access pattern

      Interesting way to put their performance

  2. Oct 2023
    1. Shorter thresholds thus make the task of the ML predictormore tractable. While longer thresholds move the byte missratio of relaxed Belady closer to Belady’s MIN. It is thusimportant to find the proper threshold

      This is an interesting insight for our work to prefetch N number of pages instead of 1.

  3. Sep 2023
    1. Distribution of tier2 access ratio and tier2 resi-dency ratio under different promotion policies.

      Okay this is great, the fact that access ratio is higher at higher 60s promotion and lowest at PMU tells me that prefetching actually has a scope even for this system, but its just difficult to do it. The sampling based prefetching work can be a good fit here

    2. Tier2 performance impact and overheads.

      I think the system overhead needs important distinctions, does this include the original address space walking to determine the accessed bits or just the userspace daemon walking over accessed bits?

    3. the large variation means that population studies are nec-essary to determine the actual impact.

      I actually don't agree with this, these results are nice and helpful but I do wanna see the bottom line of how the TPS or latency or completion time of individual jobs are affected.

    4. that this metric is relatively high especially compared toswap based solutions [32 ],

      Why is it higher than swap based solutions? that's not obvious to me apart from one key difference is that swap based solutions might be promoting more pages, as promotion is compulsory on access, where as here promotion is not compulsory

    5. . The base policy promotes pagestouched in two or more consecutive hot scan periods.

      I would be interested to see if promoted pages are actually accessed after 1 minute, I wonder if there is such a study

    6. The hardware deployed with TMTS supportsprecise events filtered to loads sourced from tier2. Unfortunately,the hardware does not support such filtering for memory stores,so stores are not sampled

      Interesting, I am not sure if by hardware they mean specific intel processors. Not counting for stores discards HeMem as their primary finding was that stores hurt and they should be treat differently

    7. hence, rely on swap-backed memory(anonymous and tmpfs) for more than 98% of their pages. Thus,only swap-backed pages are treated as demotion candidates.

      Interesting thing to keep in mind, this is also my observsation but important to note that at datacenter scale that is the case as well

    8. The diversity and scale of its applicationsmotivates an application-transparent solution in the general case,adaptable to specific workload demands.

      Motivation for why things need to be application transparent, though I don't necessarily buy it yet

  4. Jul 2023
    1. Figure 5

      This figure tells me that for mostly all workloads, 1 epoch of training and 10k memory trace is enough for prediction, because they never need to be retrained.

    2. n ≈ 50, 000

      What are they predicting, they are literally saying that there are 50000 possible deltas? EDIT: revisited the previous paper and there really are 50000 deltas possible, we should look at what the deltas are, vs what our deltas would be, 50k is impossible as a delta for page prefetching.

    3. may lead to slowing down of inference due to largenumber of output labels.

      This has to be smaller in our case than in caches, the intuition is that +64 means the delta of pages is every 64th page, that is a very large amount of memory.

    4. We propose to use a hybrid offline+online training approach wherea base model is trained offline first. At runtime, in case of low ac-curacy, a more specialized model is trained in real-time with thehypotheses that high accuracy can be obtained by only few trainingsamples with few epochs, and that this high accuracy can be sus-tained for a long period of time before another round of retrainingis required

      basically finetuning online

    5. the approach of training offline and testingonline for individual application is not a practical prefetcher,

      Why is it not a practical prefetcher? If we overfit a model for an application and it can predict all the patterns then this should work.

  5. Jun 2023
    1. Second, it is not worth learning correlations for addresses that occurinfrequently.

      This is very interesting, for page prefetching it should be the opposite, we want to learn correlations for pages that are less frequent, becaues the pages that are more frequent would never be considered cold and thus we should never need to prefetch them, unless we are dealing with a scenario of total memory disaggregation beceause in that case cache and page prefetching would be the same.

    2. For example, a prefetchis considered correct if any one of the ten predictions by the modelmatch the next address, thus ignoring practical considerations ofaccuracy and timeliness.

      Totally agree with this criticism, it was also a sorepoint for me when I read the paper

    3. Unfortunately, regression-based models are trained to arrive close to the ground truth label,but since a small error in a cache line address will prefetch thewrong line, being close is not useful for prefetching.

      This is true for cache prefetching but not for page prefetching, maybe regression is good enough for page prefetching and that allows is to be achievable at runtime even?

    4. While the RL formulation is conceptu-ally powerful, the use of tables is insufficient for RL because tablesare sample inefficient and sensitive to noise in contexts.

      I have no clue what this means, need to clear it out with someone who knows ML

    5. data prefetchers have no known ground truth labelsfrom which to learn.

      the fact that there is no ground truth is very important and pertinent in prefetching, but maybe hopp can be used as a ground truth?

    6. To address this issue, weuse a novel attention-based embedding layer that allows the pageprediction to provide context for the offset prediction.

      Using first output as input to second

    1. Only PC

      Coupled with the t-SNE visualization, it is clear that deltas for a particular PC are clustered, and what if we just trained a model for that? These results do puzzle me a bit because I am having a hard time wrapping my head around it.

    2. the size of the vocabulary required in order toobtain at best 50% accuracy is usually O(1000) or less

      This is intuitive, you wouldn't expect more deltas than that because >1000 deltas imply that numbers greater than 1000 could be a delta at which point I would stop calling it a pattern, especially when clustering comes into play.

  6. May 2023
    1. We relate contemporary prefetchingstrategies to n-gram models in natural languageprocessing

      Relating cache prefetching to n-gram (NLP), could be a good precursor to relating prefetching to LLMs.