Hypothesis

65 Matching Annotations

Feb 2024
people.ece.ubc.ca people.ece.ubc.ca

Kleio: A Hybrid Memory Page Scheduler with Machine Intelligence

17
1. shauryapatel 16 Feb 2024
  
  in Public
  
  , as itis common for a page to convert from being frequently accessedto not being accessed at all on two consecutive epochs, thus theprediction MAE can be significantly high.
  
  this is interesting
2. shauryapatel 16 Feb 2024
  
  in Public
  
  Overall, we prove that the accuracy of the RNN predictions issuch that it can deliver application performance similar to whatwould be possible with oracular knowledge of the access frequency.
  
  I would've liked to see these results with real workloads, these benchmarks have a much smaller footprint than real workloads, which makes me skeptical as to how much the 100 pages needs to be changed to achieve the same results for real workloads
3. shauryapatel 16 Feb 2024
  
  in Public
  
  lso, we assume dedicated DMA enginesthat allow seamless page migration, which is overlapped with thecomputation, as explored in [14, 19].
  
  This would be very messy in reality
4. shauryapatel 16 Feb 2024
  
  in Public
  
  Concerning the memory footprint of these applications,
  
  As the footprint of the app increases, it might be harder to apply this solution as the number of selected pages might have to increase
5. shauryapatel 16 Feb 2024
  
  in Public
  
  Number of accesses
  
  Is this total accesses or last scheduling epoch?
6. shauryapatel 16 Feb 2024
  
  in Public
  
  Identifies the subset of application pages that are important toperformance, through its page selector component, described indetail later on
  
  We can re-use this for the LRB model
7. shauryapatel 16 Feb 2024
  
  in Public
  
  we flip the problem and explore the case of pre-dicting when a page is going to be accessed next
  
  This is exactly what LRB also does
8. shauryapatel 16 Feb 2024
  
  in Public
  
  However, whennormalizing hundred thousand values in such a way (total numberof pages according to Table 1), there will be vast information loss.
  
  This is what transforMAP does?
9. shauryapatel 16 Feb 2024
  
  in Public
  
  This approach has several limitations
  
  Very important for us to understand this
10. shauryapatel 16 Feb 2024
  
  in Public
  
  More specifically, itis the sequence of the page accesses that were serviced from mainmemory and not the processor’s hardware caches, as they happenedthroughout the application run time.
  
  I believe this means cache misses.
11. shauryapatel 09 Feb 2024
  
  in Public
  
  , so as to aggregate the accesses on an application pagegranularity and then determine an ordering of heavily accessedpages. These predictions need to happen periodically, when thepage scheduler is invoked, so that the appropriate page migrationsare determined and executed.
  
  This is very similar to what we want to do, although at a different granularity. I really like this learning objective.
12. shauryapatel 09 Feb 2024
  
  in Public
  
  Figure 2:
  
  This figure is very hard to read for me
13. shauryapatel 09 Feb 2024
  
  in Public
  
  History page scheduler can reduce performanceup to 55% (in the case of lulesh) and 13% on average.
  
  13% on average might be very hard to recover, I find that ML can be good for workloads where heuristics leave a lot on the table.
14. shauryapatel 09 Feb 2024
  
  in Public
  
  Purely history-based pagescheduling methods are limited in the performance opportunitiesthey can provide to applications running on hybrid memory systems.Instead, they must be augmented with more intelligent, predictivemethods
  
  I think this is a bit of a broad claim because though history page scheduler is one method, we cannot make such a strong claim about all possible heuristic solutions.
15. shauryapatel 09 Feb 2024
  
  in Public
  
  commontheme among these approaches is that they rely exclusively onhistoric information about page accesses. Specifically, the state-of-the-art [ 20 , 27 , 28 ] in system-level dynamic page managementsolutions for HMS utilize the immediate observed behavior to makedecisions on the best future page placement.
  
  A mistake even we might be making
16. shauryapatel 09 Feb 2024
  
  in Public
  
  An effective page scheduler is responsible for ensur-ing that hot pages – the ones that are accessed frequently
  
  Is having the entire page in DRAM always better or for some pages just hardware cache-line caching is sufficient? I have been thinking about this for some time now.
17. shauryapatel 09 Feb 2024
  
  in Public
  
  Kleio reduces on average 80% of the performance gapbetween the existing solutions and an oracle with knowledge offuture access pattern
  
  Interesting way to put their performance
Visit annotations in context

Annotators

shauryapatel

URL

people.ece.ubc.ca/shauryapatel/data/kleio.pdf
Oct 2023
www.usenix.org www.usenix.org

Untitled document

1
1. shauryapatel 12 Oct 2023
  
  in Public
  
  Shorter thresholds thus make the task of the ML predictormore tractable. While longer thresholds move the byte missratio of relaxed Belady closer to Belady’s MIN. It is thusimportant to find the proper threshold
  
  This is an interesting insight for our work to prefetch N number of pages instead of 1.
Visit annotations in context

Annotators

shauryapatel

URL

usenix.org/system/files/nsdi20-paper-song.pdf
Sep 2023
www.micahlerner.com www.micahlerner.com

Towards an Adaptable Systems Architecture for Memory Tiering at Warehouse-Scale

10
1. shauryapatel 13 Sep 2023
  
  in Public
  
  Distribution of tier2 access ratio and tier2 resi-dency ratio under different promotion policies.
  
  Okay this is great, the fact that access ratio is higher at higher 60s promotion and lowest at PMU tells me that prefetching actually has a scope even for this system, but its just difficult to do it. The sampling based prefetching work can be a good fit here
2. shauryapatel 13 Sep 2023
  
  in Public
  
  However, a newly referenced page is likely to be referenced againsoon and frequently.
  
  How soon and how frequently?
3. shauryapatel 13 Sep 2023
  
  in Public
  
  Tier2 performance impact and overheads.
  
  I think the system overhead needs important distinctions, does this include the original address space walking to determine the accessed bits or just the userspace daemon walking over accessed bits?
4. shauryapatel 13 Sep 2023
  
  in Public
  
  the large variation means that population studies are nec-essary to determine the actual impact.
  
  I actually don't agree with this, these results are nice and helpful but I do wanna see the bottom line of how the TPS or latency or completion time of individual jobs are affected.
5. shauryapatel 13 Sep 2023
  
  in Public
  
  that this metric is relatively high especially compared toswap based solutions [32 ],
  
  Why is it higher than swap based solutions? that's not obvious to me apart from one key difference is that swap based solutions might be promoting more pages, as promotion is compulsory on access, where as here promotion is not compulsory
6. shauryapatel 13 Sep 2023
  
  in Public
  
  . The base policy promotes pagestouched in two or more consecutive hot scan periods.
  
  I would be interested to see if promoted pages are actually accessed after 1 minute, I wonder if there is such a study
7. shauryapatel 13 Sep 2023
  
  in Public
  
  The hardware deployed with TMTS supportsprecise events filtered to loads sourced from tier2. Unfortunately,the hardware does not support such filtering for memory stores,so stores are not sampled
  
  Interesting, I am not sure if by hardware they mean specific intel processors. Not counting for stores discards HeMem as their primary finding was that stores hurt and they should be treat differently
8. shauryapatel 13 Sep 2023
  
  in Public
  
  hence, rely on swap-backed memory(anonymous and tmpfs) for more than 98% of their pages. Thus,only swap-backed pages are treated as demotion candidates.
  
  Interesting thing to keep in mind, this is also my observsation but important to note that at datacenter scale that is the case as well
9. shauryapatel 13 Sep 2023
  
  in Public
  
  . This system design point is quitedistinct from virtual memory
  
  I believe this is a very very important distinction that can have big consequences
10. shauryapatel 13 Sep 2023
  
  in Public
  
  The diversity and scale of its applicationsmotivates an application-transparent solution in the general case,adaptable to specific workload demands.
  
  Motivation for why things need to be application transparent, though I don't necessarily buy it yet
Visit annotations in context

Annotators

shauryapatel

URL

micahlerner.com/assets/pdf/adaptable.pdf
Jul 2023
dl.acm.org dl.acm.org

3357526.3357549.pdf

14
1. shauryapatel 12 Jul 2023
  
  in Public
  
  at a much higher train-ing time and memory footprint
  
  how high? some numbers would be nice.
2. shauryapatel 12 Jul 2023
  
  in Public
  
  variable temporal dependencies
  
  proof? analysis?
3. shauryapatel 12 Jul 2023
  
  in Public
  
  general
  
  what is general about this model?
4. shauryapatel 12 Jul 2023
  
  in Public
  
  Figure 5
  
  This figure tells me that for mostly all workloads, 1 epoch of training and 10k memory trace is enough for prediction, because they never need to be retrained.
5. shauryapatel 12 Jul 2023
  
  in Public
  
  measured to be up to 15x the training times
  
  Some concrete numbers would've been nice
6. shauryapatel 12 Jul 2023
  
  in Public
  
  Emulating Online Learning:
  
  Important question is when do you want to learn, you don't want to keep learning the same patterns again and again
7. shauryapatel 12 Jul 2023
  
  in Public
  
  This reduces the output size and the neuralnetwork is trained to predict binary outputs that are laterconverted to decimal.
  
  The concept is very simple it seems
8. shauryapatel 12 Jul 2023
  
  in Public
  
  000x compression
  
  in memory? model size? what compression?
9. shauryapatel 12 Jul 2023
  
  in Public
  
  n ≈ 50, 000
  
  What are they predicting, they are literally saying that there are 50000 possible deltas? EDIT: revisited the previous paper and there really are 50000 deltas possible, we should look at what the deltas are, vs what our deltas would be, 50k is impossible as a delta for page prefetching.
10. shauryapatel 12 Jul 2023
  
  in Public
  
  may lead to slowing down of inference due to largenumber of output labels.
  
  This has to be smaller in our case than in caches, the intuition is that +64 means the delta of pages is every 64th page, that is a very large amount of memory.
11. shauryapatel 12 Jul 2023
  
  in Public
  
  Figure 1: Autocorrelation coefficients for each trace for various lags.
  
  This figure made no sense to me
12. shauryapatel 12 Jul 2023
  
  in Public
  
  for various lags
  
  First use of this term and I have no clue what it means
13. shauryapatel 12 Jul 2023
  
  in Public
  
  We propose to use a hybrid offline+online training approach wherea base model is trained offline first. At runtime, in case of low ac-curacy, a more specialized model is trained in real-time with thehypotheses that high accuracy can be obtained by only few trainingsamples with few epochs, and that this high accuracy can be sus-tained for a long period of time before another round of retrainingis required
  
  basically finetuning online
14. shauryapatel 12 Jul 2023
  
  in Public
  
  the approach of training offline and testingonline for individual application is not a practical prefetcher,
  
  Why is it not a practical prefetcher? If we overfit a model for an application and it can predict all the patterns then this should work.
Visit annotations in context

Annotators

shauryapatel

URL

dl.acm.org/doi/pdf/10.1145/3357526.3357549
Jun 2023
www.cs.utexas.edu www.cs.utexas.edu

A Hierarchical Neural Model of Data Prefetching

14
1. shauryapatel 21 Jun 2023
  
  in Public
  
  Art
  
  haha?
2. shauryapatel 21 Jun 2023
  
  in Public
  
  Second, it is not worth learning correlations for addresses that occurinfrequently.
  
  This is very interesting, for page prefetching it should be the opposite, we want to learn correlations for pages that are less frequent, becaues the pages that are more frequent would never be considered cold and thus we should never need to prefetch them, unless we are dealing with a scenario of total memory disaggregation beceause in that case cache and page prefetching would be the same.
3. shauryapatel 21 Jun 2023
  
  in Public
  
  illustrates the page-aware offset embedding
  
  I'll have to go over this entire part again, I don't know how attention networks work, will revisit
4. shauryapatel 21 Jun 2023
  
  in Public
  
  the entire model is trained online
  
  Interesting
5. shauryapatel 21 Jun 2023
  
  in Public
  
  second embedding layer
  
  is this just a neural network?
6. shauryapatel 21 Jun 2023
  
  in Public
  
  Voyager is trained to predict the mostpredictable address from multiple possible labels
  
  What does this even mean if not the next address?
7. shauryapatel 21 Jun 2023
  
  in Public
  
  P(Addrt +1 |Addrt )
  
  This seems very difficult to do for a very large access stream, how would you have enough instances to learn this?
8. shauryapatel 21 Jun 2023
  
  in Public
  
  For example, a prefetchis considered correct if any one of the ten predictions by the modelmatch the next address, thus ignoring practical considerations ofaccuracy and timeliness.
  
  Totally agree with this criticism, it was also a sorepoint for me when I read the paper
9. shauryapatel 21 Jun 2023
  
  in Public
  
  Unfortunately, regression-based models are trained to arrive close to the ground truth label,but since a small error in a cache line address will prefetch thewrong line, being close is not useful for prefetching.
  
  This is true for cache prefetching but not for page prefetching, maybe regression is good enough for page prefetching and that allows is to be achievable at runtime even?
10. shauryapatel 21 Jun 2023
  
  in Public
  
  While the RL formulation is conceptu-ally powerful, the use of tables is insufficient for RL because tablesare sample inefficient and sensitive to noise in contexts.
  
  I have no clue what this means, need to clear it out with someone who knows ML
11. shauryapatel 21 Jun 2023
  
  in Public
  
  data prefetchers have no known ground truth labelsfrom which to learn.
  
  the fact that there is no ground truth is very important and pertinent in prefetching, but maybe hopp can be used as a ground truth?
12. shauryapatel 21 Jun 2023
  
  in Public
  
  , in the presence of data-dependent correlations acrossmultiple PCs.
  
  Citation would be useful
13. shauryapatel 21 Jun 2023
  
  in Public
  
  STMS
  
  https://web.eecs.umich.edu/~twenisch/papers/hpca09.pdf
14. shauryapatel 21 Jun 2023
  
  in Public
  
  To address this issue, weuse a novel attention-based embedding layer that allows the pageprediction to provide context for the offset prediction.
  
  Using first output as input to second
Visit annotations in context

Annotators

shauryapatel

URL

cs.utexas.edu/~lin/papers/asplos21.pdf
proceedings.mlr.press proceedings.mlr.press

Learning Memory Access Patterns

6
1. shauryapatel 01 Jun 2023
  
  in Public
  
  Only PC
  
  Coupled with the t-SNE visualization, it is clear that deltas for a particular PC are clustered, and what if we just trained a model for that? These results do puzzle me a bit because I am having a hard time wrapping my head around it.
2. shauryapatel 01 Jun 2023
  
  in Public
  
  t-SN
  
  I really like this visualization, its pretty cool
3. shauryapatel 01 Jun 2023
  
  in Public
  
  K highest-probability deltasare chosen for prefetching
  
  I am assuming this is just for the next miss and not the next K misses
4. shauryapatel 01 Jun 2023
  
  in Public
  
  hierarchicalsoftmax
  
  I don't know what this is and will need to read more.
5. shauryapatel 01 Jun 2023
  
  in Public
  
  the size of the vocabulary required in order toobtain at best 50% accuracy is usually O(1000) or less
  
  This is intuitive, you wouldn't expect more deltas than that because >1000 deltas imply that numbers greater than 1000 could be a delta at which point I would stop calling it a pattern, especially when clustering comes into play.
6. shauryapatel 01 Jun 2023
  
  in Public
  
  Figure 1
  
  I am not able to understand this figure or what it is trying to convey
Visit annotations in context

Annotators

shauryapatel

URL

proceedings.mlr.press/v80/hashemi18a/hashemi18a.pdf
May 2023
proceedings.mlr.press proceedings.mlr.press

Learning Memory Access Patterns

3
1. shauryapatel 02 May 2023
  
  in Public
  
  LSTM
  
  http://colah.github.io/posts/2015-08-Understanding-LSTMs/ good article to understand LSTMs
2. shauryapatel 02 May 2023
  
  in Public
  
  We relate contemporary prefetchingstrategies to n-gram models in natural languageprocessing
  
  Relating cache prefetching to n-gram (NLP), could be a good precursor to relating prefetching to LLMs.
3. shauryapatel 02 May 2023
  
  in Public
  
  However, the space ofmachine learning for computer hardware archi-tecture is only lightly explored.
  
  Motivation for future works.
Visit annotations in context

Annotators

shauryapatel

URL

proceedings.mlr.press/v80/hashemi18a/hashemi18a.pdf

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL

Annotators

URL