, as itis common for a page to convert from being frequently accessedto not being accessed at all on two consecutive epochs, thus theprediction MAE can be significantly high.
this is interesting
, as itis common for a page to convert from being frequently accessedto not being accessed at all on two consecutive epochs, thus theprediction MAE can be significantly high.
this is interesting
Overall, we prove that the accuracy of the RNN predictions issuch that it can deliver application performance similar to whatwould be possible with oracular knowledge of the access frequency.
I would've liked to see these results with real workloads, these benchmarks have a much smaller footprint than real workloads, which makes me skeptical as to how much the 100 pages needs to be changed to achieve the same results for real workloads
lso, we assume dedicated DMA enginesthat allow seamless page migration, which is overlapped with thecomputation, as explored in [14, 19].
This would be very messy in reality
Concerning the memory footprint of these applications,
As the footprint of the app increases, it might be harder to apply this solution as the number of selected pages might have to increase
Number of accesses
Is this total accesses or last scheduling epoch?
Identifies the subset of application pages that are important toperformance, through its page selector component, described indetail later on
We can re-use this for the LRB model
we flip the problem and explore the case of pre-dicting when a page is going to be accessed next
This is exactly what LRB also does
However, whennormalizing hundred thousand values in such a way (total numberof pages according to Table 1), there will be vast information loss.
This is what transforMAP does?
This approach has several limitations
Very important for us to understand this
More specifically, itis the sequence of the page accesses that were serviced from mainmemory and not the processor’s hardware caches, as they happenedthroughout the application run time.
I believe this means cache misses.
, so as to aggregate the accesses on an application pagegranularity and then determine an ordering of heavily accessedpages. These predictions need to happen periodically, when thepage scheduler is invoked, so that the appropriate page migrationsare determined and executed.
This is very similar to what we want to do, although at a different granularity. I really like this learning objective.
Figure 2:
This figure is very hard to read for me
History page scheduler can reduce performanceup to 55% (in the case of lulesh) and 13% on average.
13% on average might be very hard to recover, I find that ML can be good for workloads where heuristics leave a lot on the table.
Purely history-based pagescheduling methods are limited in the performance opportunitiesthey can provide to applications running on hybrid memory systems.Instead, they must be augmented with more intelligent, predictivemethods
I think this is a bit of a broad claim because though history page scheduler is one method, we cannot make such a strong claim about all possible heuristic solutions.
commontheme among these approaches is that they rely exclusively onhistoric information about page accesses. Specifically, the state-of-the-art [ 20 , 27 , 28 ] in system-level dynamic page managementsolutions for HMS utilize the immediate observed behavior to makedecisions on the best future page placement.
A mistake even we might be making
An effective page scheduler is responsible for ensur-ing that hot pages – the ones that are accessed frequently
Is having the entire page in DRAM always better or for some pages just hardware cache-line caching is sufficient? I have been thinking about this for some time now.
Kleio reduces on average 80% of the performance gapbetween the existing solutions and an oracle with knowledge offuture access pattern
Interesting way to put their performance
Shorter thresholds thus make the task of the ML predictormore tractable. While longer thresholds move the byte missratio of relaxed Belady closer to Belady’s MIN. It is thusimportant to find the proper threshold
This is an interesting insight for our work to prefetch N number of pages instead of 1.
Distribution of tier2 access ratio and tier2 resi-dency ratio under different promotion policies.
Okay this is great, the fact that access ratio is higher at higher 60s promotion and lowest at PMU tells me that prefetching actually has a scope even for this system, but its just difficult to do it. The sampling based prefetching work can be a good fit here
However, a newly referenced page is likely to be referenced againsoon and frequently.
How soon and how frequently?
Tier2 performance impact and overheads.
I think the system overhead needs important distinctions, does this include the original address space walking to determine the accessed bits or just the userspace daemon walking over accessed bits?
the large variation means that population studies are nec-essary to determine the actual impact.
I actually don't agree with this, these results are nice and helpful but I do wanna see the bottom line of how the TPS or latency or completion time of individual jobs are affected.
that this metric is relatively high especially compared toswap based solutions [32 ],
Why is it higher than swap based solutions? that's not obvious to me apart from one key difference is that swap based solutions might be promoting more pages, as promotion is compulsory on access, where as here promotion is not compulsory
. The base policy promotes pagestouched in two or more consecutive hot scan periods.
I would be interested to see if promoted pages are actually accessed after 1 minute, I wonder if there is such a study
The hardware deployed with TMTS supportsprecise events filtered to loads sourced from tier2. Unfortunately,the hardware does not support such filtering for memory stores,so stores are not sampled
Interesting, I am not sure if by hardware they mean specific intel processors. Not counting for stores discards HeMem as their primary finding was that stores hurt and they should be treat differently
hence, rely on swap-backed memory(anonymous and tmpfs) for more than 98% of their pages. Thus,only swap-backed pages are treated as demotion candidates.
Interesting thing to keep in mind, this is also my observsation but important to note that at datacenter scale that is the case as well
. This system design point is quitedistinct from virtual memory
I believe this is a very very important distinction that can have big consequences
The diversity and scale of its applicationsmotivates an application-transparent solution in the general case,adaptable to specific workload demands.
Motivation for why things need to be application transparent, though I don't necessarily buy it yet
at a much higher train-ing time and memory footprint
how high? some numbers would be nice.
variable temporal dependencies
proof? analysis?
general
what is general about this model?
Figure 5
This figure tells me that for mostly all workloads, 1 epoch of training and 10k memory trace is enough for prediction, because they never need to be retrained.
measured to be up to 15x the training times
Some concrete numbers would've been nice
Emulating Online Learning:
Important question is when do you want to learn, you don't want to keep learning the same patterns again and again
This reduces the output size and the neuralnetwork is trained to predict binary outputs that are laterconverted to decimal.
The concept is very simple it seems
000x compression
in memory? model size? what compression?
n ≈ 50, 000
What are they predicting, they are literally saying that there are 50000 possible deltas? EDIT: revisited the previous paper and there really are 50000 deltas possible, we should look at what the deltas are, vs what our deltas would be, 50k is impossible as a delta for page prefetching.
may lead to slowing down of inference due to largenumber of output labels.
This has to be smaller in our case than in caches, the intuition is that +64 means the delta of pages is every 64th page, that is a very large amount of memory.
Figure 1: Autocorrelation coefficients for each trace for various lags.
This figure made no sense to me
for various lags
First use of this term and I have no clue what it means
We propose to use a hybrid offline+online training approach wherea base model is trained offline first. At runtime, in case of low ac-curacy, a more specialized model is trained in real-time with thehypotheses that high accuracy can be obtained by only few trainingsamples with few epochs, and that this high accuracy can be sus-tained for a long period of time before another round of retrainingis required
basically finetuning online
the approach of training offline and testingonline for individual application is not a practical prefetcher,
Why is it not a practical prefetcher? If we overfit a model for an application and it can predict all the patterns then this should work.
Art
haha?
Second, it is not worth learning correlations for addresses that occurinfrequently.
This is very interesting, for page prefetching it should be the opposite, we want to learn correlations for pages that are less frequent, becaues the pages that are more frequent would never be considered cold and thus we should never need to prefetch them, unless we are dealing with a scenario of total memory disaggregation beceause in that case cache and page prefetching would be the same.
illustrates the page-aware offset embedding
I'll have to go over this entire part again, I don't know how attention networks work, will revisit
the entire model is trained online
Interesting
second embedding layer
is this just a neural network?
Voyager is trained to predict the mostpredictable address from multiple possible labels
What does this even mean if not the next address?
P(Addrt +1 |Addrt )
This seems very difficult to do for a very large access stream, how would you have enough instances to learn this?
For example, a prefetchis considered correct if any one of the ten predictions by the modelmatch the next address, thus ignoring practical considerations ofaccuracy and timeliness.
Totally agree with this criticism, it was also a sorepoint for me when I read the paper
Unfortunately, regression-based models are trained to arrive close to the ground truth label,but since a small error in a cache line address will prefetch thewrong line, being close is not useful for prefetching.
This is true for cache prefetching but not for page prefetching, maybe regression is good enough for page prefetching and that allows is to be achievable at runtime even?
While the RL formulation is conceptu-ally powerful, the use of tables is insufficient for RL because tablesare sample inefficient and sensitive to noise in contexts.
I have no clue what this means, need to clear it out with someone who knows ML
data prefetchers have no known ground truth labelsfrom which to learn.
the fact that there is no ground truth is very important and pertinent in prefetching, but maybe hopp can be used as a ground truth?
, in the presence of data-dependent correlations acrossmultiple PCs.
Citation would be useful
STMS
To address this issue, weuse a novel attention-based embedding layer that allows the pageprediction to provide context for the offset prediction.
Using first output as input to second
Only PC
Coupled with the t-SNE visualization, it is clear that deltas for a particular PC are clustered, and what if we just trained a model for that? These results do puzzle me a bit because I am having a hard time wrapping my head around it.
t-SN
I really like this visualization, its pretty cool
K highest-probability deltasare chosen for prefetching
I am assuming this is just for the next miss and not the next K misses
hierarchicalsoftmax
I don't know what this is and will need to read more.
the size of the vocabulary required in order toobtain at best 50% accuracy is usually O(1000) or less
This is intuitive, you wouldn't expect more deltas than that because >1000 deltas imply that numbers greater than 1000 could be a delta at which point I would stop calling it a pattern, especially when clustering comes into play.
Figure 1
I am not able to understand this figure or what it is trying to convey
LSTM
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ good article to understand LSTMs
We relate contemporary prefetchingstrategies to n-gram models in natural languageprocessing
Relating cache prefetching to n-gram (NLP), could be a good precursor to relating prefetching to LLMs.
However, the space ofmachine learning for computer hardware archi-tecture is only lightly explored.
Motivation for future works.