Hypothesis

the lack of KV sharing across requests leads to redundant prefill computation and wasted memory.

KV sharing across concurrent requests is a non-obvious efficiency lever: if two users send similar prompts, their prefill KV states are computed independently. CXL's shared memory pool makes cross-request KV reuse architecturally possible for the first time without expensive GPU-to-GPU transfers.

kv-sharing prefill multi-tenant-inference

Tags

Annotators

URL