ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models paper repo

What are the motivations for this work?

Large Language Models (LLMs) have recently been incorporated into various online applications, and Serving LLM inference at scale is a challenging problem. LLM requires huge GPU resources, which is expensive today.

The technical challenge of serving LLM includes:

LLM is a memory intensive application
highly dynamic, bursty traffic for LLM application

The technical challenge of serving LLM as Microservices includes:

Transport from model repositories
Costly checkpoint loading from storage devices

The current solution for Machine Learning Serverless Application include checkpointing, which has significant overheads and latencies. And there are other solutions (and drawbacks) for Regular Serverless Application:

Keep instances warm: waste GPU resources
Memory Caching: not efficient for Large Models
Additional Storage Server: Large communication overhead

**What is the proposed solution?**

Leveraging the multi-tier storage architecture for local checkpoint storage and utilizing their significant storage bandwidth for efficient checkpoint loading, which includes:

Fast LLM checkpoint loading: Increase memory addressing efficiency
Locality-driven LLM inference with live migration
Locality-aware server allocation

Fast LLM checkpoint loading

x9zbVr

Preloading on GPU
Utilize parallelized PCIE, direct r/w and throughput optimization.
Question Mark: Preloading may introduce new overheads, and description here is not clear enough.

Locality-Driven LLM Inference with Live Migration

IBRwpi uyNbYu

token-based migration
And fault tolerance

The model loading scheduler sends a model loading request to dest server to load model A into GPUs. If there is an idle instance of model A on dest server, the scheduler skips this step.
After loading, the scheduler sends a migration request carrying the address of dest server to src server.
Upon receiving a migrate request, src server sets itself as “migrating”, sends a resume request with intermediate tokens (i.e., input tokens and the output tokens produced before step 3) to dest server if the inference is not completed. Otherwise, it immediately returns to the scheduler.
dest server recomputes KV cache given the tokens in the resume request.
Once resume request is done, src server stops inference,returns to the scheduler, and replies to the request router with all tokens (i.e., the intermediate tokens together with the remaining tokens produced between step 3 and step 5) and a flag “migrated”. If long-context, the collection of all tokens can be very large thus resuming takes a long time, during which many new tokens are predicted. In such a case, we can repeat the above two steps to further reduce the tokens to send between src and dest.
The scheduler finishes the migration, unloads model A at src server and starts loading model B.
The request router checks the flag in the inference response. If it is “migrated”, the request router replaces src server with dest server in its route table and sends all tokens to dest server to continue inference.

Locality-Aware Server Allocation

Rl6i2d Key idea: Measure the time for migration, estimate model loading time and migration time.

**What is the work's evaluation of the proposed solution?**

ServerlessLLM demonstrated a 10-200X improvement in latency for running OPT model inferences across datasets, which shows ServerlessLLM’s effectiveness.

LLM workload evaluation: Described in AlpaServe
LLM checkpoint manager: PyTorch and Safetensors

What is your analysis of the identified problem, idea and evaluation?

Running LLM on serverless might be a future trend, and the overhead brought by Serverless will amplify on LLMs. It's glad to see the solution on utilize the system throughput and add locality, however there might still have overheads and question mark exist: Recover from checkpoint, Migrating by token, Invoke by GPU address.

**What are the contributions?**

LLM migration
Measurement on LLM migration and model loading
GPU memory management

**What are future directions for this research?**

WIP

What questions are you left with?

WIP

What is your take-away message from this paper?

WIP

Contributor

Canarypwn

Changelog

Last edited 9 months agoView full history

LLM

mlsys_osdi24_sosp23

Testing

SC24

FCA

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models paper repo

What are the motivations for this work?

**What is the proposed solution?**

Fast LLM checkpoint loading

Locality-Driven LLM Inference with Live Migration

Locality-Aware Server Allocation

**What is the work's evaluation of the proposed solution?**

What is your analysis of the identified problem, idea and evaluation?

**What are the contributions?**

**What are future directions for this research?**

What questions are you left with?

What is your take-away message from this paper?

Contributor

Changelog

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models paper repo ​

What are the motivations for this work? ​

What is the proposed solution? ​

Fast LLM checkpoint loading ​

Locality-Driven LLM Inference with Live Migration ​

Locality-Aware Server Allocation ​

What is the work's evaluation of the proposed solution? ​

What is your analysis of the identified problem, idea and evaluation? ​

What are the contributions? ​

What are future directions for this research? ​

What questions are you left with? ​

What is your take-away message from this paper? ​

Contributor ​

Changelog ​

ServerlessLLM: Locality-Enhanced Serverless Inference for Large Language Models paper repo

What are the motivations for this work?

**What is the proposed solution?**

Fast LLM checkpoint loading

Locality-Driven LLM Inference with Live Migration

Locality-Aware Server Allocation

**What is the work's evaluation of the proposed solution?**

What is your analysis of the identified problem, idea and evaluation?

**What are the contributions?**

**What are future directions for this research?**

What questions are you left with?

What is your take-away message from this paper?

Contributor

Changelog