Efficient Streaming Language Models with Attention Sinks
Efficient Streaming Language Models with Attention Sinks
Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Val
arxiv.org
Ring Attention with Blockwise Trnasformers for Near-Infinite Context
HyperAttention: Long-context Attention in Near-Linear Time
Flash-Decoding for long context inference
Stanford CRFM
Motivation Large language models (LLM) such as ChatGPT or Llama have received unprecedented attention lately. However, they remain massively expensive to run. Even though generating a single response can cost about $0.01 (a few seconds of an 8xA100 instanc
crfm.stanford.edu
Efficient Memory Management for Large Language Model Serving with PagedAttention
'Daily-Trend-Review' 카테고리의 다른 글
2023/10/27: transformer-math (0) | 2023.10.27 |
---|---|
2023/10/24: attention (0) | 2023.10.24 |
2023/10/16: RAG (0) | 2023.10.16 |
2023/10/06: long context llms (0) | 2023.10.06 |
2023/09/27: Speed up Inference (0) | 2023.09.27 |