Daily-Trend-Review

2023/10/18: Long-context 최적화 방법

hellcat 2023. 10. 18. 09:21

Efficient Streaming Language Models with Attention Sinks

 

Efficient Streaming Language Models with Attention Sinks

Deploying Large Language Models (LLMs) in streaming applications such as multi-round dialogue, where long interactions are expected, is urgently needed but poses two major challenges. Firstly, during the decoding stage, caching previous tokens' Key and Val

arxiv.org

Ring Attention with Blockwise Trnasformers for Near-Infinite Context

HyperAttention: Long-context Attention in Near-Linear Time

 

Flash-Decoding for long context inference

 

Stanford CRFM

Motivation Large language models (LLM) such as ChatGPT or Llama have received unprecedented attention lately. However, they remain massively expensive to run. Even though generating a single response can cost about $0.01 (a few seconds of an 8xA100 instanc

crfm.stanford.edu

Efficient Memory Management for Large Language Model Serving with PagedAttention

'Daily-Trend-Review' 카테고리의 다른 글

2023/10/27: transformer-math  (0) 2023.10.27
2023/10/24: attention  (0) 2023.10.24
2023/10/16: RAG  (0) 2023.10.16
2023/10/06: long context llms  (0) 2023.10.06
2023/09/27: Speed up Inference  (0) 2023.09.27