https://vgel.me/posts/faster-inference/
Table of Contents
- Why is simple inference so slow?
- Hardware
- Batching
- Shrinking model weights
- KV caching
- Speculative Decoding
- Training time optimizations
- Conclusion
How to make LLMs go fast
How to make LLMs go fast December 18, 2023 In my last post, we made a transformer by hand. There, we used the classic autoregressive sampler, along the lines of: This approach to inference is elegant and cuts to the heart of how LLMs work—they're autoreg
vgel.me
'Daily-Trend-Review' 카테고리의 다른 글
2023/12/23: optimizing your llm in production (0) | 2023.12.23 |
---|---|
2023/12/23: RAG 101 (0) | 2023.12.23 |
2023/12/18: Mixtral 8x7B (1) | 2023.12.18 |
2023/12/14: Prompt Cache: Modular Attention Reuse For Low-Latency Inference (1) | 2023.12.14 |
2023/12/12: chip cloud 논문 (0) | 2023.12.14 |