Towards 100x Speedup: Full Stack Transformer Inference Optimization | Built with Notion
Imagine two companies have equally powerful models. Company A can serve the model to 10 users with 1 GPU, but company B can serve 20 users. Who will win in the long run?
yaofu.notion.site
https://huggingface.co/blog/whisper-speculative-decoding
Speculative Decoding for 2x Faster Whisper Inference
Speculative Decoding for 2x Faster Whisper Inference Open AI's Whisper is a general purpose speech transcription model that achieves state-of-the-art results across a range of different benchmarks and audio conditions. The latest large-v3 model tops the Op
huggingface.co
https://huggingface.co/blog/moe#serving-techniques
Mixture of Experts Explained
Mixture of Experts Explained With the release of Mixtral 8x7B (announcement, model card), a class of transformer has become the hottest topic in the open AI community: Mixture of Experts, or MoEs for short. In this blog post, we take a look at the building
huggingface.co
https://www.baseten.co/blog/llm-transformer-inference-guide/
A guide to LLM inference and performance
Learn if LLM inference is compute or memory bound to fully utilize GPU power. Get insights on better GPU resource utilization.
www.baseten.co
'Daily-Trend-Review' 카테고리의 다른 글
2024/01/05: Decoding Strategies in Large Language Models (1) | 2024.01.05 |
---|---|
2024/01/02: Transformer inference tricks (0) | 2024.01.02 |
2023/12/23: optimizing your llm in production (0) | 2023.12.23 |
2023/12/23: RAG 101 (0) | 2023.12.23 |
2023/12/23: how to make LLMs go fast (0) | 2023.12.23 |