Daily-Trend-Review

2023/12/25: Towards 100x Speedup: Full Stack Transformer Inference Optimization

hellcat 2023. 12. 25. 19:49

https://yaofu.notion.site/Towards-100x-Speedup-Full-Stack-Transformer-Inference-Optimization-43124c3688e14cffaf2f1d6cbdf26c6c

 

Towards 100x Speedup: Full Stack Transformer Inference Optimization | Built with Notion

Imagine two companies have equally powerful models. Company A can serve the model to 10 users with 1 GPU, but company B can serve 20 users. Who will win in the long run?

yaofu.notion.site

 

https://huggingface.co/blog/whisper-speculative-decoding

 

Speculative Decoding for 2x Faster Whisper Inference

Speculative Decoding for 2x Faster Whisper Inference Open AI's Whisper is a general purpose speech transcription model that achieves state-of-the-art results across a range of different benchmarks and audio conditions. The latest large-v3 model tops the Op

huggingface.co

https://huggingface.co/blog/moe#serving-techniques

 

Mixture of Experts Explained

Mixture of Experts Explained With the release of Mixtral 8x7B (announcement, model card), a class of transformer has become the hottest topic in the open AI community: Mixture of Experts, or MoEs for short. In this blog post, we take a look at the building

huggingface.co

https://www.baseten.co/blog/llm-transformer-inference-guide/

 

A guide to LLM inference and performance

Learn if LLM inference is compute or memory bound to fully utilize GPU power. Get insights on better GPU resource utilization.

www.baseten.co