Daily-Trend-Review

2024/01/27: Harmonizing Multi-GPUs

hellcat 2024. 1. 27. 11:36

Harmonizing Multi-GPUs: Efficient Scaling of LLM Inference

 

TitanML docs | TitanML docs

The machine learning optimization company

docs.titanml.co

 

In the Fast Lane! Speculative Decoding - 10x Larger Model, No Extra Cost

 

TitanML docs | TitanML docs

The machine learning optimization company

docs.titanml.co

Large Language Models - the hardware connection

 

Large Language Models — the hardware connection | APNIC Blog

Guest Post: The role of networking when scaling LLM architecture's gigantic models.

blog.apnic.net

Tensor Parallelism

 

Tensor Parallelism - NADDOD Blog

Tensor parallelism alleviates memory issues in large-scale training. RoCE enables efficient communication for GPU tensor parallelism, accelerating computations.

www.naddod.com

Fast and Expressive LLM inference with Radixattention and SGLang

 

Fast and Expressive LLM Inference with RadixAttention and SGLang | LMSYS Org

<p>Large Language Models (LLMs) are increasingly utilized for complex tasks that require multiple chained generation calls, advanced prompting techniques, co...

lmsys.org

All you need to know about LLMs