2023/12/08: LLM transformer inference guide

Daily-Trend-Review

2023/12/08: LLM transformer inference guide

hellcat 2023. 12. 8. 13:38

https://www.baseten.co/blog/llm-transformer-inference-guide/

A guide to LLM inference and performance

To attain the full power of a GPU during LLM inference, you have to know if the inference is compute bound or memory bound. Learn how to better utilize GPU resources.

www.baseten.co

ops:byte 비율 계산

A10의 ops:byte = (125 TFLOPS)/(600GB/s) = 208.3 ops/byte
ops:byte = 208.3 이하이면 memory-bound
ops:byte = 208.3 이상이면 compute-bound

Arithmetic Intensity 계산

Prefill
- 모델은 프롬프트 토큰을 병렬로 수집하여 KV cache를 채움
- KV cache는 모델의 상태로 생각할 수 있음
Autoregressive sampling:
- 현재 상태(KV cache에 저장)를 활용하여 다음 토큰을 샘플링하고 디코딩
- KV cache가 없으면 모든 연속 토큰을 샘플링하는데 더 오랜 시간이 걸림

Attention 방정식 분석

N : sequence length (=4096)
d : d_head (=128)
Q, K, V: N x d (4096 x 128)
S, P: N x N (4096 x 4096)
O: N x d (4096 x 128)

Total memory IO: 8N^2 + 8dN
Total Compute: 4dN^2 + 3N^2
Arithmetic intensity for llama : Total memory IO / Total Compute = (8N^2 + 8dN) / (4dN^2 + 3N^2) = 62 ops/byte

추론 bottleneck 발견

llama 7B은 auto-regressive 과정에서 memory-bound
- A10 ops/byte = 208.3 < AI for llama 7B = 62
비싼 비용을 지출하지만 GPU를 충분히 사용하지 못하고 있음

GPU 상에 Memory-bound 과정을 일괄처리(Batching)

Generation: time/token = (model weights)/(accelerator memory BW)
Prefill: # of tokens * (model weights)/(accelerator memory BW)
Total Generation Time = Prefill time + # of tokens * time/token

NVIDIA GPU의 LLama2 7B Chat Benchmark

입력 토큰: 350개
출력 토큰: 150개

'Daily-Trend-Review' 카테고리의 다른 글

2023/12/10: 아이패드에서 colab 사용법 (0)	2023.12.10
2023/12/10: optimizing your llm in production (0)	2023.12.10
2023/12/06: The New Stack and Ops for AI (OpenAI dev) (1)	2023.12.08
2023/12/01: Accelerating Generative AI with PyTorch II: GPT, Fast (0)	2023.12.01
PagedAttention + vLLM (0)	2023.11.30

현재글2023/12/08: LLM transformer inference guide

AI, Quant 투자 공부

글쓰기 좋아하는 AI 엔지니어의 AI와 Quant 투자 스터디를 위한 공간

jupyter notebook, State of GPT, 거인의포트폴리오, etf, transformer, 정채진프로, Generative-AI, mdd, ChatGPT, llm, GPT, gpt-4, vscode, llma, 강환국, 퀀트투자, QLORA, training, LLaMA-Adapter, 삼프로tv,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

AI, Quant 투자 공부

2023/12/08: LLM transformer inference guide

ops:byte 비율 계산

Arithmetic Intensity 계산

Attention 방정식 분석

추론 bottleneck 발견

GPU 상에 Memory-bound 과정을 일괄처리(Batching)

NVIDIA GPU의 LLama2 7B Chat Benchmark

'Daily-Trend-Review' 카테고리의 다른 글

'Daily-Trend-Review'의 다른글

티스토리툴바

2023/12/08: LLM transformer inference guide

ops:byte 비율 계산

Arithmetic Intensity 계산

Attention 방정식 분석

추론 bottleneck 발견

GPU 상에 Memory-bound 과정을 일괄처리(Batching)

NVIDIA GPU의 LLama2 7B Chat Benchmark

'Daily-Trend-Review' 카테고리의 다른 글

'Daily-Trend-Review'의 다른글

관련글

티스토리툴바