2023/12/08: LLM transformer inference guide

Daily-Trend-Review

2023/12/08: LLM transformer inference guide

hellcat 2023. 12. 8. 13:38

https://www.baseten.co/blog/llm-transformer-inference-guide/

A guide to LLM inference and performance

To attain the full power of a GPU during LLM inference, you have to know if the inference is compute bound or memory bound. Learn how to better utilize GPU resources.

www.baseten.co

ops:byte 비율 계산

A10의 ops:byte = (125 TFLOPS)/(600GB/s) = 208.3 ops/byte
ops:byte = 208.3 이하이면 memory-bound
ops:byte = 208.3 이상이면 compute-bound

Arithmetic Intensity 계산

Prefill
- 모델은 프롬프트 토큰을 병렬로 수집하여 KV cache를 채움
- KV cache는 모델의 상태로 생각할 수 있음
Autoregressive sampling:
- 현재 상태(KV cache에 저장)를 활용하여 다음 토큰을 샘플링하고 디코딩
- KV cache가 없으면 모든 연속 토큰을 샘플링하는데 더 오랜 시간이 걸림

Attention 방정식 분석

N : sequence length (=4096)
d : d_head (=128)
Q, K, V: N x d (4096 x 128)
S, P: N x N (4096 x 4096)
O: N x d (4096 x 128)

Total memory IO: 8N^2 + 8dN
Total Compute: 4dN^2 + 3N^2
Arithmetic intensity for llama : Total memory IO / Total Compute = (8N^2 + 8dN) / (4dN^2 + 3N^2) = 62 ops/byte

추론 bottleneck 발견

llama 7B은 auto-regressive 과정에서 memory-bound
- A10 ops/byte = 208.3 < AI for llama 7B = 62
비싼 비용을 지출하지만 GPU를 충분히 사용하지 못하고 있음

GPU 상에 Memory-bound 과정을 일괄처리(Batching)

Generation: time/token = (model weights)/(accelerator memory BW)
Prefill: # of tokens * (model weights)/(accelerator memory BW)
Total Generation Time = Prefill time + # of tokens * time/token

NVIDIA GPU의 LLama2 7B Chat Benchmark

입력 토큰: 350개
출력 토큰: 150개