https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference
LLM의 정량적인 성능 지표
- 분당 완료된 요청 (requests/sec)
- TTFT(Time To First Token)
- ITL(Inter-Token Latency)
- End-to-End Latency
- Cost/request
전용 인스턴스에 대한 추가 지표
- 구성
- 8 replicas with 1 GPU --> 가장 낮은 TTFT (Data Parallelism only)
- 1 replicas with 8 GPU --> 가장 높은 throughput (8x the memory BW)
- Output token throughput
벤치마크 결과
- Completed queries/minute vs concurrent requests
- Time to First Token(TTFT) vs concurrent requests
- Inter-Token Latency(ITL) vs concurrent requests
- End-to-End Latency vs concurrent requests
- Cost per thousand requests
'Daily-Trend-Review' 카테고리의 다른 글
2023/12/11: LLM and Transformers Series (0) | 2023.12.11 |
---|---|
2023/12/11: LLM Visualization (0) | 2023.12.11 |
2023/12/10: 아이패드에서 colab 사용법 (0) | 2023.12.10 |
2023/12/10: optimizing your llm in production (0) | 2023.12.10 |
2023/12/08: LLM transformer inference guide (1) | 2023.12.08 |