https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference
Reproducible Performance Metrics for LLM inference
Anyscale is releasing LLMPerf for benchmarking LLMs on current LLM offerings. See benchmarking results for Anyscale Endpoints vs Fireworks.ai.
www.anyscale.com
LLM의 정량적인 성능 지표
- 분당 완료된 요청 (requests/sec)
- TTFT(Time To First Token)
- ITL(Inter-Token Latency)
- End-to-End Latency
- Cost/request
전용 인스턴스에 대한 추가 지표
- 구성
- 8 replicas with 1 GPU --> 가장 낮은 TTFT (Data Parallelism only)
- 1 replicas with 8 GPU --> 가장 높은 throughput (8x the memory BW)
- Output token throughput
벤치마크 결과
- Completed queries/minute vs concurrent requests
- Time to First Token(TTFT) vs concurrent requests
- Inter-Token Latency(ITL) vs concurrent requests
- End-to-End Latency vs concurrent requests
- Cost per thousand requests
'Daily-Trend-Review' 카테고리의 다른 글
2023/12/11: LLM and Transformers Series (0) | 2023.12.11 |
---|---|
2023/12/11: LLM Visualization (0) | 2023.12.11 |
2023/12/10: 아이패드에서 colab 사용법 (0) | 2023.12.10 |
2023/12/10: optimizing your llm in production (0) | 2023.12.10 |
2023/12/08: LLM transformer inference guide (1) | 2023.12.08 |