
2023/12/11: Reproducible Performance Metrics for LLM inference

hellcat 2023. 12. 11. 08:36



Reproducible Performance Metrics for LLM inference

Anyscale is releasing LLMPerf for benchmarking LLMs on current LLM offerings. See benchmarking results for Anyscale Endpoints vs Fireworks.ai.



LLM의 정량적인 성능 지표

  • 분당 완료된 요청 (requests/sec)
  • TTFT(Time To First Token)
  • ITL(Inter-Token Latency)
  • End-to-End Latency
  • Cost/request

전용 인스턴스에 대한 추가 지표

  • 구성
    • 8 replicas with 1 GPU --> 가장 낮은 TTFT (Data Parallelism only)
    • 1 replicas with 8 GPU --> 가장 높은 throughput (8x the memory BW)
  • Output token throughput

벤치마크 결과

  • Completed queries/minute vs concurrent requests
  • Time to First Token(TTFT) vs concurrent requests
  • Inter-Token Latency(ITL) vs concurrent requests
  • End-to-End Latency vs concurrent requests
  • Cost per thousand requests