Daily-Trend-Review

2023/12/11: Reproducible Performance Metrics for LLM inference

hellcat 2023. 12. 11. 08:36

https://www.anyscale.com/blog/reproducible-performance-metrics-for-llm-inference

 

Reproducible Performance Metrics for LLM inference

Anyscale is releasing LLMPerf for benchmarking LLMs on current LLM offerings. See benchmarking results for Anyscale Endpoints vs Fireworks.ai.

www.anyscale.com

 

LLM의 정량적인 성능 지표

  • 분당 완료된 요청 (requests/sec)
  • TTFT(Time To First Token)
  • ITL(Inter-Token Latency)
  • End-to-End Latency
  • Cost/request

전용 인스턴스에 대한 추가 지표

  • 구성
    • 8 replicas with 1 GPU --> 가장 낮은 TTFT (Data Parallelism only)
    • 1 replicas with 8 GPU --> 가장 높은 throughput (8x the memory BW)
  • Output token throughput

벤치마크 결과

  • Completed queries/minute vs concurrent requests
  • Time to First Token(TTFT) vs concurrent requests
  • Inter-Token Latency(ITL) vs concurrent requests
  • End-to-End Latency vs concurrent requests
  • Cost per thousand requests