2023/12/10: optimizing your llm in production

Daily-Trend-Review

2023/12/10: optimizing your llm in production

hellcat 2023. 12. 10. 15:07

∈https://huggingface.co/blog/optimize-llm

Optimizing your LLM in production

Optimizing your LLM in production Note: This blog post is also available as a documentation page on Transformers. Large Language Models (LLMs) such as GPT3/4, Falcon, and LLama are rapidly advancing in their ability to tackle human-centric tasks, establish

huggingface.co

효율적인 LLM deployment를 위해 가장 효과적인 기술

Lower Precision
Flash Attention
Architectural Innovations
- Alibi
- Rotary embedding
- Multi-Query Attention(MQA): Falcon, PaLM, MPT, BLOOM
- Grouped-Query Attention(GQA): LLaMAv2

KV cache

KV cache를 사용하면 텍스트 입력 토큰의 길이(=1)는 증가하지 않음

반면, KV cache의 길이는 매 decoding 단계마다 1씩 증가함

Full QK_T를 계산하는 것과 비교하였을 때 더 적은 계산이 필요로 하므로 계산 효율성이 상대한 증가함
최대 요구 메모리는 생성된 토큰의 개수에 따라 2차적으로 증가하지 않고 선형적으로 증가함

User: How many people live in France?
Assistant: Roughly 75 million people live in France
User: And how many are in Germany?
Assistant: Germany has ca. 81 million inhabitants

LLM은 auto-regressive decoding을 두번 실행한다.

1st decoding step

KV cache이 비어있음
모델은 "User: How many people live in France?"라는 프롬프트를 입력받고, "Roughly 75 million people live in France"을 자기 회귀적으로 생성함

2nd decoding step

KV cache 덕분에 첫번째 두 문장을 위한 모든 KV 벡터가 벌써 계산되어 있기 때문에 "User: And how many in Germany?"만을 입력 프롬프트로 입력 가능함
단축된 입력 프롬프트를 처리하는 동안, 계산된 KV vectors를 첫번째 decoding의 KV cache에 concat함
모델은 "User: How many people live in France? \n Assistant: Roughly 75 million people live in France \n User: And how many are in Germany?"가 인코딩된 KV cache를 이용하여 자기회귀적으로 "Germany has ca. 81 million inhabitants"라는 두 번째 Assistant's 답변을 생성함

여기서 우리가 알 수 있는 두 가지 사실을 알 수 있다.

#1. 채팅용 LLM이 대화의 모든 이전 컨텍스트를 이해할 수 있도록 모든 컨텍스트를 유지하는 것이 중요하다. 예를 들어, 위의 예에서 사용자가 "And how many are in Germany?"라고 물었을 때 LLM은 사용자가 인구를 언급한다는 것을 이해해야 한다. (첫번째 문장 User: How many people live in France?가 힌트가 된다.)
#2. KV cache는 encoder-decoder 아키텍처를 사용할 때처럼 채팅 기록을 처음부터 다시 encoding 할 필요 없이 encoding된 채팅 기록을 지속적으로 늘릴 수 있기 때문에 채팅에 매우 유용함.

하지만 KV cache는 메모리를 많이 필요로 한다.

MHA(Multi-Headed Attention)을 위한 모든 이전 입력 벡터 x_i (for i ∈ {1, ... , c-1})를 위한 Key-Value 벡터를 저장해야 함

'Daily-Trend-Review' 카테고리의 다른 글

2023/12/11: Reproducible Performance Metrics for LLM inference (0)	2023.12.11
2023/12/10: 아이패드에서 colab 사용법 (0)	2023.12.10
2023/12/08: LLM transformer inference guide (1)	2023.12.08
2023/12/06: The New Stack and Ops for AI (OpenAI dev) (1)	2023.12.08
2023/12/01: Accelerating Generative AI with PyTorch II: GPT, Fast (0)	2023.12.01

현재글2023/12/10: optimizing your llm in production

AI, Quant 투자 공부

글쓰기 좋아하는 AI 엔지니어의 AI와 Quant 투자 스터디를 위한 공간

GPT, mdd, etf, LLaMA-Adapter, transformer, 정채진프로, llma, State of GPT, training, vscode, QLORA, jupyter notebook, Generative-AI, ChatGPT, 퀀트투자, llm, 강환국, 삼프로tv, gpt-4, 거인의포트폴리오,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

AI, Quant 투자 공부