Large Language Models - the hardware connection
LLM inference - HW/SW optimizations
HOW TO BUILD LOW-COST NETWORKS FOR LARGE LANGUAGE MODELS (WITHOUT SACRIFICING PERFORMANCE)?
Reducintg Activation Recomputation in Large Trnasformer Models
Attention Block | Q, K, V matrix multiplies | 2sbh | 11sbh + 5as2b |
QKT | 4sbh | ||
Softmax | 2as2b | ||
Softmax dropout | as2b | ||
Attention over Values(V) | 2as2b(dropout output)+2sbh(Values) | ||
Linear projection | 2sbh | ||
Attention dropout | sbh | ||
MLP | 2sbh | 19sbh | |
8sbh | |||
8sbh | |||
sbh | |||
Layer normalization | Layernorm #1 | 2sbh | 4sbh |
Layernorm #2 | 2sbh | ||
Total Activation | =sbh(34+5as/h) |
'Daily-Trend-Review' 카테고리의 다른 글
24/03/09: Transformer Alternatives (0) | 2024.03.10 |
---|---|
24/02/25: OLMo (0) | 2024.02.25 |
24/02/06: Why GPT-3.5 is (mostly) cheaper than Llama2 (0) | 2024.02.06 |
24/02/04: fine-tune your lown llama 2 model in a colab note book (0) | 2024.02.04 |
2024/01/27: Harmonizing Multi-GPUs (0) | 2024.01.27 |