Large Language Models - the hardware connection
LLM inference - HW/SW optimizations
HOW TO BUILD LOW-COST NETWORKS FOR LARGE LANGUAGE MODELS (WITHOUT SACRIFICING PERFORMANCE)?
Reducintg Activation Recomputation in Large Trnasformer Models
| Attention Block | Q, K, V matrix multiplies | 2sbh | 11sbh + 5as2b |
| QKT | 4sbh | ||
| Softmax | 2as2b | ||
| Softmax dropout | as2b | ||
| Attention over Values(V) | 2as2b(dropout output)+2sbh(Values) | ||
| Linear projection | 2sbh | ||
| Attention dropout | sbh | ||
| MLP | 2sbh | 19sbh | |
| 8sbh | |||
| 8sbh | |||
| sbh | |||
| Layer normalization | Layernorm #1 | 2sbh | 4sbh |
| Layernorm #2 | 2sbh | ||
| Total Activation | =sbh(34+5as/h) |
'Daily-Trend-Review' 카테고리의 다른 글
| 24/03/09: Transformer Alternatives (0) | 2024.03.10 |
|---|---|
| 24/02/25: OLMo (0) | 2024.02.25 |
| 24/02/06: Why GPT-3.5 is (mostly) cheaper than Llama2 (0) | 2024.02.06 |
| 24/02/04: fine-tune your lown llama 2 model in a colab note book (0) | 2024.02.04 |
| 2024/01/27: Harmonizing Multi-GPUs (0) | 2024.01.27 |