Daily-Trend-Review

2023/08/28: inference optimization

hellcat 2023. 8. 28. 00:37

Full Stack Optimization of Transformer Inference: a Survey

 

How long Can Open-Source LLMs Truly Promise on Context Length?

 

How Long Can Open-Source LLMs Truly Promise on Context Length? | LMSYS Org

<p>In this blogpost, we introduce our latest series of chatbot models, LongChat-7B and LongChat-13B, featuring a new level of extended context length up to 1...

lmsys.org

 

EFFICIENTLY SCALING TRANSFORMER INFERENCE

 

Dissecting Batching Effects in GPT Inference

 

Dissecting Batching Effects in GPT Inference

Machine learning models relying on batching to improve inference throughput, especially for smaller computer vision models such as ResNet and DenseNet. GPT, as well as other large language models (LLMs), is the hottest model these days. Does batching still

le.qun.ch

Accelerating transformer inference on my RTX 4090

 

Accelerating transformer inference on my RTX 4090

Justifying an unjustifiable purchase. (I am financially ruined)

www.ericjwang.com

GPU Performance Background User's Guide

 

GPU Performance Background User's Guide - NVIDIA Docs

The GPU is a highly parallel processor architecture, composed of processing elements and a memory hierarchy. At a high level, NVIDIA® GPUs consist of a number of Streaming Multiprocessors (SMs), on-chip L2 cache, and high-bandwidth DRAM. Arithmetic and ot

docs.nvidia.com

Code Llama: Open Foundation Models for Code

 

Math-Bound VS Memory-Bound Operations

 

Math-Bound VS Memory-Bound Operations

Computation Bandwidth, Memory Bandwidth, and Data Reuse

leimao.github.io

Transformer Inference Arithmetic

 

Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking

 

Pre-computed memory or on-the-fly encoding? A hybrid approach to retrieval augmentation makes the most of your compute

 

GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2

 

GPTQ or bitsandbytes: Which Quantization Method to Use for LLMs — Examples with Llama 2

Large language model quantization for affordable fine-tuning and inference on your computer

towardsdatascience.com

Understanding QLoRA & LoRA: Fine-tuning of LLMs

 

Understanding QLoRA & LoRA: Fine-tuning of LLMs

In this short note, we gently review LoRA [1] and QLoRA [2] papers. Fine-tuning LLMs is a popular subject these days. These two papers have…

medium.com