Daily-Trend-Review

2023/08/20: llama2 inference, continuous batching 등

hellcat 2023. 8. 20. 22:56

How continuous batching enables 23x throughput in LLM inference while reducing p50 latency

 

Achieve 23x LLM Inference Throughput & Reduce p50 Latency

In this blog, we discuss continuous batching, a critical systems-level optimization that improves both throughput and latency under load for LLMs.

www.anyscale.com

Why GPT-3.5 is (mostly) cheaper than Llama2

 

Why GPT-3.5 is (mostly) cheaper than Llama 2

Llama-2 is more expensive than you'd think. In this post, we explore why it's often more expensive than gpt-3.5-turbo.

www.cursor.so

How is LLaMa.cpp possible?

 

How is LLaMa.cpp possible?

How is LLaMa.cpp possible? If you want to read more of my writing, I have a Substack. Articles will be posted simultaneously to both places. Note: This was written in March of '23, and is out of date (AI moves quickly!). This is an attempt at answering the

finbarr.ca

Implementation of Llama v2.0, FAISS in Python using LangChain

 

Implementation of Llama v2.0, FAISS in Python using LangChain 🦜️🔗

Ever since the ChatGPT arrived in market and OpenAI launched their GPT4, the craze about Large Language Models (LLMs) in developers…

medium.com

Optimize LLM Enterprise Applications through Embeddings and Chunking Strategy.

 

Optimize LLM Enterprise Applications through Embeddings and Chunking Strategy.

How to choose an embedding model? What’s the right chunk size?

actalyst.medium.com