████████╗██╗ ██╗███████╗███████╗ █████╗ ███████╗████████╗███████╗███████╗████████╗ █████╗ ██╗ ╚══██╔══╝██║ ██║██╔════╝██╔════╝██╔══██╗██╔════╝╚══██╔══╝██╔════╝██╔════╝╚══██╔══╝██╔══██╗██║ ██║ ███████║█████╗ █████╗ ███████║███████╗ ██║ █████╗ ███████╗ ██║ ███████║██║ ██║ ██╔══██║██╔══╝ ██╔══╝ ██╔══██║╚════██║ ██║ ██╔══╝ ╚════██║ ██║ ██╔══██║██║ ██║ ██║ ██║███████╗██║ ██║ ██║███████║ ██║ ███████╗███████║ ██║██╗██║ ██║██║ ╚═╝ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝ ╚═╝╚══════╝ ╚═╝ ╚══════╝╚══════╝ ╚═╝╚═╝╚═╝ ╚═╝╚═╝
Human conversations are fast, typically around 200ms between turns, and we think LLMs should be just as quick. This site provides reliable measurements for the performance of popular models.
You can filter models using the text fields in the header, e.g., Llama 3.1 405B providers, GPT-4 vs Claude 3 vs Gemini.
Definitions, methodology, and links to source below. Stats updated daily.
Have another model you want us to benchmark? File an issue on GitHub.
Definitions
===========
Methodology
===========
Distributed Footprint: We run our tools daily in multiple data centers using Fly.io. Currently we run in cdg, iad, and sea.
Connection Warmup: A warmup connection is made to remove any HTTP connection setup latency.
TTFT Measurement: The TTFT clock starts when the HTTP request is made and stops when the first token is received in the response stream.
Input Tokens: The number of input tokens differs based on what media is supplied. For text inputs, the input is approximately 1000 tokens.
Output Tokens: The number of output tokens is set to 20 (~100 chars), the length of a typical conversational sentence.
Try 3, Keep 1: For each provider, three separate inferences are performed, and the best result is kept (to remove any outliers due to queuing etc).
Sources
======
Raw Data: All data is in this public GCS bucket.
Benchmarking Tools: The full test suite is available in the ai-benchmarks repo.
Website: Full source code for this site is on GitHub.