████████╗██╗  ██╗███████╗███████╗ █████╗ ███████╗████████╗███████╗███████╗████████╗ █████╗ ██╗
╚══██╔══╝██║  ██║██╔════╝██╔════╝██╔══██╗██╔════╝╚══██╔══╝██╔════╝██╔════╝╚══██╔══╝██╔══██╗██║
   ██║   ███████║█████╗  █████╗  ███████║███████╗   ██║   █████╗  ███████╗   ██║   ███████║██║
   ██║   ██╔══██║██╔══╝  ██╔══╝  ██╔══██║╚════██║   ██║   ██╔══╝  ╚════██║   ██║   ██╔══██║██║
   ██║   ██║  ██║███████╗██║     ██║  ██║███████║   ██║   ███████╗███████║   ██║██╗██║  ██║██║
   ╚═╝   ╚═╝  ╚═╝╚══════╝╚═╝     ╚═╝  ╚═╝╚══════╝   ╚═╝   ╚══════╝╚══════╝   ╚═╝╚═╝╚═╝  ╚═╝╚═╝

Human conversations are fast, typically around 200ms between turns, and we think LLMs should be just as quick. This site provides reliable measurements for the performance of popular models.

You can filter models using the text fields in the header, e.g., Llama 3.1 405B providers, GPT-4 vs Claude 3 vs Gemini.

Definitions, methodology, and links to source below. Stats updated daily.

Have another model you want us to benchmark? File an issue on GitHub.

TTFT: Time To First Token
TPS: Tokens Per Second
Total Time: From request to final token

Fastest
Slowest

Definitions

===========

  • Model: The LLM used.
  • TTFT: Time To First Token. This is how quickly the model can process the incoming request and begin to output text, and translates directly into how quickly the UI starts to update. Lower values = lower latency/faster performance.
  • TPS: Tokens Per Second. This is how quickly the model can produce text and controls how quickly the full response shows up in the UI. Higher values = more throughput/faster performance.
  • Total: The total time from the start of the request until the response is complete, i.e., the last token has been generated. Total time = TTFT + TPS * Tokens. Lower values = lower latency/faster performance.

Methodology

===========

Distributed Footprint: We run our tools daily in multiple data centers using Fly.io. Currently we run in cdg, iad, and sea.

Connection Warmup: A warmup connection is made to remove any HTTP connection setup latency.

TTFT Measurement: The TTFT clock starts when the HTTP request is made and stops when the first token is received in the response stream.

Max Tokens: The number of output tokens is set to 20 (~100 chars), the length of a typical conversational sentence.

Try 3, Keep 1: For each provider, three separate inferences are performed, and the best result is kept (to remove any outliers due to queuing etc).

Sources

======

Raw Data: All data is in this public GCS bucket.

Benchmarking Tools: The full test suite is available in the ai-benchmarks repo.

Website: Full source code for this site is on GitHub.