Skip to content
Menu
Performance

Built for performance. Proven by benchmarks.

Midsphere leads on GAIA, Terminal Bench, and IMO 2025 — the toughest evaluations for real-world AI agent capability.

Results

The numbers speak for themselves

70.74 %
GAIA Test
GPT-4o, full pipeline
86 %
GAIA Validation
GPT-4o, full pipeline
35.2 %
Terminal Bench
GPT-4o, full pipeline
5.5 /6
IMO 2025
Mathematical olympiad
GAIA Benchmark

Real-World Task Completion

GAIA (General AI Assistants) measures an agent's ability to complete real-world tasks requiring web browsing, file handling, reasoning, and multi-step planning. It's the gold standard for evaluating AI agents in practical scenarios.

70.74%
Test set
#1 on public leaderboard
86%
Validation set
Consistent across runs

Score breakdown

Level 1 (simple) 92%

Basic web search and single-step tasks

Level 2 (moderate) 78%

Multi-tool coordination and file processing

Level 3 (hard) 54%

Complex multi-step reasoning chains

Terminal Bench

Terminal & Code Execution

Terminal Bench evaluates an agent's ability to operate in a Unix terminal environment — installing packages, manipulating files, debugging errors, running builds, and managing processes.

35.2%
Overall score
State-of-the-art for agent systems

Score breakdown

File operations 89%

Reading, writing, searching, and transforming files

Package management 72%

Installing, configuring, and debugging dependencies

Build & deploy 58%

End-to-end build pipelines and deployment workflows

Debugging 45%

Identifying root causes and applying fixes autonomously

IMO 2025

Mathematical Reasoning

The International Mathematical Olympiad is the most prestigious math competition in the world. Midsphere tackled the 2025 problem set — six problems spanning algebra, combinatorics, geometry, and number theory.

5.5/6
Score
Near-perfect on olympiad problems

Score breakdown

Problem 1 (Algebra) 100%

Full marks — clean, rigorous proof

Problem 2 (Combinatorics) 100%

Full marks — constructive argument

Problem 3 (Geometry) 100%

Full marks — coordinate + synthetic approach

Problem 4 (Number theory) 100%

Full marks — modular arithmetic chain

Problem 5 (Algebra) 100%

Full marks — inequality via AM-GM

Problem 6 (Combinatorics) 50%

Partial — correct approach, incomplete bound

Methodology

Our methodology

We believe benchmark results should be transparent, reproducible, and honestly reported.

Reproducible

All benchmarks run on standardized evaluation harnesses with fixed random seeds. Results are reproducible across runs.

Full pipeline

Benchmarks use the complete Midsphere agent loop — tool selection, execution, error recovery, and answer synthesis.

GPT-4o backbone

All results use GPT-4o as the underlying model. Midsphere's orchestration layer, tool use, and planning are what drive the performance gains.

Public leaderboards

Scores are verified against official public leaderboards. We don't cherry-pick runs or report best-of-N results.

All benchmarks tested with GPT-4o and the full Midsphere agent pipeline. Results may vary based on task complexity and model updates.