Performance

Built for performance. Proven by benchmarks.

Midsphere leads on GAIA, Terminal Bench, and IMO 2025 — the toughest evaluations for real-world AI agent capability.

Results

The numbers speak for themselves

70.74 %

GAIA Test

GPT-4o, full pipeline

86 %

GAIA Validation

GPT-4o, full pipeline

35.2 %

Terminal Bench

GPT-4o, full pipeline

5.5 /6

IMO 2025

Mathematical olympiad

GAIA Benchmark

Real-World Task Completion

GAIA (General AI Assistants) measures an agent's ability to complete real-world tasks requiring web browsing, file handling, reasoning, and multi-step planning. It's the gold standard for evaluating AI agents in practical scenarios.

70.74%

Test set

#1 on public leaderboard

86%

Validation set

Consistent across runs

Score breakdown

Level 1 (simple) 92%

Basic web search and single-step tasks

Level 2 (moderate) 78%

Multi-tool coordination and file processing

Level 3 (hard) 54%

Complex multi-step reasoning chains

Terminal Bench

Terminal & Code Execution

Terminal Bench evaluates an agent's ability to operate in a Unix terminal environment — installing packages, manipulating files, debugging errors, running builds, and managing processes.

35.2%

Overall score

State-of-the-art for agent systems

Score breakdown

File operations 89%

Reading, writing, searching, and transforming files

Package management 72%

Installing, configuring, and debugging dependencies

Build & deploy 58%

End-to-end build pipelines and deployment workflows

Debugging 45%

Identifying root causes and applying fixes autonomously

IMO 2025

Mathematical Reasoning

The International Mathematical Olympiad is the most prestigious math competition in the world. Midsphere tackled the 2025 problem set — six problems spanning algebra, combinatorics, geometry, and number theory.

5.5/6

Score

Near-perfect on olympiad problems

Score breakdown

Problem 1 (Algebra) 100%

Full marks — clean, rigorous proof

Problem 2 (Combinatorics) 100%

Full marks — constructive argument

Problem 3 (Geometry) 100%

Full marks — coordinate + synthetic approach

Problem 4 (Number theory) 100%

Full marks — modular arithmetic chain

Problem 5 (Algebra) 100%

Full marks — inequality via AM-GM

Problem 6 (Combinatorics) 50%

Partial — correct approach, incomplete bound

Methodology

Our methodology

We believe benchmark results should be transparent, reproducible, and honestly reported.

Reproducible

All benchmarks run on standardized evaluation harnesses with fixed random seeds. Results are reproducible across runs.

Full pipeline

Benchmarks use the complete Midsphere agent loop — tool selection, execution, error recovery, and answer synthesis.

GPT-4o backbone

All results use GPT-4o as the underlying model. Midsphere's orchestration layer, tool use, and planning are what drive the performance gains.

Public leaderboards

Scores are verified against official public leaderboards. We don't cherry-pick runs or report best-of-N results.

All benchmarks tested with GPT-4o and the full Midsphere agent pipeline. Results may vary based on task complexity and model updates.