We Built a Different Kind of AI Benchmark — Here’s Why
Over the past year, we’ve watched the AI industry obsess over one thing: AGI-style benchmarks.
Can a model solve PhD-level maths?
Can it reason about obscure scientific problems?
Can it cure diseases?
Those are fascinating questions — but they’re not the questions our clients ask us.
At Sulta, we work with businesses every day that are deploying AI into real systems: internal tools, customer-facing platforms, data pipelines, support workflows, and decision-making processes. In those environments, raw intelligence is only part of the equation.
So we built something different.
The Problem With Current AI Benchmarks
Most popular benchmarks optimise for abstract intelligence, not business output.
They don’t tell you:
- How well a model performs under high-pressure, high-stakes customer Q&A
- Whether it can reliably call tools across messy, real-world APIs
- How well it understands and consolidates internal documentation
- If it can summarise long, contradictory inputs into something actually usable
- How autonomous it can be without silently failing or hallucinating
In enterprise environments, failure isn’t academic — it costs time, money, and trust.
We kept running into the same issue:
great benchmark scores, disappointing production results.
That gap is what pushed us to build our own benchmark from the ground up.
What We Actually Built
This isn’t a single test or leaderboard.
It’s a fully modular, autonomous benchmarking framework designed specifically to measure business performance of LLMs.
At a high level, it evaluates models across scenarios like:
- Custom enterprise Q&A under tight constraints
- Multi-step tool calling with brittle dependencies
- Long-context summarisation with conflicting signals
- Internal document understanding and consolidation
- Decision-making under incomplete or noisy data
- Autonomous task execution with measurable outcomes
Each module is independent, configurable, and designed to mirror real enterprise workloads, not synthetic test cases.
We didn’t want “can the model answer correctly?” We wanted “does this move a business forward?”
The Hard Parts (And There Were Many)
Building this was not clean or easy.
One of the biggest challenges was defining success. In real business workflows, answers aren’t binary. Sometimes the “best” response is conservative. Sometimes it’s incomplete but safe. Sometimes speed matters more than depth.
Another issue was tool reliability. Models behave very differently when tools fail, respond slowly, or return malformed data — which happens constantly in production systems. Most benchmarks ignore this entirely. We leaned into it.
We also had to confront an uncomfortable truth:
Some models that look incredible on public benchmarks perform poorly when placed inside autonomous, high-stakes workflows.
That insight alone made the entire effort worth it.
Why This Matters for Enterprises
If you’re deploying AI inside a company, you don’t care if a model can pass an exam.
You care if it can:
- Reduce operational load
- Improve decision speed
- Handle edge cases safely
- Integrate cleanly into existing systems
- Scale without constant human babysitting
Our benchmark is designed to surface those qualities — and the absence of them.
This allows us (and eventually others) to choose models based on outcomes, not hype.
A First for the South African AI Landscape
As far as we know, this is the first benchmark of its kind built in South Africa, focused entirely on enterprise-grade AI performance.
That matters to us.
Too often, African companies are consumers of AI research rather than contributors. We wanted to build something globally relevant, but grounded in the realities of businesses operating here — where efficiency, reliability, and cost sensitivity are non-negotiable.
What Comes Next
We’re committing serious resources to this project.
Over the next phase, we’ll be spending hundreds of thousands running this benchmark at scale across models, configurations, and workloads. The findings won’t sit in a drawer — we’ll be sharing them with the companies that trust us as their AI partner.
Transparency matters, especially in an industry moving this fast.
We’re also planning to open-source the framework in the near future. Our goal is to let the broader community:
- Contribute new modules
- Improve evaluation methods
- Use it to make better deployment decisions
Why We’re Doing This
We didn’t build this for marketing. We built it because we needed it.
More soon.