MemoCheck

An eval-driven study of how reliably LLM agents extract structured intent from real-world voice memo transcripts.

Headline: eval-driven iteration took the agent from v0 to v1 by cutting the hallucination rate by about 60% (0.057 → 0.023) and pushing type-classification accuracy to near-perfect (0.958 → 0.997). Both wins are real, not noise: their 95% bootstrap confidence intervals (1000 resamples) exclude zero. The same single prompt change also cost a small drop in detection (0.982 → 0.966), which is reported here for transparency.

MemoCheck is two things that only matter together: an agent that turns a voice-memo transcript into strict JSON intent (todos, calendar events, reminders, notes), and an evaluation suite that measures how reliably that extraction works. Scoring is deterministic and owned in-repo; DeepEval is used only as the pytest-native test runner, not as the grader. Every number below is micro-averaged across 4 providers, 30 cases, and 3 attempts. v1 is the agent of record.

Full write-up, methodology, and reproduction steps live in the project README on GitHub.

3-minute walkthroughVideo walkthrough

A 3-minute tour of what I built and what the evaluation found. The rest of the page is the written version.

The hero resultv0 to v1, with confidence intervals

The same deltas shown as improvements (oriented so positive is better), each with a 95% bootstrap confidence interval. Only Type and Hallucination clear zero. Detection is the disclosed cost sitting left of zero. Date's interval is too wide to call.

v0 to v1 improvement per metric with 95% bootstrap confidence intervals; Type accuracy and Hallucination rate clear zero, Detection sits left of zero, Date straddles zero.
v0 → v1 improvement per metric, all 30 cases. Intervals are 1000-resample bootstrap CIs (ADR-005).

LeaderboardPer-provider, v0 to v1

Each cell shows the v0 score, then v1. Higher is better for every metric except hallucination rate, where lower is better. The bottom row is the micro-average across all four providers.

Provider Detection Hallucination Type Date Attribution
Anthropic (Haiku 4.5) 0.9810.962 0.0440.000 0.9221.000 0.7090.687 0.9771.000
Gemini (3.1 Flash Lite Preview) 1.0001.000 0.0190.000 0.9811.000 0.8630.904 0.9580.980
Groq (Llama 3.3 70B) 0.9620.962 0.0570.038 0.9601.000 0.7710.847 1.0000.979
OpenAI (GPT-4.1 mini) 0.9870.942 0.1050.052 0.9680.986 0.7380.738 0.9650.963
All (micro-avg) 0.9820.966 0.0570.023 0.9580.997 0.7720.795 0.9750.981

OpenAI is the hallucination laggard (0.052 at v1, still the highest after halving from 0.105 at v0) and took the biggest detection hit. Gemini leads on date accuracy. The negation and schema-adherence metrics are reported separately below, because neither carries an iteration story.

By provider and metricv1 scores, at a glance

The v1 (agent of record) score for each provider and metric. Hallucination is excluded here because lower is better for it; see the leaderboard and the hero chart for that metric.

Heatmap of v1 scores by provider across type accuracy, detection, date accuracy, attribution, and negation handling.
v1 scores by provider (higher is better). Color spans 0.6 to 1.0; cell labels are exact.

Where v1 moved the needlePer-category deltas

The v0 to v1 change broken out by failure category, so the story is which failure modes actually moved, not just the aggregate.

Per-category v0 to v1 deltas; most categories have a single case and are marked anecdotal with no confidence interval.
Read with care. Most categories contain only a single case (n=1), marked anecdotal on the chart with no confidence interval. Treat per-category bars as directional, not statistical. The aggregate result and its CIs (above) are the load-bearing numbers; this view is the qualitative complement.

The honest negativev2 attacked date accuracy, and it did not move

v2's one job was to lift date accuracy. All-30 date went 0.795 → 0.766 with a CI of [-0.086, +0.026]: no move. That is the honest result. At 30 cases, the effect a single prompt edit can produce (~0.03) is smaller than the test set's case-sampling noise (~±0.06), so date is sample-size-bound. This is a finding about the benchmark's resolution, not a v2 failure, and it is why there is no v3.

Date accuracy across v0, v1, and v2, showing it stays roughly flat and does not improve at v2.
Date accuracy across versions. The v1 to v2 change is within sampling noise. Full reasoning in the v2 failure analysis.

Already solvedTwo metrics with no iteration story

  • Negation handling is near-perfect from v0 (0.995, then 1.000 at v1 and v2 with a zero-width CI). Current models handle "scratch that" retractions and false-positive traps well, so this is a single finding, not an iteration metric. The test set deliberately over-invested in negation expecting a failure mode that did not materialize.
  • Schema adherence is 100% on the first LLM attempt across every provider and version. The structured-output plus Pydantic-validate-and-retry design works. It is a validation result, not a before/after signal.

How it worksMethodology

The full methodology, including the matcher spot-check validation and the labeling conventions, is in the README.

Honest caveatsLimitations

The complete limitations list is in the README.

What's nextWhere this study points next

Full future-work list, including productionization, in the README.

Run it yourselfReproduction

The frozen run data lives in data/db_snapshot/, so every number on this page is auditable even though the agents are not bit-for-bit reproducible (run-to-run nondeterminism is explained in the v2 failure analysis). The charts on this page are generated from data/results/*.json by scripts/build_charts.py.

Clone the repo, pip install -e ., and follow the reproduction steps in the README.