MemoCheck results: eval-driven voice-memo intent extraction

MemoCheck is two things that only matter together: an agent that turns a voice-memo transcript into strict JSON intent (todos, calendar events, reminders, notes), and an evaluation suite that measures how reliably that extraction works. Scoring is deterministic and owned in-repo; DeepEval is used only as the pytest-native test runner, not as the grader. Every number below is micro-averaged across 4 providers, 30 cases, and 3 attempts. v1 is the agent of record.

Full write-up, methodology, and reproduction steps live in the project README on GitHub.

3-minute walkthroughVideo walkthrough

A 3-minute tour of what I built and what the evaluation found. The rest of the page is the written version.

The hero resultv0 to v1, with confidence intervals

The same deltas shown as improvements (oriented so positive is better), each with a 95% bootstrap confidence interval. Only Type and Hallucination clear zero. Detection is the disclosed cost sitting left of zero. Date's interval is too wide to call.

v0 to v1 improvement per metric with 95% bootstrap confidence intervals; Type accuracy and Hallucination rate clear zero, Detection sits left of zero, Date straddles zero. — v0 → v1 improvement per metric, all 30 cases. Intervals are 1000-resample bootstrap CIs (ADR-005).

LeaderboardPer-provider, v0 to v1

Each cell shows the v0 score, then v1. Higher is better for every metric except hallucination rate, where lower is better. The bottom row is the micro-average across all four providers.

Provider	Detection	Hallucination	Type	Date	Attribution
Anthropic (Haiku 4.5)	0.981→0.962	0.044→0.000	0.922→1.000	0.709→0.687	0.977→1.000
Gemini (3.1 Flash Lite Preview)	1.000→1.000	0.019→0.000	0.981→1.000	0.863→0.904	0.958→0.980
Groq (Llama 3.3 70B)	0.962→0.962	0.057→0.038	0.960→1.000	0.771→0.847	1.000→0.979
OpenAI (GPT-4.1 mini)	0.987→0.942	0.105→0.052	0.968→0.986	0.738→0.738	0.965→0.963
All (micro-avg)	0.982→0.966	0.057→0.023	0.958→0.997	0.772→0.795	0.975→0.981

OpenAI is the hallucination laggard (0.052 at v1, still the highest after halving from 0.105 at v0) and took the biggest detection hit. Gemini leads on date accuracy. The negation and schema-adherence metrics are reported separately below, because neither carries an iteration story.

By provider and metricv1 scores, at a glance

The v1 (agent of record) score for each provider and metric. Hallucination is excluded here because lower is better for it; see the leaderboard and the hero chart for that metric.

Heatmap of v1 scores by provider across type accuracy, detection, date accuracy, attribution, and negation handling. — v1 scores by provider (higher is better). Color spans 0.6 to 1.0; cell labels are exact.

Where v1 moved the needlePer-category deltas

The v0 to v1 change broken out by failure category, so the story is which failure modes actually moved, not just the aggregate.

Per-category v0 to v1 deltas; most categories have a single case and are marked anecdotal with no confidence interval. — **Read with care.** Most categories contain only a single case (n=1), marked anecdotal on the chart with no confidence interval. Treat per-category bars as directional, not statistical. The aggregate result and its CIs (above) are the load-bearing numbers; this view is the qualitative complement.

The honest negativev2 attacked date accuracy, and it did not move

v2's one job was to lift date accuracy. All-30 date went 0.795 → 0.766 with a CI of [-0.086, +0.026]: no move. That is the honest result. At 30 cases, the effect a single prompt edit can produce (~0.03) is smaller than the test set's case-sampling noise (~±0.06), so date is sample-size-bound. This is a finding about the benchmark's resolution, not a v2 failure, and it is why there is no v3.

Date accuracy across v0, v1, and v2, showing it stays roughly flat and does not improve at v2. — Date accuracy across versions. The v1 to v2 change is within sampling noise. Full reasoning in the v2 failure analysis.

Already solvedTwo metrics with no iteration story

Negation handling is near-perfect from v0 (0.995, then 1.000 at v1 and v2 with a zero-width CI). Current models handle "scratch that" retractions and false-positive traps well, so this is a single finding, not an iteration metric. The test set deliberately over-invested in negation expecting a failure mode that did not materialize.
Schema adherence is 100% on the first LLM attempt across every provider and version. The structured-output plus Pydantic-validate-and-retry design works. It is a validation result, not a before/after signal.

How it worksMethodology

Test set. 30 hand-labeled transcripts (22 self-recorded plus 8 synthetic edge cases), split into 24 visible and 6 held-out (ADR-004). The held-out six were never opened during failure-mode analysis or v1 design.
Scoring. A deterministic in-repo scorer: an embedding plus Hungarian matcher with a judged band (ADR-002), then tiered metrics for detection, hallucination, type, date, attribution, and negation (ADR-001).
Aggregation. Every metric is micro-averaged: the sum of raw per-case numerators over the sum of denominators, so a case with one action item does not count as much as a case with five.
Confidence intervals. Each reported delta carries a 95% bootstrap CI from 1000 resamples of the per-case scores (ADR-005).

The full methodology, including the matcher spot-check validation and the labeling conventions, is in the README.

Honest caveatsLimitations

Small N. 24 visible plus 6 held-out cases. The bootstrap CIs are wide, especially on the held-out split. A held-out CI that straddles zero is not on its own evidence of failure to generalize.
v0 is not a blind baseline. The v0 prompt and the labeling guide were co-designed, so v0 already knows the schema conventions the test set uses. The v0 to v1 delta is still meaningful, but absolute v0 scores are not "what an off-the-shelf agent would do on novel data".
English only, single speaker, single labeler. Accent robustness, multilingual extraction, noisy-background performance, and inter-annotator agreement are all out of scope.
Provider snapshot, not provider capability. Every score is conditional on the model versions used at run time. Provider behavior drifts.

The complete limitations list is in the README.

What's nextWhere this study points next

A bigger, date-dense test set, sized for resolution. The flat date trajectory above is a measurement limit, not a dead end: at N=30 the benchmark cannot resolve a ~0.03 date effect against ~±0.06 sampling noise. The next step is more date-bearing cases, stratified by memo size and date type, planned for statistical power.
A detection-focused iteration. v1 cut hallucinations but dropped 10 real items, because detection and hallucination read off the same matcher pool. The next iteration would decouple them: recover the dropped true positives without re-adding the false positives, likely a two-pass extract-then-filter rather than one prompt knob.

Full future-work list, including productionization, in the README.

Run it yourselfReproduction

The frozen run data lives in data/db_snapshot/, so every number on this page is auditable even though the agents are not bit-for-bit reproducible (run-to-run nondeterminism is explained in the v2 failure analysis). The charts on this page are generated from data/results/*.json by scripts/build_charts.py.

Clone the repo, pip install -e ., and follow the reproduction steps in the README.