Receipt Evals | Ashter Haider

The question

How do you make an AI workflow more reliable before adding more complexity?

I started with a deliberately small receipt-review pipeline:

extract_receipt_details(image_path) reads a receipt image and returns structured data.
evaluate_receipt_for_audit(receipt_details) decides whether the expense needs review.

Why start small

The first goal is not a broad product surface. It is to understand failure modes. Each run saves extraction and audit JSON separately, repeated outputs are preserved, and labeled examples can be compared with a lightweight assessment helper.

The next step is a batch eval harness once the useful metrics are clear.

What it demonstrates

Structured output contracts with Pydantic
Image-to-data extraction
Explicit separation between extraction and business decisions
Ground-truth comparison
An eval-driven approach to iteration

The code is available on GitHub.