The question
How do you make an AI workflow more reliable before adding more complexity?
I started with a deliberately small receipt-review pipeline:
extract_receipt_details(image_path)reads a receipt image and returns structured data.evaluate_receipt_for_audit(receipt_details)decides whether the expense needs review.
Why start small
The first goal is not a broad product surface. It is to understand failure modes. Each run saves extraction and audit JSON separately, repeated outputs are preserved, and labeled examples can be compared with a lightweight assessment helper.
The next step is a batch eval harness once the useful metrics are clear.
What it demonstrates
- Structured output contracts with Pydantic
- Image-to-data extraction
- Explicit separation between extraction and business decisions
- Ground-truth comparison
- An eval-driven approach to iteration
The code is available on GitHub.