Invoice data extraction with OCR + LLMs

Every AP desk has a different invoice format, and almost none of them speak the same language. The work was building the extraction + normalization layer that turns a pile of vendor PDFs into clean, comparable records, without forcing the team to babysit it.

What shipped

An OCR + LLM extraction pipeline tuned per client portfolio, with industry-specific parsing rules.
A normalization and homogenization layer so vendor formats collapse into one comparable record shape.
Discovery and PRD work with engineering to ship the pipeline into the existing AP workflow.

Outcome

~2 minutes saved per invoice across ~70k invoices/month.
Vendor-specific formatting stopped blocking the team; the data was already in shape by the time it hit review.