mmc
← projects

Invoice data extraction with OCR + LLMs

Cut invoice processing time by ~2 minutes per invoice across ~70k invoices/month using OCR and LLM-driven extraction, normalization, and homogenization.

role
Product Manager
year
2024
Product discoveryPRDOCRLLMsData normalizationAutomation

Every AP desk has a different invoice format, and almost none of them speak the same language. The work was building the extraction + normalization layer that turns a pile of vendor PDFs into clean, comparable records, without forcing the team to babysit it.

What shipped

  • An OCR + LLM extraction pipeline tuned per client portfolio, with industry-specific parsing rules.
  • A normalization and homogenization layer so vendor formats collapse into one comparable record shape.
  • Discovery and PRD work with engineering to ship the pipeline into the existing AP workflow.

Outcome

  • ~2 minutes saved per invoice across ~70k invoices/month.
  • Vendor-specific formatting stopped blocking the team; the data was already in shape by the time it hit review.