PDF parsing pipeline that extracts structured data from invoices, contracts, and compliance documents.
PDFs are converted to images and sent to Claude Vision for extraction. A Zod schema defines the expected output structure per document type (invoice: vendor, date, line items, total; contract: parties, terms, dates, clauses). Claude extracts the data and returns it as validated JSON. Failed validations trigger a re-extraction with more specific prompting.
Testing accuracy across 12 document types: invoices (97% accuracy), contracts (89%), compliance reports (91%), receipts (95%), insurance forms (87%), employment contracts (90%), NDAs (93%), purchase orders (96%), bank statements (94%), tax returns (88%), medical records (82%), and lease agreements (85%). The main failure mode is handwritten annotations — Claude Vision struggles with poor handwriting.
Being developed for the AI Compliance Engine (Irvo) and as a standalone integration. Client use case: a property management company processing 500 lease agreements annually — currently takes 2 hours per document, target is 5 minutes with human review.
Processing 200 invoices per week from different vendors, each with a different format. The pipeline extracts vendor, date, line items, VAT, and total — feeding directly into Xero or QuickBooks.
Reviewing contracts for key clauses (termination, liability, IP ownership). The pipeline highlights relevant sections and flags missing standard clauses — saving 30 minutes per contract review.
Extracting policy details from renewal documents. Coverage amounts, exclusions, and premium changes are pulled into a comparison spreadsheet automatically.