Your AI can "see" images. So why does it keep missing the invoice total? Lessons from vision + language in production.
Common Use Cases
Invoice and receipt processing, ID verification, product catalog (image + description), medical image analysis, visual Q&A. Multi-modal models (GPT-4V, Claude 3, Gemini) promise one model for all of it—but deployment often fails without the right pipeline. The gap is usually preprocessing, fallbacks, and cost control, not model capability.
Why Multi-Modal Fails
- Image quality—too small, blurry, or rotated. Vision models struggle with low-res or skewed scans; a quick preprocessing step (resize, deskew, contrast) often doubles accuracy.
- OCR is better—for text-heavy images (invoices, forms, receipts), use Tesseract or AWS Textract first. Send the extracted text to the LLM instead of the raw image; you get better accuracy and lower cost. Use vision only for layout, validation, or when OCR fails.
- Model limits—small text, complex layouts, tables. Even the best vision models miss fine print or misread dense grids; combine OCR + LLM for tables.
- No structured output—model returns prose instead of JSON; hard to integrate. Always prompt for a schema (e.g. JSON with fields like total, date, line_items) and validate before writing to your DB.
- Cost—images consume many tokens; resolution drives cost. Resize to the minimum needed (e.g. 1024px long edge for docs) and avoid sending the same image multiple times in one request.
- No fallback—when vision fails, the whole process fails. Design for timeouts and errors: retry once, then fall back to OCR-only or human review so the pipeline doesn't block.
Real-world scenario: An invoice pipeline sent every scan directly to a vision API. Accuracy was 76% and cost was high. After adding preprocessing (deskew, contrast) and switching to Textract for text extraction plus GPT-4 for "parse this text into JSON (total, date, line_items)," accuracy reached 94% and cost per invoice dropped by about 60%. Vision was then used only for layout validation (e.g. "is this a valid invoice image?") when Textract confidence was low.
"For text-heavy docs: OCR first, then use vision for validation or extraction. Always have a fallback path when the vision call fails."
Best Practices
Preprocess images (resize, rotate, contrast). For text-heavy docs: OCR first, then use vision for validation or extraction. Prompt for structured output (JSON schema). Handle vision failures (retry, fallback to OCR-only, or human review). Resize to optimal resolution for cost. Real examples: invoice processing with OCR + GPT-4V at 95% accuracy; ID verification with vision + validation rules; medical imaging with vision + human review. Always have a fallback path when the vision call fails or times out so the workflow doesn't break.
Pipeline and Fallback
Pipeline order: (1) validate image (size, format), (2) preprocess, (3) for docs, run OCR and pass text to LLM; for non-doc images, send to vision with a strict prompt and schema, (4) validate output against schema before persisting, (5) on failure, retry or route to human. Example schema for invoice extraction:
{
"total": "number",
"date": "YYYY-MM-DD",
"vendor": "string",
"line_items": [{"description": "string", "amount": "number"}]
}
Validate that the model's response conforms before writing to your database; if it doesn't, retry once or send to human review. We'll add code for preprocessing, OCR+LLM pipeline, and structured output in a follow-up.
What to Do Next
If you're deploying vision or document AI, add preprocessing, OCR-first for text-heavy images, structured output, and a fallback path. Schedule multi-modal implementation support and we can help design the pipeline. Our AI Agent Development practice includes production-ready document and vision pipelines with the right mix of OCR, vision, and human review.
