Developers using local OCR models waste time building image preprocessing pipelines (rotation correction, quality enhancement) and switching between models to handle edge cases like MRZ zones, angled photos, and low-quality scans.
A self-hostable or cloud OCR API that chains image preprocessing (auto-rotation, deskewing, enhancement) with the best small VLM for the task, returning structured JSON output. Automatically detects document type and routes to the right model/config.
Freemium API: free tier at 500 pages/month, paid tiers based on volume. Self-hosted license for enterprises.
The Reddit thread and pain signals are textbook developer frustration. Preprocessing is genuinely tedious - rotation correction, deskew, quality enhancement are all well-known time sinks. The MRZ edge case alone can burn a week. Every developer who has tried to build document processing has hit this wall. The pain is real, recurring, and currently solved by duct-taping 3-4 libraries together.
Document processing TAM is enormous ($10B+), but this targets a specific niche: developers who want better-than-Tesseract but simpler-than-Google-Document-AI. Estimated serviceable market is $200M-500M covering SMB fintech KYC, SaaS document workflows, and developer tools. The self-hosted angle opens enterprise deals. Not a tiny market, but you are competing for a slice of a market with very large incumbents.
Developers already pay for OCR APIs (Google, AWS, Azure). Companies doing KYC pay $0.05-$0.50 per verification. The key insight is that teams currently paying $0.10/page to Google would pay $0.03-0.05/page for equivalent quality with self-host option. Enterprise self-hosted licenses ($500-2000/month) are viable for data-sensitive industries. The 109 upvotes on an OCR post suggest engaged audience, but converting Reddit enthusiasm to paying customers is always a gap.
Highly feasible for a solo dev with ML/CV background. The core components exist: OpenCV for preprocessing, Qwen2.5-VL/Florence for extraction, FastAPI for the API layer. The innovation is in the orchestration and document-type routing, not in building models from scratch. MVP in 4-6 weeks is realistic. The preprocessing pipeline (deskew, rotation, enhancement) is well-understood computer vision. Main risk is getting the auto-routing reliable across diverse document types.
This is the strongest dimension. No existing product combines: (1) smart preprocessing pipeline, (2) small VLM-powered extraction, (3) automatic document type detection and routing, (4) structured JSON output, AND (5) self-hostable. Cloud giants don't offer self-hosting. Open-source tools don't offer the orchestration layer. The 'intelligent preprocessing + VLM routing' combo is genuinely underserved. The Reddit comments confirm developers are manually stitching these pieces together.
Document processing is inherently ongoing - companies process documents continuously, not once. KYC verification is per-customer. Invoice processing is monthly. API usage is naturally metered and recurring. Self-hosted licenses renew annually. This is one of the most naturally recurring use cases in developer tools. Once integrated into a pipeline, switching costs are high.
- +Clear, validated pain point with direct user quotes from an engaged community (109 upvotes, 43 comments)
- +Strong competition gap - no one combines preprocessing + VLM routing + self-hosting in a drop-in API
- +Excellent recurring revenue dynamics - document processing is continuous, not one-time
- +Timing is perfect - small VLMs (Qwen2.5-VL 2B/7B) just crossed the quality threshold to make this viable without GPU clusters
- +Self-hosted angle is a massive differentiator for regulated industries (fintech, healthcare, government) where data cannot leave premises
- +Technical moat grows with each document type and preprocessing rule added - hard for competitors to replicate the routing intelligence
- !Cloud giants (Google, AWS, Azure) could add better preprocessing and VLM-based extraction to their existing products, compressing the gap
- !Small VLM quality may not match cloud API quality for edge cases, leading to churn from developers who expected parity
- !Developer tools market is notoriously hard to monetize - many will use the free tier or self-host and never pay
- !Document type auto-detection and routing is the hardest technical challenge - if it fails on edge cases, the whole value prop collapses
- !Supporting the long tail of document types (global IDs, varied invoice formats, handwritten forms) could become an endless engineering treadmill
Cloud-based document processing with specialized processors for invoices, receipts, IDs, and custom documents. Handles preprocessing internally with Google's infrastructure.
Amazon's document text extraction service with table, form, and query-based extraction. Includes some built-in image correction.
Open-source OCR library with built-in preprocessing
Open-source OCR toolkit from Baidu with text detection, recognition, and some preprocessing. Supports 80+ languages.
Open-source library and hosted API for extracting and transforming unstructured data from documents, PDFs, images, and more. Focuses on RAG pipeline preprocessing.
FastAPI service with 3 endpoints: /ocr (general), /ocr/id (identity documents), /ocr/invoice. Preprocessing pipeline: auto-rotation via OpenCV, deskew, contrast enhancement. Use Qwen2.5-VL-2B as the default model with MRZ-specific handling for passports/IDs. Return structured JSON with confidence scores. Docker image for self-hosting. Ship with a simple web playground for testing. Focus MVP on identity documents (passport, driver license, national ID) since KYC is the highest willingness-to-pay use case. Skip auto document-type detection in MVP - let the developer specify the endpoint.
Free tier (500 pages/month, community support) -> Pro ($49/month for 10K pages, priority models, webhook callbacks) -> Business ($199/month for 100K pages, custom document types, SLA) -> Enterprise (self-hosted license $999-2999/month, on-prem deployment support, custom model fine-tuning). Early revenue from Pro tier targeting indie SaaS builders doing KYC. Scale revenue from Enterprise self-hosted licenses to fintechs and banks.
4-6 weeks to MVP, 8-10 weeks to first paying customer. The path: Week 1-2 build preprocessing pipeline and API skeleton, Week 3-4 integrate VLM and build document-type handlers, Week 5-6 add billing/auth and deploy. First revenue likely from a HackerNews/Reddit launch targeting the same community where the pain signals originated. Identity document processing for KYC is the fastest path to revenue since those buyers have budget and urgency.
- “needed some image pre-processing to rotate images correctly for good results”
- “MRZ at the bottom of Passport or ID documents throws it in a loop”
- “from clear scans to potato phone pics”
- “Paddle but that's not a simple model like qwen”