7.8highGO

SchemaLens

AI-powered tool that reverse-engineers undocumented legacy databases by inferring schemas, relationships, and generating living documentation.

DevToolsData engineers, database administrators, and consultants who work with legacy...
The Gap

Data engineers inherit large legacy databases with no documentation, no formalized schema, and no information_schema — forcing weeks of manual code reading and guesswork to understand data flows and relationships.

Solution

Connects to any database, samples data, analyzes column names/types/values/foreign key patterns, and uses AI to infer entity relationships, generate ER diagrams, and produce queryable documentation. Supports Oracle, SQL Server, MySQL, Postgres, and other legacy systems.

Revenue Model

Freemium — free for small databases (<50 tables), paid tiers for enterprise-scale databases, team collaboration, and ongoing schema drift monitoring. $49/mo individual, $299/mo team.

Feasibility Scores
Pain Intensity9/10

This is a genuine, visceral pain. The Reddit thread confirms it — someone literally spent 3 weeks reading code to understand a database. This isn't a 'nice to have' — it blocks entire migration projects worth millions. When a consultant bills $200/hr and spends 80 hours manually reverse-engineering a schema, that's $16K of pure waste per engagement. The pain is acute, time-bound, and has real dollar cost.

Market Size7/10

TAM is meaningful but niche. There are ~500K data engineers globally and millions of DBAs. Legacy database modernization is a $20B+ market. However, the specific tool market (reverse-engineering documentation) is a subset. Estimated serviceable market: $200M-$500M if you capture consultants, enterprises, and data teams. Not a billion-dollar standalone market, but strong enough for a very profitable SaaS.

Willingness to Pay8/10

Strong signals. Enterprises already pay $50K+ for Alation/Collibra. Consultants bill $150-300/hr and would happily pay $49/mo to save days of work — the ROI is absurd (save 40 hours = $6K-12K vs $49 cost). The Reddit thread shows someone literally built their own tool to solve this — that's the ultimate willingness-to-pay signal. Enterprise procurement for migration projects has budget. $299/mo for teams is well within 'expense it on a credit card' range.

Technical Feasibility7/10

Core MVP is buildable by a strong solo dev in 6-8 weeks: connect to DB via JDBC/ODBC, query system catalogs + sample data, analyze column name patterns (user_id → users.id), check referential integrity in actual data, feed to LLM for relationship inference, generate ER diagrams. The hard parts: (1) handling databases that truly lack information_schema (old Oracle, AS/400, etc.) requires specialized connectors; (2) AI inference accuracy needs to be high enough to be trusted — hallucinated relationships are worse than none; (3) scale testing on 1000+ table databases. Doable but not trivial.

Competition Gap9/10

This is the strongest signal. Every existing tool falls into one of two buckets: (1) free/cheap tools that only read declared metadata — useless for the core problem of undocumented databases; (2) enterprise data catalogs that cost $50K+ and take months to deploy. There is NO mid-market, AI-powered tool focused specifically on the acute problem of 'understand this legacy database fast.' The gap is enormous and well-defined.

Recurring Potential6/10

Mixed. The acute use case (reverse-engineer a database during migration) is project-based, not recurring — you need it intensely for 2-4 weeks, then you're done. Schema drift monitoring adds recurring value but is a weaker pain point. Team collaboration and living documentation improve stickiness. The best recurring path is per-database pricing for consultants who do this repeatedly across clients. Enterprise contracts for ongoing documentation maintenance are possible but require more product depth.

Strengths
  • +Massive, validated gap — no AI-powered tool exists in the mid-market for this specific acute pain
  • +Clear, quantifiable ROI — saves weeks of manual work that costs thousands in labor
  • +Strong pain signals from real practitioners (Reddit thread, someone built their own tool)
  • +Natural enterprise upsell path — starts with individual data engineers, expands to team/org
  • +AI timing is perfect — LLMs are now good enough to do credible schema inference that wasn't possible 3 years ago
  • +Defensible moat potential — training on patterns from thousands of legacy databases creates compounding data advantage
Risks
  • !Recurring revenue challenge — core use case is project-based (migration), not ongoing. Must find sticky features (drift monitoring, living docs) or target consultants who do this repeatedly
  • !Accuracy trust gap — if AI infers wrong relationships, users lose trust fast. False positives in schema inference could lead to bad migration decisions. Need high precision over recall
  • !Legacy database connectivity is a long tail of pain — each old system (AS/400, Informix, ancient Oracle versions) has its own quirks. Supporting the truly legacy databases that need this most is hard
  • !Enterprise sales cycle — the teams with the biggest pain are inside large enterprises with procurement processes, security reviews, and data access restrictions. Getting DB credentials from a Fortune 500 is non-trivial
  • !Open-source risk — SchemaSpy or SchemaCrawler could add AI features, or someone could build an open-source alternative quickly
Competition
Dataedo

Database documentation tool that connects to databases, imports metadata, lets teams add descriptions, and generates documentation with ER diagrams. Supports reverse-engineering schemas from 20+ database types.

Pricing: Starts ~$299/year per user (Professional
Gap: No AI-powered inference of undocumented relationships. Relies heavily on existing metadata and foreign keys — if they don't exist in the DB, you're still doing manual work. No automatic data sampling or pattern-based relationship detection. Documentation is static, not queryable.
SchemaSpy

Open-source tool that analyzes database metadata and generates interactive HTML documentation with ER diagrams. Reads information_schema and foreign keys to map relationships.

Pricing: Free and open-source (GPL
Gap: Completely dependent on existing metadata — if there are no declared foreign keys or information_schema, it produces almost nothing. Zero AI or heuristic inference. No data sampling. No support for truly undocumented databases. Output is static HTML, not living documentation. No collaboration features.
SchemaCrawler

Open-source database schema discovery and comprehension tool. Provides detailed schema metadata, generates ER diagrams, and supports scripting/automation for schema analysis.

Pricing: Free and open-source
Gap: Same core limitation — reads only declared metadata. No AI inference, no data-driven relationship discovery, no pattern matching on column names or values. Command-line only, no web UI. No collaboration. Requires technical setup. Useless on databases without proper constraints.
DbVisualizer / DBeaver (with ER features)

General-purpose database IDE tools that include ER diagram generation and schema browsing as part of their feature set. DBeaver is open-source with a Pro tier; DbVisualizer is commercial.

Pricing: DBeaver Community: Free. DBeaver Pro: $249/year. DbVisualizer: $197-$297/year.
Gap: ER diagrams only show declared foreign keys — no inference whatsoever. These are IDE tools, not documentation tools. No AI analysis, no data sampling, no relationship inference from naming patterns or data values. Documentation generation is minimal or absent. No schema drift monitoring.
Alation / Atlan / data catalog tools

Enterprise data catalog platforms that crawl databases, infer lineage, and provide searchable metadata with collaborative documentation. Increasingly adding AI features.

Pricing: Enterprise pricing only — typically $50K-$500K+/year. Not accessible to individuals or small teams.
Gap: Massively overpriced and over-scoped for the specific problem of reverse-engineering a single legacy database. Months-long implementation. Designed for ongoing data governance, not the acute pain of 'I just inherited this Oracle DB and need to understand it in 2 weeks.' No focused legacy database reverse-engineering workflow.
MVP Suggestion

CLI + web UI tool that connects to PostgreSQL, MySQL, SQL Server, and Oracle. Queries system catalogs AND samples actual data (first 1000 rows per table). Uses column name pattern matching (regex-based: *_id, *_code, *_key) plus data value intersection analysis to infer foreign key relationships. Feeds metadata + samples to GPT-4/Claude to generate natural language table/column descriptions and relationship confidence scores. Outputs interactive ER diagram (use Mermaid.js or D3) and a searchable HTML documentation site. Free for <50 tables, require sign-up for larger databases. Ship in 6 weeks.

Monetization Path

Free tier (<50 tables, single database) drives adoption with individual data engineers → $49/mo Individual (unlimited tables, multiple databases, export to PDF/Confluence, AI-generated documentation) → $299/mo Team (shared workspace, collaborative annotations, schema diff/drift alerts, SSO) → Enterprise ($1K+/mo, on-prem deployment, audit logs, API access, custom integrations). Secondary revenue: consulting marketplace connecting SchemaLens power users with companies needing legacy DB expertise.

Time to Revenue

8-12 weeks. Weeks 1-6: build MVP with 4 database connectors, AI inference, and basic web UI. Weeks 6-8: beta with 20-30 users from Reddit/HN data engineering communities. Weeks 8-10: iterate on accuracy based on feedback. Weeks 10-12: launch paid tier. First paying customers likely from consultants and freelance data engineers who hit this pain regularly. Could see $1K-5K MRR within 3 months of launch if product-market fit is validated.

What people are saying
  • reverse engineering a very large legacy enterprise database, no formalised schema, no information_schema, no documentation
  • interested in tools that infer relationships automatically, or whether it's always a manual grind
  • I just read the code for like 3 weeks. Noted down what I thought was the flow
  • the maintainer of that old code was very open about not understanding it, because he didn't write the origin
  • I built a tool to solve that problem a few years ago based on queries of the Oracle data dictionary