When the previous DE leaves, all business logic knowledge walks out the door — new DEs spend months reverse-engineering undocumented transformations.
Connects to data pipelines, SQL scripts, and BI tools to parse transformation logic, then generates human-readable documentation with lineage graphs and version history.
Freemium — free for up to 10 queries/pipelines, $49-$149/mo for full workspace with collaboration and change tracking.
This is a top-3 pain point in data engineering. The Reddit thread captures a universal experience — every DE who inherits an undocumented stack has lived this nightmare. It causes months of lost productivity, costly errors in business reporting, and significant organizational risk. Companies have literally made wrong business decisions because no one understood the legacy logic. The pain is acute, recurring (every time someone leaves), and has real dollar consequences.
TAM is tricky. There are ~200k data engineers in the US, but the sweet spot is solo/small-team DEs at companies with 50-500 employees — maybe 30-50k potential users. At $99/mo average, that's ~$36-60M TAM. Not venture-scale, but excellent for a bootstrapped SaaS. The constraint is that enterprise (where the big money is) is served by catalogs, and very small teams may resist paying. Mid-market is the wedge but it's a narrower band than it appears.
DEs at companies spending $50k+/year on Snowflake can justify $149/mo for documentation tooling — it's a rounding error. The buyer is often the DE themselves (bottoms-up) or their engineering manager. Pain signals suggest this saves weeks of onboarding time, which easily justifies the price. However, there's a cultural challenge: many DEs see documentation as something that 'should be free' or 'should be part of dbt/the warehouse.' The $49-149 range is well-calibrated — low enough for a credit card purchase, high enough to signal value.
Core SQL parsing is solved (SQLGlot handles most dialects). LLMs can generate human-readable explanations effectively. The hard parts: (1) connecting to diverse data stacks (Airflow, dbt, Looker, Tableau, stored procs, Python scripts) requires many integrations, (2) resolving cross-pipeline dependencies accurately is non-trivial, (3) keeping documentation 'living' (auto-updating on changes) requires webhooks/polling infrastructure. A solo dev can build an MVP for SQL-only in 6-8 weeks, but the 'connects to everything' vision is a 6-12 month journey. Start narrow: just SQL files + one warehouse.
The gap is real and significant. Enterprise catalogs are too expensive and complex for SMBs. dbt requires rewriting everything. ChatGPT is ephemeral. Nobody offers: (1) auto-discovery of business logic from raw SQL with (2) persistent, versioned, human-readable documentation at (3) a price point accessible to solo DEs. The intersection of 'automated logic extraction + living docs + affordable' is genuinely unoccupied. The risk is that dbt Cloud or a catalog player adds this feature, but their incentives point upmarket, not down.
Strong recurring dynamics: (1) pipelines change constantly, so documentation must be continuously updated — this isn't a one-time export, (2) new team members onboard regularly, creating ongoing value, (3) version history becomes more valuable over time (network effect with your own data), (4) once documentation exists, removing it is painful. Low churn risk once embedded in workflow. The 'living' aspect is the key to retention — static docs would be a one-time purchase, but auto-updating docs are a subscription.
- +Solves a universally recognized, high-pain problem that every data engineer has experienced firsthand — strong emotional resonance for marketing
- +Clear competition gap: enterprise catalogs are 100x the price, dbt requires migration, ChatGPT is ephemeral — no one serves the solo DE at SMBs
- +LLM advances make the core technical proposition (auto-explain SQL in plain English) dramatically more feasible than it was 2 years ago
- +Strong bottoms-up adoption potential: individual DEs can sign up without procurement approval at the $49-149 price point
- +Built-in retention moat: documentation becomes more valuable over time and is painful to abandon once the team relies on it
- !Integration breadth is the make-or-break challenge: every data stack is different (Snowflake + dbt + Airflow vs. BigQuery + Dataform + Composer vs. stored procedures in SQL Server), and supporting even the top 3 combinations requires significant engineering effort
- !dbt Cloud is aggressively adding AI documentation features and could close the gap for dbt users specifically — this would eliminate a significant portion of the target market
- !Solo DEs at small companies may have the pain but not the budget authority or culture to pay for documentation tooling — they might just paste SQL into ChatGPT and call it good enough
- !Accuracy risk: if auto-generated business logic explanations are wrong or misleading, it's worse than no documentation — trust is hard to earn and easy to lose
- !The 'living' aspect requires reliable change detection and re-parsing, which adds operational complexity (CI/CD hooks, warehouse query log polling, git monitoring)
Open-source SQL transformation framework that encourages documenting models, tests, and lineage as code. dbt Cloud adds a hosted catalog and lineage visualization.
Enterprise data catalog platforms that provide metadata management, data lineage, search, and governance. Atlan is the modern challenger; Alation and Collibra are incumbents.
Open-source Python libraries that parse SQL to extract column-level lineage, transformations, and dependencies. SQLGlot can transpile between dialects and analyze query structure.
Modern data discovery and documentation platforms that auto-crawl warehouses and BI tools to generate lineage and documentation. Select Star emphasizes automated lineage; Castor focuses on AI-generated documentation.
Developers paste SQL into ChatGPT or use Copilot to get explanations of what queries do. Ad-hoc but increasingly common workflow for reverse-engineering legacy SQL.
Week 1-2: Build a web app where users upload or paste SQL files/queries. Use SQLGlot for parsing and an LLM (Claude API) to generate human-readable business logic explanations. Show a basic lineage graph (table/column dependencies). Week 3-4: Add Snowflake and BigQuery direct connections to auto-discover queries from query history. Week 5-6: Add versioning — detect when SQL changes and highlight what business logic changed. Week 7-8: Add a shareable workspace with search. Ship the free tier (10 queries) and start collecting feedback. Do NOT build Airflow/dbt/Looker integrations until you have 50+ users asking for them.
Free tier (10 queries, paste-only) to drive adoption and SEO/word-of-mouth → $49/mo Pro (unlimited queries, warehouse connection, version history) for individual DEs → $149/mo Team (collaboration, shared workspace, change notifications, SSO) for small data teams → $499+/mo Enterprise (API access, custom integrations, audit logs, on-prem) once you have traction. Consider a one-time 'audit report' product ($199-499) for DEs who just need to document a stack once during onboarding — this captures the 'I just joined and need to understand everything' moment.
4-6 weeks to MVP launch, 8-12 weeks to first paying customer. The paste-SQL-and-explain feature can be built and monetized quickly. Target the 'just joined a new company' moment — post in r/dataengineering, data Twitter/Bluesky, and dbt Slack community. First dollar likely comes from a solo DE who just inherited a messy stack and needs to understand it fast. Path to $5k MRR in 4-6 months if execution is strong.
- “not only did no one know how the business logic was set up”
- “the old logic they were referring to was incorrect”
- “the last person in that role left three years ago”
- “source of that data was an Excel sheet that was last updated three years ago”