90% of companies waste money collecting and storing massive amounts of data when they'd get more value from cleaning what they already have. Most orgs don't query data older than 30 days but keep paying to store and maintain it.
Connects to existing data warehouses (Postgres, BigQuery, Snowflake), analyzes query patterns, data freshness, and actual usage. Generates a signal-to-noise score per table/dataset, recommends archival or deletion of unused data, and highlights high-value columns that drive actual decisions. Includes a cost savings calculator showing how much you're spending on idle data.
Freemium — free audit for one data source, paid tiers ($99-499/mo) for continuous monitoring, automated cleanup recommendations, and multi-source support
The pain is real and well-documented — companies genuinely overspend on idle data storage and under-invest in data quality. The Reddit thread and broader industry sentiment confirm this. However, it's a 'slow bleed' pain, not an 'on-fire' pain. Most companies tolerate the waste because nobody's job depends on fixing it, and the absolute dollar amounts for sub-1000 employee companies may be $500-5K/month in wasted storage — painful but not urgent enough to trigger immediate purchase decisions.
TAM is significant if you count all companies with data warehouses (hundreds of thousands globally). But the serviceable market — SMBs with 1-5 person data teams who are over-provisioned, have budget authority, and would self-serve a $99-499/mo tool — is narrower. Estimated SAM: ~50K companies globally, $100M+ at full penetration. The challenge is that the biggest spenders (who feel the most pain) are enterprises who want enterprise sales, while SMBs may not spend enough on data infrastructure to justify even $99/mo for optimization.
This is the weakest link. The value proposition is cost savings — 'pay us $X to save you $Y.' But for SMBs spending $1-5K/month on Snowflake/BigQuery, the savings from archiving idle tables might be $200-800/month. Hard to justify $499/month tooling to save $500/month. The free audit is smart — it creates the 'aha moment' — but converting to ongoing $99-499/month recurring revenue requires continuous value beyond the initial cleanup. Data teams at SMBs also tend to DIY this with SQL queries against metadata tables. The buyer (data team lead) often doesn't control infrastructure budget.
Very buildable as an MVP. The core components are: (1) connect to warehouse metadata APIs (Snowflake ACCOUNT_USAGE, BigQuery INFORMATION_SCHEMA, Postgres pg_stat_user_tables), (2) analyze query logs and access history, (3) calculate storage costs per table, (4) generate a composite score. All data sources have well-documented APIs. A solo dev with data engineering experience could build a working prototype for one warehouse (e.g., Snowflake) in 4-6 weeks. Multi-warehouse support and a polished UI add time but the core logic is straightforward.
Clear whitespace exists. The market is segmented into three camps that don't talk to each other: data quality tools ('is it correct?'), data catalogs ('where is it?'), and cost optimization tools ('how to run cheaper?'). Nobody connects quality + usage + cost into a unified 'is this data worth keeping?' score with actionable archive/delete recommendations. The closest competitors (Selectstar, Unravel) each cover one piece but not the synthesis. The SMB price point ($99-499/mo) is also completely unserved — existing tools start at $50K+/year.
The initial value is a one-time audit ('here's what to clean up'), which is powerful for acquisition but risky for retention. Continuous monitoring ('alert when a table goes stale, track data ROI trends over time, catch new waste as it accumulates') creates recurring value but requires the user to care about ongoing optimization rather than one-time cleanup. Monthly/quarterly reports showing cost savings achieved and new waste detected could drive retention. The freemium model with continuous monitoring at paid tiers is the right structure, but churn risk is real — users may clean up and cancel.
- +Clear competitive whitespace — nobody synthesizes usage + quality + cost into a single 'keep or kill' recommendation at SMB price points
- +Technically very feasible with well-documented warehouse metadata APIs; solo dev can ship MVP in 4-6 weeks for one warehouse
- +Pain is real and growing — cloud data warehouse 'bill shock' is a documented trend as post-pandemic data infrastructure matures
- +Free audit creates a powerful acquisition hook — immediate, tangible value that demonstrates ROI before asking for payment
- +SMB pricing ($99-499/mo) is radically underserved — entire data observability market is priced for enterprises at $50K+/year
- !Willingness-to-pay ceiling: SMBs with $1-5K/month warehouse bills may not pay $499/month for optimization tooling, and the savings may not justify the cost at lower tiers
- !One-time value trap: the biggest 'aha moment' is the initial audit, after which users may clean up and churn rather than pay for ongoing monitoring
- !DIY competition: a competent data engineer can write SQL against INFORMATION_SCHEMA and ACCOUNT_USAGE to get 80% of this value in a day — your tool needs to be dramatically easier and more insightful
- !Security/access concerns: connecting to production data warehouses requires significant trust, especially for a new/unknown vendor. SMBs may be hesitant to grant read access to metadata
- !Warehouse vendors are building native features: Snowflake's Account Usage views, BigQuery's INFORMATION_SCHEMA, and native cost management dashboards are improving — platform risk if they add 'unused table detection' natively
End-to-end data observability platform that monitors pipelines for freshness, volume, schema changes, and distribution anomalies. ML-based anomaly detection with field-level lineage across the modern data stack.
Automated data discovery and lineage platform that analyzes query logs to build lineage, track data usage by user/frequency, and identify popular vs unused assets. Lightweight catalog functionality.
Data quality testing and monitoring via SodaCL
Full-stack data observability and optimization platform focused on performance and cost. Analyzes query performance, resource utilization, and costs across Snowflake, Databricks, BigQuery. FinOps for data with chargeback/showback.
Active metadata platform combining data catalog, governance, lineage, collaboration, and quality integrations. Positions as 'GitHub for data teams' with modern UI and embedded AI for discovery.
Single-warehouse connector (start with Snowflake — largest SMB data warehouse market). Read-only connection to ACCOUNT_USAGE and INFORMATION_SCHEMA. Dashboard showing: (1) signal-to-noise score per table based on query frequency, recency, unique users, and downstream dependencies, (2) idle data report — tables not queried in 30/60/90 days with storage cost, (3) total monthly cost savings if idle data is archived. One-click export of recommended actions. No automated cleanup in MVP — just the diagnosis and recommendations. Ship as a web app with OAuth-based Snowflake connection.
Free one-time audit for a single Snowflake account → $99/month for continuous monitoring + monthly reports + Slack/email alerts when tables go stale → $299/month for multi-source support (BigQuery + Postgres) + team features + automated archival scripts → $499/month for advanced analytics (column-level value scoring, data lineage impact analysis, compliance/retention policy engine). Long-term: usage-based pricing tied to warehouse spend under management (1-2% of optimized savings).
8-12 weeks. Weeks 1-6: build MVP with Snowflake connector and scoring engine. Weeks 7-8: beta with 5-10 design partners from data engineering communities (Reddit, dbt Slack, Locally Optimistic). Weeks 9-10: iterate based on feedback, add cost calculator polish. Weeks 11-12: launch free audit publicly, convert early users to paid monitoring tier. First paying customer likely in week 10-12 if the free audit delivers a clear 'you're wasting $X/month' moment.
- “90% of companies would get more value from cleaning the data they already have than collecting more of it”
- “You need more signal, not more rows”
- “most organizations didn't query older than 30 days and almost none were querying data older than one year — outside of that window it just sat there idle”
- “everyone else is just paying for Spark clusters to run queries that Postgres could handle”