6.7mediumCONDITIONAL GO

DataSignal

Automated data quality and signal-to-noise analyzer that scores your existing datasets and recommends what to keep, clean, or drop.

DevToolsSmall-to-mid-size companies (sub-1000 employees) with data teams of 1-5 peopl...
The Gap

90% of companies waste money collecting and storing massive amounts of data when they'd get more value from cleaning what they already have. Most orgs don't query data older than 30 days but keep paying to store and maintain it.

Solution

Connects to existing data warehouses (Postgres, BigQuery, Snowflake), analyzes query patterns, data freshness, and actual usage. Generates a signal-to-noise score per table/dataset, recommends archival or deletion of unused data, and highlights high-value columns that drive actual decisions. Includes a cost savings calculator showing how much you're spending on idle data.

Revenue Model

Freemium — free audit for one data source, paid tiers ($99-499/mo) for continuous monitoring, automated cleanup recommendations, and multi-source support

Feasibility Scores
Pain Intensity7/10

The pain is real and well-documented — companies genuinely overspend on idle data storage and under-invest in data quality. The Reddit thread and broader industry sentiment confirm this. However, it's a 'slow bleed' pain, not an 'on-fire' pain. Most companies tolerate the waste because nobody's job depends on fixing it, and the absolute dollar amounts for sub-1000 employee companies may be $500-5K/month in wasted storage — painful but not urgent enough to trigger immediate purchase decisions.

Market Size6/10

TAM is significant if you count all companies with data warehouses (hundreds of thousands globally). But the serviceable market — SMBs with 1-5 person data teams who are over-provisioned, have budget authority, and would self-serve a $99-499/mo tool — is narrower. Estimated SAM: ~50K companies globally, $100M+ at full penetration. The challenge is that the biggest spenders (who feel the most pain) are enterprises who want enterprise sales, while SMBs may not spend enough on data infrastructure to justify even $99/mo for optimization.

Willingness to Pay5/10

This is the weakest link. The value proposition is cost savings — 'pay us $X to save you $Y.' But for SMBs spending $1-5K/month on Snowflake/BigQuery, the savings from archiving idle tables might be $200-800/month. Hard to justify $499/month tooling to save $500/month. The free audit is smart — it creates the 'aha moment' — but converting to ongoing $99-499/month recurring revenue requires continuous value beyond the initial cleanup. Data teams at SMBs also tend to DIY this with SQL queries against metadata tables. The buyer (data team lead) often doesn't control infrastructure budget.

Technical Feasibility8/10

Very buildable as an MVP. The core components are: (1) connect to warehouse metadata APIs (Snowflake ACCOUNT_USAGE, BigQuery INFORMATION_SCHEMA, Postgres pg_stat_user_tables), (2) analyze query logs and access history, (3) calculate storage costs per table, (4) generate a composite score. All data sources have well-documented APIs. A solo dev with data engineering experience could build a working prototype for one warehouse (e.g., Snowflake) in 4-6 weeks. Multi-warehouse support and a polished UI add time but the core logic is straightforward.

Competition Gap8/10

Clear whitespace exists. The market is segmented into three camps that don't talk to each other: data quality tools ('is it correct?'), data catalogs ('where is it?'), and cost optimization tools ('how to run cheaper?'). Nobody connects quality + usage + cost into a unified 'is this data worth keeping?' score with actionable archive/delete recommendations. The closest competitors (Selectstar, Unravel) each cover one piece but not the synthesis. The SMB price point ($99-499/mo) is also completely unserved — existing tools start at $50K+/year.

Recurring Potential7/10

The initial value is a one-time audit ('here's what to clean up'), which is powerful for acquisition but risky for retention. Continuous monitoring ('alert when a table goes stale, track data ROI trends over time, catch new waste as it accumulates') creates recurring value but requires the user to care about ongoing optimization rather than one-time cleanup. Monthly/quarterly reports showing cost savings achieved and new waste detected could drive retention. The freemium model with continuous monitoring at paid tiers is the right structure, but churn risk is real — users may clean up and cancel.

Strengths
  • +Clear competitive whitespace — nobody synthesizes usage + quality + cost into a single 'keep or kill' recommendation at SMB price points
  • +Technically very feasible with well-documented warehouse metadata APIs; solo dev can ship MVP in 4-6 weeks for one warehouse
  • +Pain is real and growing — cloud data warehouse 'bill shock' is a documented trend as post-pandemic data infrastructure matures
  • +Free audit creates a powerful acquisition hook — immediate, tangible value that demonstrates ROI before asking for payment
  • +SMB pricing ($99-499/mo) is radically underserved — entire data observability market is priced for enterprises at $50K+/year
Risks
  • !Willingness-to-pay ceiling: SMBs with $1-5K/month warehouse bills may not pay $499/month for optimization tooling, and the savings may not justify the cost at lower tiers
  • !One-time value trap: the biggest 'aha moment' is the initial audit, after which users may clean up and churn rather than pay for ongoing monitoring
  • !DIY competition: a competent data engineer can write SQL against INFORMATION_SCHEMA and ACCOUNT_USAGE to get 80% of this value in a day — your tool needs to be dramatically easier and more insightful
  • !Security/access concerns: connecting to production data warehouses requires significant trust, especially for a new/unknown vendor. SMBs may be hesitant to grant read access to metadata
  • !Warehouse vendors are building native features: Snowflake's Account Usage views, BigQuery's INFORMATION_SCHEMA, and native cost management dashboards are improving — platform risk if they add 'unused table detection' natively
Competition
Monte Carlo Data

End-to-end data observability platform that monitors pipelines for freshness, volume, schema changes, and distribution anomalies. ML-based anomaly detection with field-level lineage across the modern data stack.

Pricing: Sales-driven, ~$100K-$300K+/year. No free tier.
Gap: No dataset value/usage scoring, no archive/delete recommendations, no cost savings calculations, no signal-to-noise analysis. Answers 'is the data correct?' but never 'is the data worth keeping?' Priced far out of reach for SMBs with 1-5 person data teams.
Selectstar

Automated data discovery and lineage platform that analyzes query logs to build lineage, track data usage by user/frequency, and identify popular vs unused assets. Lightweight catalog functionality.

Pricing: Free tier for small deployments, Pro ~$1K-2K/month, Enterprise custom.
Gap: Shows query frequency but does NOT produce a composite value score weighting usage by downstream impact or business criticality. No actionable archive/delete recommendations with projected savings. No cost-to-storage mapping. Surfaces unused tables in reports but stops short of telling you what to do about them.
Soda.io

Data quality testing and monitoring via SodaCL

Pricing: Soda Core: free/open-source. Soda Cloud Free: 1 datasource. Pro: ~$300-500/month. Enterprise: custom.
Gap: Purely quality-focused — checks if data is correct, not whether anyone uses it or whether it's worth the storage cost. No usage scoring, no archive/delete recommendations, no cost analysis, no signal-to-noise concept. You could have perfectly 'healthy' data that nobody has queried in two years.
Unravel Data

Full-stack data observability and optimization platform focused on performance and cost. Analyzes query performance, resource utilization, and costs across Snowflake, Databricks, BigQuery. FinOps for data with chargeback/showback.

Pricing: Sales-driven, ~$50K-$200K+/year based on warehouse spend under management.
Gap: Optimizes compute costs ('run your queries cheaper') but does NOT analyze table-level storage value or recommend what data to stop paying for entirely. No dataset-level value scoring, no storage-focused cost savings, no signal-to-noise analysis. Oriented toward 'run what you have cheaper' rather than 'figure out what you should stop running.' Enterprise pricing excludes SMBs.
Atlan

Active metadata platform combining data catalog, governance, lineage, collaboration, and quality integrations. Positions as 'GitHub for data teams' with modern UI and embedded AI for discovery.

Pricing: Sales-driven, ~$50K-$200K+/year. No self-serve tier.
Gap: Has partial usage tracking but does NOT produce a quantified value score or ROI metric per dataset. No automated lifecycle recommendations (archive/delete). No cost savings calculations. Catalogs everything equally — does not distinguish high-value from low-value data. Enterprise-only pricing with no SMB path.
MVP Suggestion

Single-warehouse connector (start with Snowflake — largest SMB data warehouse market). Read-only connection to ACCOUNT_USAGE and INFORMATION_SCHEMA. Dashboard showing: (1) signal-to-noise score per table based on query frequency, recency, unique users, and downstream dependencies, (2) idle data report — tables not queried in 30/60/90 days with storage cost, (3) total monthly cost savings if idle data is archived. One-click export of recommended actions. No automated cleanup in MVP — just the diagnosis and recommendations. Ship as a web app with OAuth-based Snowflake connection.

Monetization Path

Free one-time audit for a single Snowflake account → $99/month for continuous monitoring + monthly reports + Slack/email alerts when tables go stale → $299/month for multi-source support (BigQuery + Postgres) + team features + automated archival scripts → $499/month for advanced analytics (column-level value scoring, data lineage impact analysis, compliance/retention policy engine). Long-term: usage-based pricing tied to warehouse spend under management (1-2% of optimized savings).

Time to Revenue

8-12 weeks. Weeks 1-6: build MVP with Snowflake connector and scoring engine. Weeks 7-8: beta with 5-10 design partners from data engineering communities (Reddit, dbt Slack, Locally Optimistic). Weeks 9-10: iterate based on feedback, add cost calculator polish. Weeks 11-12: launch free audit publicly, convert early users to paid monitoring tier. First paying customer likely in week 10-12 if the free audit delivers a clear 'you're wasting $X/month' moment.

What people are saying
  • 90% of companies would get more value from cleaning the data they already have than collecting more of it
  • You need more signal, not more rows
  • most organizations didn't query older than 30 days and almost none were querying data older than one year — outside of that window it just sat there idle
  • everyone else is just paying for Spark clusters to run queries that Postgres could handle