7.7highGO

Entity Resolution Platform

A managed, opinionated MDM platform that handles merge, unmerge, lineage tracking, and conflict resolution out of the box.

DevToolsData engineers and data platform teams at mid-to-large companies dealing with...
The Gap

Data engineers working on identity resolution face a complex web of decisions: field-level merge strategies, child data deduplication, unmerge/backtracking, data recency trust scoring, and unique ID management. Most teams rebuild these from scratch or cobble together partial solutions like Splink.

Solution

A SaaS platform that provides configurable entity resolution pipelines with built-in merge strategies, automatic lineage/audit trails, one-click unmerge with child cascade, recency-weighted field resolution, and a unified ID graph. Integrates with warehouses (Snowflake, BigQuery, Databricks) and exposes APIs.

Revenue Model

subscription

Feasibility Scores
Pain Intensity9/10

The Reddit thread itself is a masterclass in pain signals — every comment describes a different dimension of complexity (merge strategies, unmerge, child cascade, recency trust, ID management). This is a known, recurring headache that data engineers face repeatedly across companies. The phrase 'welcome to the problem space' implies veterans know this is unsolved. Teams spend months rebuilding these pipelines from scratch. Pain is real, frequent, and expensive.

Market Size8/10

Entity resolution/MDM TAM is $15-20B and growing 12-15% CAGR. Even capturing a niche (developer-first, warehouse-native ER for mid-to-large companies) represents a $500M+ addressable segment. Every company with customer data eventually needs identity resolution. Not consumer-tiny, not enterprise-only — sweet spot for a focused SaaS.

Willingness to Pay7/10

Enterprise MDM buyers already pay $200K-$1M+/year for Reltio/Tamr/Informatica. Data engineering teams have tooling budgets ($5K-$50K/year per tool is normal for Snowflake/dbt/Fivetran ecosystem). A warehouse-native ER platform at $1K-$10K/month would be a fraction of what enterprises pay today. However, open-source alternatives (Splink) create a free floor, and convincing data engineers to pay for managed services over DIY requires proving significant time savings. Score docked because the buyer (data engineer) often isn't the budget holder.

Technical Feasibility5/10

Entity resolution is genuinely hard computer science — probabilistic matching, graph algorithms, conflict resolution logic, lineage DAGs, warehouse-native execution (Snowflake UDFs vs BigQuery remote functions vs Databricks). A true MVP covering configurable merge strategies, unmerge with cascade, lineage tracking, recency weighting, AND multi-warehouse integration is ambitious for 4-8 weeks. A solo dev could build a proof-of-concept for ONE warehouse with basic merge/unmerge in 8 weeks, but production-grade multi-warehouse support with all promised features is more like 4-6 months. The core matching engine alone is a deep problem.

Competition Gap8/10

The whitespace is clear and validated: NO existing product offers configurable merge/unmerge + lineage + recency-weighted resolution + warehouse-native execution + developer-first UX together. Open-source tools (Splink, Zingg) only do matching. Enterprise platforms (Reltio, Informatica, Tamr) are $200K+/year, not warehouse-native, and not built for data engineers. API tools (Senzing, Tilores) resolve but don't manage. The gap is real and well-defined.

Recurring Potential9/10

Entity resolution is inherently ongoing — new records arrive daily, matches evolve, merges/unmerges happen continuously, data quality degrades over time. This is not a one-time ETL job. Companies need persistent identity graphs maintained in perpetuity. Usage-based pricing on record volume + monthly platform fee is natural. Very strong subscription/consumption model fit. Once integrated into a data pipeline, switching costs are extremely high.

Strengths
  • +Clearly validated pain with specific, articulated sub-problems (merge, unmerge, lineage, recency) — not a solution looking for a problem
  • +Massive competition gap: nothing is both warehouse-native AND developer-first with full MDM workflows
  • +Existing market spending proves willingness to pay — you just need to offer 80% of value at 10% of enterprise MDM price
  • +Extremely high switching costs once integrated into data pipelines — strong retention moat
  • +Growing market with tailwinds: cloud warehouse adoption, data mesh, regulatory pressure all increase demand
Risks
  • !Technical complexity is high — entity resolution is a deep domain with many edge cases. Underestimating build time is the #1 risk
  • !Open-source Splink is 'good enough' for many teams, creating a free floor that makes initial conversion harder
  • !Selling to data engineers (influencers) vs. data platform leaders (budget holders) creates a two-step sale that slows deals
  • !Multi-warehouse support (Snowflake + BigQuery + Databricks) triples integration surface area — scope creep risk
  • !Enterprise MDM vendors (Reltio, Informatica) could build warehouse-native connectors and close the gap from above
Competition
Splink

Open-source Python library for probabilistic record linkage and entity resolution. Built by the UK Ministry of Justice. Runs on Spark, DuckDB, or Athena. Identifies matching records using Fellegi-Sunter probabilistic model.

Pricing: Free / open source (MIT License
Gap: It's a matching LIBRARY, not a platform. No merge/unmerge workflows, no lineage or audit trails, no conflict resolution or survivorship rules, no unified ID graph, no recency-weighted resolution, no warehouse-native managed service. Requires significant engineering to productionize into MDM.
Tamr (acquired by Mastercard ~2024)

ML-powered data mastering platform combining machine learning with human-in-the-loop curation for entity resolution, schema mapping, and data classification. Targets large enterprise data unification.

Pricing: Enterprise only, typically $300K-$1M+/year with professional services
Gap: No one-click unmerge with cascade, no recency-weighted resolution, not warehouse-native (runs in its own env, not inside Snowflake/BQ/Databricks), very expensive and heavy (requires PS), limited lineage tracking, not developer-first — built for data stewards not data engineers
Reltio

Cloud-native MDM SaaS platform providing golden record management, matching, merging, and graph-based relationship visualization. Strong in healthcare, financial services, life sciences.

Pricing: SaaS subscription, typically $200K-$500K+/year based on profile count. No free tier.
Gap: Not warehouse-native — data must move INTO Reltio's cloud. Unmerge exists but no cascading child unmerge. No recency-weighted resolution out of the box. Lineage is basic, not full merge genealogy. Expensive for data engineering teams. UI-driven config, not code-first.
Senzing

Embeddable entity resolution API/engine using proprietary AI. Self-hosted or cloud-deployed. Focuses purely on entity resolution

Pricing: Free tier for up to 100K records. Paid tiers ~$0.01-0.05 per record/year at scale. Also on AWS Marketplace.
Gap: No configurable merge/survivorship strategies (opinionated and closed), no lineage or audit trail, no unmerge workflow, no recency-weighted resolution, not warehouse-native, not full MDM — purely matching, no golden record management or stewardship UI
Zingg

Open-source ML-based entity resolution built on Apache Spark. Uses active learning — you label a few examples, it trains a model and scales matching across large datasets.

Pricing: Free / open source (AGPL license
Gap: Everything beyond matching — like Splink it's a matching library, not a platform. No merge/unmerge workflows, no lineage, no conflict resolution, no unified ID graph, no recency-weighted resolution. Smaller community than Splink. AGPL license may deter commercial adoption.
MVP Suggestion

Start with ONE warehouse (Snowflake — largest data engineering community). Build a managed entity resolution pipeline with: (1) configurable field-level merge strategies via YAML/code, (2) basic unmerge with child cascade, (3) automatic merge lineage/audit log, (4) recency-weighted field resolution, (5) unified ID graph queryable via SQL. Skip the UI initially — expose everything via SQL functions + a CLI/API. Use Splink's matching under the hood for the probabilistic linkage layer and focus your differentiation on the MDM workflow layer (merge/unmerge/lineage/survivorship). Deploy as a Snowflake Native App or dbt package + managed service.

Monetization Path

Free: open-source dbt package or Snowflake Native App for basic entity matching (captures Splink users). Paid ($500-2K/month): managed merge/unmerge workflows, lineage tracking, recency weighting, conflict resolution UI. Enterprise ($5K-20K/month): multi-warehouse support, SSO, audit compliance (SOC2/HIPAA), dedicated support, custom merge strategies. Scale: consumption-based pricing on records resolved per month.

Time to Revenue

3-5 months to MVP with first design partner paying. 6-9 months to repeatable revenue with 5-10 paying customers. The key is finding 2-3 design partners from the Reddit thread commenters or similar communities who will co-develop the MVP in exchange for discounted pricing.

What people are saying
  • Welcome to the problem space (implying it's a known, recurring headache)
  • do you throw away fields of data, or do you consider both sets, to enrich your master
  • Do you keep a backtracking trace, to be able to unmerge. Unmerge of children too
  • Do you trust more recent data more than older data
  • What unique ID do you keep, or do you make up a [new one]
  • If they have child data, do you keep the union of all children