Lego Block 2: Data Quality

Lead Cleanroom

Normalizer + Dedup Engine

Standalone service that takes messy lead records, standardizes fields, detects duplicates, and produces golden (clean) lead records with complete audit trail. Run independently or integrate with your existing systems.

10+
Normalization Rules
4
Dedup Strategies
Auto
Golden Records
Full
Audit Trail

What You Will Build

A standalone mini-product that processes raw lead data from any source (CSV, CRM, API), applies normalization rules, runs deduplication logic, and outputs clean, merged golden records ready for use.

Field Normalization

Standardize name, email, phone, domain, title, and company fields with smart rules.

Duplicate Detection

Find duplicates using email, phone, domain+name, and company+name matching strategies.

Golden Records

Create best-version merged records from duplicate clusters with configurable merge policies.

Audit Trail

Track every change with source lead IDs, match reasons, confidence scores, and field history.

Core Capabilities

Must-Have Features

  • Normalize name, email, phone, domain, title, company
  • Deduplicate using deterministic rules
  • Group duplicates into clusters
  • Create golden record for each cluster
  • Maintain complete audit trail

Future Enhancements

  • Fuzzy matching (Levenshtein/Jaro distance)
  • Advanced merge policies (most recent, longest, verified)
  • Manual review UI for approve/reject merges
  • Redis caching for lookup performance

Architecture Overview

Clean, modular architecture designed for data quality and auditability. Runs as a standalone service or integrates with existing systems.

Spring Boot Service

Orchestrates normalization and dedup logic with strong domain modeling

Java 17+Spring Boot 3Spring Data JPAREST API

MySQL Database

Stores raw leads, normalized leads, clusters, golden records, and audit logs

MySQL 8Flyway MigrationsIndexed LookupsACID Transactions

Optional React UI

Admin interface for testing, viewing clusters, and audit trail

ReactTypeScriptREST ClientPostman Alternative

Normalization Rules

Smart field-level rules to standardize messy data from various sources. All normalization is reversible with original values preserved.

Email

  • Trim whitespace and convert to lowercase
  • Remove "mailto:" prefix if present
  • Validate format (contains @ and domain)
  • Store normalized for matching

Domain

  • Derive from email domain if missing
  • Remove www. prefix
  • Convert to lowercase
  • Strip paths (hostname only)

Phone

  • Remove all non-digit characters
  • Convert to E.164 format (+1...)
  • Handle US default country code
  • Store normalized for dedup

Name

  • Trim extra spaces between words
  • Split full name into first/last (best-effort)
  • Optional title-case formatting
  • Keep original for display

Company

  • Convert to lowercase for matching
  • Remove suffixes (Inc, LLC, Ltd, Corp)
  • Preserve original display name
  • Normalize for clustering

Title

  • Trim whitespace
  • Standardize common abbreviations
  • Optional seniority detection
  • Keep original phrasing

Deduplication Strategy

Four-tier matching strategy runs in priority order. Higher priority matches take precedence. Each match includes confidence scoring and reason tracking.

Priority 1Score: 100

Email Exact Match

Match leads with identical normalized email addresses

cluster_key: email:{email_normalized}
Example: john.doe@acme.com = john.doe@acme.com
Priority 2Score: 95

Phone Exact Match

Match leads with identical E.164 formatted phone numbers

cluster_key: phone:{phone_e164}
Example: +14155551234 = +14155551234
Priority 3Score: 75

Domain + Name

Match leads from same domain with identical full name

cluster_key: hash(domain + full_name)
Example: acme.com + John Doe
Priority 4Score: 65

Company + Name

Match leads from same company with identical full name

cluster_key: hash(company + full_name)
Example: Acme Inc + John Doe

How It Works

  1. 1Normalize the incoming lead using field rules
  2. 2Run dedup strategies in priority order (email → phone → domain+name → company+name)
  3. 3If match found, add lead to existing cluster; otherwise create new cluster
  4. 4Record match type, confidence score, and reason for audit trail
  5. 5Trigger golden record rebuild for the cluster

Database Schema

Six-table design optimized for data lineage, deduplication, and auditability. All relationships use UUIDs for global uniqueness.

raw_leads

Original unprocessed lead data from any source

  • id (uuid)
  • source (CSV/CRM/API)
  • payload_json
  • created_at
normalized_leads

Cleaned and standardized lead records with indexes

  • id (uuid)
  • raw_lead_id (FK)
  • email_normalized (indexed)
  • phone_e164 (indexed)
  • domain_normalized (indexed)
  • company_name_normalized (indexed)
  • full_name, first_name, last_name
  • title_normalized, linkedin_url_normalized
dedup_clusters

Groups of duplicate leads with unique cluster keys

  • id (uuid)
  • cluster_key (unique)
  • match_type (EMAIL_EXACT/PHONE_EXACT/etc)
  • created_at, updated_at
cluster_members

Links normalized leads to their clusters

  • cluster_id (FK)
  • normalized_lead_id (FK)
  • is_primary (bool)
  • match_reason, match_score (0-100)
golden_leads

Best-version merged records for each cluster

  • id (uuid)
  • cluster_id (FK unique)
  • full_name, email, phone
  • company_name, domain, title
  • linkedin_url
  • merge_policy_version, updated_at
field_audit_log

Complete history of field changes and merge decisions

  • id (PK)
  • cluster_id
  • field_name
  • old_value, new_value
  • reason
  • changed_at

Schema Benefits

Data Lineage

Track every lead from raw input through normalization to golden record

Performance

Indexed lookups on email, phone, domain, and company for fast matching

Auditability

Complete history of changes, merge decisions, and confidence scores

Golden Record

Merge Policy Engine

Smart field-level merge policies create the best-version golden record from duplicate clusters. Every decision is logged for full auditability.

Email

Highest

Prefer non-null, valid email address with @ and domain

Phone

High

Prefer E.164 formatted number, fallback to any valid phone

Company

Medium

Prefer longest non-null value (most complete)

Title

Medium

Prefer most recent non-null value

LinkedIn

Medium

Prefer URL matching linkedin.com domain

Name

High

Prefer full_name with both first and last parts present

Automatic Rebuild

Golden record automatically updates when new leads join cluster or policies change

Version Control

Track merge policy versions so you can evolve rules without losing history

Audit Trail

Every field change logged with old/new values, reason, and timestamp for compliance

REST API

Clean REST endpoints for integration. Test with Postman or use the optional React admin UI.

POST/cleanroom/leads

Submit raw lead for normalization

Request
{ full_name, email, phone, company, ... }
Response
{ raw_lead_id, normalized_lead_id }
POST/cleanroom/leads/{id}/dedup

Run deduplication for normalized lead

Request
normalized_lead_id (path param)
Response
{ cluster_id, match_type, match_score }
POST/cleanroom/clusters/{id}/golden/rebuild

Rebuild golden record for cluster

Request
cluster_id (path param)
Response
{ golden_lead: {...} }
GET/cleanroom/clusters/{id}

View cluster details with members

Request
cluster_id (path param)
Response
{ cluster, members[], golden_lead }
GET/cleanroom/search

Search by email, phone, domain, company, name

Request
?email=&phone=&domain=&company=&name=
Response
{ clusters: [...], count: N }

Optional React Admin UI

Build a simple React interface for testing and monitoring:

  • Paste lead JSON and click "Normalize" to test field rules
  • Click "Run Dedup" to see matching and clustering in action
  • View cluster members and golden lead side-by-side
  • Browse audit log to understand merge decisions

Key Benefits

Clean, deduplicated lead data with complete traceability. Built for reliability and integration flexibility.

Data Quality

Eliminate duplicates and standardize fields for consistent, reliable lead data across systems.

Standalone Service

Run independently or integrate with existing tools. Works with any lead source via simple API.

Full Auditability

Complete audit trail of every normalization, match decision, and golden record update for compliance.

Scalable Architecture

Indexed lookups and efficient clustering handle thousands of leads with fast matching performance.

Flexible Integration

REST API works with CSV imports, CRM systems, or custom data pipelines without vendor lock-in.

Production Ready

Deterministic dedup rules, versioned merge policies, and MySQL transactions ensure data integrity.

Done Criteria

Your Lead Cleanroom is ready for production when these conditions are met:

  • Insert 1,000 raw leads with known duplicates
  • Service creates correct clusters by email, phone, domain+name
  • Golden leads generated with best-version data
  • Audit trail shows why each field was chosen
  • Search API returns clusters and golden records instantly

Build Your Next Product With AI Expertise

Experience the future of software development. Let our GenAI platform accelerate your next project.

Schedule a Free AI Blueprint Session