Lead Cleanroom
Normalizer + Dedup Engine
Standalone service that takes messy lead records, standardizes fields, detects duplicates, and produces golden (clean) lead records with complete audit trail. Run independently or integrate with your existing systems.
What You Will Build
A standalone mini-product that processes raw lead data from any source (CSV, CRM, API), applies normalization rules, runs deduplication logic, and outputs clean, merged golden records ready for use.
Field Normalization
Standardize name, email, phone, domain, title, and company fields with smart rules.
Duplicate Detection
Find duplicates using email, phone, domain+name, and company+name matching strategies.
Golden Records
Create best-version merged records from duplicate clusters with configurable merge policies.
Audit Trail
Track every change with source lead IDs, match reasons, confidence scores, and field history.
Core Capabilities
Must-Have Features
- Normalize name, email, phone, domain, title, company
- Deduplicate using deterministic rules
- Group duplicates into clusters
- Create golden record for each cluster
- Maintain complete audit trail
Future Enhancements
- Fuzzy matching (Levenshtein/Jaro distance)
- Advanced merge policies (most recent, longest, verified)
- Manual review UI for approve/reject merges
- Redis caching for lookup performance
Architecture Overview
Clean, modular architecture designed for data quality and auditability. Runs as a standalone service or integrates with existing systems.
Spring Boot Service
Orchestrates normalization and dedup logic with strong domain modeling
MySQL Database
Stores raw leads, normalized leads, clusters, golden records, and audit logs
Optional React UI
Admin interface for testing, viewing clusters, and audit trail
Normalization Rules
Smart field-level rules to standardize messy data from various sources. All normalization is reversible with original values preserved.
- •Trim whitespace and convert to lowercase
- •Remove "mailto:" prefix if present
- •Validate format (contains @ and domain)
- •Store normalized for matching
Domain
- •Derive from email domain if missing
- •Remove www. prefix
- •Convert to lowercase
- •Strip paths (hostname only)
Phone
- •Remove all non-digit characters
- •Convert to E.164 format (+1...)
- •Handle US default country code
- •Store normalized for dedup
Name
- •Trim extra spaces between words
- •Split full name into first/last (best-effort)
- •Optional title-case formatting
- •Keep original for display
Company
- •Convert to lowercase for matching
- •Remove suffixes (Inc, LLC, Ltd, Corp)
- •Preserve original display name
- •Normalize for clustering
Title
- •Trim whitespace
- •Standardize common abbreviations
- •Optional seniority detection
- •Keep original phrasing
Deduplication Strategy
Four-tier matching strategy runs in priority order. Higher priority matches take precedence. Each match includes confidence scoring and reason tracking.
Email Exact Match
Match leads with identical normalized email addresses
cluster_key: email:{email_normalized}Phone Exact Match
Match leads with identical E.164 formatted phone numbers
cluster_key: phone:{phone_e164}Domain + Name
Match leads from same domain with identical full name
cluster_key: hash(domain + full_name)Company + Name
Match leads from same company with identical full name
cluster_key: hash(company + full_name)How It Works
- 1Normalize the incoming lead using field rules
- 2Run dedup strategies in priority order (email → phone → domain+name → company+name)
- 3If match found, add lead to existing cluster; otherwise create new cluster
- 4Record match type, confidence score, and reason for audit trail
- 5Trigger golden record rebuild for the cluster
Database Schema
Six-table design optimized for data lineage, deduplication, and auditability. All relationships use UUIDs for global uniqueness.
raw_leadsOriginal unprocessed lead data from any source
- id (uuid)
- source (CSV/CRM/API)
- payload_json
- created_at
normalized_leadsCleaned and standardized lead records with indexes
- id (uuid)
- raw_lead_id (FK)
- email_normalized (indexed)
- phone_e164 (indexed)
- domain_normalized (indexed)
- company_name_normalized (indexed)
- full_name, first_name, last_name
- title_normalized, linkedin_url_normalized
dedup_clustersGroups of duplicate leads with unique cluster keys
- id (uuid)
- cluster_key (unique)
- match_type (EMAIL_EXACT/PHONE_EXACT/etc)
- created_at, updated_at
cluster_membersLinks normalized leads to their clusters
- cluster_id (FK)
- normalized_lead_id (FK)
- is_primary (bool)
- match_reason, match_score (0-100)
golden_leadsBest-version merged records for each cluster
- id (uuid)
- cluster_id (FK unique)
- full_name, email, phone
- company_name, domain, title
- linkedin_url
- merge_policy_version, updated_at
field_audit_logComplete history of field changes and merge decisions
- id (PK)
- cluster_id
- field_name
- old_value, new_value
- reason
- changed_at
Schema Benefits
Data Lineage
Track every lead from raw input through normalization to golden record
Performance
Indexed lookups on email, phone, domain, and company for fast matching
Auditability
Complete history of changes, merge decisions, and confidence scores
Merge Policy Engine
Smart field-level merge policies create the best-version golden record from duplicate clusters. Every decision is logged for full auditability.
Prefer non-null, valid email address with @ and domain
Phone
HighPrefer E.164 formatted number, fallback to any valid phone
Company
MediumPrefer longest non-null value (most complete)
Title
MediumPrefer most recent non-null value
Prefer URL matching linkedin.com domain
Name
HighPrefer full_name with both first and last parts present
Automatic Rebuild
Golden record automatically updates when new leads join cluster or policies change
Version Control
Track merge policy versions so you can evolve rules without losing history
Audit Trail
Every field change logged with old/new values, reason, and timestamp for compliance
REST API
Clean REST endpoints for integration. Test with Postman or use the optional React admin UI.
/cleanroom/leadsSubmit raw lead for normalization
{ full_name, email, phone, company, ... }{ raw_lead_id, normalized_lead_id }/cleanroom/leads/{id}/dedupRun deduplication for normalized lead
normalized_lead_id (path param){ cluster_id, match_type, match_score }/cleanroom/clusters/{id}/golden/rebuildRebuild golden record for cluster
cluster_id (path param){ golden_lead: {...} }/cleanroom/clusters/{id}View cluster details with members
cluster_id (path param){ cluster, members[], golden_lead }/cleanroom/searchSearch by email, phone, domain, company, name
?email=&phone=&domain=&company=&name={ clusters: [...], count: N }Optional React Admin UI
Build a simple React interface for testing and monitoring:
- •Paste lead JSON and click "Normalize" to test field rules
- •Click "Run Dedup" to see matching and clustering in action
- •View cluster members and golden lead side-by-side
- •Browse audit log to understand merge decisions
Key Benefits
Clean, deduplicated lead data with complete traceability. Built for reliability and integration flexibility.
Data Quality
Eliminate duplicates and standardize fields for consistent, reliable lead data across systems.
Standalone Service
Run independently or integrate with existing tools. Works with any lead source via simple API.
Full Auditability
Complete audit trail of every normalization, match decision, and golden record update for compliance.
Scalable Architecture
Indexed lookups and efficient clustering handle thousands of leads with fast matching performance.
Flexible Integration
REST API works with CSV imports, CRM systems, or custom data pipelines without vendor lock-in.
Production Ready
Deterministic dedup rules, versioned merge policies, and MySQL transactions ensure data integrity.
Done Criteria
Your Lead Cleanroom is ready for production when these conditions are met:
- Insert 1,000 raw leads with known duplicates
- Service creates correct clusters by email, phone, domain+name
- Golden leads generated with best-version data
- Audit trail shows why each field was chosen
- Search API returns clusters and golden records instantly
Build Your Next Product With AI Expertise
Experience the future of software development. Let our GenAI platform accelerate your next project.
Schedule a Free AI Blueprint Session