Lego Block 2: Data Quality

Lead Cleanroom

Normalizer + Dedup Engine

Standalone service that takes messy lead records, standardizes fields, detects duplicates, and produces golden (clean) lead records with complete audit trail. Run independently or integrate with your existing systems.

Explore Features Request Implementation

10+

Normalization Rules

Dedup Strategies

Auto

Golden Records

Full

Audit Trail

What You Will Build

A standalone mini-product that processes raw lead data from any source (CSV, CRM, API), applies normalization rules, runs deduplication logic, and outputs clean, merged golden records ready for use.

Field Normalization

Standardize name, email, phone, domain, title, and company fields with smart rules.

Duplicate Detection

Find duplicates using email, phone, domain+name, and company+name matching strategies.

Golden Records

Create best-version merged records from duplicate clusters with configurable merge policies.

Audit Trail

Track every change with source lead IDs, match reasons, confidence scores, and field history.

Core Capabilities

Must-Have Features

Normalize name, email, phone, domain, title, company
Deduplicate using deterministic rules
Group duplicates into clusters
Create golden record for each cluster
Maintain complete audit trail

Future Enhancements

Fuzzy matching (Levenshtein/Jaro distance)
Advanced merge policies (most recent, longest, verified)
Manual review UI for approve/reject merges
Redis caching for lookup performance

Architecture Overview

Clean, modular architecture designed for data quality and auditability. Runs as a standalone service or integrates with existing systems.

Spring Boot Service

Orchestrates normalization and dedup logic with strong domain modeling

Java 17+Spring Boot 3Spring Data JPAREST API

MySQL Database

Stores raw leads, normalized leads, clusters, golden records, and audit logs

MySQL 8Flyway MigrationsIndexed LookupsACID Transactions

Optional React UI

Admin interface for testing, viewing clusters, and audit trail

ReactTypeScriptREST ClientPostman Alternative

Normalization Rules

Smart field-level rules to standardize messy data from various sources. All normalization is reversible with original values preserved.

Email

•Trim whitespace and convert to lowercase
•Remove "mailto:" prefix if present
•Validate format (contains @ and domain)
•Store normalized for matching

Domain

•Derive from email domain if missing
•Remove www. prefix
•Convert to lowercase
•Strip paths (hostname only)

Phone

•Remove all non-digit characters
•Convert to E.164 format (+1...)
•Handle US default country code
•Store normalized for dedup

Name

•Trim extra spaces between words
•Split full name into first/last (best-effort)
•Optional title-case formatting
•Keep original for display

Company

•Convert to lowercase for matching
•Remove suffixes (Inc, LLC, Ltd, Corp)
•Preserve original display name
•Normalize for clustering

Title

•Trim whitespace
•Standardize common abbreviations
•Optional seniority detection
•Keep original phrasing

Deduplication Strategy

Four-tier matching strategy runs in priority order. Higher priority matches take precedence. Each match includes confidence scoring and reason tracking.

Priority 1Score: 100

Email Exact Match

Match leads with identical normalized email addresses

cluster_key: email:{email_normalized}

Example: john.doe@acme.com = john.doe@acme.com

Priority 2Score: 95

Phone Exact Match

Match leads with identical E.164 formatted phone numbers

cluster_key: phone:{phone_e164}

Example: +14155551234 = +14155551234

Priority 3Score: 75

Domain + Name

Match leads from same domain with identical full name

cluster_key: hash(domain + full_name)

Example: acme.com + John Doe

Priority 4Score: 65

Company + Name

Match leads from same company with identical full name

cluster_key: hash(company + full_name)

Example: Acme Inc + John Doe

How It Works

1Normalize the incoming lead using field rules
2Run dedup strategies in priority order (email → phone → domain+name → company+name)
3If match found, add lead to existing cluster; otherwise create new cluster
4Record match type, confidence score, and reason for audit trail
5Trigger golden record rebuild for the cluster

Database Schema

Six-table design optimized for data lineage, deduplication, and auditability. All relationships use UUIDs for global uniqueness.

raw_leads

Original unprocessed lead data from any source

id (uuid)
source (CSV/CRM/API)
payload_json
created_at

normalized_leads

Cleaned and standardized lead records with indexes

id (uuid)
raw_lead_id (FK)
email_normalized (indexed)
phone_e164 (indexed)
domain_normalized (indexed)
company_name_normalized (indexed)
full_name, first_name, last_name
title_normalized, linkedin_url_normalized

dedup_clusters

Groups of duplicate leads with unique cluster keys

id (uuid)
cluster_key (unique)
match_type (EMAIL_EXACT/PHONE_EXACT/etc)
created_at, updated_at

cluster_members

Links normalized leads to their clusters

cluster_id (FK)
normalized_lead_id (FK)
is_primary (bool)
match_reason, match_score (0-100)

golden_leads

Best-version merged records for each cluster

id (uuid)
cluster_id (FK unique)
full_name, email, phone
company_name, domain, title
linkedin_url
merge_policy_version, updated_at

field_audit_log

Complete history of field changes and merge decisions

id (PK)
cluster_id
field_name
old_value, new_value
reason
changed_at

Schema Benefits

Data Lineage

Track every lead from raw input through normalization to golden record

Performance

Indexed lookups on email, phone, domain, and company for fast matching

Auditability

Complete history of changes, merge decisions, and confidence scores

Golden Record

Merge Policy Engine

Smart field-level merge policies create the best-version golden record from duplicate clusters. Every decision is logged for full auditability.

Email

Highest

Prefer non-null, valid email address with @ and domain

Phone

High

Prefer E.164 formatted number, fallback to any valid phone

Company

Medium

Prefer longest non-null value (most complete)

Title

Medium

Prefer most recent non-null value

Medium

Prefer URL matching linkedin.com domain

Name

High

Prefer full_name with both first and last parts present

Automatic Rebuild

Golden record automatically updates when new leads join cluster or policies change

Version Control

Track merge policy versions so you can evolve rules without losing history

Audit Trail

Every field change logged with old/new values, reason, and timestamp for compliance

REST API

Clean REST endpoints for integration. Test with Postman or use the optional React admin UI.

POST/cleanroom/leads

Submit raw lead for normalization

Request

{ full_name, email, phone, company, ... }

Response

{ raw_lead_id, normalized_lead_id }

POST/cleanroom/leads/{id}/dedup

Run deduplication for normalized lead

Request

normalized_lead_id (path param)

Response

{ cluster_id, match_type, match_score }

POST/cleanroom/clusters/{id}/golden/rebuild

Rebuild golden record for cluster

Request

cluster_id (path param)

Response

{ golden_lead: {...} }

GET/cleanroom/clusters/{id}

View cluster details with members

Request

cluster_id (path param)

Response

{ cluster, members[], golden_lead }

GET/cleanroom/search

Search by email, phone, domain, company, name

Request

?email=&phone=&domain=&company=&name=

Response

{ clusters: [...], count: N }

Optional React Admin UI

Build a simple React interface for testing and monitoring:

•Paste lead JSON and click "Normalize" to test field rules
•Click "Run Dedup" to see matching and clustering in action
•View cluster members and golden lead side-by-side
•Browse audit log to understand merge decisions

Key Benefits

Clean, deduplicated lead data with complete traceability. Built for reliability and integration flexibility.

Data Quality

Eliminate duplicates and standardize fields for consistent, reliable lead data across systems.

Standalone Service

Run independently or integrate with existing tools. Works with any lead source via simple API.

Full Auditability

Complete audit trail of every normalization, match decision, and golden record update for compliance.

Scalable Architecture

Indexed lookups and efficient clustering handle thousands of leads with fast matching performance.

Flexible Integration

REST API works with CSV imports, CRM systems, or custom data pipelines without vendor lock-in.

Production Ready

Deterministic dedup rules, versioned merge policies, and MySQL transactions ensure data integrity.

Done Criteria

Your Lead Cleanroom is ready for production when these conditions are met:

Insert 1,000 raw leads with known duplicates
Service creates correct clusters by email, phone, domain+name
Golden leads generated with best-version data
Audit trail shows why each field was chosen
Search API returns clusters and golden records instantly

Build Your Next Product With AI Expertise

Experience the future of software development. Let our GenAI platform accelerate your next project.

Schedule a Free AI Blueprint Session

Lead Cleanroom

What You Will Build

Field Normalization

Duplicate Detection

Golden Records

Audit Trail

Core Capabilities

Must-Have Features

Future Enhancements

Architecture Overview

Spring Boot Service

MySQL Database

Optional React UI

Normalization Rules

Email

Domain

Phone

Name

Company

Title

Deduplication Strategy

Email Exact Match

Phone Exact Match

Domain + Name

Company + Name

How It Works

Database Schema

Schema Benefits

Data Lineage

Performance

Auditability

Merge Policy Engine

Email

Phone

Company

Title

LinkedIn

Name

Automatic Rebuild

Version Control

Audit Trail

REST API

Optional React Admin UI

Key Benefits

Data Quality

Standalone Service

Full Auditability

Scalable Architecture

Flexible Integration

Production Ready

Done Criteria

Build Your Next Product With AI Expertise