Website Scraper + Field Extractor

Domain-to-signal mini-service that safely scrapes company websites and extracts structured fields for enrichment and personalization.

Smart Crawling

Fetches homepage + key pages (About, Careers, Pricing, Contact) with safe limits and timeout controls

Field Extraction

Extracts company name, summary, keywords, social links, and contact info with confidence scoring

Audit Trail

Stores raw HTML/text + extracted JSON in MongoDB for audit, reuse, and debugging purposes

Scope & Features

A focused mini-service that takes a company domain/URL, safely fetches the website, and returns structured company signals.

Must-Have (v1)

Input domain or URL for crawling
Fetch homepage + key pages (About, Careers, Pricing, Contact)
Extract company name, description (1-3 sentences), keywords/topics
Detect social links (LinkedIn, Twitter/X, Facebook, Instagram, YouTube)
Find emails/phones on contact pages (optional)
Store raw HTML/text + extracted JSON for audit and reuse

Nice-to-Have (Later)

robots.txt respect + crawl budget management
sitemap.xml support for discovering pages
JS-rendered pages (Playwright) fallback for dynamic sites
Language detection for internationalized content
Batch scraping mode for processing multiple domains

Three-Tier Architecture

FastAPI service with BeautifulSoup scraper and MongoDB storage for scalable web extraction.

API Layer

FastAPI

RESTful API for crawl job submission, status checks, and extraction retrieval

Scraping Engine

Requests + BeautifulSoup

Fetches pages safely with timeout, size limits, and redirect controls. Playwright fallback for JS sites.

Storage Layer

MongoDB

Stores crawl jobs, raw pages, and extracted fields with full audit trail

MongoDB Collections

Three collections to track job state, store raw pages, and maintain extracted company data.

crawl_jobs

Tracks crawl job lifecycle from QUEUED to COMPLETED/FAILED

Field	Type	Description
`job_id`	uuid	Unique job identifier
`input_url`	string	Original domain or URL
`status`	enum	QUEUED \| FETCHING \| EXTRACTING \| COMPLETED \| FAILED
`started_at`	timestamp	Job start time
`finished_at`	timestamp	Job completion time
`error`	string	Error message if failed

pages

Raw pages fetched during crawl with metadata

Field	Type	Description
`job_id`	uuid	Reference to crawl job
`url`	string	Page URL
`http_status`	int	HTTP status code
`content_type`	string	Content-Type header
`html`	text	Raw HTML (optional, can be large)
`text`	text	Cleaned text content
`title`	string	Page title
`meta_description`	string	Meta description tag
`fetched_at`	timestamp	Fetch timestamp

extractions

Structured company data extracted from crawled pages

Field	Type	Description
`job_id`	uuid	Reference to crawl job
`domain`	string	Company domain
`company_name`	string	Extracted company name
`summary`	text	1-3 sentence company description
`keywords`	array	List of relevant keywords/topics
`social_links`	object	LinkedIn, Twitter, Facebook, Instagram, YouTube links
`detected_emails`	array	Emails found on contact page
`detected_phones`	array	Phone numbers found on contact page
`about_url`	string	About page URL if found
`careers_url`	string	Careers page URL if found
`pricing_url`	string	Pricing page URL if found
`contact_url`	string	Contact page URL if found
`confidence`	int	0-100 confidence score
`created_at`	timestamp	Extraction timestamp

Crawling Strategy (v1)

Deterministic, small-footprint approach that focuses on key pages with hard safety limits.

Crawl Steps

Normalize Input

Convert domain to https://{domain} format

Fetch Homepage

Download main landing page content

Discover Key Pages

Find About, Careers, Pricing, Contact pages from homepage links

Fetch Key Pages

Download discovered pages (max 5 total pages)

Safety Limits

Max Response Size

2-5 MB

Max Timeout

10 seconds

Max Redirects

5 hops

Max Total Pages

5 pages

Page Discovery Rules

•About: /about, /company, /who-we-are
•Careers: /careers, /jobs
•Pricing: /pricing, /plans
•Contact: /contact, /support

Extraction Rules (v1)

Best-effort extraction logic to pull structured company signals from raw HTML/text with confidence scoring.

Company Name

•Prefer <meta property="og:site_name">
•Else <title> cleanup (remove trailing " | Company")
•Else header/logo alt text

Summary (1-3 sentences)

•Prefer <meta name="description">
•Else first meaningful paragraph from homepage/about page
•Clean boilerplate (cookie banners, nav text)

Keywords/Topics

•Extract from headings (h1/h2) + meta keywords
•Remove stopwords, keep top N unique tokens
•Normalize to lowercase

Social Links

•Find anchors with domains: linkedin.com, twitter.com/x.com, facebook.com, instagram.com, youtube.com
•Keep first best match per platform
•Normalize URLs to canonical format

Emails

•Regex scan only on Contact page text (reduce noise)
•Pattern: something@domain
•Filter out info@, sales@, support@ duplicates

Phones

•Basic patterns: (XXX) XXX-XXXX, XXX-XXX-XXXX, +X XXX XXX XXXX
•Scan only on Contact page
•Don't overfit; focus on US/international formats

Confidence Score (0-100)

Scoring Logic

Has company_name+30
Has summary+30
Has LinkedIn link+20
Found about/contact page+20

Example Output

{
  "confidence": 100,
  "company_name": "Acme Corp",
  "summary": "Leading B2B SaaS...",
  "linkedin": "linkedin.com/company/acme",
  "about_url": "acme.com/about",
  "contact_url": "acme.com/contact"
}

FastAPI Endpoints

RESTful API for job submission, status polling, and one-shot extraction.

POST/crawl

Submit a new crawl job for a domain

Request Body

{
  "domain": "example.com",
  "mode": "simple"
}

Response

{
  "job_id": "uuid-123",
  "status": "QUEUED"
}

GET/crawl/{job_id}

Check status and get extraction results when ready

Response

{
  "job_id": "uuid-123",
  "status": "COMPLETED",
  "extraction": {
    "company_name": "Example Corp",
    "summary": "Leading B2B SaaS provider...",
    "confidence": 95
  }
}

GET/crawl/{job_id}/pages

Get list of fetched pages with metadata (optional: without full HTML)

Response

{
  "pages": [
    {
      "url": "https://example.com",
      "title": "Example Corp - Homepage",
      "http_status": 200
    }
  ]
}

POST/extract

One-shot mode: crawl + extract in single call for quick testing

Request Body

{
  "url": "https://example.com"
}

Response

{
  "company_name": "Example Corp",
  "summary": "Leading B2B SaaS provider...",
  "confidence": 95
}

Optional React UI

Minimal testing interface for submitting crawl jobs and viewing extraction results.

Domain Input

Simple form to enter company domain or URL

Crawl Status

Real-time progress tracking (QUEUED → FETCHING → EXTRACTING → COMPLETED)

Extraction Results

Display company name, summary, keywords, social links in clean UI

Page List

Show fetched pages with status codes and titles

UI Purpose

•Quick testing and validation of extraction logic
•Useful for demos and manual QA
•Not required for production API usage
•Can be built with React + Tailwind in a few hours

Production Benefits

A focused, production-ready service that extracts company signals safely and reliably.

Domain-to-Signal Pipeline

Turn any company domain into structured enrichment data

Safe & Reliable

Hard limits prevent runaway crawls and resource exhaustion

Full Audit Trail

Raw HTML + extracted JSON stored for debugging and reprocessing

Fast Development

Simple v1 scope with clear extension path for advanced features

Done Criteria

Enter 20 random business domains and get consistent company name + summary + LinkedIn for most
Stored raw pages + extracted fields available in MongoDB for audit
Safe limits prevent long crawls, huge downloads, or timeout issues
Confidence scoring provides quality signal for downstream systems
Optional React UI for manual testing and demos

Next Steps: Add URL classification function, clean text extractor recipe, and Playwright fallback for JS-heavy sites in v2.

Build Your Next Product With AI Expertise

Experience the future of software development. Let our GenAI platform accelerate your next project.

Schedule a Free AI Blueprint Session