Back to Lego Blocks

Website Scraper + Field Extractor

Domain-to-signal mini-service that safely scrapes company websites and extracts structured fields for enrichment and personalization.

Smart Crawling

Fetches homepage + key pages (About, Careers, Pricing, Contact) with safe limits and timeout controls

Field Extraction

Extracts company name, summary, keywords, social links, and contact info with confidence scoring

Audit Trail

Stores raw HTML/text + extracted JSON in MongoDB for audit, reuse, and debugging purposes

Scope & Features

A focused mini-service that takes a company domain/URL, safely fetches the website, and returns structured company signals.

Must-Have (v1)

  • Input domain or URL for crawling
  • Fetch homepage + key pages (About, Careers, Pricing, Contact)
  • Extract company name, description (1-3 sentences), keywords/topics
  • Detect social links (LinkedIn, Twitter/X, Facebook, Instagram, YouTube)
  • Find emails/phones on contact pages (optional)
  • Store raw HTML/text + extracted JSON for audit and reuse

Nice-to-Have (Later)

  • robots.txt respect + crawl budget management
  • sitemap.xml support for discovering pages
  • JS-rendered pages (Playwright) fallback for dynamic sites
  • Language detection for internationalized content
  • Batch scraping mode for processing multiple domains

Three-Tier Architecture

FastAPI service with BeautifulSoup scraper and MongoDB storage for scalable web extraction.

API Layer

FastAPI

RESTful API for crawl job submission, status checks, and extraction retrieval

Scraping Engine

Requests + BeautifulSoup

Fetches pages safely with timeout, size limits, and redirect controls. Playwright fallback for JS sites.

Storage Layer

MongoDB

Stores crawl jobs, raw pages, and extracted fields with full audit trail

MongoDB Collections

Three collections to track job state, store raw pages, and maintain extracted company data.

crawl_jobs

Tracks crawl job lifecycle from QUEUED to COMPLETED/FAILED

FieldTypeDescription
job_iduuidUnique job identifier
input_urlstringOriginal domain or URL
statusenumQUEUED | FETCHING | EXTRACTING | COMPLETED | FAILED
started_attimestampJob start time
finished_attimestampJob completion time
errorstringError message if failed

pages

Raw pages fetched during crawl with metadata

FieldTypeDescription
job_iduuidReference to crawl job
urlstringPage URL
http_statusintHTTP status code
content_typestringContent-Type header
htmltextRaw HTML (optional, can be large)
texttextCleaned text content
titlestringPage title
meta_descriptionstringMeta description tag
fetched_attimestampFetch timestamp

extractions

Structured company data extracted from crawled pages

FieldTypeDescription
job_iduuidReference to crawl job
domainstringCompany domain
company_namestringExtracted company name
summarytext1-3 sentence company description
keywordsarrayList of relevant keywords/topics
social_linksobjectLinkedIn, Twitter, Facebook, Instagram, YouTube links
detected_emailsarrayEmails found on contact page
detected_phonesarrayPhone numbers found on contact page
about_urlstringAbout page URL if found
careers_urlstringCareers page URL if found
pricing_urlstringPricing page URL if found
contact_urlstringContact page URL if found
confidenceint0-100 confidence score
created_attimestampExtraction timestamp

Crawling Strategy (v1)

Deterministic, small-footprint approach that focuses on key pages with hard safety limits.

Crawl Steps

1

Normalize Input

Convert domain to https://{domain} format

2

Fetch Homepage

Download main landing page content

3

Discover Key Pages

Find About, Careers, Pricing, Contact pages from homepage links

4

Fetch Key Pages

Download discovered pages (max 5 total pages)

Safety Limits

Max Response Size
2-5 MB
Max Timeout
10 seconds
Max Redirects
5 hops
Max Total Pages
5 pages

Page Discovery Rules

  • About: /about, /company, /who-we-are
  • Careers: /careers, /jobs
  • Pricing: /pricing, /plans
  • Contact: /contact, /support

Extraction Rules (v1)

Best-effort extraction logic to pull structured company signals from raw HTML/text with confidence scoring.

Company Name

  • Prefer <meta property="og:site_name">
  • Else <title> cleanup (remove trailing " | Company")
  • Else header/logo alt text

Summary (1-3 sentences)

  • Prefer <meta name="description">
  • Else first meaningful paragraph from homepage/about page
  • Clean boilerplate (cookie banners, nav text)

Keywords/Topics

  • Extract from headings (h1/h2) + meta keywords
  • Remove stopwords, keep top N unique tokens
  • Normalize to lowercase

Social Links

  • Find anchors with domains: linkedin.com, twitter.com/x.com, facebook.com, instagram.com, youtube.com
  • Keep first best match per platform
  • Normalize URLs to canonical format

Emails

  • Regex scan only on Contact page text (reduce noise)
  • Pattern: something@domain
  • Filter out info@, sales@, support@ duplicates

Phones

  • Basic patterns: (XXX) XXX-XXXX, XXX-XXX-XXXX, +X XXX XXX XXXX
  • Scan only on Contact page
  • Don't overfit; focus on US/international formats

Confidence Score (0-100)

Scoring Logic

  • Has company_name+30
  • Has summary+30
  • Has LinkedIn link+20
  • Found about/contact page+20

Example Output

{
  "confidence": 100,
  "company_name": "Acme Corp",
  "summary": "Leading B2B SaaS...",
  "linkedin": "linkedin.com/company/acme",
  "about_url": "acme.com/about",
  "contact_url": "acme.com/contact"
}

FastAPI Endpoints

RESTful API for job submission, status polling, and one-shot extraction.

POST/crawl

Submit a new crawl job for a domain

Request Body

{
  "domain": "example.com",
  "mode": "simple"
}

Response

{
  "job_id": "uuid-123",
  "status": "QUEUED"
}
GET/crawl/{job_id}

Check status and get extraction results when ready

Response

{
  "job_id": "uuid-123",
  "status": "COMPLETED",
  "extraction": {
    "company_name": "Example Corp",
    "summary": "Leading B2B SaaS provider...",
    "confidence": 95
  }
}
GET/crawl/{job_id}/pages

Get list of fetched pages with metadata (optional: without full HTML)

Response

{
  "pages": [
    {
      "url": "https://example.com",
      "title": "Example Corp - Homepage",
      "http_status": 200
    }
  ]
}
POST/extract

One-shot mode: crawl + extract in single call for quick testing

Request Body

{
  "url": "https://example.com"
}

Response

{
  "company_name": "Example Corp",
  "summary": "Leading B2B SaaS provider...",
  "confidence": 95
}

Optional React UI

Minimal testing interface for submitting crawl jobs and viewing extraction results.

Domain Input

Simple form to enter company domain or URL

Crawl Status

Real-time progress tracking (QUEUED → FETCHING → EXTRACTING → COMPLETED)

Extraction Results

Display company name, summary, keywords, social links in clean UI

Page List

Show fetched pages with status codes and titles

UI Purpose

  • Quick testing and validation of extraction logic
  • Useful for demos and manual QA
  • Not required for production API usage
  • Can be built with React + Tailwind in a few hours

Production Benefits

A focused, production-ready service that extracts company signals safely and reliably.

Domain-to-Signal Pipeline

Turn any company domain into structured enrichment data

Safe & Reliable

Hard limits prevent runaway crawls and resource exhaustion

Full Audit Trail

Raw HTML + extracted JSON stored for debugging and reprocessing

Fast Development

Simple v1 scope with clear extension path for advanced features

Done Criteria

  • Enter 20 random business domains and get consistent company name + summary + LinkedIn for most
  • Stored raw pages + extracted fields available in MongoDB for audit
  • Safe limits prevent long crawls, huge downloads, or timeout issues
  • Confidence scoring provides quality signal for downstream systems
  • Optional React UI for manual testing and demos

Next Steps: Add URL classification function, clean text extractor recipe, and Playwright fallback for JS-heavy sites in v2.

Build Your Next Product With AI Expertise

Experience the future of software development. Let our GenAI platform accelerate your next project.

Schedule a Free AI Blueprint Session