Website Scraper + Field Extractor
Domain-to-signal mini-service that safely scrapes company websites and extracts structured fields for enrichment and personalization.
Smart Crawling
Fetches homepage + key pages (About, Careers, Pricing, Contact) with safe limits and timeout controls
Field Extraction
Extracts company name, summary, keywords, social links, and contact info with confidence scoring
Audit Trail
Stores raw HTML/text + extracted JSON in MongoDB for audit, reuse, and debugging purposes
Scope & Features
A focused mini-service that takes a company domain/URL, safely fetches the website, and returns structured company signals.
Must-Have (v1)
- Input domain or URL for crawling
- Fetch homepage + key pages (About, Careers, Pricing, Contact)
- Extract company name, description (1-3 sentences), keywords/topics
- Detect social links (LinkedIn, Twitter/X, Facebook, Instagram, YouTube)
- Find emails/phones on contact pages (optional)
- Store raw HTML/text + extracted JSON for audit and reuse
Nice-to-Have (Later)
- robots.txt respect + crawl budget management
- sitemap.xml support for discovering pages
- JS-rendered pages (Playwright) fallback for dynamic sites
- Language detection for internationalized content
- Batch scraping mode for processing multiple domains
Three-Tier Architecture
FastAPI service with BeautifulSoup scraper and MongoDB storage for scalable web extraction.
API Layer
FastAPIRESTful API for crawl job submission, status checks, and extraction retrieval
Scraping Engine
Requests + BeautifulSoupFetches pages safely with timeout, size limits, and redirect controls. Playwright fallback for JS sites.
Storage Layer
MongoDBStores crawl jobs, raw pages, and extracted fields with full audit trail
MongoDB Collections
Three collections to track job state, store raw pages, and maintain extracted company data.
crawl_jobs
Tracks crawl job lifecycle from QUEUED to COMPLETED/FAILED
| Field | Type | Description |
|---|---|---|
job_id | uuid | Unique job identifier |
input_url | string | Original domain or URL |
status | enum | QUEUED | FETCHING | EXTRACTING | COMPLETED | FAILED |
started_at | timestamp | Job start time |
finished_at | timestamp | Job completion time |
error | string | Error message if failed |
pages
Raw pages fetched during crawl with metadata
| Field | Type | Description |
|---|---|---|
job_id | uuid | Reference to crawl job |
url | string | Page URL |
http_status | int | HTTP status code |
content_type | string | Content-Type header |
html | text | Raw HTML (optional, can be large) |
text | text | Cleaned text content |
title | string | Page title |
meta_description | string | Meta description tag |
fetched_at | timestamp | Fetch timestamp |
extractions
Structured company data extracted from crawled pages
| Field | Type | Description |
|---|---|---|
job_id | uuid | Reference to crawl job |
domain | string | Company domain |
company_name | string | Extracted company name |
summary | text | 1-3 sentence company description |
keywords | array | List of relevant keywords/topics |
social_links | object | LinkedIn, Twitter, Facebook, Instagram, YouTube links |
detected_emails | array | Emails found on contact page |
detected_phones | array | Phone numbers found on contact page |
about_url | string | About page URL if found |
careers_url | string | Careers page URL if found |
pricing_url | string | Pricing page URL if found |
contact_url | string | Contact page URL if found |
confidence | int | 0-100 confidence score |
created_at | timestamp | Extraction timestamp |
Crawling Strategy (v1)
Deterministic, small-footprint approach that focuses on key pages with hard safety limits.
Crawl Steps
Normalize Input
Convert domain to https://{domain} format
Fetch Homepage
Download main landing page content
Discover Key Pages
Find About, Careers, Pricing, Contact pages from homepage links
Fetch Key Pages
Download discovered pages (max 5 total pages)
Safety Limits
Page Discovery Rules
- •About: /about, /company, /who-we-are
- •Careers: /careers, /jobs
- •Pricing: /pricing, /plans
- •Contact: /contact, /support
Extraction Rules (v1)
Best-effort extraction logic to pull structured company signals from raw HTML/text with confidence scoring.
Company Name
- •Prefer <meta property="og:site_name">
- •Else <title> cleanup (remove trailing " | Company")
- •Else header/logo alt text
Summary (1-3 sentences)
- •Prefer <meta name="description">
- •Else first meaningful paragraph from homepage/about page
- •Clean boilerplate (cookie banners, nav text)
Keywords/Topics
- •Extract from headings (h1/h2) + meta keywords
- •Remove stopwords, keep top N unique tokens
- •Normalize to lowercase
Social Links
- •Find anchors with domains: linkedin.com, twitter.com/x.com, facebook.com, instagram.com, youtube.com
- •Keep first best match per platform
- •Normalize URLs to canonical format
Emails
- •Regex scan only on Contact page text (reduce noise)
- •Pattern: something@domain
- •Filter out info@, sales@, support@ duplicates
Phones
- •Basic patterns: (XXX) XXX-XXXX, XXX-XXX-XXXX, +X XXX XXX XXXX
- •Scan only on Contact page
- •Don't overfit; focus on US/international formats
Confidence Score (0-100)
Scoring Logic
- Has company_name+30
- Has summary+30
- Has LinkedIn link+20
- Found about/contact page+20
Example Output
{
"confidence": 100,
"company_name": "Acme Corp",
"summary": "Leading B2B SaaS...",
"linkedin": "linkedin.com/company/acme",
"about_url": "acme.com/about",
"contact_url": "acme.com/contact"
}FastAPI Endpoints
RESTful API for job submission, status polling, and one-shot extraction.
/crawlSubmit a new crawl job for a domain
Request Body
{
"domain": "example.com",
"mode": "simple"
}Response
{
"job_id": "uuid-123",
"status": "QUEUED"
}/crawl/{job_id}Check status and get extraction results when ready
Response
{
"job_id": "uuid-123",
"status": "COMPLETED",
"extraction": {
"company_name": "Example Corp",
"summary": "Leading B2B SaaS provider...",
"confidence": 95
}
}/crawl/{job_id}/pagesGet list of fetched pages with metadata (optional: without full HTML)
Response
{
"pages": [
{
"url": "https://example.com",
"title": "Example Corp - Homepage",
"http_status": 200
}
]
}/extractOne-shot mode: crawl + extract in single call for quick testing
Request Body
{
"url": "https://example.com"
}Response
{
"company_name": "Example Corp",
"summary": "Leading B2B SaaS provider...",
"confidence": 95
}Optional React UI
Minimal testing interface for submitting crawl jobs and viewing extraction results.
Domain Input
Simple form to enter company domain or URL
Crawl Status
Real-time progress tracking (QUEUED → FETCHING → EXTRACTING → COMPLETED)
Extraction Results
Display company name, summary, keywords, social links in clean UI
Page List
Show fetched pages with status codes and titles
UI Purpose
- •Quick testing and validation of extraction logic
- •Useful for demos and manual QA
- •Not required for production API usage
- •Can be built with React + Tailwind in a few hours
Production Benefits
A focused, production-ready service that extracts company signals safely and reliably.
Domain-to-Signal Pipeline
Turn any company domain into structured enrichment data
Safe & Reliable
Hard limits prevent runaway crawls and resource exhaustion
Full Audit Trail
Raw HTML + extracted JSON stored for debugging and reprocessing
Fast Development
Simple v1 scope with clear extension path for advanced features
Done Criteria
- Enter 20 random business domains and get consistent company name + summary + LinkedIn for most
- Stored raw pages + extracted fields available in MongoDB for audit
- Safe limits prevent long crawls, huge downloads, or timeout issues
- Confidence scoring provides quality signal for downstream systems
- Optional React UI for manual testing and demos
Next Steps: Add URL classification function, clean text extractor recipe, and Playwright fallback for JS-heavy sites in v2.
Build Your Next Product With AI Expertise
Experience the future of software development. Let our GenAI platform accelerate your next project.
Schedule a Free AI Blueprint Session