Intelligent Document Ingestion, Crawling & Semantic Search
A unified platform combining intelligent web crawling, document chunking, embedding, indexing, and semantic search to transform unstructured content into searchable knowledge.
Overview
Ensar Web & Search is an end-to-end intelligent document and crawl platform that crawls websites, extracts and chunks text, generates OpenAI-compatible vector embeddings, and indexes them using DuckDB with high-performance vector similarity search (HNSW). It supports real-time job management, semantic search, tagging, fuzzy filtering, and document APIs via FastAPI. Designed for building knowledge bases from internal or external web content, it enables organizations to search by meaning, not just keywords.
Technologies
- Fully automated web crawling and ingestion pipeline
- Chunked semantic indexing using OpenAI embeddings
- Real-time semantic search over embedded content
- Fuzzy tag search and full-text retrieval
- Scalable, modular, and cloud-ready architecture
- Monitor crawling and processing progress in real-time
- Organize and categorize content with flexible tagging
- Build searchable knowledge bases from web content
- Find information based on meaning, not just keywords
Key Features
Asynchronous Web Crawler
Discover and extract content from websites using a breadth-first search strategy with crawl4ai. Filters non-essential content for high-quality markdown output.
Chunking Engine
Uses LangChain text splitters to segment long-form content into semantically meaningful chunks with configurable overlap.
Semantic Vector Processing
Transforms content into 1536-dimension vector embeddings using LiteLLM (OpenAI-compatible), enabling powerful semantic similarity comparisons.
Vector Indexing & Similarity Search
Indexes vectors in DuckDB with HNSW (Hierarchical Navigable Small World) index for fast cosine similarity search.
Distributed Job Management
Manages crawl and indexing tasks asynchronously using Redis Queue (RQ) with built-in job tracking, status updates, and fault tolerance.
Tag and Domain Filtering
Supports both exact and fuzzy tag filtering using JSON-based filters and Levenshtein distance matching for flexible document discovery.
RESTful API with FastAPI
Provides endpoints for document search, retrieval, tag listing, and job monitoring with pagination and structured responses.
Common Use Cases
Enterprise Knowledge Search
Index internal wikis and documentation for employees to search semantically, reducing time spent navigating siloed resources.
Technical Documentation Search
Index and unify product, API, and support documentation for fast developer access to relevant topics using semantic matching.
Compliance Document Management
Crawl and semantically search legal and regulatory sites to build a compliance repository searchable by intent, not just keywords.
Research Knowledge Base
Aggregate research papers, journals, and publications into a semantic knowledge system for concept-based academic exploration.
AI Agent Knowledge Base
Feed AI agents with a dense vector-indexed corpus to enhance conversational understanding and domain-specific QA capabilities.
Educational Content Repository
Crawl and index online learning materials to enable students and educators to discover educational content by topic and concept.
Competitive Intelligence
Continuously crawl and index industry and competitor sites to build a semantic market intelligence dashboard.
Ready to transform your business?
Let's discuss how our intelligent document ingestion, crawling & semantic search solutions can help you overcome challenges and achieve your business objectives.