Learning Block

Document Parsing & Chunking

Master text extraction from PDFs and DOCX files, implement intelligent chunking strategies with overlap, and persist structured data in PostgreSQL for RAG and semantic search applications.

Text Extraction
PDF, DOCX, Plain Text
Smart Chunking
500-700 tokens, 20-30% overlap
PostgreSQL Storage
Structured, Retrievable
// Success Response
{
"documentId": "uuid-123",
"chunksCreated": 18
}

Learning Objectives & Scope

Build a production-ready document parsing and chunking service to master text extraction, chunking strategies, and structured data persistence for RAG applications.

Goal

Master document parsing, chunking strategies, and text persistence for building RAG pipelines and semantic search applications.

Success Criteria

  • Clean text extraction from PDFs and DOCX
  • Stable chunk boundaries with overlap
  • Stored and retrievable chunks

Tech Stack

FastAPI
PyPDF2, python-docx
PostgreSQL + SQLAlchemy

Must-Have Features

  • Accept raw text or uploaded file (PDF, DOCX)
  • Extract clean text from documents
  • Chunk text into 500-700 tokens with 20-30% overlap
  • Store documents and chunks in PostgreSQL
  • Retrieve chunks by document ID
  • Handle parsing errors gracefully

Nice-to-Have Features

  • Support for additional formats (HTML, Markdown)
  • Configurable chunking strategies (semantic, fixed, sliding)
  • Metadata extraction (title, author, creation date)
  • Chunk deduplication across documents
  • Full-text search on chunks
  • Batch document processing
Three-Tier Architecture

System Architecture

FastAPI service with parser layer, chunking engine, and PostgreSQL storage for scalable document processing.

Parser Layer

PyPDF2, python-docx, charset detection

PDF text extraction with layout preservation
DOCX parsing with style information
Plain text with encoding detection
Error handling for corrupted files

Chunking Engine

Token-based splitting with overlap

Configurable chunk size (500-700 tokens)
Overlap percentage (20-30%)
Sentence boundary detection
Metadata preservation (source, position)

Storage Layer

PostgreSQL with SQLAlchemy ORM

documents table (id, title, raw_text)
chunks table (id, document_id, chunk_index, text)
Efficient indexing and retrieval
Transaction management
PostgreSQL Schema

Data Model

Two-table schema with documents and chunks, designed for efficient storage and retrieval.

documents

Master document records

idUUID
PRIMARY KEY
titleVARCHAR(500)
NOT NULL
raw_textTEXT
NOT NULL
file_typeVARCHAR(10)
NOT NULL
created_atTIMESTAMP
DEFAULT NOW()

chunks

Text chunks with position tracking

idUUID
PRIMARY KEY
document_idUUID
FOREIGN KEY → documents(id)
chunk_indexINTEGER
NOT NULL
textTEXT
NOT NULL
token_countINTEGER
NOT NULL
created_atTIMESTAMP
DEFAULT NOW()

Relationship

chunks.document_iddocuments.id(FOREIGN KEY, CASCADE DELETE)
Each document can have multiple chunks. Deleting a document automatically removes all associated chunks.
Intelligent Chunking

Chunking Strategy

Fixed-size chunking with overlap ensures context preservation and optimal performance for downstream RAG and embedding tasks.

Fixed Token Size

Each chunk targets 500-700 tokens

Chunk 1: 650 tokens, Chunk 2: 620 tokens, Chunk 3: 580 tokens

Percentage Overlap

20-30% overlap between consecutive chunks

If chunk size = 600 tokens, overlap = 150-180 tokens (last 150 tokens of chunk N become first 150 of chunk N+1)

Sentence Boundaries

Prefer to split at sentence boundaries when possible

Instead of mid-sentence cuts, chunk ends after complete sentence

Chunking Algorithm

1
Tokenize Full Text
Use tiktoken or similar to count tokens in the document
2
Calculate Chunk Boundaries
Determine start and end positions for each chunk with configured overlap
3
Adjust for Sentence Boundaries
Fine-tune boundaries to end at sentence completion when possible
4
Create Chunk Records
Store each chunk with document_id, chunk_index, text, and token_count

Document Parsers

Dedicated parsers for each file format ensure clean text extraction with proper handling of encoding, layout, and formatting.

PDF

PyPDF2
Extract text from all pages
Preserve whitespace and layout
Handle encrypted PDFs with password
Detect and handle image-based PDFs (OCR fallback)
from PyPDF2 import PdfReader

def extract_pdf_text(file_path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

DOCX

python-docx
Extract paragraphs and runs
Preserve formatting metadata
Handle tables and lists
Extract header/footer content
from docx import Document

def extract_docx_text(file_path):
    doc = Document(file_path)
    text = ""
    for para in doc.paragraphs:
        text += para.text + "\n"
    return text

Plain Text

chardet
Auto-detect character encoding
Handle UTF-8, ASCII, Latin-1
Normalize line endings
Strip control characters
import chardet

def extract_text(file_bytes):
    encoding = chardet.detect(file_bytes)
    text = file_bytes.decode(encoding['encoding'])
    return text.strip()
RESTful API

API Endpoints

Simple and intuitive API for uploading documents and retrieving chunked text.

POST/parse

Upload and parse a document

Request
{
  "file": <binary>,
  "title": "Q4 Earnings Report"
}
Response
{
  "documentId": "550e8400-e29b-41d4-a716-446655440000",
  "chunksCreated": 18,
  "title": "Q4 Earnings Report",
  "fileType": "pdf"
}
GET/documents/{id}/chunks

Retrieve all chunks for a document

Response
{
  "documentId": "550e8400-e29b-41d4-a716-446655440000",
  "chunks": [
    {
      "id": "...",
      "chunkIndex": 0,
      "text": "Q4 2024 earnings exceeded...",
      "tokenCount": 650
    },
    {
      "id": "...",
      "chunkIndex": 1,
      "text": "...exceeded expectations with revenue...",
      "tokenCount": 620
    }
  ]
}

Learning Benefits

Build foundational skills for RAG pipelines and document intelligence applications.

What You'll Learn

  • Learn document parsing with PyPDF2 and python-docx
  • Implement intelligent chunking strategies for RAG
  • Master token counting and overlap calculation
  • Store structured data in PostgreSQL with proper indexing
  • Handle file uploads and multipart form data
  • Build production-ready error handling and validation

Definition of Done

  • POST /parse endpoint accepts PDF, DOCX, and plain text
  • Text is cleanly extracted from all document types
  • Chunks are created with 500-700 tokens and 20-30% overlap
  • Documents and chunks are stored in PostgreSQL
  • GET /documents/{id}/chunks retrieves all chunks in order
  • API returns documentId and chunksCreated on success

Ready to Build?

This block provides the foundation for building RAG systems, document search, and text analytics applications. Master document parsing and chunking to unlock advanced AI capabilities.

FastAPIPyPDF2python-docxPostgreSQLSQLAlchemy

Build Your Next Product With AI Expertise

Experience the future of software development. Let our GenAI platform accelerate your next project.

Schedule a Free AI Blueprint Session