Learning Block

Document Parsing & Chunking

Master text extraction from PDFs and DOCX files, implement intelligent chunking strategies with overlap, and persist structured data in PostgreSQL for RAG and semantic search applications.

Text Extraction

PDF, DOCX, Plain Text

Smart Chunking

500-700 tokens, 20-30% overlap

PostgreSQL Storage

Structured, Retrievable

// Success Response

{

"documentId": "uuid-123",

"chunksCreated": 18

}

Learning Objectives & Scope

Build a production-ready document parsing and chunking service to master text extraction, chunking strategies, and structured data persistence for RAG applications.

Goal

Master document parsing, chunking strategies, and text persistence for building RAG pipelines and semantic search applications.

Success Criteria

Clean text extraction from PDFs and DOCX
Stable chunk boundaries with overlap
Stored and retrievable chunks

Tech Stack

FastAPI

PyPDF2, python-docx

PostgreSQL + SQLAlchemy

Must-Have Features

Accept raw text or uploaded file (PDF, DOCX)
Extract clean text from documents
Chunk text into 500-700 tokens with 20-30% overlap
Store documents and chunks in PostgreSQL
Retrieve chunks by document ID
Handle parsing errors gracefully

Nice-to-Have Features

Support for additional formats (HTML, Markdown)
Configurable chunking strategies (semantic, fixed, sliding)
Metadata extraction (title, author, creation date)
Chunk deduplication across documents
Full-text search on chunks
Batch document processing

Three-Tier Architecture

System Architecture

FastAPI service with parser layer, chunking engine, and PostgreSQL storage for scalable document processing.

Parser Layer

PyPDF2, python-docx, charset detection

PDF text extraction with layout preservation

DOCX parsing with style information

Plain text with encoding detection

Error handling for corrupted files

Chunking Engine

Token-based splitting with overlap

Configurable chunk size (500-700 tokens)

Overlap percentage (20-30%)

Sentence boundary detection

Metadata preservation (source, position)

Storage Layer

PostgreSQL with SQLAlchemy ORM

documents table (id, title, raw_text)

chunks table (id, document_id, chunk_index, text)

Efficient indexing and retrieval

Transaction management

PostgreSQL Schema

Data Model

Two-table schema with documents and chunks, designed for efficient storage and retrieval.

documents

Master document records

idUUID

PRIMARY KEY

titleVARCHAR(500)

NOT NULL

raw_textTEXT

NOT NULL

file_typeVARCHAR(10)

NOT NULL

created_atTIMESTAMP

DEFAULT NOW()

chunks

Text chunks with position tracking

idUUID

PRIMARY KEY

document_idUUID

FOREIGN KEY → documents(id)

chunk_indexINTEGER

NOT NULL

textTEXT

NOT NULL

token_countINTEGER

NOT NULL

created_atTIMESTAMP

DEFAULT NOW()

Relationship

chunks.document_id→documents.id(FOREIGN KEY, CASCADE DELETE)

Each document can have multiple chunks. Deleting a document automatically removes all associated chunks.

Intelligent Chunking

Chunking Strategy

Fixed-size chunking with overlap ensures context preservation and optimal performance for downstream RAG and embedding tasks.

Fixed Token Size

Each chunk targets 500-700 tokens

Chunk 1: 650 tokens, Chunk 2: 620 tokens, Chunk 3: 580 tokens

Percentage Overlap

20-30% overlap between consecutive chunks

If chunk size = 600 tokens, overlap = 150-180 tokens (last 150 tokens of chunk N become first 150 of chunk N+1)

Sentence Boundaries

Prefer to split at sentence boundaries when possible

Instead of mid-sentence cuts, chunk ends after complete sentence

Chunking Algorithm

Tokenize Full Text

Use tiktoken or similar to count tokens in the document

Calculate Chunk Boundaries

Determine start and end positions for each chunk with configured overlap

Adjust for Sentence Boundaries

Fine-tune boundaries to end at sentence completion when possible

Create Chunk Records

Store each chunk with document_id, chunk_index, text, and token_count

Document Parsers

Dedicated parsers for each file format ensure clean text extraction with proper handling of encoding, layout, and formatting.

PDF

PyPDF2

Extract text from all pages

Preserve whitespace and layout

Handle encrypted PDFs with password

Detect and handle image-based PDFs (OCR fallback)

from PyPDF2 import PdfReader

def extract_pdf_text(file_path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

DOCX

python-docx

Extract paragraphs and runs

Preserve formatting metadata

Handle tables and lists

Extract header/footer content

from docx import Document

def extract_docx_text(file_path):
    doc = Document(file_path)
    text = ""
    for para in doc.paragraphs:
        text += para.text + "\n"
    return text

Plain Text

chardet

Auto-detect character encoding

Handle UTF-8, ASCII, Latin-1

Normalize line endings

Strip control characters

import chardet

def extract_text(file_bytes):
    encoding = chardet.detect(file_bytes)
    text = file_bytes.decode(encoding['encoding'])
    return text.strip()

RESTful API

API Endpoints

Simple and intuitive API for uploading documents and retrieving chunked text.

POST/parse

Upload and parse a document

Request

{
  "file": <binary>,
  "title": "Q4 Earnings Report"
}

Response

{
  "documentId": "550e8400-e29b-41d4-a716-446655440000",
  "chunksCreated": 18,
  "title": "Q4 Earnings Report",
  "fileType": "pdf"
}

GET/documents/{id}/chunks

Retrieve all chunks for a document

Response

{
  "documentId": "550e8400-e29b-41d4-a716-446655440000",
  "chunks": [
    {
      "id": "...",
      "chunkIndex": 0,
      "text": "Q4 2024 earnings exceeded...",
      "tokenCount": 650
    },
    {
      "id": "...",
      "chunkIndex": 1,
      "text": "...exceeded expectations with revenue...",
      "tokenCount": 620
    }
  ]
}

Learning Benefits

Build foundational skills for RAG pipelines and document intelligence applications.

What You'll Learn

Learn document parsing with PyPDF2 and python-docx
Implement intelligent chunking strategies for RAG
Master token counting and overlap calculation
Store structured data in PostgreSQL with proper indexing
Handle file uploads and multipart form data
Build production-ready error handling and validation

Definition of Done

POST /parse endpoint accepts PDF, DOCX, and plain text
Text is cleanly extracted from all document types
Chunks are created with 500-700 tokens and 20-30% overlap
Documents and chunks are stored in PostgreSQL
GET /documents/{id}/chunks retrieves all chunks in order
API returns documentId and chunksCreated on success

Ready to Build?

This block provides the foundation for building RAG systems, document search, and text analytics applications. Master document parsing and chunking to unlock advanced AI capabilities.

FastAPIPyPDF2python-docxPostgreSQLSQLAlchemy

Build Your Next Product With AI Expertise

Experience the future of software development. Let our GenAI platform accelerate your next project.

Schedule a Free AI Blueprint Session