PageIndex tutorial: Document indexing for reasoning-based RAG

If you’ve built retrieval-augmented generation systems, you’ve likely experienced the frustration: vector databases that return semantically similar but irrelevant chunks, context windows that break across arbitrary boundaries, and retrieval accuracy that plummets when documents contain tables, layouts, or multi-page context. Traditional RAG was designed for similarity, not relevance—and similarity ≠ relevance.

PageIndex by VecifyAI fundamentally reimagines document indexing for reasoning-based RAG. Instead of chopping documents into artificial chunks and praying vector similarity works, PageIndex transforms lengthy PDFs into intelligent tree structures that enable large language models to navigate documents like human experts—through reasoning, not matching.

This tutorial provides a complete, production-ready guide to installing and using PageIndex for your AI applications.

Why traditional RAG fails with complex documents

The chunking problem

Conventional RAG systems rely on three flawed assumptions:

Semantic similarity equals relevance: Vector embeddings measure similarity, but financial reports, legal contracts, and technical manuals require domain expertise and multi-step reasoning where similar terms appear throughout.
Chunks preserve context: Arbitrary chunking breaks tables across boundaries, destroys layout relationships, and loses hierarchical document structure.
Top-K retrieval is sufficient: Setting a fixed number of retrieved chunks (top-K) forces you to choose between recall and precision, missing relevant passages or drowning in noise.

The layout preservation challenge

Complex documents contain:

Multi-column layouts that lose meaning when linearized
Tables spanning multiple pages
Hierarchical sections with implicit relationships
Figures and diagrams with captions separated from content

Traditional OCR and parsing tools extract page-level text but discard global structure, making it impossible for LLMs to understand document organization.

What is PageIndex and how it solves these problems

PageIndex is a vectorless, reasoning-based document indexing system that builds hierarchical tree structures from long documents, enabling LLMs to perform tree search retrieval instead of vector similarity search.

Core innovation

Inspired by AlphaGo’s tree search algorithms, PageIndex performs retrieval in two steps:

Generate a “table-of-contents” tree structure index: Transform PDFs into semantic trees where each node represents a logical document section with precise page boundaries.
Perform reasoning-based retrieval through tree search: Enable LLMs to navigate the tree structure, reasoning about which branches contain relevant information—just like a human expert flipping to the right chapter.

The PageIndex tree structure

Each node in a PageIndex tree contains:

Title: Section heading
Node ID: Unique identifier for retrieval
Start/End index: Exact physical page numbers
Summary: Concise description of node content
Nested nodes: Child sections forming the hierarchy

{
  "title": "Financial Stability",
  "node_id": "0006",
  "start_index": 21,
  "end_index": 22,
  "summary": "The Federal Reserve ...",
  "nodes": [
    {
      "title": "Monitoring Financial Vulnerabilities",
      "node_id": "0007",
      "start_index": 22,
      "end_index": 28,
      "summary": "The Federal Reserve's monitoring ..."
    }
  ]
}

Key features that transform your RAG pipeline

No vectors needed

Eliminates expensive vector databases, embedding pipelines, and similarity search infrastructure. PageIndex uses document structure and LLM reasoning for retrieval, reducing operational complexity and cost.

No chunking required

Documents are organized into natural sections, not artificial chunks. This preserves full context, prevents fragmentation, and maintains the original document hierarchy.

Human-like retrieval

Simulates how human experts navigate complex documents. The LLM traverses a table-of-contents-like structure, reasoning about relevance at each branch—enabling true multi-step reasoning.

Precise page referencing

Every node contains exact physical page numbers and summaries, allowing pinpoint retrieval and verifiable citations. This makes answers traceable and auditable.

Scales to massive documents

Designed to handle hundreds or thousands of pages efficiently. The tree structure enables logarithmic search complexity rather than linear scanning.

Prerequisites and environment setup

Before installing PageIndex, ensure your environment meets these requirements:

System requirements

Python: Version 3.8 or higher
Operating system: Linux, macOS, or Windows with WSL
Disk space: At least 2GB free for dependencies and document processing
Memory: Minimum 8GB RAM recommended for large documents

API keys and credentials

PageIndex requires an OpenAI API key for LLM-powered document analysis. Create a .env file in your project root:

CHATGPT_API_KEY=your_openai_api_key_here

Note: While PageIndex is optimized for GPT models, future versions will support additional LLM providers.

Optional dependencies

For advanced OCR capabilities with scanned PDFs, consider using PageIndex’s cloud service, which includes a specialized long-context OCR model that preserves document hierarchy better than standard tools.

Step-by-step installation guide

Step 1: Clone the PageIndex repository

git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex

Step 2: Install Python dependencies

PageIndex uses a minimal dependency footprint. Install requirements using pip:

pip3 install --upgrade -r requirements.txt

The requirements.txt includes essential packages for PDF processing, API integration, and configuration management.

Step 3: Configure your API key

Create a .env file in the root directory:

echo "CHATGPT_API_KEY=your_openai_api_key_here" > .env

Replace your_openai_api_key_here with your actual OpenAI API key.

Step 4: Verify installation

Test your installation by running the help command:

python3 run_pageindex.py --help

You should see the available parameters and options displayed.

Usage guide: Indexing documents and performing queries

Basic document indexing

Process a PDF document to generate its PageIndex tree structure:

python3 run_pageindex.py --pdf_path /path/to/your/document.pdf

This command creates a JSON file in the ./results/ directory containing the hierarchical tree structure.

Customizing processing parameters

PageIndex offers several parameters to optimize for your document type:

python3 run_pageindex.py \
  --pdf_path /path/to/complex_report.pdf \
  --model gpt-4o-2024-11-20 \
  --toc-check-pages 30 \
  --max-pages-per-node 15 \
  --max-tokens-per-node 25000 \
  --if-add-node-summary yes

Parameter explanations:

--model: OpenAI model to use (default: gpt-4o-2024-11-20)
--toc-check-pages: Pages to check for table of contents (default: 20)
--max-pages-per-node: Maximum pages per node (default: 10)
--max-tokens-per-node: Maximum tokens per node (default: 20000)
--if-add-node-id: Add node ID (yes/no, default: yes)
--if-add-node-summary: Add node summary (yes/no, default: no)
--if-add-doc-description: Add document description (yes/no, default: yes)

Markdown file support

PageIndex also supports markdown files, using heading levels to determine hierarchy:

python3 run_pageindex.py --md_path /path/to/document.md

Important: Ensure your markdown uses proper heading hierarchy (#, ##, ###). For PDFs converted to markdown, use PageIndex OCR to preserve original structure.

Programmatic usage in Python

Integrate PageIndex into your applications:

from pageindex import page_index_main, config
import os

# Configure options
opt = config(
    model='gpt-4o-2024-11-20',
    toc_check_page_num=20,
    max_page_num_each_node=10,
    max_token_num_each_node=20000,
    if_add_node_id='yes',
    if_add_node_summary='yes',
    if_add_doc_description='yes'
)

# Process PDF
tree_structure = page_index_main('/path/to/document.pdf', opt)

# Save results
import json
with open('document_structure.json', 'w') as f:
    json.dump(tree_structure, f, indent=2)

Building a reasoning-based RAG system with PageIndex

Preprocessing workflow

Process documents: Generate PageIndex trees for all documents in your corpus
Store structures: Save tree structures and document IDs in a database
Index node contents: Store each node’s content in a separate table indexed by node ID

Retrieval pipeline

def reasoning_based_retrieval(query, document_tree):
    """
    Perform reasoning-based retrieval using PageIndex tree structure
    """
    prompt = f"""
    You are given a question and a tree structure of a document.
    Find all nodes likely to contain the answer through reasoning.
    
    Question: {query}
    
    Document tree: {document_tree}
    
    Reply in JSON format:
    {{
        "thinking": "Reasoning about where to look...",
        "node_list": ["node_id1", "node_id2"]
    }}
    """
    
    # Use LLM to reason about relevant nodes
    response = call_llm(prompt)
    relevant_nodes = response['node_list']
    
    # Fetch node contents and generate answer
    context = fetch_node_contents(relevant_nodes)
    answer = generate_answer(query, context)
    
    return answer

Example prompt for node selection

The key to reasoning-based RAG is enabling the LLM to think through the document structure:

prompt = f"""
You are a financial analyst answering questions from SEC filings.
Use the document tree structure to locate relevant sections.

Question: What was the company's revenue growth in Q4 2024?

Document tree structure: {tree_structure}

Instructions:
1. Analyze the question to identify required information
2. Navigate the tree structure logically
3. Select nodes most likely to contain the answer
4. Provide reasoning for each selection

Return JSON:
{{
    "thinking": "Revenue growth would be in financial statements...",
    "node_list": ["0003", "0007", "0012"]
}}
"""

Real-world performance: Mafin 2.5 case study

PageIndex powers Mafin 2.5, a state-of-the-art reasoning-based RAG model for financial document analysis that achieved 98.7% accuracy on FinanceBench—a benchmark for financial question answering.

Performance highlights

98.7% accuracy on FinanceBench, significantly outperforming vector-based RAG systems
Precise navigation through complex SEC filings and earnings disclosures
Explainable retrieval with clear page-level references for auditability
Domain expertise integration through few-shot learning in the reasoning process

Comparison with traditional RAG

Feature	Traditional vector RAG	PageIndex reasoning-based RAG
Retrieval method	Semantic similarity	Tree search + reasoning
Accuracy on domain docs	60-75%	98.7% (FinanceBench)
Chunking required	Yes	No
Vector database needed	Yes	No
Retrieval traceability	Black box	Fully explainable
Context preservation	Fragmented	Hierarchical
Infrastructure cost	High (vector DB)	Minimal

Advanced features and cloud integration

PageIndex MCP server

The new MCP (Model Context Protocol) server brings PageIndex into Claude, Cursor, and any MCP-enabled agent. Chat with long PDFs using human-like, reasoning-based retrieval.

Install the MCP server:

npm install -g @vectifyai/pageindex-mcp

PageIndex OCR for complex documents

For scanned PDFs or complex layouts, PageIndex OCR provides superior hierarchy preservation compared to standard OCR tools. The cloud service includes this advanced OCR capability.

Cloud API and dashboard

If self-hosting isn’t ideal, use the hosted API:

Dashboard: Upload and explore PDFs visually
API: Integrate into production environments
Free tier: 1,000 pages available

Troubleshooting common issues

Bug fix notice

A bug introduced on April 18 has been fixed. If you cloned the repository between April 18–23, update to the latest version:

git pull origin main

Document parsing failures

For complex PDFs that fail to parse correctly:

Try the cloud service with PageIndex OCR
Convert to markdown first using specialized tools
Adjust --max-pages-per-node and --max-tokens-per-node parameters

API rate limits

Large documents may hit OpenAI rate limits. Consider:

Processing documents in batches
Using a paid OpenAI tier for higher limits
Implementing exponential backoff in your code

The future of RAG is reasoning-based

PageIndex represents a paradigm shift in document indexing for AI applications. By eliminating vectors, chunking, and similarity search, it enables LLMs to retrieve information through reasoning—mirroring how human experts navigate complex documents.

The 98.7% accuracy on FinanceBench demonstrates that reasoning-based retrieval isn’t just theoretically superior; it delivers measurable improvements in production environments.

Whether you’re building financial analysis tools, legal document search systems, or technical documentation assistants, PageIndex provides the foundation for truly intelligent document retrieval.

Start with the self-hosted open-source version for development, then scale to the cloud service for production OCR capabilities. The future of RAG isn’t about better embeddings—it’s about better reasoning.

Next steps and resources

GitHub repository: https://github.com/VectifyAI/PageIndex
Cloud dashboard: https://pageindex.ai
Discord community: Join for support and discussions
Cookbook examples: Explore advanced use cases in the repository
MCP server: Integrate with Claude and Cursor for agentic workflows

Leave a star on the GitHub repository to support the project and receive updates on new features and improvements.