If you’ve built retrieval-augmented generation systems, you’ve likely experienced the frustration: vector databases that return semantically similar but irrelevant chunks, context windows that break across arbitrary boundaries, and retrieval accuracy that plummets when documents contain tables, layouts, or multi-page context. Traditional RAG was designed for similarity, not relevance—and similarity ≠relevance.
PageIndex by VecifyAI fundamentally reimagines document indexing for reasoning-based RAG. Instead of chopping documents into artificial chunks and praying vector similarity works, PageIndex transforms lengthy PDFs into intelligent tree structures that enable large language models to navigate documents like human experts—through reasoning, not matching.
This tutorial provides a complete, production-ready guide to installing and using PageIndex for your AI applications.
Why traditional RAG fails with complex documents
The chunking problem
Conventional RAG systems rely on three flawed assumptions:
- Semantic similarity equals relevance: Vector embeddings measure similarity, but financial reports, legal contracts, and technical manuals require domain expertise and multi-step reasoning where similar terms appear throughout.
- Chunks preserve context: Arbitrary chunking breaks tables across boundaries, destroys layout relationships, and loses hierarchical document structure.
- Top-K retrieval is sufficient: Setting a fixed number of retrieved chunks (top-K) forces you to choose between recall and precision, missing relevant passages or drowning in noise.
The layout preservation challenge
Complex documents contain:
- Multi-column layouts that lose meaning when linearized
- Tables spanning multiple pages
- Hierarchical sections with implicit relationships
- Figures and diagrams with captions separated from content
Traditional OCR and parsing tools extract page-level text but discard global structure, making it impossible for LLMs to understand document organization.
What is PageIndex and how it solves these problems
PageIndex is a vectorless, reasoning-based document indexing system that builds hierarchical tree structures from long documents, enabling LLMs to perform tree search retrieval instead of vector similarity search.
Core innovation
Inspired by AlphaGo’s tree search algorithms, PageIndex performs retrieval in two steps:
- Generate a “table-of-contents” tree structure index: Transform PDFs into semantic trees where each node represents a logical document section with precise page boundaries.
- Perform reasoning-based retrieval through tree search: Enable LLMs to navigate the tree structure, reasoning about which branches contain relevant information—just like a human expert flipping to the right chapter.
The PageIndex tree structure
Each node in a PageIndex tree contains:
- Title: Section heading
- Node ID: Unique identifier for retrieval
- Start/End index: Exact physical page numbers
- Summary: Concise description of node content
- Nested nodes: Child sections forming the hierarchy
{
"title": "Financial Stability",
"node_id": "0006",
"start_index": 21,
"end_index": 22,
"summary": "The Federal Reserve ...",
"nodes": [
{
"title": "Monitoring Financial Vulnerabilities",
"node_id": "0007",
"start_index": 22,
"end_index": 28,
"summary": "The Federal Reserve's monitoring ..."
}
]
}Key features that transform your RAG pipeline
No vectors needed
Eliminates expensive vector databases, embedding pipelines, and similarity search infrastructure. PageIndex uses document structure and LLM reasoning for retrieval, reducing operational complexity and cost.
No chunking required
Documents are organized into natural sections, not artificial chunks. This preserves full context, prevents fragmentation, and maintains the original document hierarchy.
Human-like retrieval
Simulates how human experts navigate complex documents. The LLM traverses a table-of-contents-like structure, reasoning about relevance at each branch—enabling true multi-step reasoning.
Precise page referencing
Every node contains exact physical page numbers and summaries, allowing pinpoint retrieval and verifiable citations. This makes answers traceable and auditable.
Scales to massive documents
Designed to handle hundreds or thousands of pages efficiently. The tree structure enables logarithmic search complexity rather than linear scanning.
Prerequisites and environment setup
Before installing PageIndex, ensure your environment meets these requirements:
System requirements
- Python: Version 3.8 or higher
- Operating system: Linux, macOS, or Windows with WSL
- Disk space: At least 2GB free for dependencies and document processing
- Memory: Minimum 8GB RAM recommended for large documents
API keys and credentials
PageIndex requires an OpenAI API key for LLM-powered document analysis. Create a .env file in your project root:
CHATGPT_API_KEY=your_openai_api_key_hereNote: While PageIndex is optimized for GPT models, future versions will support additional LLM providers.
Optional dependencies
For advanced OCR capabilities with scanned PDFs, consider using PageIndex’s cloud service, which includes a specialized long-context OCR model that preserves document hierarchy better than standard tools.
Step-by-step installation guide
Step 1: Clone the PageIndex repository
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndexStep 2: Install Python dependencies
PageIndex uses a minimal dependency footprint. Install requirements using pip:
pip3 install --upgrade -r requirements.txtThe requirements.txt includes essential packages for PDF processing, API integration, and configuration management.
Step 3: Configure your API key
Create a .env file in the root directory:
echo "CHATGPT_API_KEY=your_openai_api_key_here" > .envReplace your_openai_api_key_here with your actual OpenAI API key.
Step 4: Verify installation
Test your installation by running the help command:
python3 run_pageindex.py --helpYou should see the available parameters and options displayed.
Usage guide: Indexing documents and performing queries
Basic document indexing
Process a PDF document to generate its PageIndex tree structure:
python3 run_pageindex.py --pdf_path /path/to/your/document.pdfThis command creates a JSON file in the ./results/ directory containing the hierarchical tree structure.
Customizing processing parameters
PageIndex offers several parameters to optimize for your document type:
python3 run_pageindex.py \
--pdf_path /path/to/complex_report.pdf \
--model gpt-4o-2024-11-20 \
--toc-check-pages 30 \
--max-pages-per-node 15 \
--max-tokens-per-node 25000 \
--if-add-node-summary yesParameter explanations:
--model: OpenAI model to use (default: gpt-4o-2024-11-20)--toc-check-pages: Pages to check for table of contents (default: 20)--max-pages-per-node: Maximum pages per node (default: 10)--max-tokens-per-node: Maximum tokens per node (default: 20000)--if-add-node-id: Add node ID (yes/no, default: yes)--if-add-node-summary: Add node summary (yes/no, default: no)--if-add-doc-description: Add document description (yes/no, default: yes)
Markdown file support
PageIndex also supports markdown files, using heading levels to determine hierarchy:
python3 run_pageindex.py --md_path /path/to/document.mdImportant: Ensure your markdown uses proper heading hierarchy (#, ##, ###). For PDFs converted to markdown, use PageIndex OCR to preserve original structure.
Programmatic usage in Python
Integrate PageIndex into your applications:
from pageindex import page_index_main, config
import os
# Configure options
opt = config(
model='gpt-4o-2024-11-20',
toc_check_page_num=20,
max_page_num_each_node=10,
max_token_num_each_node=20000,
if_add_node_id='yes',
if_add_node_summary='yes',
if_add_doc_description='yes'
)
# Process PDF
tree_structure = page_index_main('/path/to/document.pdf', opt)
# Save results
import json
with open('document_structure.json', 'w') as f:
json.dump(tree_structure, f, indent=2)Building a reasoning-based RAG system with PageIndex
Preprocessing workflow
- Process documents: Generate PageIndex trees for all documents in your corpus
- Store structures: Save tree structures and document IDs in a database
- Index node contents: Store each node’s content in a separate table indexed by node ID
Retrieval pipeline
def reasoning_based_retrieval(query, document_tree):
"""
Perform reasoning-based retrieval using PageIndex tree structure
"""
prompt = f"""
You are given a question and a tree structure of a document.
Find all nodes likely to contain the answer through reasoning.
Question: {query}
Document tree: {document_tree}
Reply in JSON format:
{{
"thinking": "Reasoning about where to look...",
"node_list": ["node_id1", "node_id2"]
}}
"""
# Use LLM to reason about relevant nodes
response = call_llm(prompt)
relevant_nodes = response['node_list']
# Fetch node contents and generate answer
context = fetch_node_contents(relevant_nodes)
answer = generate_answer(query, context)
return answerExample prompt for node selection
The key to reasoning-based RAG is enabling the LLM to think through the document structure:
prompt = f"""
You are a financial analyst answering questions from SEC filings.
Use the document tree structure to locate relevant sections.
Question: What was the company's revenue growth in Q4 2024?
Document tree structure: {tree_structure}
Instructions:
1. Analyze the question to identify required information
2. Navigate the tree structure logically
3. Select nodes most likely to contain the answer
4. Provide reasoning for each selection
Return JSON:
{{
"thinking": "Revenue growth would be in financial statements...",
"node_list": ["0003", "0007", "0012"]
}}
"""Real-world performance: Mafin 2.5 case study
PageIndex powers Mafin 2.5, a state-of-the-art reasoning-based RAG model for financial document analysis that achieved 98.7% accuracy on FinanceBench—a benchmark for financial question answering.
Performance highlights
- 98.7% accuracy on FinanceBench, significantly outperforming vector-based RAG systems
- Precise navigation through complex SEC filings and earnings disclosures
- Explainable retrieval with clear page-level references for auditability
- Domain expertise integration through few-shot learning in the reasoning process
Comparison with traditional RAG
| Feature | Traditional vector RAG | PageIndex reasoning-based RAG |
| Retrieval method | Semantic similarity | Tree search + reasoning |
| Accuracy on domain docs | 60-75% | 98.7% (FinanceBench) |
| Chunking required | Yes | No |
| Vector database needed | Yes | No |
| Retrieval traceability | Black box | Fully explainable |
| Context preservation | Fragmented | Hierarchical |
| Infrastructure cost | High (vector DB) | Minimal |
Advanced features and cloud integration
PageIndex MCP server
The new MCP (Model Context Protocol) server brings PageIndex into Claude, Cursor, and any MCP-enabled agent. Chat with long PDFs using human-like, reasoning-based retrieval.
Install the MCP server:
npm install -g @vectifyai/pageindex-mcpPageIndex OCR for complex documents
For scanned PDFs or complex layouts, PageIndex OCR provides superior hierarchy preservation compared to standard OCR tools. The cloud service includes this advanced OCR capability.
Cloud API and dashboard
If self-hosting isn’t ideal, use the hosted API:
- Dashboard: Upload and explore PDFs visually
- API: Integrate into production environments
- Free tier: 1,000 pages available
Troubleshooting common issues
Bug fix notice
A bug introduced on April 18 has been fixed. If you cloned the repository between April 18–23, update to the latest version:
git pull origin mainDocument parsing failures
For complex PDFs that fail to parse correctly:
- Try the cloud service with PageIndex OCR
- Convert to markdown first using specialized tools
- Adjust
--max-pages-per-nodeand--max-tokens-per-nodeparameters
API rate limits
Large documents may hit OpenAI rate limits. Consider:
- Processing documents in batches
- Using a paid OpenAI tier for higher limits
- Implementing exponential backoff in your code
The future of RAG is reasoning-based
PageIndex represents a paradigm shift in document indexing for AI applications. By eliminating vectors, chunking, and similarity search, it enables LLMs to retrieve information through reasoning—mirroring how human experts navigate complex documents.
The 98.7% accuracy on FinanceBench demonstrates that reasoning-based retrieval isn’t just theoretically superior; it delivers measurable improvements in production environments.
Whether you’re building financial analysis tools, legal document search systems, or technical documentation assistants, PageIndex provides the foundation for truly intelligent document retrieval.
Start with the self-hosted open-source version for development, then scale to the cloud service for production OCR capabilities. The future of RAG isn’t about better embeddings—it’s about better reasoning.
Next steps and resources
- GitHub repository: https://github.com/VectifyAI/PageIndex
- Cloud dashboard: https://pageindex.ai
- Discord community: Join for support and discussions
- Cookbook examples: Explore advanced use cases in the repository
- MCP server: Integrate with Claude and Cursor for agentic workflows
Leave a star on the GitHub repository to support the project and receive updates on new features and improvements.








