Project Eden

Intelligent Retrieval-Augmented Generation System

Project Eden

Overview

What is Project Eden?

Project Eden is an intelligent Retrieval-Augmented Generation (RAG) system that transforms unstructured data files (PDFs, Excel spreadsheets, etc.) into a queryable knowledge base. It combines structured queries, semantic search, and agentic reasoning to answer questions from your data with high accuracy and proper citations.

Unlike traditional RAG systems, Project Eden uses LLM-powered planning to automatically understand your data structure, generate optimal queries, and route questions to the best retrieval strategy—all without manual configuration.

Automatic Schema Detection

Uses LLMs to analyze file structure, detect entities, infer data types, and normalize headers automatically—no manual schema definition required.

Multi-Strategy Search

Four retrieval modes: structured DSL queries, semantic vector search, hybrid filtering, and reciprocal rank fusion—each optimized for different query types.

Intelligent Query Routing

Automatically analyzes questions and selects the optimal retrieval strategy based on query characteristics (structured vs semantic, complexity, etc.).

Built-in Testing & Evaluation

Comprehensive test harness with automatic metrics, LLM-based quality evaluation, comparison tools, and performance regression tracking.

Rich CLI Interface

Full-featured command-line interface with color-coded outputs, progress indicators, file viewers, and detailed debug information.

Type-Safe DSL

Custom domain-specific language for structured queries with schema validation, type-safe accessors, and helpful error messages.

Quick Start

Setup & Process Your First File

# Install dependencies pnpm install pnpm build # Configure environment # Create .env with: # OPENAI_API_KEY=sk-... # DATABASE_URL=postgresql://... # Initialize database pnpm eden db:migrate # Process a file pnpm eden process fixtures/input/sample.xlsx # Ask a question pnpm eden ask "Show me all records with more than 5 bedrooms"

Key Capabilities

DSL Queries

Structured queries for precise filtering. Best for numeric constraints and exact field matching.

Vector Search

Semantic similarity search using embeddings. Perfect for conceptual queries and exploratory search.

Hybrid Filter

DSL pre-filtering combined with vector ranking. Filter first, then rank by semantic similarity.

Hybrid Fusion

Combines DSL and vector results using Reciprocal Rank Fusion for optimal relevance.

Auto Routing

Automatically selects the optimal retrieval strategy based on query characteristics.

Schema Detection

LLM-powered automatic schema detection. No manual configuration required.

Architecture Overview

Project Eden follows a three-phase pipeline: Planning → Ingestion → Persistence, followed by intelligent query routing and answer generation.

Data Flow

Raw File → Plan → Ingest → Persist → Query → Answer ↓ ↓ ↓ ↓ ↓ ↓ XLSX Schemas Chunks DB+Vec Router LLM Phase 1: Planning • LLM analyzes file structure • Detects schemas and entities • Generates normalization plan • Infers data types and relationships Phase 2: Ingestion • Executes plan with deterministic tools • Normalizes headers and data • Extracts structured records • Generates summaries and evidence Phase 3: Persistence • Generates embeddings (batched) • Stores chunks, schemas, vectors • Creates database indexes • Makes data queryable Query Phase: • Router analyzes question • Selects retrieval strategy • Executes query (DSL/vector/hybrid) • LLM synthesizes answer with citations

Handbook

System Architecture

Core Components

Project Eden is built on a modular architecture with clear separation of concerns. Each component handles a specific aspect of the RAG pipeline.

Planning System

runPlanner(options: PlannerOptions): Promise

Main entry point for the planning phase. Analyzes file structure and generates a processing plan.

Parameters
NameTypeDescription
fileId
string
Unique identifier for the file being processed
filePath
string
Path to the file to analyze
client
LlmClient
LLM client for schema detection
llmConfig
LlmConfig
LLM configuration settings
Returns
Promise - Contains plan, schema_v0, schema_v1, and quality grade
Examples
const result = await runPlanner({ fileId: 'abc-123', filePath: './data.xlsx', client, llmConfig });

Query Routing

selectRetrievalStrategy(question: string, context: SchemaContext): Promise

Analyzes a question and selects the optimal retrieval strategy (DSL, vector, hybrid-filter, or hybrid-fusion).

Parameters
NameTypeDescription
question
string
Natural language question to answer
context
SchemaContext
Available schemas and DSL specification
Returns
RetrievalStrategy - Contains mode, dslQuery (if applicable), and semanticQuery (if applicable)
Examples
const strategy = await selectRetrievalStrategy( "Which chalet has 5 bedrooms?", schemaContext ); // Returns: { mode: 'dsl', dslQuery: {...} }

Vector Search

vectorSearch(client: PoolClient, options: VectorSearchOptions): Promise

Performs cosine similarity search using pgvector. Returns results ranked by semantic similarity.

Parameters
NameTypeDescription
client
PoolClient
PostgreSQL client connection
options
VectorSearchOptions
Search options including queryVector, accountId, filters, and limit
Returns
Promise - Results with similarity scores and ranks
Examples
const results = await vectorSearch(client, { accountId: 'user-123', queryVector: embeddingVector, limit: 20, schemaIds: ['accommodation'] });

Database Schema

Core Tables

Project Eden uses PostgreSQL with the pgvector extension for vector similarity search. The schema is designed for efficient querying with JSONB for flexible data storage and GIN indexes for fast lookups.

files Table

ColumnTypeDescription
idUUIDPrimary key, auto-generated
account_idUUIDTenant identifier for multi-tenancy
nameTEXTOriginal filename
mimeTEXTMIME type (e.g., 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet')
bytesBYTEAFile content (optional, for binary storage)
sha256BYTEASHA-256 hash for deduplication
created_atTIMESTAMPTZCreation timestamp

normalized_records Table

ColumnTypeDescription
idUUIDPrimary key, auto-generated
account_idUUIDTenant identifier
file_idUUIDForeign key to files.id
schema_idTEXTSchema identifier (e.g., 'accommodation')
kindTEXTEntity type (e.g., 'property', 'amenity')
search_dataJSONBNormalized record data (indexed with GIN)
evidenceJSONBOriginal source data and provenance
embeddingVECTOR(1024)Embedding vector for semantic search
created_atTIMESTAMPTZCreation timestamp

file_plans Table

ColumnTypeDescription
file_idUUIDPrimary key, references files.id
planJSONBProcessing plan (tool execution steps)
schema_v0JSONBInitial schema detected by planner
schema_v1JSONBMaterialized execution schema
quality_gradeTEXTPlanner confidence: 'A', 'B', or 'C'
created_atTIMESTAMPTZCreation timestamp

Indexes

Efficient indexing is crucial for performance. Project Eden uses several index types optimized for different query patterns.

Index Definitions

-- GIN index for JSONB containment queries (DSL queries) CREATE INDEX nr_gin ON normalized_records USING GIN (search_data jsonb_path_ops); -- Composite indexes for filtering CREATE INDEX nr_file_kind ON normalized_records(file_id, kind); CREATE INDEX nr_account_kind ON normalized_records(account_id, kind); -- IVFFlat index for vector similarity search CREATE INDEX nr_vec ON normalized_records USING ivfflat (embedding); -- Unique index for file deduplication CREATE UNIQUE INDEX files_account_hash_idx ON files(account_id, sha256);

Extensions

Required PostgreSQL extensions must be installed before running migrations.

-- Required extensions CREATE EXTENSION IF NOT EXISTS pgcrypto; -- For UUID generation CREATE EXTENSION IF NOT EXISTS vector; -- For vector similarity search CREATE EXTENSION IF NOT EXISTS fuzzystrmatch; -- For fuzzy string matching

DSL Query Language

Overview

The Domain-Specific Language (DSL) provides a type-safe, composable query language for filtering JSONB records in PostgreSQL. It compiles to optimized SQL using JSONB operators and GIN indexes.

Type Accessors

All field references must be wrapped in a type accessor that matches the schema definition. This enables schema validation and proper SQL casting.

Type Accessors

AccessorSchema TypesExample
string("path")string, datetime, date, timestring("property.name")
number("path")numbernumber("property.size_m2")
boolean("path")booleanboolean("property.has_wifi")
json("path")jsonjson("property.amenities")

Comparators

Equality & Existence

eq(accessor, value)

Exact match. Matches NULL or missing fields when value is null.

Examples
eq(string("property.country"), "France")
eq(number("property.bedrooms"), 4)
eq(string("property.name"), null)

Numeric Comparisons

gt(accessor, value)

Greater than

Examples
gt(number("property.size_m2"), 100)

Set Membership

in(accessor, array(values...))

Value in list. Uses PostgreSQL ANY() for efficient comparison.

Examples
in(string("property.country"), array("France", "Switzerland", "Austria"))
in(number("property.bedrooms"), array(4, 5, 6))

String Pattern Matching

contains(accessor, substring)

Case-insensitive substring match. Compiles to SQL ILIKE ‘%substring%’.

Examples
contains(string("property.name"), "luxury")
contains(string("room_amenity.question"), "cleaning")

Logical Operators

Combine filters with logical operators for complex queries.

Logical Operators

and(...filters) # All must match or(...filters) # At least one must match not(filter) # Negation # Example: Large chalets in Three Valleys and( eq(string("property.building_type"), "Chalet"), eq(string("property.skiarea"), "Three Valleys"), ge(number("property.size_m2"), 200) ) # Example: Complex nested logic and( or( eq(string("property.building_type"), "Chalet"), eq(string("property.building_type"), "Apartment") ), ge(number("property.size_m2"), 200), not(exists(string("property.shared_facilities"))) )

Sorting and Limiting

Control result ordering and count with sort() and limit() functions.

Sort and Limit

# Sort by field (ascending or descending) sort(filter, "property.size_m2", "desc") # Limit results (default: 5, max: 20) limit(filter, 10) # Combine: Top 3 largest properties limit( sort( gt(number("property.size_m2"), 0), "property.size_m2", "desc" ), 3 )

Schema Validation

All queries are validated against schema_v1 before execution:

  1. Field existence: Referenced paths must exist in schema
  2. Type compatibility: Type accessor must match schema type
  3. Early failure: Errors caught before database query

Performance Optimization

Different operators have different performance characteristics:

Optimized Operators

# ✅ Fast: Uses GIN index eq(string("property.country"), "France") exists(string("property.sauna")) in(string("country"), array("FR", "CH")) # ⚠️ Slower: Requires JSONB extraction and casting gt(number("property.size"), 100) # Sequential scan # 💡 Optimization tip: Filter first with indexed operators # ✅ Good: Filter by country first (GIN), then scan for size and( eq(string("property.country"), "France"), # GIN index ge(number("property.size_m2"), 200) # Sequential scan on subset ) # ❌ Slower: Size scan first (no index), then filter and( ge(number("property.size_m2"), 200), # Full table scan eq(string("property.country"), "France") )

Query Routing System

Strategy Selection

The query router analyzes each question using an LLM to determine the optimal retrieval strategy. The decision is based on query characteristics, available schemas, and query complexity.

Retrieval Strategies

StrategyWhen UsedBest ForExample Query
DSLStructured constraints, numeric filtersPrecise filtering on known fields"properties with exactly 5 bedrooms in France"
VectorSemantic/conceptual queriesDescriptions, open-ended questions"cozy mountain retreat", "what are the sauna policies?"
Hybrid-FilterHard constraint + semantic refinementMUST be in France, find luxury ones"luxury" among "bedrooms >= 4" results
Hybrid-FusionMixed structured + semantic6+ people with microwave and TV"properties in Méribel with hot tub"

Decision Criteria

The router considers several factors when selecting a strategy:

Numeric Constraints

Presence of counts, measurements, or comparisons (bedrooms >= 5, price < 500) suggests DSL mode

Structural vs Semantic

Questions about specific fields or values favor DSL; conceptual questions favor vector search

Complexity

Questions combining both structured filters and semantic concepts use hybrid approaches

Schema Availability

DSL queries require matching fields in available schemas; vector search works without schema knowledge

Reciprocal Rank Fusion (RRF)

The hybrid-fusion strategy uses Reciprocal Rank Fusion to combine results from DSL and vector searches. RRF provides a robust way to merge rankings without requiring score normalization.

RRF Algorithm

# RRF Score Formula RRF(rank) = 1 / (k + rank) # Where: # - k = constant (default: 60) # - rank = position in result set (1-indexed) # Combined Score: # - Each result appears in DSL ranking: score_dsl = 1 / (k + rank_dsl) # - Each result appears in vector ranking: score_vec = 1 / (k + rank_vec) # - Final score = score_dsl + score_vec # Results are sorted by combined RRF score, highest first

Processing Pipeline

Planning Phase

The planner uses LLMs to analyze file structure and generate a deterministic processing plan. This phase is critical for understanding data without manual configuration.

Schema Detection

Identifies entities (e.g., 'accommodation', 'amenity'), their attributes, and relationships between entities.

Type Inference

Determines data types for each attribute: string, number, boolean, date, datetime, time, or json.

Header Normalization

Maps messy headers (e.g., 'Bedrooms', 'bedrooms', 'Bed Rms') to normalized field names (e.g., 'bedrooms').

Plan Generation

Creates an ordered list of tool executions (normalize_headers, infer_orientation, segment_text, etc.) with parameters.

Quality Assessment

Assigns a quality grade (A/B/C) based on confidence in schema detection and plan correctness.

Ingestion Phase

The executor runs the plan deterministically, transforming raw data into normalized records.

Execution Tools

normalize_headers(tableIndex: number, mappings: Record)

Normalizes column headers according to planner mappings. Maps original headers to normalized attribute IDs.

Parameters
NameTypeDescription
tableIndex
number
Index of table in tables array
mappings
Record
Header mappings: original header → normalized ID

Persistence Phase

Normalized records are enriched with embeddings and persisted to the database.

Chunk Enrichment

LLM generates summaries and extracts evidence from raw chunks for better retrieval context

Embedding Generation

Batched embedding generation using OpenAI's text-embedding-3-small model (1024 dimensions)

Database Insertion

Batched inserts with transaction support for atomicity

Index Creation

GIN indexes for JSONB queries, IVFFlat indexes for vector similarity search

CLI Commands Reference

Database Commands

Database Management

pnpm eden db:migrate

Run database migrations to set up or update the database schema.

Returns
void
Throws
Error if migrations fail or pgvector extension not installed

Data Processing

Processing Pipeline

pnpm eden process [--output ] [--force-plan] [--debug]

End-to-end processing: plan generation, ingestion, and persistence in one command.

Parameters
NameTypeDescription
file
string
Path to file (PDF, XLSX, etc.)
--output
optional
string
Output directory for artifacts
--force-plan
optional
boolean
Force regeneration of plan
--debug
optional
boolean
Show detailed progress
Examples
pnpm eden process fixtures/input/accommodations.xlsx
pnpm eden process data.pdf --debug

Query Commands

Query Execution

pnpm eden ask [--file-ids ] [--schema-ids ] [--limit ] [--debug]

Ask a natural language question with automatic strategy selection. Returns natural language answer with citations.

Parameters
NameTypeDescription
question
string
Natural language question
--limit
optional
number
Maximum results (default: 20)
--debug
optional
boolean
Show strategy selection reasoning
Examples
pnpm eden ask "Which chalet can host 10 people?"
pnpm eden ask "Do you have accommodations with a jacuzzi?" --limit 5

Testing & Evaluation

Testing Commands

pnpm eden test --input [--output ] [--limit ] [--debug]

Run a test suite with a list of questions. Generates comprehensive results with statistics and metrics.

Parameters
NameTypeDescription
--input
string
Path to questions JSON file (required)
--output
optional
string
Output path for results
Examples
pnpm eden test --input fixtures/test/test-questions.json

Data Management

Data Operations

pnpm eden repair [--schema ] [--file-id ] [--batch-size ] [--force]

Repair and regenerate embeddings for existing data. Useful after changing embedding models.

Parameters
NameTypeDescription
--schema
optional
string
Specific schema to repair
--force
optional
boolean
Force regeneration even if embeddings exist

LLM Integration

Configuration

Project Eden uses a flexible LLM configuration system that supports multiple providers and models. Configuration is defined in config/llm.json.

LLM Configuration Schema

{ "providers": { "openai": { "apiKeyEnvVar": "OPENAI_API_KEY" }, "groq": { "apiKeyEnvVar": "GROQ_API_KEY" } }, "tasks": { "planner": { "provider": "openai", "model": "gpt-4", "maxOutputTokens": 8000 }, "answer": { "provider": "openai", "model": "gpt-4", "maxOutputTokens": 4000 }, "embed": { "provider": "openai", "model": "text-embedding-3-small" } }, "pricing": { "openai": { "gpt-4": { "inputPerMillion": 30.0, "outputPerMillion": 60.0 }, "text-embedding-3-small": { "perMillion": 0.02 } } }, "temperature": 0.2 }

LLM Tasks

LLM Task Types

TaskPurposeModelOutput
plannerSchema detection and plan generationgpt-4Plan JSON with schemas
repairData repair and validationgpt-4Repaired records
classifierQuery classificationgpt-4Retrieval strategy
answerAnswer synthesisgpt-4Natural language answer with citations
embedEmbedding generationtext-embedding-3-small1024-dimension vectors

Retry & Error Handling

Project Eden implements robust retry logic for LLM API calls with exponential backoff and configurable retry policies.

Retry Policy

interface RetryPolicy { maxRetries: number; // Default: 3 initialDelayMs: number; // Default: 1000 maxDelayMs: number; // Default: 10000 backoffMultiplier: number; // Default: 2 } // Retries on: // - Network errors // - Rate limit errors (429) // - Server errors (5xx) // - Timeout errors

Testing & Evaluation

Testing Framework

Project Eden includes a comprehensive testing framework for regression testing and quality assurance.

Test Input Format

Questions JSON Format

{ "questions": [ "Can you recommend a chalet for 10 people?", "Which property is best for families?", "Do any chalets offer mountain views?" ] }

Automatic Metrics

The test framework automatically computes several metrics without requiring LLM calls:

Answer Similarity

Levenshtein distance between answers for consistency tracking

Mode Consistency

Tracks which retrieval strategy was used for each question

Performance Benchmarking

Timing breakdown (planning, retrieval, answer generation) and cost tracking

Confidence Distribution

Analysis of high/medium/low confidence answer distribution

LLM-Based Evaluation

Optional LLM evaluation provides deeper quality assessment (requires additional API costs):

Evaluation Dimensions

DimensionDescriptionScale
CorrectnessFactual accuracy of the answer1-5
CompletenessWhether all relevant information is included1-5
RelevanceHow well the answer addresses the question1-5
Citation QualityAccuracy and helpfulness of source citations1-5

Comparison Reports

The compare command generates detailed comparison reports between test runs, including:

Performance Comparison

Timing deltas, cost differences, and throughput metrics

Answer Consistency

Similarity scores and mode changes between runs

Quality Analysis

LLM evaluation scores with comparative reasoning

Recommendations

Actionable suggestions for improving system performance

Implementation Details

Technology Stack

Core Technologies

TechnologyPurposeVersion
TypeScriptPrimary language5.9.2
Node.jsRuntime18+
PostgreSQLDatabase14+
pgvectorVector similarity0.2.1
OpenAI APILLM & Embeddings5.23.0
ZodSchema validation3.24.1
yargsCLI framework18.0.0
pdfjs-distPDF parsing5.4.149
xlsxExcel parsing0.20.3

Project Structure

Directory Layout

project-eden/ ├── src/ │ ├── cli/ # CLI commands and UI │ │ ├── commands/ # Individual command implementations │ │ ├── ui/ # Terminal UI components │ │ └── viewers/ # File viewer implementations │ ├── db/ # Database client and migrations │ ├── dsl/ # DSL parser and query engine │ │ ├── compile.ts # DSL to SQL compiler │ │ ├── parseJson.ts # JSON DSL parser │ │ ├── stringParser.ts # String DSL parser │ │ └── validator.ts # Schema validation │ ├── executor/ # Plan execution and ingestion │ │ ├── executePlan.ts # Tool execution orchestrator │ │ ├── runIngestion.ts # Main ingestion pipeline │ │ └── semantic/ # LLM-powered enrichment │ ├── llm/ # LLM client and tasks │ │ ├── client.ts # LLM client abstraction │ │ ├── prompts/ # Prompt templates │ │ ├── schemas/ # Zod schemas for LLM responses │ │ └── tasks/ # Task creation utilities │ ├── loader/ # File loaders (PDF, XLSX) │ ├── persist/ # Database persistence layer │ │ ├── embeddings.ts # Embedding generation │ │ └── repositories/ # Data access layer │ ├── planner/ # Agentic planning system │ │ ├── runPlanner.ts # Main planner entry point │ │ ├── schema.ts # Schema detection logic │ │ └── toolCatalog.ts # Available execution tools │ ├── router/ # Query routing and answer generation │ │ ├── askPipeline.ts # Main ask command pipeline │ │ └── executeRetrieval.ts # Strategy execution │ ├── tools/ # Execution tools │ │ ├── normalizeHeaders.ts │ │ ├── inferOrientation.ts │ │ ├── segmentText.ts │ │ └── extractTable.ts │ └── vector/ # Vector search implementation │ ├── vectorSearch.ts │ ├── hybridFilter.ts │ └── hybridFusion.ts ├── db/migrations/ # SQL migrations ├── fixtures/ # Sample data and test files └── out/ # Output artifacts

Type Safety

Project Eden is built with strict TypeScript and uses Zod for runtime validation. All LLM responses are validated against Zod schemas before use.

Schema Validation Pattern

// Define Zod schema const plannerResponseSchema = z.object({ plan: z.array(plannerToolSchema), schema_v0: plannerSchemaV0, quality_hypothesis: z.enum(['A', 'B', 'C']) }); // Validate LLM response const validatedResponse = plannerResponseSchema.parse(response.data); // Type-safe usage type PlannerResponse = z.infer;

Error Handling

The system uses custom error types for better error messages and debugging:

Error Types

// DSL validation errors class DslValidationError extends Error { constructor( message: string, public field?: string, public suggestions?: string[] ) {} } // LLM errors class LlmError extends Error { constructor( message: string, public code: string, public retryable: boolean ) {} } // Planner errors class PlannerError extends Error { constructor( message: string, public qualityGrade: 'A' | 'B' | 'C' ) {} }

Performance Considerations

Key performance optimizations:

Batched Embeddings

Embeddings generated in batches (default: 50) to reduce API calls and improve throughput

GIN Indexes

JSONB containment queries use GIN indexes for fast lookups

IVFFlat Indexes

Vector similarity search uses IVFFlat indexes for approximate nearest neighbor search

Parallel Query Execution

Hybrid-fusion executes DSL and vector queries in parallel before merging

Connection Pooling

PostgreSQL connection pooling for efficient database access

Limitations & Future Work

Current Limitations

Single Language Support

Currently optimized for English. Multi-language support planned.

PostgreSQL Only

Requires PostgreSQL with pgvector. Additional vector databases planned.

OpenAI Embeddings Only

Uses OpenAI embeddings. Local embedding models (sentence-transformers) planned.

No Incremental Updates

Full re-ingestion required for updates. Delta processing planned.

Limited PDF Support

Basic PDF parsing. Advanced table extraction and complex layouts planned.

Planned Improvements

Multi-Language Support

Support for multiple languages in planning, querying, and answers

Additional Vector Databases

Support for Pinecone, Weaviate, Qdrant, and other vector databases

Local Embedding Models

Integration with sentence-transformers for local embedding generation

Streaming Answers

Stream answers as they're generated for better UX

Web Interface

Browser-based UI for querying and data management

Incremental Updates

Delta processing for updating existing data without full re-ingestion

Advanced PDF Parsing

Better table extraction, image OCR, and complex layout handling

Multi-Modal Support

Support for images, audio, and other non-text content