Building the Catalog and Ingestion Pipeline: Archetypes, Embeddings, and ChromaDB

The first post covered architecture. Here the focus shifts to data: how to generate a realistic product catalog at scale, why description quality matters for RAG, and how the ingestion pipeline embeds everything into ChromaDB.

The pipeline produced 1180 products with rich descriptions, embedded them in 39 seconds, and returned retrieval results that actually held up.

The archetype strategy

Writing 1180 product descriptions by hand is infeasible. Having Claude write them one-by-one is slow and produces inconsistent output. The solution: archetype-based generation.

An archetype is a product template that defines:

Brands, series names, and name patterns
Attribute ranges (weight, capacity, price, materials)
A description template with placeholders
Tag pools and use case pools
Variation logic

Here’s the backpacking tent archetype:

{
  "archetype": "backpacking_tent",
  "category": "tents",
  "brands": ["Big Agnes", "NEMO", "MSR", "REI Co-op", "Marmot"],
  "name_patterns": [
    "{brand} {series} {capacity}-Person Tent",
    "{brand} {series} Ultralight {capacity}P"
  ],
  "series_names": ["Ridgeline", "Summit", "Crestview", "Alpine"],
  "price_range": [179, 499],
  "capacity_range": [1, 4],
  "description_template": "The {name} is a {weight}-pound {season}-season backpacking tent designed for {use_case}. Features a {fly_type} rainfly with {waterproof_rating}mm waterproof coating and {pole_material} poles. The {floor_area} sq ft floor area comfortably fits {capacity} sleeper(s) with {vestibule_count} vestibule(s) for gear storage. Packed size: {packed_size}. {extra_detail}",
  "attribute_ranges": {
    "weight_lbs": [2.2, 5.8],
    "seasons": [3, 4],
    "fly_type": ["full-coverage", "partial-coverage"],
    "pole_material": ["aluminum", "carbon fiber", "DAC Featherlite"],
    "floor_area_sqft": [20, 58],
    "vestibule_count": [1, 2],
    "waterproof_rating_mm": [1200, 3000],
    "packed_size": ["4x15 in", "5x17 in", "6x21 in"]
  },
  "extra_details": [
    "Interior mesh pockets keep essentials organized.",
    "Color-coded pole sleeves allow quick pitching.",
    "Reflective guylines improve nighttime visibility."
  ],
  "tag_pool": ["ultralight", "waterproof", "freestanding", "3-season"],
  "use_case_pool": ["weekend backpacking", "thru-hiking", "alpine camping"]
}

The generate_catalog.py script reads archetypes and generates variations:

def generate_product_from_archetype(archetype: dict) -> dict:
    """Generate a single product variation from an archetype."""
    
    # Random selections from pools
    brand = random.choice(archetype["brands"])
    series = random.choice(archetype["series_names"])
    capacity = random.randint(*archetype["capacity_range"])
    price = round(random.uniform(*archetype["price_range"]), 2)
    
    # Fill name pattern
    name_pattern = random.choice(archetype["name_patterns"])
    name = name_pattern.format(brand=brand, series=series, capacity=capacity)
    
    # Generate attributes from ranges
    attributes = {}
    for key, value_range in archetype["attribute_ranges"].items():
        if isinstance(value_range, list) and isinstance(value_range[0], str):
            # It's a list of options, pick one
            attributes[key] = random.choice(value_range)
        else:
            # It's a numeric range
            if isinstance(value_range[0], int):
                attributes[key] = random.randint(*value_range)
            else:
                attributes[key] = round(random.uniform(*value_range), 1)
    
    # Fill description template
    template_vars = {
        "name": name,
        "weight": attributes.get("weight_lbs"),
        "season": attributes.get("seasons"),
        "use_case": random.choice(archetype["use_case_pool"]),
        "fly_type": attributes.get("fly_type"),
        "waterproof_rating": attributes.get("waterproof_rating_mm"),
        "pole_material": attributes.get("pole_material"),
        "floor_area": attributes.get("floor_area_sqft"),
        "capacity": capacity,
        "vestibule_count": attributes.get("vestibule_count"),
        "packed_size": attributes.get("packed_size"),
        "extra_detail": random.choice(archetype["extra_details"])
    }
    description = archetype["description_template"].format(**template_vars)
    
    # Sample tags
    tags = random.sample(archetype["tag_pool"], k=min(5, len(archetype["tag_pool"])))
    
    return {
        "name": name,
        "brand": brand,
        "category": archetype["category"],
        "subcategory": archetype.get("subcategory"),
        "price": price,
        "description": description,
        "attributes": attributes,
        "tags": tags
    }

Run it 50-60 times per archetype across 20 archetypes → 1180 products with realistic variation in names, specs, and descriptions.

Why description quality matters

Vector embeddings capture semantic meaning. If descriptions are thin or generic, embeddings can’t differentiate between products.

Bad description:

“Great tent. Very light. Good for camping.”

Good description:

“The Big Agnes Ridgeline 2-Person Tent is a 2.8-pound 3-season backpacking tent designed for weekend backpacking. Features a full-coverage rainfly with 2000mm waterproof coating and aluminum poles. The 28 sq ft floor area comfortably fits 2 sleepers with 1 vestibule for gear storage. Packed size: 5x17 in. Reflective guylines improve nighttime visibility.”

The good description gives the embedding model something to work with:

Specific weight (2.8 pounds)
Season rating (3-season)
Use case (weekend backpacking)
Technical specs (2000mm waterproof, aluminum poles)
Concrete dimensions (28 sq ft, 5x17 in packed)

When a user searches “lightweight tent for weekend trips,” the embedding model can match “2.8-pound” with “lightweight” and “weekend backpacking” with “weekend trips” even though the exact words don’t overlap. That’s semantic similarity working.

The archetype templates enforce this quality automatically. Every generated product has 2-4 sentences with specific attributes embedded in natural language.

The database schema

Products live in Neon (Postgres):

CREATE TABLE products (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL,
  brand TEXT NOT NULL,
  category TEXT NOT NULL,
  subcategory TEXT,
  price NUMERIC(10,2) NOT NULL,
  description TEXT NOT NULL,      -- This is the primary RAG input
  attributes JSONB,               -- {"weight_lbs": 3.2, "seasons": 3, ...}
  tags TEXT[],                    -- ["waterproof", "lightweight", ...]
  image_url TEXT,                 -- Placeholder image
  created_at TIMESTAMPTZ DEFAULT now()
);

-- For keyword search (built in Phase 3)
ALTER TABLE products ADD COLUMN search_vector tsvector
  GENERATED ALWAYS AS (
    to_tsvector('english', 
      coalesce(name, '') || ' ' || 
      coalesce(brand, '') || ' ' || 
      coalesce(description, '') || ' ' || 
      coalesce(category, '')
    )
  ) STORED;

CREATE INDEX idx_products_search ON products USING GIN(search_vector);
CREATE INDEX idx_products_category ON products(category);

Why JSONB for attributes? Different product types have different attributes. Tents have vestibule_count and floor_area_sqft. Sleeping bags have fill_power and temp_rating_f. Storing these in JSONB keeps the schema flexible without creating dozens of nullable columns.

Why store description separately from attributes? The description is natural language text optimized for embeddings. Attributes are structured data optimized for filtering and display. Keeping them separate makes both more useful.

The ingestion pipeline

backend/scripts/ingest.py reads products from Neon, embeds them, and stores vectors in ChromaDB.

Step 1: Fetch products

async def fetch_all_products():
    """Fetch all products from Neon."""
    engine = create_async_engine(DATABASE_URL)
    async_session = async_sessionmaker(engine, class_=AsyncSession)
    
    async with async_session() as session:
        result = await session.execute(text("SELECT * FROM products"))
        products = [dict(row) for row in result.mappings().all()]
    
    await engine.dispose()
    return products

Returns a list of product dicts with all fields.

Step 2: Build LlamaIndex Documents

A Document is LlamaIndex’s unit of text that gets embedded. For each product, combine the important fields into one string:

def build_document_text(product: dict) -> str:
    """Combine product fields into a single text string for embedding."""
    parts = [
        f"{product['name']} by {product['brand']}.",
        product['description'],
        f"Category: {product['category']}.",
    ]
    
    if product.get('tags'):
        parts.append(f"Tags: {', '.join(product['tags'])}.")
    
    if product.get('attributes'):
        # Flatten JSONB attributes into readable text
        attr_strings = [
            f"{k.replace('_', ' ')}: {v}" 
            for k, v in product['attributes'].items()
        ]
        parts.append(f"Specifications: {'. '.join(attr_strings)}.")
    
    return " ".join(parts)

def create_documents(products: list[dict]) -> list[Document]:
    """Convert product dicts to LlamaIndex Documents."""
    documents = []
    for product in products:
        doc = Document(
            text=build_document_text(product),
            metadata={
                "product_id": str(product['id']),
                "price": float(product['price']),
                "category": product['category'],
                "brand": product['brand'],
            }
        )
        documents.append(doc)
    return documents

Why combine multiple fields? The embedding model needs a single string. We smoosh name, description, tags, and flattened attributes together so the vector captures everything semantically meaningful about the product.

Why include attributes in the text? A user might search for “tent under 4 pounds.” If weight isn’t in the embedded text, the vector won’t capture it and retrieval will miss. Flattening {"weight_lbs": 3.2} into "weight lbs: 3.2" makes it retrievable.

What is metadata? Fields stored alongside the vector but NOT embedded. Used for:

Post-retrieval identification (product_id to fetch from Neon)
Filtering (not used in v1, but common in production)

Metadata is not part of the semantic meaning, it’s just bookkeeping.

Step 3: Index into ChromaDB

def create_index(documents: list[Document]) -> VectorStoreIndex:
    """Create ChromaDB index from documents."""
    
    # Create ChromaDB client with persistent storage
    chroma_client = chromadb.PersistentClient(path="./chroma_db")
    
    # Delete existing collection if it exists (clean re-runs)
    try:
        chroma_client.delete_collection(name="ozark_ridge_products")
    except:
        pass
    
    # Create collection
    chroma_collection = chroma_client.create_collection("ozark_ridge_products")
    
    # Wrap in LlamaIndex's ChromaVectorStore
    vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
    storage_context = StorageContext.from_defaults(vector_store=vector_store)
    
    # Build the index - embeds all documents and stores in ChromaDB
    print(f"Indexing {len(documents)} products...")
    index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True
    )
    
    print(f"✓ Indexed {len(documents)} products into ChromaDB")
    return index

What happens during VectorStoreIndex.from_documents():

Takes each Document’s text
Calls bge-small-en-v1.5 to embed it (384-dimensional vector)
Stores vector + metadata in ChromaDB
Shows progress bar

First run downloads the model (~130MB), then caches it. Subsequent runs use the cached model. 1180 products embedded in 39 seconds on CPU.

Configuration

Before any of this runs, configure LlamaIndex settings:

from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Use local embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# We're calling Claude directly, not through LlamaIndex
Settings.llm = None

Why Settings.llm = None? LlamaIndex has built-in LLM integrations, but we want full control over prompts. Setting this to None prevents LlamaIndex from trying to use its default LLM.

Running the pipeline

cd backend
python scripts/ingest.py

Output:

Starting ingestion pipeline...

1. Fetching products from database...
✓ Fetched 1180 products

2. Converting products to LlamaIndex Documents...
✓ Created 1180 documents

3. Creating ChromaDB index...
Indexing 1180 products...
Generating embeddings: 100% |████████| 1180/1180 [00:39<00:00, 30.16it/s]
✓ Indexed 1180 products into ChromaDB

Ingestion pipeline completed successfully.
ChromaDB persisted to ./chroma_db directory.

The chroma_db/ directory now contains the persisted vector index. Future runs of the application load this index without re-embedding everything.

Testing retrieval

Before building the API endpoint, test that retrieval actually works with backend/scripts/test_retrieval.py:

def test_query(index: VectorStoreIndex, query: str, top_k: int = 5):
    """Run a test query and print results."""
    retriever = index.as_retriever(similarity_top_k=top_k)
    results = retriever.retrieve(query)
    
    print(f"\nQuery: '{query}'")
    print(f"Top {len(results)} results:\n")
    
    for i, result in enumerate(results, 1):
        print(f"{i}. Score: {result.score:.4f}")
        print(f"   Product ID: {result.metadata.get('product_id')}")
        print(f"   Category: {result.metadata.get('category')}")
        print(f"   Text: {result.text[:150]}...\n")

Results for “waterproof 2-person tent”:

1. Score: 0.6646
   Product ID: 0c167f77-8985-4f4c-8123-86531198b69a
   Category: tents
   Text: Kelty Horizon 2-Person Tent by Kelty. The Kelty Horizon 2-Person 
         Tent is a 2.4-pound 3-season backpacking tent designed for bike 
         touring...

2. Score: 0.6468
   Product ID: 22356b8c-90f9-44b1-9182-75512d44c3af
   Category: tents
   Text: Big Agnes Crestview 2-Person Backpacking Tent by Big Agnes. The 
         Big Agnes Crestview 2-Person Backpacking Tent is a 2.6-pound 
         4-season...

Scores of 0.64-0.67 are solid matches. The top results are actually 2-person tents. Retrieval is working.

What makes this different from tutorials

Most RAG tutorials skip the data layer entirely or use a handful of manually-written documents. This project makes different choices:

Archetype-based generation scales. Writing 1180 product descriptions by hand is infeasible. Generating them one-by-one with Claude is slow and inconsistent. Archetypes with variation logic produce realistic, diverse products at scale with no manual writing.

Description quality is non-negotiable. Thin descriptions produce thin embeddings. The archetype templates enforce 2-4 sentence descriptions with specific attributes in natural language, which is what makes semantic search work.

Attributes are embedded as text. Storing {"weight_lbs": 3.2} in JSONB is correct for the database, but it’s useless for embeddings. Flattening it to "weight lbs: 3.2" in the document text makes it retrievable.

The ingestion pipeline is separate from the app. ingest.py is a script you run manually, not part of the FastAPI server. This is the correct production pattern — indexing is expensive, you don’t rebuild it on every server restart.

What I’d change

Add re-indexing logic. Right now, re-running ingest.py deletes the old collection and rebuilds from scratch. In production, you’d want upsert logic: update changed products, add new ones, remove deleted ones. ChromaDB supports this with document IDs.

Track indexing timestamps. Store when each product was last indexed. If a product’s description changed in Neon but wasn’t re-indexed, retrieval returns stale content. Tracking timestamps makes this visible.

Add a dry-run mode. Let generate_catalog.py print what it would generate without writing to the database. Useful for verifying archetype changes before committing them.

The next step

The catalog is built, the ingestion pipeline works, and retrieval returns relevant products.

Previous: Building AI Search for a Retail Website: The Stack and Why

Next: Keyword Search vs Semantic Search

Source code

Full project: github.com/tylerwellss/ozark-ridge

LlamaIndex docs: docs.llamaindex.ai

ChromaDB docs: docs.trychroma.com

The archetype strategy#

Why description quality matters#

The database schema#

The ingestion pipeline#

Step 1: Fetch products#

Step 2: Build LlamaIndex Documents#

Step 3: Index into ChromaDB#

Configuration#

Running the pipeline#

Testing retrieval#

What makes this different from tutorials#

What I’d change#

The next step#

Series navigation#

Source code#