The first post covered architecture. Here the focus shifts to data: how to generate a realistic product catalog at scale, why description quality matters for RAG, and how the ingestion pipeline embeds everything into ChromaDB.
The pipeline produced 1180 products with rich descriptions, embedded them in 39 seconds, and returned retrieval results that actually held up.
The archetype strategy
Writing 1180 product descriptions by hand is infeasible. Having Claude write them one-by-one is slow and produces inconsistent output. The solution: archetype-based generation.
An archetype is a product template that defines:
- Brands, series names, and name patterns
- Attribute ranges (weight, capacity, price, materials)
- A description template with placeholders
- Tag pools and use case pools
- Variation logic
Here’s the backpacking tent archetype:
{
"archetype": "backpacking_tent",
"category": "tents",
"brands": ["Big Agnes", "NEMO", "MSR", "REI Co-op", "Marmot"],
"name_patterns": [
"{brand} {series} {capacity}-Person Tent",
"{brand} {series} Ultralight {capacity}P"
],
"series_names": ["Ridgeline", "Summit", "Crestview", "Alpine"],
"price_range": [179, 499],
"capacity_range": [1, 4],
"description_template": "The {name} is a {weight}-pound {season}-season backpacking tent designed for {use_case}. Features a {fly_type} rainfly with {waterproof_rating}mm waterproof coating and {pole_material} poles. The {floor_area} sq ft floor area comfortably fits {capacity} sleeper(s) with {vestibule_count} vestibule(s) for gear storage. Packed size: {packed_size}. {extra_detail}",
"attribute_ranges": {
"weight_lbs": [2.2, 5.8],
"seasons": [3, 4],
"fly_type": ["full-coverage", "partial-coverage"],
"pole_material": ["aluminum", "carbon fiber", "DAC Featherlite"],
"floor_area_sqft": [20, 58],
"vestibule_count": [1, 2],
"waterproof_rating_mm": [1200, 3000],
"packed_size": ["4x15 in", "5x17 in", "6x21 in"]
},
"extra_details": [
"Interior mesh pockets keep essentials organized.",
"Color-coded pole sleeves allow quick pitching.",
"Reflective guylines improve nighttime visibility."
],
"tag_pool": ["ultralight", "waterproof", "freestanding", "3-season"],
"use_case_pool": ["weekend backpacking", "thru-hiking", "alpine camping"]
}
The generate_catalog.py script reads archetypes and generates variations:
def generate_product_from_archetype(archetype: dict) -> dict:
"""Generate a single product variation from an archetype."""
# Random selections from pools
brand = random.choice(archetype["brands"])
series = random.choice(archetype["series_names"])
capacity = random.randint(*archetype["capacity_range"])
price = round(random.uniform(*archetype["price_range"]), 2)
# Fill name pattern
name_pattern = random.choice(archetype["name_patterns"])
name = name_pattern.format(brand=brand, series=series, capacity=capacity)
# Generate attributes from ranges
attributes = {}
for key, value_range in archetype["attribute_ranges"].items():
if isinstance(value_range, list) and isinstance(value_range[0], str):
# It's a list of options, pick one
attributes[key] = random.choice(value_range)
else:
# It's a numeric range
if isinstance(value_range[0], int):
attributes[key] = random.randint(*value_range)
else:
attributes[key] = round(random.uniform(*value_range), 1)
# Fill description template
template_vars = {
"name": name,
"weight": attributes.get("weight_lbs"),
"season": attributes.get("seasons"),
"use_case": random.choice(archetype["use_case_pool"]),
"fly_type": attributes.get("fly_type"),
"waterproof_rating": attributes.get("waterproof_rating_mm"),
"pole_material": attributes.get("pole_material"),
"floor_area": attributes.get("floor_area_sqft"),
"capacity": capacity,
"vestibule_count": attributes.get("vestibule_count"),
"packed_size": attributes.get("packed_size"),
"extra_detail": random.choice(archetype["extra_details"])
}
description = archetype["description_template"].format(**template_vars)
# Sample tags
tags = random.sample(archetype["tag_pool"], k=min(5, len(archetype["tag_pool"])))
return {
"name": name,
"brand": brand,
"category": archetype["category"],
"subcategory": archetype.get("subcategory"),
"price": price,
"description": description,
"attributes": attributes,
"tags": tags
}
Run it 50-60 times per archetype across 20 archetypes → 1180 products with realistic variation in names, specs, and descriptions.
Why description quality matters
Vector embeddings capture semantic meaning. If descriptions are thin or generic, embeddings can’t differentiate between products.
Bad description:
“Great tent. Very light. Good for camping.”
Good description:
“The Big Agnes Ridgeline 2-Person Tent is a 2.8-pound 3-season backpacking tent designed for weekend backpacking. Features a full-coverage rainfly with 2000mm waterproof coating and aluminum poles. The 28 sq ft floor area comfortably fits 2 sleepers with 1 vestibule for gear storage. Packed size: 5x17 in. Reflective guylines improve nighttime visibility.”
The good description gives the embedding model something to work with:
- Specific weight (2.8 pounds)
- Season rating (3-season)
- Use case (weekend backpacking)
- Technical specs (2000mm waterproof, aluminum poles)
- Concrete dimensions (28 sq ft, 5x17 in packed)
When a user searches “lightweight tent for weekend trips,” the embedding model can match “2.8-pound” with “lightweight” and “weekend backpacking” with “weekend trips” even though the exact words don’t overlap. That’s semantic similarity working.
The archetype templates enforce this quality automatically. Every generated product has 2-4 sentences with specific attributes embedded in natural language.
The database schema
Products live in Neon (Postgres):
CREATE TABLE products (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name TEXT NOT NULL,
brand TEXT NOT NULL,
category TEXT NOT NULL,
subcategory TEXT,
price NUMERIC(10,2) NOT NULL,
description TEXT NOT NULL, -- This is the primary RAG input
attributes JSONB, -- {"weight_lbs": 3.2, "seasons": 3, ...}
tags TEXT[], -- ["waterproof", "lightweight", ...]
image_url TEXT, -- Placeholder image
created_at TIMESTAMPTZ DEFAULT now()
);
-- For keyword search (built in Phase 3)
ALTER TABLE products ADD COLUMN search_vector tsvector
GENERATED ALWAYS AS (
to_tsvector('english',
coalesce(name, '') || ' ' ||
coalesce(brand, '') || ' ' ||
coalesce(description, '') || ' ' ||
coalesce(category, '')
)
) STORED;
CREATE INDEX idx_products_search ON products USING GIN(search_vector);
CREATE INDEX idx_products_category ON products(category);
Why JSONB for attributes? Different product types have different attributes. Tents have vestibule_count and floor_area_sqft. Sleeping bags have fill_power and temp_rating_f. Storing these in JSONB keeps the schema flexible without creating dozens of nullable columns.
Why store description separately from attributes? The description is natural language text optimized for embeddings. Attributes are structured data optimized for filtering and display. Keeping them separate makes both more useful.
The ingestion pipeline
backend/scripts/ingest.py reads products from Neon, embeds them, and stores vectors in ChromaDB.
Step 1: Fetch products
async def fetch_all_products():
"""Fetch all products from Neon."""
engine = create_async_engine(DATABASE_URL)
async_session = async_sessionmaker(engine, class_=AsyncSession)
async with async_session() as session:
result = await session.execute(text("SELECT * FROM products"))
products = [dict(row) for row in result.mappings().all()]
await engine.dispose()
return products
Returns a list of product dicts with all fields.
Step 2: Build LlamaIndex Documents
A Document is LlamaIndex’s unit of text that gets embedded. For each product, combine the important fields into one string:
def build_document_text(product: dict) -> str:
"""Combine product fields into a single text string for embedding."""
parts = [
f"{product['name']} by {product['brand']}.",
product['description'],
f"Category: {product['category']}.",
]
if product.get('tags'):
parts.append(f"Tags: {', '.join(product['tags'])}.")
if product.get('attributes'):
# Flatten JSONB attributes into readable text
attr_strings = [
f"{k.replace('_', ' ')}: {v}"
for k, v in product['attributes'].items()
]
parts.append(f"Specifications: {'. '.join(attr_strings)}.")
return " ".join(parts)
def create_documents(products: list[dict]) -> list[Document]:
"""Convert product dicts to LlamaIndex Documents."""
documents = []
for product in products:
doc = Document(
text=build_document_text(product),
metadata={
"product_id": str(product['id']),
"price": float(product['price']),
"category": product['category'],
"brand": product['brand'],
}
)
documents.append(doc)
return documents
Why combine multiple fields? The embedding model needs a single string. We smoosh name, description, tags, and flattened attributes together so the vector captures everything semantically meaningful about the product.
Why include attributes in the text? A user might search for “tent under 4 pounds.” If weight isn’t in the embedded text, the vector won’t capture it and retrieval will miss. Flattening {"weight_lbs": 3.2} into "weight lbs: 3.2" makes it retrievable.
What is metadata? Fields stored alongside the vector but NOT embedded. Used for:
- Post-retrieval identification (
product_idto fetch from Neon) - Filtering (not used in v1, but common in production)
Metadata is not part of the semantic meaning, it’s just bookkeeping.
Step 3: Index into ChromaDB
def create_index(documents: list[Document]) -> VectorStoreIndex:
"""Create ChromaDB index from documents."""
# Create ChromaDB client with persistent storage
chroma_client = chromadb.PersistentClient(path="./chroma_db")
# Delete existing collection if it exists (clean re-runs)
try:
chroma_client.delete_collection(name="ozark_ridge_products")
except:
pass
# Create collection
chroma_collection = chroma_client.create_collection("ozark_ridge_products")
# Wrap in LlamaIndex's ChromaVectorStore
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Build the index - embeds all documents and stores in ChromaDB
print(f"Indexing {len(documents)} products...")
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True
)
print(f"✓ Indexed {len(documents)} products into ChromaDB")
return index
What happens during VectorStoreIndex.from_documents():
- Takes each Document’s text
- Calls
bge-small-en-v1.5to embed it (384-dimensional vector) - Stores vector + metadata in ChromaDB
- Shows progress bar
First run downloads the model (~130MB), then caches it. Subsequent runs use the cached model. 1180 products embedded in 39 seconds on CPU.
Configuration
Before any of this runs, configure LlamaIndex settings:
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
# Use local embedding model
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# We're calling Claude directly, not through LlamaIndex
Settings.llm = None
Why Settings.llm = None? LlamaIndex has built-in LLM integrations, but we want full control over prompts. Setting this to None prevents LlamaIndex from trying to use its default LLM.
Running the pipeline
cd backend
python scripts/ingest.py
Output:
Starting ingestion pipeline...
1. Fetching products from database...
✓ Fetched 1180 products
2. Converting products to LlamaIndex Documents...
✓ Created 1180 documents
3. Creating ChromaDB index...
Indexing 1180 products...
Generating embeddings: 100% |████████| 1180/1180 [00:39<00:00, 30.16it/s]
✓ Indexed 1180 products into ChromaDB
Ingestion pipeline completed successfully.
ChromaDB persisted to ./chroma_db directory.
The chroma_db/ directory now contains the persisted vector index. Future runs of the application load this index without re-embedding everything.
Testing retrieval
Before building the API endpoint, test that retrieval actually works with backend/scripts/test_retrieval.py:
def test_query(index: VectorStoreIndex, query: str, top_k: int = 5):
"""Run a test query and print results."""
retriever = index.as_retriever(similarity_top_k=top_k)
results = retriever.retrieve(query)
print(f"\nQuery: '{query}'")
print(f"Top {len(results)} results:\n")
for i, result in enumerate(results, 1):
print(f"{i}. Score: {result.score:.4f}")
print(f" Product ID: {result.metadata.get('product_id')}")
print(f" Category: {result.metadata.get('category')}")
print(f" Text: {result.text[:150]}...\n")
Results for “waterproof 2-person tent”:
1. Score: 0.6646
Product ID: 0c167f77-8985-4f4c-8123-86531198b69a
Category: tents
Text: Kelty Horizon 2-Person Tent by Kelty. The Kelty Horizon 2-Person
Tent is a 2.4-pound 3-season backpacking tent designed for bike
touring...
2. Score: 0.6468
Product ID: 22356b8c-90f9-44b1-9182-75512d44c3af
Category: tents
Text: Big Agnes Crestview 2-Person Backpacking Tent by Big Agnes. The
Big Agnes Crestview 2-Person Backpacking Tent is a 2.6-pound
4-season...
Scores of 0.64-0.67 are solid matches. The top results are actually 2-person tents. Retrieval is working.
What makes this different from tutorials
Most RAG tutorials skip the data layer entirely or use a handful of manually-written documents. This project makes different choices:
Archetype-based generation scales. Writing 1180 product descriptions by hand is infeasible. Generating them one-by-one with Claude is slow and inconsistent. Archetypes with variation logic produce realistic, diverse products at scale with no manual writing.
Description quality is non-negotiable. Thin descriptions produce thin embeddings. The archetype templates enforce 2-4 sentence descriptions with specific attributes in natural language, which is what makes semantic search work.
Attributes are embedded as text. Storing {"weight_lbs": 3.2} in JSONB is correct for the database, but it’s useless for embeddings. Flattening it to "weight lbs: 3.2" in the document text makes it retrievable.
The ingestion pipeline is separate from the app. ingest.py is a script you run manually, not part of the FastAPI server. This is the correct production pattern — indexing is expensive, you don’t rebuild it on every server restart.
What I’d change
Add re-indexing logic. Right now, re-running ingest.py deletes the old collection and rebuilds from scratch. In production, you’d want upsert logic: update changed products, add new ones, remove deleted ones. ChromaDB supports this with document IDs.
Track indexing timestamps. Store when each product was last indexed. If a product’s description changed in Neon but wasn’t re-indexed, retrieval returns stale content. Tracking timestamps makes this visible.
Add a dry-run mode. Let generate_catalog.py print what it would generate without writing to the database. Useful for verifying archetype changes before committing them.
The next step
The catalog is built, the ingestion pipeline works, and retrieval returns relevant products.
Series navigation
Previous: Building AI Search for a Retail Website: The Stack and Why
Next: Keyword Search vs Semantic Search
Source code
Full project: github.com/tylerwellss/ozark-ridge
LlamaIndex docs: docs.llamaindex.ai
ChromaDB docs: docs.trychroma.com