This is the final post in the series. The first four covered what I built and how. This one covers what I learned, what I’d do differently, and why this architecture matters beyond the demo.


What worked

Archetype-based catalog generation scaled cleanly. Writing 1180 product descriptions by hand would have been infeasible. Generating them one-by-one with Claude would have been slow and inconsistent. The archetype system with variation logic produced realistic, diverse products at scale with no manual writing and consistent quality across the catalog.

The key decision: enforcing 2-4 sentence descriptions with specific attributes embedded in natural language. That’s what made retrieval work at 0.65+ similarity scores instead of 0.4-0.5 noise. Description quality is the ceiling on retrieval quality — no amount of tuning top-k or choosing better embedding models fixes thin data.

Separation of concerns made the system flexible. Neon as source of truth, ChromaDB as index, LlamaIndex as abstraction, FastAPI as orchestration. Each piece can be swapped independently. ChromaDB could become Pinecone with one configuration change. bge-small-en-v1.5 could become OpenAI embeddings with three lines of code. The relational schema in Neon is independent of the vector store entirely.

This matters for portfolio projects because interviewers ask “how would this work in production?” The answer: swap ChromaDB for a hosted vector store, add re-indexing logic for changed products, scale horizontally with more FastAPI instances. The architecture already supports it.

The keyword vs. AI comparison was the right framing. Building both and running the same queries through each produced the clearest demonstration of value. “Something warm for cold nights at camp” failing keyword search (0 results) and succeeding with AI search (6 relevant products) is more convincing than any architectural explanation.

Interviewers understand the problem immediately because they’ve experienced it as users. Bad e-commerce search is universal. Showing the fix side-by-side makes the technical architecture tangible.

Local embeddings eliminated API costs and rate limits. Running bge-small-en-v1.5 locally via HuggingFace meant zero API calls for embedding. The model downloads once (~130MB), caches, and then runs instantly. For a demo with 1180 products, it embedded everything in 39 seconds on CPU.

The alternative — OpenAI embeddings — would have cost ~$0.50 for ingestion and $0.02 per search query. Negligible for a demo, but local embeddings also mean no rate limits, no network dependency, and full control over the model version.

The retrieval-then-hydration pattern was correct. ChromaDB returns product IDs and scores. Those IDs fetch fresh records from Neon. This keeps the vector store as an index, not a database, which is the right production pattern. Prices change, descriptions update, inventory moves — Neon stays current, ChromaDB gets re-indexed periodically.

Storing full product data in ChromaDB would have worked for the demo but wouldn’t scale to real e-commerce where product data changes constantly.


What didn’t work as well

Placeholder images hurt the demo polish. The decision to use generic placeholders instead of real product images was pragmatic (avoiding Unsplash API rate limits and cost), but it makes the site look unfinished. In a portfolio demo competing for attention against other projects, polish matters.

The fix: pre-fetch 20 images (one per archetype category) from Unsplash in a one-time script, store them locally, and reference them by category. All tents get the tent image, all sleeping bags get the sleeping bag image. That’s 20 API calls total instead of 1180, zero runtime overhead, and the demo looks polished.

The retrieval decision heuristic is brittle. The keyword-based check for whether to trigger ChromaDB retrieval works >90% of the time but has obvious failure modes. “What other colors does this come in?” triggers retrieval even though it’s a product-specific question that shouldn’t. “Compare this tent’s weight to the alternatives” might not trigger if phrased without the word “compare.”

A better approach: a lightweight classification call to Claude before the main query. Send the message and current product context to Claude with a simple prompt: “Does this question require information about other products? Reply yes or no.” Parse the response, trigger retrieval if yes. Adds one API call and ~200ms latency but eliminates false positives.

Top-k tuning was guesswork. The project uses top_k=10 for search and top_k=5 for assistant retrieval. Those numbers were picked arbitrarily and never validated. Maybe 8 is better. Maybe 15. Without an eval harness measuring retrieval quality at different k values, there’s no way to know.

An eval harness like the one in the RAG agent project would fix this: define 20-30 test queries with expected products, run them at different top-k values, measure precision/recall. Pick the k that maximizes both. Right now it’s just “10 seems reasonable.”

No deployment meant no real-world testing. Running locally is fine for development but you don’t learn about latency, caching, or concurrency issues until real traffic hits it. The first time 5 users search simultaneously, you discover whether ChromaDB’s file-based storage handles concurrent reads gracefully. (It does, but that’s not obvious from the architecture.)

Deploying to Vercel (frontend) + Render/Railway (backend) + keeping ChromaDB local would surface these issues. It’s also easier to share a deployed link with potential employers than “clone this repo and run two servers.”


What I’d build in v2

Re-ranking with Voyage AI. Retrieve top-20 by vector similarity, re-rank to top-5 using a cross-encoder. Cross-encoders read the query and document together and score their relevance as a pair, which is more accurate than comparing pre-computed vectors. Voyage AI’s rerank API makes this trivial:

retrieved = retrieve_products(query, top_k=20)
docs = [r["text"] for r in retrieved]
reranked = voyage_client.rerank(query, docs, model="rerank-2", top_k=5)
final_results = [retrieved[r.index] for r in reranked.results]

The RAG agent project used this pattern and measured a noticeable improvement in precision for ambiguous queries. The trade-off: one additional API call per search, ~100-200ms added latency. Worth it for production, probably overkill for a demo unless you’re specifically showcasing re-ranking.

Filters on AI search. Let users constrain by category, price range, or brand before the semantic search runs. The cleanest implementation: ChromaDB metadata filters. When creating the index, store category and price in metadata. At query time, pass filters to the retriever:

retriever = index.as_retriever(
    similarity_top_k=10,
    filters={"category": "tents", "price": {"$lt": 300}}
)

This runs the vector similarity search only over products that match the filters, which improves relevance and reduces noise. The UI would be dropdowns or sliders that populate the filter dict.

Conversation memory across sessions. Right now, closing the chat widget wipes the conversation. In production, you’d persist conversations in a database keyed by user ID (or session ID if unauthenticated). When the user returns, load their previous conversation and let them continue.

This requires minimal changes: store messages in Postgres with a conversation_id foreign key, fetch on widget mount, append new messages. The architecture already supports it — the conversation history is just an array.

Evaluation harness. Build a test suite that runs 30-40 queries with known expected products and measures:

  • Retrieval precision (did the right products surface in top-k?)
  • Answer quality (is Claude’s summary accurate and grounded?)
  • Latency (is the end-to-end response time acceptable?)

Run this before and after every change to top-k, embedding model, re-ranking, or prompt engineering. Without it, every change is a guess. With it, you can measure whether a change improved or degraded quality.

The RAG agent project has this and it’s the most valuable part of that codebase. It catches regressions immediately and makes optimization data-driven instead of intuition-driven.

Streaming responses. Use Claude’s streaming API to render tokens as they arrive instead of waiting for the full response. FastAPI supports this with StreamingResponse:

from anthropic import Anthropic

client = Anthropic()

async def stream_response(query, context):
    with client.messages.stream(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": query}]
    ) as stream:
        for text in stream.text_stream:
            yield f"data: {text}\n\n"

@router.post("/search/ai/stream")
async def ai_search_stream(request: AISearchRequest):
    # ... retrieval logic ...
    return StreamingResponse(
        stream_response(request.query, products_context),
        media_type="text/event-stream"
    )

The frontend consumes the stream with EventSource or fetch and appends tokens as they arrive. Perceived latency drops significantly — a 3-second response that streams feels faster than a 1-second response that produces nothing then dumps everything at once.


Why this architecture matters for e-commerce

The value proposition for AI search in e-commerce is straightforward: users phrase queries in natural language, and product catalogs don’t use that language. Semantic search bridges the gap.

But the real opportunity is in the assistant. Amazon’s Rufus, Instacart’s AI assistant, Shopify’s product recommendation tools — these aren’t novelties. They’re strategic investments because conversational interfaces reduce friction in the buying process.

A user who lands on a product page and asks “Is this good for winter camping?” is expressing intent. Answering that question accurately moves them closer to conversion. A user who asks “What sleeping bag pairs with this tent?” is upselling themselves. The assistant’s job is to surface the right product at the right time with the right reasoning.

The technical challenge is making the assistant useful rather than frustrating:

  • Fast enough that users don’t give up waiting (1-3 seconds is the ceiling)
  • Accurate enough that bad recommendations don’t erode trust (hallucination is a dealbreaker)
  • Contextual enough that it understands what the user is looking at (without requiring them to re-explain)

This project demonstrates all three. The assistant responds in 1-2 seconds, only recommends products from the retrieved context, and automatically knows which product the user is viewing. That’s the baseline for production.


The broader lesson: RAG is a data problem first

The most important takeaway from building this: retrieval quality is bounded by data quality, and data quality is bounded by how much effort you put into descriptions.

The “something warm for cold nights at camp” query worked because sleeping bag descriptions included temperature ratings in natural language. If those descriptions were thin (“Great sleeping bag. Warm and light.”), the embedding model would have nothing to work with and retrieval would fail regardless of top-k tuning or embedding model choice.

This generalizes beyond e-commerce. Every RAG application has a corpus of some kind — documentation, internal knowledge base, customer support tickets, legal contracts. The quality ceiling on retrieval is set by that corpus. Chunking strategy, embedding models, and re-ranking all matter, but they’re optimizations on top of a foundation. If the foundation is weak, optimization doesn’t help.

For this project, the archetype-based generation enforced description quality automatically. Every product got 2-4 sentences with specific attributes in natural language. That’s what made retrieval work.

For production, this means: invest in data quality before investing in retrieval sophistication. Audit your product descriptions, documentation, or knowledge base. Make sure the text is rich, specific, and uses the language your users actually search with. That work has higher ROI than swapping embedding models or tuning hyperparameters.


Source code

Full project: github.com/tylerwellss/ozark-ridge


Series navigation

Previous: Building the AI Product Assistant

Series index:

  1. The Stack and Why
  2. Building the Catalog and Ingestion Pipeline
  3. Keyword Search vs Semantic Search
  4. Building the AI Product Assistant
  5. Lessons Learned and What’s Next (this post)