Keyword Search vs Semantic Search: Why Natural Language Queries Need Vector Embeddings

The previous posts covered architecture and data ingestion. This one is about the core value proposition: why semantic search matters and how to demonstrate it.

The approach: build both keyword and AI search, run the same queries through each, and document where keyword search fails. The results make the case for semantic search more effectively than any architectural explanation could.

What keyword search actually does

Postgres full-text search works by tokenizing text into lexemes (normalized words), removing stop words, and matching query tokens against indexed documents. It’s fast, deterministic, and has been reliable for decades.

The implementation:

-- Add a generated tsvector column
ALTER TABLE products ADD COLUMN search_vector tsvector
  GENERATED ALWAYS AS (
    to_tsvector('english', 
      coalesce(name, '') || ' ' || 
      coalesce(brand, '') || ' ' || 
      coalesce(description, '') || ' ' || 
      coalesce(category, '')
    )
  ) STORED;

CREATE INDEX idx_products_search ON products USING GIN(search_vector);

The tsvector column is automatically maintained — any change to name, brand, description, or category updates it. The GIN index makes lookups fast even at scale.

The FastAPI endpoint:

@router.get("/search/keyword")
async def keyword_search(q: str, db: AsyncSession = Depends(get_db)):
    """Traditional keyword search using Postgres full-text."""
    query = text("""
        SELECT * FROM products 
        WHERE search_vector @@ plainto_tsquery('english', :query)
        ORDER BY ts_rank(search_vector, plainto_tsquery('english', :query)) DESC
        LIMIT 20
    """)
    result = await db.execute(query, {"query": q})
    return result.mappings().all()

plainto_tsquery converts plain English into a tsquery. ts_rank scores matches — more matching tokens = higher rank. The results come back in <100ms.

Where keyword search fails

The limitation isn’t implementation quality. It’s fundamental to how token matching works.

Test 1: “hiking boots”

Keyword search works perfectly. The query tokens (“hike”, “boot” after stemming) match product names and descriptions directly. Results are ranked by how many times those tokens appear.

Result: 15 products, all boots, correctly ranked.

Test 2: “waterproof tent for 2 people under $300”

This is where it starts to break. The query contains:

“waterproof” — a direct token match
“tent” — a direct token match
“2 people” — not a token in most descriptions (they say “2-person” or “capacity: 2”)
“under $300” — Postgres can’t interpret this as a price constraint

Keyword search returns tents, but misses many 2-person tents because they don’t contain the exact phrase “2 people.” It also returns $500 tents because it can’t filter by price from a natural language query.

Result: 8 products, 3 of which don’t match the query intent.

Test 3: “something warm for cold nights at camp”

This is where keyword search collapses entirely.

The query tokens after stemming: “warm”, “cold”, “night”, “camp”

Products that should match:

Sleeping bags rated to 0°F or below
Insulated jackets with high warmth-to-weight ratios
Base layers designed for cold weather

None of these products contain the words “warm” or “cold” in their descriptions. They say things like:

“rated to -10°F”
“700-fill down insulation”
“comfort range: 15-35°F”
“midweight merino wool”

These are semantically equivalent to “warm for cold nights” but share zero lexemes with the query.

Result: 0 products.

This isn’t a fixable problem. You can’t write product descriptions that anticipate every possible phrasing a user might try. Token matching fundamentally can’t bridge the gap between “something warm” and “rated to -10°F.”

What semantic search does differently

Semantic search embeds both the query and the product descriptions into a high-dimensional vector space where similar meanings cluster together geometrically. Cosine similarity measures the angle between vectors — small angle = similar meaning.

The same query (“something warm for cold nights at camp”) embedded with bge-small-en-v1.5 produces a 384-dimensional vector. That vector is compared against every product vector in ChromaDB. The closest matches:

1. Score: 0.6576
   Kelty Blaze 33°F Insulated Bag
   "...synthetic-insulated sleeping bag rated to 33°F..."

2. Score: 0.6489
   Kelty Blaze Synthetic 23° Bag
   "...rated to 23°F, using Heatseeker Eco insulation..."

3. Score: 0.6479
   Kelty Trailmix 34°F Insulated Bag
   "...synthetic-insulated sleeping bag rated to 34°F..."

Scores of 0.64-0.66 are solid matches. The model understood that “warm for cold nights” relates to temperature ratings and insulation, even though the words don’t overlap.

The embedding model learned these relationships during training on massive text corpora. It knows that “cold nights” and “low temperature rating” and “insulated” and “down fill” all cluster together semantically.

The side-by-side comparison

Running both searches on the same queries produces this:

Query	Keyword results	AI results	Winner
“hiking boots”	15 boots	12 boots	Tie — both work
“waterproof tent for 2 people under $300”	8 tents (3 wrong capacity/price)	4 tents (all correct)	AI
“something warm for cold nights at camp”	0 results	5 sleeping bags + jackets	AI
“lightweight gear for a solo thru-hike”	3 random items	8 ultralight products across categories	AI
“best bang for your buck camp stove”	2 stoves	5 budget-friendly stoves ranked by value	AI

The pattern: keyword search works when the query uses the exact terminology of the product descriptions. AI search works when the query expresses intent in natural language, regardless of phrasing.

What the retrieval scores mean

ChromaDB returns similarity scores from 0.0 (unrelated) to 1.0 (identical). In practice:

0.7-1.0: Very strong match, rare outside exact duplicates
0.6-0.7: Solid match, semantically related
0.5-0.6: Weak but plausible match
<0.5: Noise, probably not relevant

The “waterproof 2-person tent” query returned scores around 0.66 for actual 2-person tents. That’s a good signal — not perfect, but clearly above the noise threshold.

One nuance worth noting: scores are relative to your corpus and embedding model. A 0.66 in this corpus might be a 0.55 in a different one with different content density. The absolute number matters less than the relative ranking.

The architectural trade-off

Keyword search is:

Fast — <100ms for any query
Deterministic — same query always returns same results
Simple — no ML models, no embeddings, no vector stores
Cheap — just Postgres

Semantic search is:

Slower — 1-3 seconds including embedding + retrieval + generation
Non-deterministic — embedding models can update, rankings can shift slightly
Complex — requires embedding model, vector store, orchestration layer
More expensive — compute for embeddings, storage for vectors, API calls for inference

The decision depends on your use case. If users search with product-specific terminology (“NEMO Tensor sleeping pad”), keyword search is fine. If they search with natural language intent (“something to sleep on that packs small”), semantic search is necessary.

For e-commerce, users phrase queries both ways. The right architecture supports both and lets the frontend decide which to use based on query characteristics — or just runs both and shows the better results.

The role of Claude in this pipeline

Semantic search handles retrieval. Claude handles presentation.

After ChromaDB returns the top-10 most similar products, those products get passed to Claude with the original query. Claude’s job:

Write a natural language summary that explains what it found
Re-rank products by relevance, considering constraints that vector similarity alone can’t capture (like “under $300”)
Return structured output that the frontend can render

The prompt:

"""You are a search assistant for Ozark Ridge, an outdoor gear retailer.
The user searched for: "{query}"

Here are the most relevant products from our catalog:
{products_context}

Respond with a JSON object containing exactly two fields:
- "summary": a 2-3 sentence response addressing what the user is looking for
- "product_ids": an array of product IDs ordered by relevance, max 10

Return ONLY valid JSON."""

Claude’s re-ranking matters. On the “waterproof tent under $300” query, ChromaDB returned a $450 tent ranked highly because it was semantically very similar. Claude demoted it in the final ranking because it didn’t meet the price constraint stated in the query. Vector similarity got it into the candidate set; Claude’s reasoning filtered it out.

Demonstrating the difference

The demo script for this project:

Open keyword search
Enter “something warm for cold nights at camp”
Show: 0 results
Switch to AI search
Enter the same query
Show: 5-6 relevant products (sleeping bags, insulated jackets)
Click one, show the product detail page
Explain: keyword search can’t bridge “warm for cold nights” to “rated to 20°F” — semantic search can

That 30-second demo makes the value proposition immediately clear in a way that architectural diagrams and explanations never do.

When keyword search is still the right choice

Semantic search isn’t always better. Cases where keyword search wins:

Exact match queries — “Big Agnes Copper Spur HV UL2” is an exact product name. Keyword search finds it instantly. Semantic search might return similar tents that aren’t the one the user asked for.

Technical specs — “tent with 2000mm waterproof rating” works fine with keyword search if “2000mm” appears in descriptions. Semantic search might interpret this more loosely and return products with 1500mm ratings.

Brand/SKU searches — “MSR” or “product #12345” are token matches, not semantic queries.

The best implementation: detect the query type and route accordingly. If the query contains a product name, SKU, or brand, use keyword search. If it’s a natural language question or intent-based query, use semantic search. Or run both and merge results.

For a portfolio demo, running both side-by-side and letting the user toggle between them shows you understand the trade-offs.

The data layer determines the ceiling

One finding that only became clear after running queries at scale: retrieval quality is bounded by description quality, not by the sophistication of the retrieval system.

The “warm for cold nights” query worked because sleeping bag descriptions included temperature ratings in natural language (“rated to 20°F”, “comfort range: 15-35°F”). If those descriptions were thin (“Great sleeping bag. Warm and light.”), the embedding model would have nothing to work with and retrieval would fail.

This is the most important takeaway for building production search: spend more time on data quality than on tuning top-k or choosing embedding models. Rich, specific, natural-language descriptions make semantic search work. Thin descriptions make it fail regardless of how good your vector store is.

The archetype-based generation strategy in this project enforces description quality automatically. Every product has 2-4 sentences with specific attributes embedded in natural language. That’s what makes retrieval work at 0.65+ similarity scores instead of 0.4-0.5 noise.

What I’d add next

Hybrid search — combine vector similarity with keyword matching. Some queries benefit from both. “Big Agnes tent under $300” should match “Big Agnes” exactly (keyword) but interpret “under $300” semantically. Combining scores from both retrieval methods is straightforward:

final_score = (0.7 * semantic_score) + (0.3 * keyword_score)

The weights depend on your corpus and query patterns. Test and tune.

Query classification — detect whether a query is product-specific (use keyword) or intent-based (use semantic). Simple heuristics work: if the query contains a brand name or product model, route to keyword. If it’s a question or contains words like “for” or “best”, route to semantic.

Re-ranking — retrieve top-20 by vector similarity, re-rank to top-5 with a cross-encoder model. Cross-encoders are slower but more accurate because they read the query and document together, not just compare pre-computed vectors. Voyage AI’s rerank API is the standard choice.

The broader point

Natural-language queries need vector embeddings because token matching cannot bridge the semantic gap between how users phrase intent and how products describe features.

“Something warm for cold nights” and “rated to -10°F” mean the same thing. Keyword search can’t see that. Semantic search can.

The value isn’t in the technology — it’s in removing friction from the user’s search experience. Users shouldn’t have to learn product-specific terminology to find what they need. Semantic search lets them ask questions the way they naturally would, and the system figures out the mapping.

That’s the demo. That’s the value.

Previous: Building the Catalog and Ingestion Pipeline

Next: Building the AI Product Assistant

Source code

Full project: github.com/tylerwellss/ozark-ridge

ChromaDB docs: docs.trychroma.com

Postgres full-text search: postgresql.org/docs/current/textsearch.html

What keyword search actually does#

Where keyword search fails#

What semantic search does differently#

The side-by-side comparison#

What the retrieval scores mean#

The architectural trade-off#

The role of Claude in this pipeline#

Demonstrating the difference#

When keyword search is still the right choice#

The data layer determines the ceiling#

What I’d add next#

The broader point#

Series navigation#

Source code#