Last Updated: January 12, 2026
When a user types "running shoes size 10" into a search box, they expect results in milliseconds. They expect those results to be relevant, ranked by some notion of quality. They expect to filter by brand, price range, and color.
Traditional databases cannot deliver this experience. A SQL LIKE '%running shoes%' query scans every row, ignores word order, misses synonyms, and provides no relevance ranking. Even with full-text indexes, relational databases lack the sophisticated text analysis and scoring that users expect.
Full-text search engines are purpose-built for this problem. They use inverted indexes that map words to documents, enabling sub-second searches across millions of documents.
They apply linguistic analysis to understand that "running," "runs," and "ran" are related. They score documents by relevance using algorithms like BM25. They support faceted navigation, autocomplete, and fuzzy matching out of the box.
The inverted index is the core data structure that makes full-text search fast. Instead of mapping documents to words, it maps words to documents.
Consider indexing three documents:
A traditional (forward) index maps document to content:
An inverted index maps terms to documents:
To search for "quick brown":
Each lookup is O(1) hash or O(log n) tree lookup. List intersection is O(min(|list1|, |list2|)) for sorted lists. This is dramatically faster than scanning all documents.
Each term maps to a posting list containing:
Positions enable phrase queries like "quick brown" (terms must appear adjacent).
Before indexing and searching, text goes through analysis to normalize it into searchable tokens.
Tokenizer: Splits text into tokens
| Tokenizer | Input | Output |
|---|---|---|
| Standard | "Hello, World!" | ["Hello", "World"] |
| Whitespace | "Hello, World!" | ["Hello,", "World!"] |
| N-gram | "Hello" | ["He", "Hel", "ell", "llo"] |
Token Filters: Transform tokens
| Filter | Input | Output |
|---|---|---|
| Lowercase | "HELLO" | "hello" |
| Stemmer | "running" | "run" |
| Stop words | "the quick fox" | "quick fox" |
| Synonym | "laptop" | ["laptop", "notebook"] |
| ASCII folding | "café" | "cafe" |
Elasticsearch analyzer example:
Consider the query "running" searching for documents containing "runs":
Without stemming:
With stemming:
Proper text analysis dramatically improves recall (finding relevant documents).
Not all matching documents are equally relevant. Search engines rank results by relevance using scoring algorithms.
The classic scoring formula combines:
Example:
Common words like "the" have low IDF (appear everywhere), so they contribute little to relevance.
BM25 is the modern standard, improving on TF-IDF:
Key improvements over TF-IDF:
Adjust relevance based on field or query-time factors:
Matches in title are 3x more important than matches in content.
Combine text relevance with other signals:
This boosts popular and recent documents in addition to text relevance.
Modern search engines provide features beyond basic text matching.
Handle typos and misspellings:
Match terms in order:
Only matches documents with these words adjacent and in order.
Suggest completions as the user types:
Implemented using specialized data structures like edge n-grams or finite state transducers.
Return snippets with matching terms highlighted:
Response:
Return aggregations for filtering:
Elasticsearch is the most popular open-source search engine, built on Apache Lucene.
Concepts:
Query types:
| Query | Purpose | Scoring |
|---|---|---|
match | Full-text search | Yes |
term | Exact match | No (filter) |
range | Numeric/date range | No (filter) |
bool | Combine queries | Combines scores |
Filter vs Query context:
Read scaling: Add replicas. Each replica can serve search queries.
Write scaling: Add primary shards. Writes are distributed across primaries.
Storage scaling: Shards distribute data across nodes.
The dominant open-source search engine:
AWS-maintained fork of Elasticsearch:
Another Lucene-based search platform:
Search-as-a-service:
Open-source, developer-friendly search:
| Feature | Elasticsearch | OpenSearch | Solr | Algolia | Meilisearch |
|---|---|---|---|---|---|
| Open source | Yes (SSPL) | Yes | Yes | No | Yes |
| Managed options | Elastic Cloud | AWS | Various | Yes (only) | Meilisearch Cloud |
| Learning curve | Medium | Medium | Steep | Low | Low |
| Scale | Massive | Massive | Massive | Large | Medium |
| Best for | General purpose | AWS users | Enterprise | SaaS, ease | Simplicity |
Requirements:
Use case: Centralized logging and analysis
Internal knowledge base or documentation search:
Search engines are the right choice when:
Search engines may not fit when:
Full-text search engines are optimized for text search with relevance ranking:
| Aspect | Search Engine Approach |
|---|---|
| Data structure | Inverted index mapping terms to documents |
| Text analysis | Tokenization, stemming, normalization |
| Relevance | BM25 scoring with boosting |
| Features | Fuzzy matching, facets, autocomplete |
| Scaling | Sharding for distribution, replicas for availability |
The next chapter explores vector databases, a specialized database type that has emerged to support AI and machine learning applications by enabling similarity search over high-dimensional embeddings.