AlgoMaster Logo

Full-Text Search Engines

Last Updated: January 12, 2026

Ashish

Ashish Pratap Singh

When a user types "running shoes size 10" into a search box, they expect results in milliseconds. They expect those results to be relevant, ranked by some notion of quality. They expect to filter by brand, price range, and color.

Traditional databases cannot deliver this experience. A SQL LIKE '%running shoes%' query scans every row, ignores word order, misses synonyms, and provides no relevance ranking. Even with full-text indexes, relational databases lack the sophisticated text analysis and scoring that users expect.

Full-text search engines are purpose-built for this problem. They use inverted indexes that map words to documents, enabling sub-second searches across millions of documents.

They apply linguistic analysis to understand that "running," "runs," and "ran" are related. They score documents by relevance using algorithms like BM25. They support faceted navigation, autocomplete, and fuzzy matching out of the box.

The Inverted Index

The inverted index is the core data structure that makes full-text search fast. Instead of mapping documents to words, it maps words to documents.

How It Works

Consider indexing three documents:

A traditional (forward) index maps document to content:

An inverted index maps terms to documents:

Why Inverted Indexes Are Fast

To search for "quick brown":

  1. Look up "quick" → [Doc1, Doc3]
  2. Look up "brown" → [Doc1, Doc2]
  3. Intersect the lists → [Doc1]

Each lookup is O(1) hash or O(log n) tree lookup. List intersection is O(min(|list1|, |list2|)) for sorted lists. This is dramatically faster than scanning all documents.

Posting Lists

Each term maps to a posting list containing:

  • Document IDs
  • Term frequency (how many times the term appears)
  • Term positions (for phrase queries)
  • Optional: additional scoring factors

Positions enable phrase queries like "quick brown" (terms must appear adjacent).

Text Analysis

Before indexing and searching, text goes through analysis to normalize it into searchable tokens.

The Analysis Pipeline

Components

Tokenizer: Splits text into tokens

TokenizerInputOutput
Standard"Hello, World!"["Hello", "World"]
Whitespace"Hello, World!"["Hello,", "World!"]
N-gram"Hello"["He", "Hel", "ell", "llo"]

Token Filters: Transform tokens

FilterInputOutput
Lowercase"HELLO""hello"
Stemmer"running""run"
Stop words"the quick fox""quick fox"
Synonym"laptop"["laptop", "notebook"]
ASCII folding"café""cafe"

Analyzer Configuration

Elasticsearch analyzer example:

Why Analysis Matters

Consider the query "running" searching for documents containing "runs":

Without stemming:

  • Query: "running"
  • Document: "runs"
  • Result: No match (different tokens)

With stemming:

  • Query: "running" → "run"
  • Document: "runs" → "run"
  • Result: Match!

Proper text analysis dramatically improves recall (finding relevant documents).

Relevance Scoring

Not all matching documents are equally relevant. Search engines rank results by relevance using scoring algorithms.

TF-IDF

The classic scoring formula combines:

  • Term Frequency (TF): How often does the term appear in this document? More = more relevant.
  • Inverse Document Frequency (IDF): How rare is this term across all documents? Rare terms are more significant.

Example:

  • Document contains "database" 5 times out of 100 words: TF = 0.05
  • "database" appears in 1,000 of 1,000,000 documents: IDF = log(1,000,000/1,000) = 3
  • TF-IDF = 0.05 × 3 = 0.15

Common words like "the" have low IDF (appear everywhere), so they contribute little to relevance.

BM25

BM25 is the modern standard, improving on TF-IDF:

Key improvements over TF-IDF:

  • Term frequency has diminishing returns (5 occurrences is not 5x better than 1)
  • Normalizes for document length (longer documents are not unfairly penalized)

Boosting

Adjust relevance based on field or query-time factors:

Matches in title are 3x more important than matches in content.

Custom Scoring

Combine text relevance with other signals:

This boosts popular and recent documents in addition to text relevance.

Search Features

Modern search engines provide features beyond basic text matching.

Fuzzy Matching

Handle typos and misspellings:

Phrase Queries

Match terms in order:

Only matches documents with these words adjacent and in order.

Autocomplete

Suggest completions as the user types:

Implemented using specialized data structures like edge n-grams or finite state transducers.

Highlighting

Return snippets with matching terms highlighted:

Response:

Return aggregations for filtering:

Elasticsearch Architecture

Elasticsearch is the most popular open-source search engine, built on Apache Lucene.

Cluster Architecture

Concepts:

  • Cluster: A collection of nodes working together
  • Node: A single Elasticsearch instance
  • Index: A collection of documents (like a database table)
  • Shard: A partition of an index (for horizontal scaling)
  • Replica: A copy of a shard (for availability and read scaling)

Indexing Documents

Searching

Query types:

QueryPurposeScoring
matchFull-text searchYes
termExact matchNo (filter)
rangeNumeric/date rangeNo (filter)
boolCombine queriesCombines scores

Filter vs Query context:

  • Filters are cached and faster (no scoring)
  • Use filters for structured data (price, category)
  • Use queries for text relevance

Scaling

Read scaling: Add replicas. Each replica can serve search queries.

Write scaling: Add primary shards. Writes are distributed across primaries.

Storage scaling: Shards distribute data across nodes.

Popular Search Engines

Elasticsearch

The dominant open-source search engine:

  • Based on: Apache Lucene
  • API: REST + JSON
  • Ecosystem: Kibana (visualization), Logstash (ingestion), Beats (agents)
  • Use cases: Full-text search, log analytics, metrics, security analytics

OpenSearch

AWS-maintained fork of Elasticsearch:

  • Origin: Forked from Elasticsearch 7.10 after license change
  • Compatibility: Largely compatible with Elasticsearch
  • Use cases: Same as Elasticsearch, especially on AWS

Apache Solr

Another Lucene-based search platform:

  • Strengths: Mature, battle-tested, strong for enterprise search
  • Differences: XML configuration, different query syntax
  • Use cases: Enterprise search, e-commerce

Algolia

Search-as-a-service:

  • Deployment: Managed cloud only
  • Strengths: Low latency, excellent developer experience
  • Features: Typo tolerance, instant search, AI-powered
  • Use cases: Site search, e-commerce, mobile apps

Meilisearch

Open-source, developer-friendly search:

  • Strengths: Easy setup, typo tolerance out of the box
  • Language: Rust (fast, memory-safe)
  • Use cases: Small to medium-scale search, prototyping

Comparison

FeatureElasticsearchOpenSearchSolrAlgoliaMeilisearch
Open sourceYes (SSPL)YesYesNoYes
Managed optionsElastic CloudAWSVariousYes (only)Meilisearch Cloud
Learning curveMediumMediumSteepLowLow
ScaleMassiveMassiveMassiveLargeMedium
Best forGeneral purposeAWS usersEnterpriseSaaS, easeSimplicity

Use Cases

Requirements:

  • Fast autocomplete
  • Typo tolerance
  • Faceted filtering (brand, price, rating)
  • Personalized ranking
  • Synonym handling (headphones = earphones)

Log Analytics (ELK Stack)

Use case: Centralized logging and analysis

  • Ingest logs from all applications
  • Search and filter logs in real-time
  • Build dashboards and alerts
  • Investigate incidents

Internal knowledge base or documentation search:

  • Index documents (PDFs, Word, HTML)
  • Extract and analyze text
  • Search with relevance ranking
  • Highlight matching snippets

When to Choose Search Engines

Search engines are the right choice when:

  • Full-text search is a primary feature. Users search by natural language queries.
  • Relevance ranking matters. Results need to be ordered by quality, not just filtered.
  • Fast, complex queries are needed. Faceted navigation, autocomplete, typo tolerance.
  • Log and event analytics. Searching and aggregating massive volumes of log data.

When to Consider Alternatives

Search engines may not fit when:

  • Simple LIKE queries suffice. For basic substring matching, database full-text indexes work.
  • Transactions are needed. Search engines are eventually consistent, not ACID.
  • Primary data storage. Use as a secondary index, not as the source of truth.
  • Vector/semantic search. While Elasticsearch has vector search, purpose-built vector databases may be better.

Summary

Full-text search engines are optimized for text search with relevance ranking:

AspectSearch Engine Approach
Data structureInverted index mapping terms to documents
Text analysisTokenization, stemming, normalization
RelevanceBM25 scoring with boosting
FeaturesFuzzy matching, facets, autocomplete
ScalingSharding for distribution, replicas for availability

Key concepts:

  • Inverted index: Maps terms to documents for O(1) lookup
  • Analyzers: Process text into searchable tokens
  • BM25: Modern relevance scoring algorithm
  • Facets: Aggregations for filtering navigation
  • Shards: Horizontal partitioning for scale

The next chapter explores vector databases, a specialized database type that has emerged to support AI and machine learning applications by enabling similarity search over high-dimensional embeddings.