A search engine is a software system that helps users find relevant information from a large collection of data by processing queries and returning matching results.
It typically works by indexing content (such as web pages or documents), allowing users to perform keyword-based searches, and ranking the results based on relevance.
Popular real-world examples include Google, Bing, and Elasticsearch.
In this chapter, we will explore the low-level design of a simple in-memory search engine.
Let's start by clarifying the requirements:
Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.
Here is an example of how a conversation between the candidate and the interviewer might unfold:
Candidate: Implementing web crawling can add significant complexity. Should we preload documents or web pages into the system?
Interviewer: For this version, assume a predefined set of documents or web pages is already available in memory. No need to implement crawling.
Candidate: Should the search engine support only keyword-based search, or also handle phrases queries and logical operators?
Interviewer: Keep it simple for now. Basic keyword-based search is sufficient.
Candidate: Should the system return only exact matches, or also support partial and fuzzy matches?
Interviewer: Let's support exact matches only for now. You can assume case-insensitive search.
Candidate: Do we need to rank the results, or is returning any matching document enough?
Interviewer: Basic scoring and ranking should be implemented (e.g., based on the frequency of the keyword within each document).
Candidate: Should we include text processing techniques like stop-word removal or stemming during indexing and querying?
Interviewer: Not for this version. Treat all words equally. No stop-word removal or stemming.
Candidate: Should we allow users to input search queries dynamically, or can we hardcode a set of search queries?
Interviewer: For this version, assume queries are predefined and supplied via code. No need to handle runtime user input.
After gathering the details, we can summarize the key system requirements.