Search Implementation
Overview
Search functionality transforms how users discover and access content. Modern search systems must handle full-text queries, provide relevant results quickly, support filtering and faceting, and scale with data growth. This guide covers search engine selection, implementation patterns, relevance tuning, and performance optimization.
Search implementation involves three core concerns: indexing (structuring data for efficient retrieval), querying (understanding user intent and executing searches), and ranking (determining result relevance). The architecture typically separates the search engine from the primary database to optimize each for its specific purpose.
The indexing pipeline extracts data from the source database, transforms it into search documents, and writes to the search engine. The query pipeline parses user input, executes searches, ranks results, and returns responses. Analytics feed back into relevance tuning to continuously improve search quality.
Search Engine Selection
Choose search engines based on use case, scale, features, and operational complexity. Different engines optimize for different trade-offs.
Elasticsearch / OpenSearch
Elasticsearch is the most widely adopted search engine for complex use cases. OpenSearch is an open-source fork maintained by AWS with identical API compatibility.
Strengths:
- Powerful query DSL with boolean logic, fuzzy matching, boosting
- Advanced aggregations for faceted search and analytics
- Distributed architecture scales to billions of documents
- Rich ecosystem of plugins and integrations
- Near real-time indexing and search
- Comprehensive REST API
Considerations:
- Resource intensive (memory, CPU)
- Complex cluster management and tuning
- Requires dedicated infrastructure
- Learning curve for query DSL
Best for: Large-scale applications, complex queries, analytics, faceted search, and applications requiring advanced features like geo-search or nested document queries.
Typesense
Typesense is a modern, fast, typo-tolerant search engine optimized for instant search experiences.
Strengths:
- Simple API and minimal configuration
- Automatic typo tolerance with configurable fuzziness
- Sub-50ms search latency for instant search
- Built-in faceting and filtering
- Easy deployment (single binary)
- Semantic search capabilities
Considerations:
- Smaller ecosystem than Elasticsearch
- Limited advanced features
- Data must fit in RAM for best performance
- Newer project with evolving features
Best for: Applications requiring instant search, e-commerce catalogs, documentation search, and use cases prioritizing simplicity and speed over advanced features.
Algolia
Algolia is a hosted search-as-a-service platform optimized for end-user search experiences.
Strengths:
- Extremely fast (sub-10ms queries globally)
- Comprehensive UI libraries (InstantSearch)
- Built-in typo tolerance and relevance
- Automatic infrastructure scaling
- Rich dashboard for configuration
- A/B testing and analytics
Considerations:
- Expensive at scale (pricing by operations and records)
- Less flexible than self-hosted solutions
- Vendor lock-in
- Limited query complexity compared to Elasticsearch
Best for: Customer-facing search, e-commerce, content discovery, and applications requiring world-class search UX without operational overhead.
Meilisearch
Meilisearch provides powerful search with minimal configuration, designed for modern web applications.
Strengths:
- Simple setup with sensible defaults
- Fast search responses (sub-50ms)
- Automatic typo tolerance
- Multi-language support
- Easy deployment (Docker, single binary)
- Good documentation and community
Considerations:
- Limited advanced features
- Smaller community than Elasticsearch
- Basic analytics capabilities
Best for: Small to medium applications, documentation sites, content management systems, and projects requiring quick search implementation.
Apache Solr
Solr is a mature, enterprise-grade search platform built on Apache Lucene (same as Elasticsearch).
Strengths:
- Proven stability and reliability
- Rich feature set comparable to Elasticsearch
- Strong consistency guarantees
- Excellent documentation
- Enterprise support options
Considerations:
- XML-based configuration (more verbose)
- Less modern API compared to Elasticsearch
- Smaller community momentum
Best for: Enterprise applications, government systems, and organizations requiring proven stability with strong support.
Full-Text Search Implementation
Full-text search analyzes text, tokenizes it into terms, and enables finding documents matching query terms. Understanding text analysis is critical for effective search.
Text Analysis Pipeline
Text analysis transforms raw text into searchable terms through a series of steps:
Character Filters: Clean text before tokenization (remove HTML tags, normalize characters).
Tokenizer: Split text into tokens (words). Common tokenizers include:
- Standard: Splits on whitespace and punctuation
- Whitespace: Splits only on whitespace
- N-gram: Creates overlapping character sequences for partial matching
Token Filters: Transform tokens (lowercase, stemming, stopword removal, synonyms).
Example analysis configuration in Elasticsearch:
{
"settings": {
"analysis": {
"analyzer": {
"custom_english_analyzer": {
"type": "custom",
"tokenizer": "standard",
"char_filter": ["html_strip"],
"filter": [
"lowercase",
"english_stop",
"english_stemmer",
"asciifolding"
]
}
},
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "custom_english_analyzer",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"content": {
"type": "text",
"analyzer": "custom_english_analyzer"
},
"category": {
"type": "keyword"
}
}
}
}
The html_strip character filter removes HTML tags. The lowercase filter normalizes case so "Search" matches "search". The english_stop filter removes common words like "the", "is", "at" that don't add meaning. The english_stemmer reduces words to their root form so "searching", "searched", and "searches" all become "search".
The asciifolding filter converts accented characters to ASCII equivalents so "café" matches "cafe". The multi-field setup (title.keyword) enables both full-text search and exact matching or aggregations.
Search Query Implementation
Build search queries that handle user intent, typos, and relevance requirements:
// Elasticsearch search service
import { Client } from '@elastic/elasticsearch';
interface SearchQuery {
query: string;
filters?: Record<string, any>;
page?: number;
pageSize?: number;
sortBy?: string;
sortOrder?: 'asc' | 'desc';
}
interface SearchResult<T> {
results: T[];
total: number;
page: number;
pageSize: number;
facets?: Record<string, FacetResult[]>;
}
class ElasticsearchService {
private client: Client;
constructor() {
this.client = new Client({
node: process.env.ELASTICSEARCH_URL,
auth: {
apiKey: process.env.ELASTICSEARCH_API_KEY,
},
});
}
async search<T>(
index: string,
searchQuery: SearchQuery
): Promise<SearchResult<T>> {
const { query, filters, page = 1, pageSize = 20, sortBy, sortOrder } = searchQuery;
const esQuery = this.buildQuery(query, filters);
const from = (page - 1) * pageSize;
const response = await this.client.search({
index,
body: {
query: esQuery,
from,
size: pageSize,
sort: sortBy ? [{ [sortBy]: sortOrder || 'desc' }] : undefined,
highlight: {
fields: {
title: { pre_tags: ['<mark>'], post_tags: ['</mark>'] },
content: {
pre_tags: ['<mark>'],
post_tags: ['</mark>'],
fragment_size: 150,
number_of_fragments: 3,
},
},
},
aggs: this.buildAggregations(filters),
},
});
return {
results: response.hits.hits.map(hit => ({
...hit._source,
id: hit._id,
score: hit._score,
highlights: hit.highlight,
})) as T[],
total: response.hits.total.value,
page,
pageSize,
facets: this.parseFacets(response.aggregations),
};
}
private buildQuery(query: string, filters?: Record<string, any>): any {
const must: any[] = [];
const filter: any[] = [];
// Multi-match query for text search
if (query) {
must.push({
multi_match: {
query,
fields: [
'title^3', // Boost title 3x
'description^2', // Boost description 2x
'content',
],
type: 'best_fields',
fuzziness: 'AUTO',
operator: 'or',
minimum_should_match: '75%',
},
});
}
// Apply filters
if (filters) {
Object.entries(filters).forEach(([field, value]) => {
if (Array.isArray(value)) {
filter.push({ terms: { [field]: value } });
} else if (typeof value === 'object' && value.min !== undefined) {
// Range filter
filter.push({
range: {
[field]: {
gte: value.min,
lte: value.max,
},
},
});
} else {
filter.push({ term: { [field]: value } });
}
});
}
return {
bool: {
must: must.length > 0 ? must : [{ match_all: {} }],
filter,
},
};
}
private buildAggregations(filters?: Record<string, any>): any {
return {
categories: {
terms: { field: 'category', size: 20 },
},
price_ranges: {
range: {
field: 'price',
ranges: [
{ to: 50 },
{ from: 50, to: 100 },
{ from: 100, to: 200 },
{ from: 200 },
],
},
},
};
}
private parseFacets(aggregations: any): Record<string, FacetResult[]> {
if (!aggregations) return {};
const facets: Record<string, FacetResult[]> = {};
if (aggregations.categories) {
facets.categories = aggregations.categories.buckets.map(bucket => ({
value: bucket.key,
count: bucket.doc_count,
}));
}
if (aggregations.price_ranges) {
facets.priceRanges = aggregations.price_ranges.buckets.map(bucket => ({
label: this.formatPriceRange(bucket),
value: `${bucket.from || 0}-${bucket.to || ''}`,
count: bucket.doc_count,
}));
}
return facets;
}
}
The multi_match query searches across multiple fields with different boost factors. The title^3 syntax boosts title matches 3x more than content matches, reflecting that title matches are typically more relevant. The fuzziness: 'AUTO' setting allows typos (1 edit for words 3-5 characters, 2 edits for longer words).
The minimum_should_match: '75%' parameter requires that 75% of query terms match, preventing irrelevant results when users enter many terms. The type: 'best_fields' setting scores documents by the best matching field rather than summing scores across fields.
Filters in the filter clause don't affect relevance scores but narrow results. This is more efficient than including filter criteria in the must clause. Terms filters handle multi-select facets (e.g., multiple categories). Range filters handle numerical or date ranges (e.g., price ranges).
Handling the N+1 Problem
The N+1 problem occurs when search results require fetching related data from the primary database, resulting in one query for search results plus N queries for related data:
// Anti-pattern: N+1 queries
async function searchProducts(query: string): Promise<Product[]> {
const searchResults = await searchService.search('products', { query });
// N additional database queries
const products = await Promise.all(
searchResults.results.map(result =>
db.products.findUnique({
where: { id: result.id },
include: { category: true, reviews: true },
})
)
);
return products;
}
Solution 1: Denormalize into search documents
Include all necessary data in search documents to avoid database queries:
// Index document with denormalized data
interface ProductSearchDocument {
id: string;
name: string;
description: string;
price: number;
categoryId: string;
categoryName: string;
averageRating: number;
reviewCount: number;
imageUrl: string;
inStock: boolean;
}
// No additional queries needed
async function searchProducts(query: string): Promise<ProductSearchDocument[]> {
const searchResults = await searchService.search<ProductSearchDocument>('products', { query });
return searchResults.results;
}
Denormalization trades data redundancy for query performance. When category names change, reindex all products in that category. Use message queues or database triggers to keep search documents synchronized with source data.
Solution 2: Batch database queries
If denormalization isn't feasible, batch database queries:
async function searchProducts(query: string): Promise<Product[]> {
const searchResults = await searchService.search('products', { query });
const productIds = searchResults.results.map(r => r.id);
// Single query with WHERE IN clause
const products = await db.products.findMany({
where: { id: { in: productIds } },
include: { category: true, reviews: true },
});
// Preserve search order
const productMap = new Map(products.map(p => [p.id, p]));
return productIds.map(id => productMap.get(id)).filter(Boolean);
}
Search Relevance Tuning
Relevance determines which results appear first. Good relevance balances textual similarity with business logic like popularity, recency, and user preferences.
Field Boosting
Boost important fields to prioritize their matches:
const query = {
multi_match: {
query: userQuery,
fields: [
'title^5', // Title matches most important
'sku^4', // Product codes highly relevant
'description^2', // Description somewhat important
'content', // Body content baseline
'tags^1.5', // Tags slightly boosted
],
type: 'cross_fields',
},
};
Field boost values are relative. Doubling a boost value doesn't necessarily double the score due to Elasticsearch's scoring algorithm (BM25 by default). Tune boosts empirically by testing queries and adjusting based on result quality.
Function Score Queries
Combine text relevance with custom scoring functions:
const functionScoreQuery = {
function_score: {
query: {
multi_match: {
query: userQuery,
fields: ['title^3', 'description^2', 'content'],
},
},
functions: [
{
// Boost popular items
field_value_factor: {
field: 'popularity_score',
factor: 1.2,
modifier: 'log1p',
missing: 0,
},
},
{
// Boost recent items
gauss: {
created_at: {
origin: 'now',
scale: '30d',
decay: 0.5,
},
},
},
{
// Boost in-stock items
filter: { term: { in_stock: true } },
weight: 1.5,
},
],
score_mode: 'multiply',
boost_mode: 'multiply',
},
};
The field_value_factor function incorporates a numeric field into scoring. The log1p modifier applies logarithmic scaling to prevent extreme values from dominating scores (log1p(x) = log(1 + x)). This means an item with popularity 1000 doesn't score 1000x higher than one with popularity 1.
The gauss function creates a decay curve for date-based boosting. Items created "now" receive full boost. Items created 30 days ago receive 50% boost (decay: 0.5). Older items decay further. This prioritizes recent content without completely eliminating older results.
The filter function applies a constant boost to items matching a filter (in-stock products). The score_mode: 'multiply' combines function scores by multiplication. The boost_mode: 'multiply' multiplies function scores with text relevance scores.
Personalized Ranking
Incorporate user behavior and preferences into ranking:
async function personalizedSearch(
userId: string,
query: string
): Promise<SearchResult> {
const userProfile = await getUserSearchProfile(userId);
const personalizedQuery = {
function_score: {
query: {
multi_match: {
query,
fields: ['title^3', 'description^2', 'content'],
},
},
functions: [
// Boost categories user frequently views
...userProfile.preferredCategories.map(category => ({
filter: { term: { category: category.name } },
weight: 1 + category.affinity * 0.5,
})),
// Boost items similar to past interactions
{
script_score: {
script: {
source: `
cosineSimilarity(params.user_vector, 'embedding_vector') + 1.0
`,
params: {
user_vector: userProfile.embeddingVector,
},
},
},
},
],
score_mode: 'sum',
boost_mode: 'multiply',
},
};
return await searchService.search('products', {
query: personalizedQuery,
});
}
interface UserSearchProfile {
userId: string;
preferredCategories: Array<{ name: string; affinity: number }>;
embeddingVector: number[];
}
async function getUserSearchProfile(userId: string): Promise<UserSearchProfile> {
// Build profile from user behavior
const interactions = await db.userInteractions.findMany({
where: { userId },
orderBy: { timestamp: 'desc' },
take: 100,
});
// Calculate category affinities
const categoryFrequency = new Map<string, number>();
interactions.forEach(interaction => {
const count = categoryFrequency.get(interaction.category) || 0;
categoryFrequency.set(interaction.category, count + 1);
});
const preferredCategories = Array.from(categoryFrequency.entries())
.map(([name, count]) => ({
name,
affinity: count / interactions.length,
}))
.sort((a, b) => b.affinity - a.affinity)
.slice(0, 5);
// Generate embedding vector from interaction history
const embeddingVector = await generateUserEmbedding(interactions);
return {
userId,
preferredCategories,
embeddingVector,
};
}
User profiles track behavior like viewed categories, clicked results, and purchase history. Category affinity represents the proportion of interactions with each category. The embedding vector represents user interests in high-dimensional space for similarity comparisons.
The script score function computes cosine similarity between the user's embedding vector and product embedding vectors. This enables semantic matching based on interest similarity rather than just keywords. For vector search and embeddings, see Machine Learning Integration.
Pagination Strategies
Pagination affects performance, deep result access, and API design.
Offset-Based Pagination
Traditional pagination using offset and limit:
interface PaginationParams {
page: number;
pageSize: number;
}
async function searchWithOffsetPagination(
query: string,
pagination: PaginationParams
): Promise<SearchResult> {
const from = (pagination.page - 1) * pagination.pageSize;
return await esClient.search({
index: 'products',
from,
size: pagination.pageSize,
body: {
query: { match: { title: query } },
},
});
}
Advantages: Simple implementation, users can jump to any page, easy to calculate total pages.
Disadvantages: Deep pagination is slow (Elasticsearch must score all documents up to the offset), performance degrades linearly with page number, results can shift between pages if documents are added/deleted.
Elasticsearch limits offset + size to 10,000 by default (index.max_result_window setting). Deep pagination requires alternative approaches.
Cursor-Based Pagination
Use search_after for efficient deep pagination:
interface CursorPaginationParams {
pageSize: number;
searchAfter?: any[];
}
async function searchWithCursorPagination(
query: string,
pagination: CursorPaginationParams
): Promise<SearchResult & { cursor: string }> {
const response = await esClient.search({
index: 'products',
size: pagination.pageSize,
body: {
query: { match: { title: query } },
sort: [
{ _score: 'desc' },
{ id: 'asc' }, // Tiebreaker for consistent ordering
],
search_after: pagination.searchAfter,
},
});
const results = response.hits.hits.map(hit => hit._source);
const lastHit = response.hits.hits[response.hits.hits.length - 1];
// Encode cursor for next page
const cursor = lastHit
? Buffer.from(JSON.stringify(lastHit.sort)).toString('base64')
: null;
return {
results,
total: response.hits.total.value,
pageSize: pagination.pageSize,
cursor,
};
}
// Decode cursor from client
function decodeCursor(cursor: string): any[] {
return JSON.parse(Buffer.from(cursor, 'base64').toString('utf-8'));
}
The search_after parameter uses the last document's sort values to fetch the next page. This performs consistently regardless of depth because Elasticsearch doesn't score skipped documents. The tiebreaker sort field (id) ensures consistent ordering when scores are identical.
Advantages: Constant performance for any page depth, efficient for infinite scroll UIs, resilient to data changes during pagination.
Disadvantages: Cannot jump to arbitrary pages, cannot calculate total pages, requires stateful cursor management.
Relay Connection Specification
For GraphQL APIs, the Relay connection spec standardizes cursor pagination:
interface Connection<T> {
edges: Array<{
node: T;
cursor: string;
}>;
pageInfo: {
hasNextPage: boolean;
hasPreviousPage: boolean;
startCursor: string;
endCursor: string;
};
totalCount: number;
}
async function searchProducts(
query: string,
first: number,
after?: string
): Promise<Connection<Product>> {
const searchAfter = after ? decodeCursor(after) : undefined;
const response = await esClient.search({
index: 'products',
size: first + 1, // Fetch one extra to determine hasNextPage
body: {
query: { match: { title: query } },
sort: [{ _score: 'desc' }, { id: 'asc' }],
search_after: searchAfter,
},
});
const hasMore = response.hits.hits.length > first;
const hits = response.hits.hits.slice(0, first);
const edges = hits.map(hit => ({
node: hit._source as Product,
cursor: encodeCursor(hit.sort),
}));
return {
edges,
pageInfo: {
hasNextPage: hasMore,
hasPreviousPage: !!after,
startCursor: edges[0]?.cursor,
endCursor: edges[edges.length - 1]?.cursor,
},
totalCount: response.hits.total.value,
};
}
The Relay spec uses opaque cursors (no assumption about their contents) and provides pagination metadata in pageInfo. Fetching one extra result determines if more pages exist without requiring a separate count query.
Faceted Search and Filtering
Faceted search enables users to narrow results using filters based on document attributes. Facets show available filter options with result counts.
Aggregations for Facets
async function searchWithFacets(
query: string,
filters: Record<string, string[]>
): Promise<SearchResult> {
const response = await esClient.search({
index: 'products',
body: {
query: this.buildFilteredQuery(query, filters),
aggs: {
categories: {
terms: {
field: 'category.keyword',
size: 50,
},
},
brands: {
terms: {
field: 'brand.keyword',
size: 50,
},
},
price_ranges: {
range: {
field: 'price',
ranges: [
{ key: 'Under $50', to: 50 },
{ key: '$50-$100', from: 50, to: 100 },
{ key: '$100-$200', from: 100, to: 200 },
{ key: '$200+', from: 200 },
],
},
},
attributes: {
nested: {
path: 'attributes',
},
aggs: {
attribute_names: {
terms: {
field: 'attributes.name.keyword',
size: 20,
},
aggs: {
attribute_values: {
terms: {
field: 'attributes.value.keyword',
size: 10,
},
},
},
},
},
},
},
},
});
return {
results: response.hits.hits.map(hit => hit._source),
total: response.hits.total.value,
facets: this.parseFacets(response.aggregations),
};
}
private buildFilteredQuery(
query: string,
filters: Record<string, string[]>
): any {
const must: any[] = [
{
multi_match: {
query,
fields: ['title^3', 'description'],
},
},
];
const filter: any[] = [];
// Apply selected filters
Object.entries(filters).forEach(([field, values]) => {
if (values.length > 0) {
filter.push({
terms: { [`${field}.keyword`]: values },
});
}
});
return {
bool: {
must,
filter,
},
};
}
Aggregations run concurrently with the query and calculate facet values from the filtered result set. The terms aggregation groups results by field values and counts documents in each group. The size parameter limits the number of buckets returned (top 50 categories).
The range aggregation creates predefined buckets for numerical ranges. The nested aggregation handles nested document structures (e.g., products with multiple attributes). The sub-aggregation attribute_values groups values within each attribute name.
Multi-Select Facets
Multi-select facets allow selecting multiple values within a facet (e.g., multiple categories). Other facets must show counts as if the category filter weren't applied, otherwise all other facets would show zero counts after selecting one category.
private buildMultiSelectFacetAggregations(
query: string,
filters: Record<string, string[]>
): any {
const aggs = {};
// For each facet, calculate counts excluding its own filter
['categories', 'brands', 'colors'].forEach(facet => {
const filtersExcludingThisFacet = { ...filters };
delete filtersExcludingThisFacet[facet];
aggs[facet] = {
filter: this.buildFilteredQuery(query, filtersExcludingThisFacet).bool,
aggs: {
values: {
terms: {
field: `${facet}.keyword`,
size: 50,
},
},
},
};
});
return aggs;
}
Each facet aggregation wraps a filter that excludes that facet's selections. This ensures facet counts reflect result counts if that facet value were selected, enabling progressive refinement.
Autocomplete and Search Suggestions
Autocomplete helps users formulate queries by suggesting completions as they type.
Prefix Matching
Simple prefix matching for autocomplete:
async function autocomplete(prefix: string, limit: number = 10): Promise<string[]> {
const response = await esClient.search({
index: 'products',
body: {
suggest: {
title_suggest: {
prefix,
completion: {
field: 'title_suggest',
size: limit,
skip_duplicates: true,
},
},
},
},
});
return response.suggest.title_suggest[0].options.map(
option => option.text
);
}
The completion suggester requires a completion field type in the mapping:
{
"mappings": {
"properties": {
"title": {
"type": "text"
},
"title_suggest": {
"type": "completion"
}
}
}
}
Index documents with suggestion inputs:
{
"title": "Wireless Bluetooth Headphones",
"title_suggest": {
"input": [
"Wireless Bluetooth Headphones",
"Bluetooth Headphones",
"Headphones",
"Wireless Headphones"
],
"weight": 10
}
}
The input array contains phrases that should trigger this suggestion. The weight parameter prioritizes suggestions (higher weights appear first). The completion suggester uses FST (Finite State Transducers) for extremely fast prefix matching (~ms latency).
Typo-Tolerant Autocomplete
Elasticsearch suggestions don't support fuzzy matching out of the box. For typo tolerance, use Typesense or custom implementation:
// Typesense autocomplete with typo tolerance
async function typoTolerantAutocomplete(
prefix: string,
limit: number = 10
): Promise<string[]> {
const response = await typesenseClient
.collections('products')
.documents()
.search({
q: prefix,
query_by: 'title',
prefix: true,
num_typos: 2, // Allow up to 2 typos
per_page: limit,
});
return response.hits.map(hit => hit.document.title);
}
Typesense's typo tolerance uses weighted edit distance, allowing configurable typo counts. This handles common typing errors like transpositions, insertions, deletions, and substitutions.
Popular Query Suggestions
Suggest popular queries based on analytics:
interface PopularQuery {
query: string;
count: number;
lastSeen: Date;
}
class QuerySuggestionService {
async trackQuery(query: string, resultCount: number): Promise<void> {
await db.searchQueries.upsert({
where: { query: query.toLowerCase() },
create: {
query: query.toLowerCase(),
count: 1,
lastSeen: new Date(),
resultCount,
},
update: {
count: { increment: 1 },
lastSeen: new Date(),
resultCount,
},
});
}
async getPopularQueries(
prefix: string,
limit: number = 10
): Promise<PopularQuery[]> {
return db.searchQueries.findMany({
where: {
query: {
startsWith: prefix.toLowerCase(),
},
resultCount: {
gt: 0, // Only suggest queries that return results
},
},
orderBy: [
{ count: 'desc' },
{ lastSeen: 'desc' },
],
take: limit,
});
}
// Clean up old queries periodically
async cleanupStaleQueries(): Promise<void> {
const cutoffDate = new Date();
cutoffDate.setDate(cutoffDate.getDate() - 90);
await db.searchQueries.deleteMany({
where: {
lastSeen: { lt: cutoffDate },
count: { lt: 10 },
},
});
}
}
Track all search queries and their result counts. Suggest queries that match the prefix, prioritizing by frequency and recency. Filter out queries with zero results to avoid suggesting dead ends.
Indexing Strategies
Indexing performance and freshness affect search functionality and infrastructure costs.
Real-Time Indexing
Index documents immediately when they're created or updated:
// Event-driven indexing
class ProductService {
async createProduct(product: Product): Promise<Product> {
// Save to primary database
const created = await db.products.create({ data: product });
// Index in search engine
await searchService.indexDocument('products', {
id: created.id,
...this.transformForSearch(created),
});
return created;
}
async updateProduct(id: string, updates: Partial<Product>): Promise<Product> {
const updated = await db.products.update({
where: { id },
data: updates,
});
// Update search index
await searchService.updateDocument('products', id, {
...this.transformForSearch(updated),
});
return updated;
}
private transformForSearch(product: Product): ProductSearchDocument {
return {
id: product.id,
title: product.title,
description: product.description,
category: product.category.name,
price: product.price,
inStock: product.quantity > 0,
popularity: product.viewCount + product.purchaseCount * 10,
};
}
}
Real-time indexing provides immediate search visibility but increases latency for write operations. Use async processing for non-critical updates.
Bulk Indexing
Batch index operations for better performance:
class BulkIndexingService {
private indexQueue: Array<{ id: string; document: any }> = [];
private flushInterval: NodeJS.Timer;
constructor(private batchSize: number = 100, private flushIntervalMs: number = 5000) {
// Flush periodically
this.flushInterval = setInterval(() => this.flush(), flushIntervalMs);
}
async queueForIndexing(id: string, document: any): Promise<void> {
this.indexQueue.push({ id, document });
if (this.indexQueue.length >= this.batchSize) {
await this.flush();
}
}
private async flush(): Promise<void> {
if (this.indexQueue.length === 0) return;
const batch = this.indexQueue.splice(0, this.batchSize);
const operations = batch.flatMap(({ id, document }) => [
{ index: { _index: 'products', _id: id } },
document,
]);
try {
const response = await esClient.bulk({ operations });
if (response.errors) {
const failed = response.items
.filter(item => item.index?.error)
.map(item => ({
id: item.index._id,
error: item.index.error,
}));
logger.error('Bulk indexing errors', { failed });
}
} catch (error) {
logger.error('Bulk indexing failed', { error, count: batch.length });
// Re-queue failed items
this.indexQueue.unshift(...batch);
}
}
}
Bulk operations reduce network overhead and improve throughput by sending multiple documents in a single request. Elasticsearch's bulk API processes operations in batches internally. The flush interval ensures documents are indexed within a predictable time window even if the batch size isn't reached.
Change Data Capture (CDC)
Use database change streams to keep search indexes synchronized:
// PostgreSQL logical replication for CDC
import { Client } from 'pg';
class CDCIndexingService {
private pgClient: Client;
async startListening(): Promise<void> {
this.pgClient = new Client({
connectionString: process.env.DATABASE_URL,
});
await this.pgClient.connect();
// Create replication slot
await this.pgClient.query(`
SELECT pg_create_logical_replication_slot('search_indexing', 'wal2json');
`);
// Start consuming changes
this.consumeChanges();
}
private async consumeChanges(): Promise<void> {
while (true) {
const result = await this.pgClient.query(`
SELECT * FROM pg_logical_slot_get_changes(
'search_indexing',
NULL,
NULL,
'format-version', '2'
);
`);
for (const row of result.rows) {
const change = JSON.parse(row.data);
await this.handleChange(change);
}
await this.sleep(1000);
}
}
private async handleChange(change: any): Promise<void> {
if (change.table === 'products') {
if (change.action === 'INSERT' || change.action === 'UPDATE') {
const product = await this.fetchProductWithRelations(change.data.id);
await searchService.indexDocument('products', this.transformForSearch(product));
} else if (change.action === 'DELETE') {
await searchService.deleteDocument('products', change.data.id);
}
}
}
}
CDC provides reliable, eventual consistency between the database and search engine without coupling them in application code. Database changes are the source of truth. This approach handles scenarios like database migrations, bulk operations, or external database modifications that bypass the application.
Alternative CDC implementations include Debezium (Kafka-based), AWS DMS, or database triggers. For message queue integration, see Event-Driven Architecture.
Search Analytics
Track search behavior to improve relevance and identify content gaps.
Query Analytics
interface SearchAnalytics {
query: string;
resultCount: number;
clickedResults: string[];
firstClickPosition: number | null;
userId?: string;
timestamp: Date;
}
class SearchAnalyticsService {
async trackSearch(analytics: SearchAnalytics): Promise<void> {
await db.searchAnalytics.create({
data: analytics,
});
}
// Identify queries with no results
async getNoResultQueries(
startDate: Date,
endDate: Date
): Promise<Array<{ query: string; count: number }>> {
return db.searchAnalytics.groupBy({
by: ['query'],
where: {
resultCount: 0,
timestamp: { gte: startDate, lte: endDate },
},
_count: { query: true },
orderBy: { _count: { query: 'desc' } },
take: 100,
});
}
// Calculate click-through rate by query
async getClickThroughRate(
startDate: Date,
endDate: Date
): Promise<Array<{ query: string; ctr: number }>> {
const searches = await db.searchAnalytics.groupBy({
by: ['query'],
where: {
timestamp: { gte: startDate, lte: endDate },
},
_count: { query: true },
_sum: { firstClickPosition: true },
});
return searches.map(row => ({
query: row.query,
ctr: row._sum.firstClickPosition
? row._count.query / row._sum.firstClickPosition
: 0,
}));
}
// Identify poorly ranked results
async getLowClickPositions(
startDate: Date,
endDate: Date
): Promise<Array<{ query: string; avgPosition: number }>> {
const results = await db.searchAnalytics.groupBy({
by: ['query'],
where: {
firstClickPosition: { not: null },
timestamp: { gte: startDate, lte: endDate },
},
_avg: { firstClickPosition: true },
_count: { query: true },
having: {
_avg: { firstClickPosition: { gt: 5 } },
},
});
return results
.filter(row => row._count.query > 10) // Minimum query volume
.map(row => ({
query: row.query,
avgPosition: row._avg.firstClickPosition,
}));
}
}
No-result queries indicate missing content or inadequate synonyms. Low click-through rates suggest poor relevance or unappealing result presentation. High average click positions indicate relevant results are ranked too low.
Use analytics to:
- Add synonyms for common query variations
- Adjust field boosts based on click patterns
- Identify content gaps to fill
- A/B test relevance changes
For observability and monitoring of search systems, see Observability Guidelines.
Related Topics
- API Design - Designing search API endpoints
- Performance Testing - Load testing search infrastructure
- Caching - Caching search results
- Database Design - Designing searchable data models
- Observability - Logging - Logging search queries and performance
- GraphQL - Implementing GraphQL search APIs
- Spring Boot API Design - Search endpoints in Spring Boot
- React State Management - Managing search state in React
- Angular State Management - Search state with Signals