DavidSearch

Search Engine · Java, HTTP, indexing, ranking, information retrieval

Overview

An end to end web search engine featuring an HTTP server, distributed kvs store, analytics engine, query processing, ranked retrieval, and frontend serving search result pages.

Writeup.

Implementation

This project implements an end to end web search engine. First, the crawling job is run. This crawler takes pages given by several root URLs, downloads the content into a distributed key-value store, and then follows links from the original pages and repeats. After pages are collected by the crawler, the system processes the stored HTML content to build two main data structures: an inverted index for keyword search and a page ranking table for ordering results. The indexing pipeline begins by reading each crawled page’s URL and HTML content from storage. Pages missing either field are skipped. For each valid page, the raw HTML is cleaned by removing tags, punctuation, and unnecessary whitespace. The remaining text is converted to lowercase and split into individual words. The system then records which words appear on which pages, making sure that the same page is not listed multiple times for a repeated word. These word-to-page mappings are grouped together to produce an inverted index, where each word points to the set of pages that contain it. The ranking pipeline analyzes the link structure between pages. For each crawled page, the system extracts all outgoing links, resolves relative links, normalizes URLs into a consistent format, and filters invalid links. Each page is treated as a node in a graph, and each hyperlink becomes a directed edge to another page. The system then runs an iterative PageRank-style algorithm over this graph. Initially, every page starts with the same rank. During each iteration, a page distributes most of its current rank evenly across its outgoing links, while every page also receives a small base rank to account for random navigation. This process repeats until the ranks change by less than a chosen convergence threshold. Once the algorithm converges, the final rank for each page is written back to storage. The result is a searchable backend. The results are returned through a rank that is calculated using a combination of PageRank and the tf-idf score.

Optimizations

The hardest problem was scale. We were ambitious early on and crawled 1.28 million pages by Thanksgiving, but ran into PageRank runtime bottlenecks at that size, so our final deployment runs over a successfully crawled, indexed, and ranked 200,000-page corpus. Getting even that far meant optimizing nearly every layer of the system, since KVS write throughput was a bottleneck from start to finish.

Most of the heavy lifting happened in the Flame distributed-computation layer. We added the ability to persist intermediate tables to disk and delete old ones to keep memory and disk in check, then built a producer–consumer pipeline where CPU workers push rows into a bounded blocking queue while dedicated consumers batch them into the KVS, cutting per-row network overhead from round trips and packet metadata. After noticing workers sitting idle while the KVS was maxed out, we added multi-level aggregation that combines rows sharing a key before they ever hit the network. We also fused common operations, turning a separate fromTable and flatMapToPair into a single fromTableToPair that skips writing and re-reading an intermediate table.

On the KVS side, batched multiGet/multiPut requests amortized network cost across many rows, and lock striping with 65,521 lock objects (a prime, for even distribution) replaced coarse table-level locks, letting threads write to different rows concurrently without the unbounded memory growth of a lock-per-row scheme. The biggest win was asynchronous background ingestion: incoming writes land in per-table in-memory queues and the HTTP request returns immediately, while background flusher threads batch them to disk. This decoupled network I/O from disk I/O so the store could absorb bursty write traffic, with an explicit flush() API for jobs to call when they need durability guaranteed.

The crawler gained per-domain page limits, regex URL blacklisting, restart-from-checkpoint support for multi-day crawls, adaptive frontier sampling to bound memory, and layered language detection that cut non-English pages by roughly 70%. The indexer used Porter stemming and aggressive stopword filtering to shrink the index by about 40% while improving match quality.

Ranking

Final results combine content relevance with link authority. For each query the frontend normalizes and stems terms with the same logic as the indexer, fetches posting lists from the KVS in parallel across a thread pool, and pre-filters to the top candidates by query-term overlap, avoiding scoring millions of documents and cutting computation by around 99% on large corpora. It then scores documents with 0.8 · cosine(d, q) + 0.2 · normalizedPR(d), where the cosine term measures TF-IDF similarity and PageRank is log-normalized so a few high-authority pages don't dominate. The 80/20 split balances topical relevance against trustworthiness in the final ranking.