DavidSearch
Overview
An end to end web search engine featuring an HTTP server, distributed kvs store, analytics engine, query processing, ranked retrieval, and frontend serving search result pages.
Implementation
This project implements an end to end web search engine. First, the crawling job is run. This crawler takes pages given by several root URLs, downloads the content into a distributed key-value store, and then follows links from the original pages and repeats. After pages are collected by the crawler, the system processes the stored HTML content to build two main data structures: an inverted index for keyword search and a page ranking table for ordering results. The indexing pipeline begins by reading each crawled page’s URL and HTML content from storage. Pages missing either field are skipped. For each valid page, the raw HTML is cleaned by removing tags, punctuation, and unnecessary whitespace. The remaining text is converted to lowercase and split into individual words. The system then records which words appear on which pages, making sure that the same page is not listed multiple times for a repeated word. These word-to-page mappings are grouped together to produce an inverted index, where each word points to the set of pages that contain it. The ranking pipeline analyzes the link structure between pages. For each crawled page, the system extracts all outgoing links, resolves relative links, normalizes URLs into a consistent format, and filters invalid links. Each page is treated as a node in a graph, and each hyperlink becomes a directed edge to another page. The system then runs an iterative PageRank-style algorithm over this graph. Initially, every page starts with the same rank. During each iteration, a page distributes most of its current rank evenly across its outgoing links, while every page also receives a small base rank to account for random navigation. This process repeats until the ranks change by less than a chosen convergence threshold. Once the algorithm converges, the final rank for each page is written back to storage. The result is a searchable backend. The results are returned through a rank that is calculated using a combination of PageRank and the tf-idf score.
Optimizations
One of the biggest problems with the system was the scale of our corpus. We decided early on to be ambitious and crawl 10x the class recommended amount of 100k total pages. To accomplish a feat like this, we first needed to make several optimizations to the crawler. WIP