Course:LIBR557/2020WT2/web crawler

From UBC Wiki

Web Crawler

A web crawler is an automated program or software that explores the World Wide Web, collecting and downloading HTML web pages through a systematic approach. Crawlers may also be referred to as spiders, robots, or worms.

In 1993, Matthew Gray launched the world’s first documented web crawler, “World Wide Web Wanderer”. It had originally been developed to measure the size of the Web; however, it soon became a means to collect and index web pages into the “Wandex”. The “Wanderer” led to the first ever web search engine.

In its wake, more web crawlers were developed and introduced to the World Wide Web, which also resulted in the rise of search engines such as Yahoo!, Altavista, and Google.

Functions of a crawler

The web crawler can be used for many purposes, but its main role in the field of modern information retrieval is to download webpages and index them to allow search engine users to find web pages matching their query. It may also be used to support web archiving and data mining.

How does crawling work?

Today, crawling is conducted through multiple processes that run in parallel.

First, the web crawler must keep track of the URLs to download. It must also then maintain the visited and downloaded URLs. There are two notable data structures that support the web crawling process:

  • Duplicated URL eliminator: A data structure that contains the URLs identified for downloading.
  • Frontier: A data structure that keeps track of the URLs that have yet to be downloaded.

For each process, the crawler first identifies an initial set of URLs to traverse, also called seed URLs. This set of URLs is added to the frontier. The frontier possesses the logic and policies a crawler must adhere to when completing its process. For example, the frontier may include information like which pages to visit next, the prioritized order for the pages to be searched, and how often to revisit an existing page. Crawlers are known to follow “politeness policies”, which ensures it does not send an overwhelming number of requests to the web server and negatively affect the website’s performance.

When the crawler receives the URL from the frontier, it downloads the web document launched from the URL, then parses it to extract any new links on the web page to the frontier.

Robots Exclusion Protocol
Before downloading the page, the crawler must also identify the robots.txt file if it exists, which specifies the site files restricted for crawling. When a crawler submits a request for the file, it is usually cached, so the crawler doesn’t need to continuously request the file. The web administrators can also set an expiry time for the cached copy of the file, to ensure that the crawler accesses the new file when it’s updated.

The newly identified links are sent to a URL distributor that delegates each URL to a crawling process. The URLs are examined by a Custom URL filter to discard the black-listed URLs or those showing irrelevant file extensions. The Duplicate URL eliminator also removes URLs that had already been discovered.

After the filtering process, another element, the URL prioritizer, organizes the priority for each URL to be accessed, depending on its perceived page importance and more.

This cycle is continued until there are no more remaining URLs to be crawled within the process.

Types of crawling

Batch crawling

The crawler takes a “static snapshot” of the collection of web documents, starting first with an initial seed set of URLs and traversing the web pages launched from the links found from the seed URLs. Due to the large amount of content, the process can be slow and impact the network bandwidth.

Incremental or continuous crawling

The crawler balances resources for downloading brand new web pages and re-downloading pages that have been crawled in the past. Its main purpose is to explore fresh content but also maintain coverage for both new and existing pages.

Scope crawling

Scope crawlers are used to only crawl pages specific to a category or scope, such as by topic, language, format, genre, or geographical location. For example, a crawler can be set to only crawl for web documents related to “house plants”. By limiting its scope, it can reduce the time needed to crawl as well as the money required to run the process.

Deep web crawling

Some content cannot be crawled unless the crawler fills in HTML forms. This is most commonly seen in sites with heavy, regularly populated content, such as Facebook, Twitter, or other social media platforms.

Challenges

Scale of the web

The World Wide Web is home to over a billion websites, each with its immense collection of web pages. On top of that, the vast amount of content is regularly evolving as the world continues to change. While modern search engines now employ numerous computers and high-speed network links, crawlers need to be engineered appropriately to handle the quantity of URLs to be traversed.

Content selection tradeoffs

Crawling is meticulously controlled to target valued content and ignore irrelevant, duplicate, or meaningless content. Often, a crawler may come across an irrelevant page and terminate its process, never seeing the subsequent linked pages that may be helpful. While content creators can be trained to produce meaningful, structured content, the crawler must also strike a balance between the discovery of new pages and the re-discovery of previously-crawled pages.

Crawler traps

Many websites may contain malicious or misleading content to rank their content higher in search engines, which can result in crawlers inadvertently driving web traffic to commercial sources. Spammers may benefit financially due to ads, and meaningless or inappropriate content may be mistakenly crawled and prioritized to users.

Future directions

While crawling has existed for the past two decades, more research is needed to fine-tune the processes. With the explosion of user-generated content platforms such as TikTok and Instagram, it would be beneficial to explore how best to deep crawl these multimedia platforms and integrate them into search results.

Bibliography

Olston, C., & Najork, M.  (2010). Web Crawling. Foundations and Trends in Information Retrieval, 4(3), 175-246.

Saini, C., & Arora, V. (2016). Information retrieval in web crawling: A survey. 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2635-2643.

Kausar, A., Dhaka, V.S., & Singh, S.K. (2013). Web Crawler: A Review. International Journal of Computer Applications 63(2), 31-36.