Pagerank and Web-based Information Retrieval

Crawler

Follow these steps to download, modify, compile, and install Webbot on a Unix/Linux machine.
It is a quick and throrough crawler, but I found a small modification necessary to get complete link relationship text output.

Download Webbot version 5.4.0 (it comes with the Libwww library)
I have made two small changes to the source code:
1. Text logging does not recognize destination links more than once, even if referred to by distinct pages.
  SQL logging may count all links, but I couldn't get libwww to compile with mysql.
  If you use Robot/src/HTRobot.c, all links are logged (prefixed by :::: so you can grep them out).
2. Webbot will prompt you for the directory to store temporary files. This can get annoying,
  so I removed the user input query by changing Library/src/HTDialog.c
configure --with-regex, make, make install

Webbot supports many options, which define or restrict the depth and breadth of the crawl.
Here is an example bash script.

Data

Langville and Kamvar have several good data sets (see links below).
hollins.edu webbot crawl (Jan '04)

Code

link2dat.m - generate .dat file from webbot log
loaddat.m - load data into Matlab
powermethod.m - efficient Matlab implementation

Pictures

Reports

Modeling the Web and the Computation of Pagerank - Kristen Thorson, senior thesis, Hollins University, April 2004.

Researchers

Links

Kenneth Massey, June 1, 2004