Roundabout – A high-performance, distributed crawler

When I started development on the Arachni high-performance grid my focus was on the audit part, i.e. find a way to distribute the audit of batches of individual elements across multiple nodes and avoid duplication of effort amongst them.
It was a bit tricky to get right but it turned out to be quite do-able and worthwhile.

However, the crawl was done the old fashioned way, the master instance would crawl the targeted website and once completed it would then analyze all the pages it found and spread the workload.
I always intended to try out my hand on something similar but aimed towards the crawling process but it wasn’t a high priority.
But, as you can see from my last post, I did sort of figure it out, although I hadn’t had a chance to implement it until now.

This is tricky to do because there’s no way of knowing the workload before hand as it is basically a freaking labyrinth and precious information (new paths) can be hidden behind walls and walls of crap.

On the other hand, since when running Arachni in HPG mode you already have a few nodes up and running in the first place, why not utilize them a bit more — even if it turns out to be only slightly faster than a single crawler.

With that in mind, I yesterday started to implement that sort of a crawler, and here it is.
Its sole existence is that of a toy, a fun experiment, and not as a stable system. I may, in the future, put some more effort into it but my main reason for doing this is to explore this idea and eventually port it over to Arachni.

If you find this interesting, want to help out in researching or have any sort of feedback or just want to get in touch don’t hesitate to do so.

Cheers,
Tasos L.

SociBook del.icio.us Digg Facebook Google Yahoo Buzz StumbleUpon

Posted in: Arachni, Open Source, Programming, Projects, Ruby, Security, Web Application

Tags: , , , , , , ,



2 Comments

rssComments RSS transmitTrackBack Identifier URI


And you should read:

http://irl.cs.tamu.edu/people/hsin-tsang/papers/tweb2009.pdf

“”"
This article shares our experience in designing a Web crawler that can download billions of pages
using a single-server implementation and models its performance. We first show that current
crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large
crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created
by server-side scripts. We then offer a set of techniques for dealing with these issues and test their
performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days,
IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion
connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike
our prior experiments with algorithms proposed in related work, this version of IRLbot did not
experience any bottlenecks and successfully handled content from over 117 million hosts, parsed
out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.

10. CONCLUSION
This article tackled the issue of scaling Web crawlers to billions and even trillions of pages using a single server with constant CPU, disk, and memory speed.
We identified several impediments to building an efficient large-scale crawler
and showed that they could be overcome by simply changing the BFS crawling
order and designing low-overhead disk-based data structures. We experimentally tested our techniques in the Internet and found them to scale much better
than the methods proposed in prior literature.
Future work involves refining reputation algorithms, assessing their performance, and mining the collected data.
“”"

Comment by Kristian Hermansen on 22/05/2012 9:11 am


Great paper (and great acronyms), sucks that I don’t have the resources to even compare my stuff to it though. Despite that, the paper addresses a lot of issues that don’t really transfer to the field Arachni is in — i.e. no need for politeness or worrying about intrusiveness.

However, one thing that I could transfer from the paper to my crawler is the URL off-loading (RAM->disk) technique. I’ve already got something similar going on but it’s not so much for look-ups as for storage.

It was a very interesting read nonetheless man, thanks a lot for sharing.

Comment by Zapotek on 22/05/2012 4:44 pm

addLeave a comment