Arachni grid, draft design

Hi guys,

After the last release a few people noticed that I’ve been adding more and more distributed computing features, so I figured that others may be interested to know where I’m heading with this.

In all honesty, the distributed stuff are just a way for me to have fun with it. The end result will eventually be an automatically deployed, load balanced, high-performance grid — which sounds (and is) cool but it’d only be of use to a handful of organisations– but it’s by no means a priority.

To give you some background, this is what happens when you use the XMLRPC infrastructure (WebUI or the arachni_xmlrpc script):
C: Client
D: Dispatcher
P: Instance pool
I: Instance
S: Server

Client -> Dispatcher operations

Real person talk:
1: The client issues a “dispatch” command to the Dispatcher — the Dispatcher then pops an Instance out of the pool.
2: The Dispatcher gives the info of the poped Instance the the Client (an authentication token is also included in that info).
3: The Client sends the scan options to the Instance — URL, modules, plugins, etc.
4: OK if the options where valid, an error if not.
5: While the Instance is busy (i.e. the scan is in progress) flush the output buffer of the scanner and give it to us.
6: Request the scan report.

All these messages are encrypted using key “k” by SSL.
The pool is there to remove the overhead of Instance initialisation and avoid blocking during the “dispatch” call in message 1.
The Pool is populated on start-up and after an Instance is poped it replenishes itself to avoid starvation and, as a result, blocking.
And this is how it’s been for quite some time.

After the new release though me and Andres (w3af’s lead developer) were talking about plans and ideas for a distributed scanner and he got me all fired again.

We were both more interested in a high-performance grid rather than a high-availability one and in case you’re not familiar with the concepts let me explain the difference in simple terms.

Modes of operation

High-availability grid

Objective: Scan a lot of sites at the same time.
High-availability is essentially load balancing, which aims to distribute the workload in a way that will best utilize all the nodes in the grid and thus avoid saturation (of either node resources or bandwidth).
Load balancing is simple under these circumstances, you just start the scan using the least burdened available node — load balancing mid-scan is virtually impossible for this sort of system and if you need it you’ve done something wrong in the first place.

You can sort of do that already but it’d be up to you to choose the Dispatcher to use.

High-performance grid

Objective: Scan one site really fast.
This is a bit harder, but a lot more fun and interesting, to implement because you need to parallelise one scan in such a way so as to achieve increased performance.

Off the top of my head the easier way to do this is:

  1. divide the site’s pages in batches and assign the batches to peers — 50-70 pages per batch for example
  2. poll the peers at regular intervals and when they’re done grab the results
  3. once all peers have finished, merge the results and send them back to the client

Arachni already tries for maximum bandwidth utilisation by using async requests so in order to get something more out of it there are a few criteria that need to be satisfied first:

  1. the site needs to be large enough (to gain some performance improvement from parallelism)
  2. the nodes need to have independent bandwidth lines (otherwise you’d get the same performance as if using async requests)
  3. the remote host must be able to handle the heat (DoS’ing with Arachni is possible even by single instances for small servers, a grid will be devastating)


Dispatcher DB

In both modes, all Dispatchers will need to be able to find any other Dispatcher in the grid, which means that a peer list needs to be maintained and be accessible at all times. The peer list can simply be a list of the URLs of all Dispatchers.

However, I’m not happy with adding external dependencies (like Redis) to an already sizeable system for the user’s sake, asking people to set-up a high-availability DB would be too much.

So, my first though was DHT but it doesn’t really meet the requirements; we don’t need a distributed hash table, we need:

  1. every Dispatcher to have its own copy of the peer list
  2. these lists to be kept up-to-date — Dispatcher arrivals and departures need to be accounted for, especially arrivals

So I had to figure this out of my own and this is what I came up with:

Dispatcher grid

D: Dispatcher

And because my hand-writing is crappy here’s a transcript (with a bit more extra stuff):
D2 is fired-up with D1 assigned as its neighbour.
(1) D2: announces self to D1
– Convergence –
D3 is fired-up with D1 assigned as its neighbour.
(2) D3: requests peer list from D1
(3) D1: sends peer list to D3
(4) D3: announces self to D1 and requests announcement propagation
(5) D1: propagates the announcement
– Convergence –
D4 is fired-up with D3 assigned as its neighbour.
(6) D4: requests peer list from D3
(7) D3: sends peer list to D4
(8) D4: announces self to D3 and requests propagation
(9) D3: propagates announcement to D2
(10) D3: propagates announcement to D1
– Convergence –

Convergence reached at steps: 1, 5 and 10
Number of required messages for convergence: 2 + node_num (for node_num > 2)
Bandwidth usage for convergence: sizeof( 1 + node_num messages) + sizeof( peer list)

(In case you’re wondering, convergence is reached when all nodes share the same knowledge.)

I’ve actually written the code for this and it seems to work.

Dispatcher switch

This acts like a railroad switch.
If the user has requested that the next scan should be in high-performance mode, it’ll return a H.P. Instance (stands for high-performance) instead of one straight out of the pool.
The H.P. Instance will have the same API as a regular Instance but it will operate as discussed in the “High-performance grid” section or the next section.

Default will be high-availability mode which will return an instance from the Dispatcher with the smallest workload (or based on user provided criteria like weight, cost, etc).

H.P. Instance

The High-Performance Instance will crawl the site, split the sitemap in batches of URLs and then assign each batch to other instances.
The other instances will need to be from Dispatchers with a different Pipe-ID (i.e. different bandwidth lines) in order to take advantage of line aggregation.

It will then flush their output buffer and combine the messages into its own buffer in order to provide accurate progress information to the client.
Once all instances have finished their scans it will grab their reports, merge them into a single report and have it ready for the client.

Here’s an incomprehensible and confusing diagram (Substitute Supernode with H.P. Instance):

Client -> Dispatcher communication

I’m currently working on the H.P. Instance so I should have something to showcase in a few days.
When I do, I’ll let you guys know.

SociBook Digg Facebook Google Yahoo Buzz StumbleUpon

Posted in: /dev/random, Arachni, Projects

Tags: , , , , ,


rssComments RSS transmitTrackBack Identifier URI

A couple of comments on this very interesting article:
* divide the site’s pages in batches and assign the batches to peers — 50-70 pages per batch for example: This means that you’ve already crawled the remote website using only one node, and then divide the load of the “injection” process? How are you going to manage the crawling/spidering part?
* poll the peers at regular intervals and when they’re done grab the results: For lowering network performance, I would rather have the peers push the data back to the “master” when done.
* the site needs to be large enough: “Large enough” in this case would be more than 20 pages with parameters, correct? If properly tuned, the balancing *should* work for websites of “””any””” size.

All in all, I love it. It would be amazing to have something like this working and scanning “at the speed of light” ;)

Comment by Andres Riancho on 01/08/2011 1:13 pm

Yeah, the crawling will have to be done by the first node, then divide the sitemap into batches and pass them to the spawns.
Good thing is that since the crawling will have to be done before the audit starts, it can be done purely in async mode which will be quite fast.

You’re right about the polling, to be honest there are a lot more aspects that can be optimized but I always go with: First make it work, then make it pretty,

As for your last remark, bellow a certain point it doesn’t make sense to spend all these resources to scan a site in 4-5 mins instead of 7 or 8 mins.
But if you wanted to then sure, you can use the high-performance mode even for smaller sites.

But, there are still issues that can hold performance back, like sharing which elements have already been audited between instances.
You’ll have to sync these every few seconds and hope for the best…

Finally, I’ll probably have to ditch XMLRPC over HTTP for JSON over a custom lightweight protocol — or some other lightweight approach.

So yeah there are a few hurdles that need to be overcome in order to reach maximum performance but it’ll get there.

Comment by Zapotek on 01/08/2011 1:48 pm

addLeave a comment