Arachni grid, draft design
After the last release a few people noticed that I’ve been adding more and more distributed computing features, so I figured that others may be interested to know where I’m heading with this.
In all honesty, the distributed stuff are just a way for me to have fun with it. The end result will eventually be an automatically deployed, load balanced, high-performance grid — which sounds (and is) cool but it’d only be of use to a handful of organisations– but it’s by no means a priority.
To give you some background, this is what happens when you use the XMLRPC infrastructure (WebUI or the arachni_xmlrpc script):
P: Instance pool
Real person talk:
1: The client issues a “dispatch” command to the Dispatcher — the Dispatcher then pops an Instance out of the pool.
2: The Dispatcher gives the info of the poped Instance the the Client (an authentication token is also included in that info).
3: The Client sends the scan options to the Instance — URL, modules, plugins, etc.
4: OK if the options where valid, an error if not.
5: While the Instance is busy (i.e. the scan is in progress) flush the output buffer of the scanner and give it to us.
6: Request the scan report.
All these messages are encrypted using key “k” by SSL.
The pool is there to remove the overhead of Instance initialisation and avoid blocking during the “dispatch” call in message 1.
The Pool is populated on start-up and after an Instance is poped it replenishes itself to avoid starvation and, as a result, blocking.
And this is how it’s been for quite some time.
After the new release though me and Andres (w3af’s lead developer) were talking about plans and ideas for a distributed scanner and he got me all fired again.
We were both more interested in a high-performance grid rather than a high-availability one and in case you’re not familiar with the concepts let me explain the difference in simple terms.
Modes of operation
Objective: Scan a lot of sites at the same time.
High-availability is essentially load balancing, which aims to distribute the workload in a way that will best utilize all the nodes in the grid and thus avoid saturation (of either node resources or bandwidth).
Load balancing is simple under these circumstances, you just start the scan using the least burdened available node — load balancing mid-scan is virtually impossible for this sort of system and if you need it you’ve done something wrong in the first place.
You can sort of do that already but it’d be up to you to choose the Dispatcher to use.
Objective: Scan one site really fast.
This is a bit harder, but a lot more fun and interesting, to implement because you need to parallelise one scan in such a way so as to achieve increased performance.
Off the top of my head the easier way to do this is:
- divide the site’s pages in batches and assign the batches to peers — 50-70 pages per batch for example
- poll the peers at regular intervals and when they’re done grab the results
- once all peers have finished, merge the results and send them back to the client
Arachni already tries for maximum bandwidth utilisation by using async requests so in order to get something more out of it there are a few criteria that need to be satisfied first:
- the site needs to be large enough (to gain some performance improvement from parallelism)
- the nodes need to have independent bandwidth lines (otherwise you’d get the same performance as if using async requests)
- the remote host must be able to handle the heat (DoS’ing with Arachni is possible even by single instances for small servers, a grid will be devastating)
In both modes, all Dispatchers will need to be able to find any other Dispatcher in the grid, which means that a peer list needs to be maintained and be accessible at all times. The peer list can simply be a list of the URLs of all Dispatchers.
However, I’m not happy with adding external dependencies (like Redis) to an already sizeable system for the user’s sake, asking people to set-up a high-availability DB would be too much.
So, my first though was DHT but it doesn’t really meet the requirements; we don’t need a distributed hash table, we need:
- every Dispatcher to have its own copy of the peer list
- these lists to be kept up-to-date — Dispatcher arrivals and departures need to be accounted for, especially arrivals
So I had to figure this out of my own and this is what I came up with:
And because my hand-writing is crappy here’s a transcript (with a bit more extra stuff):
D2 is fired-up with D1 assigned as its neighbour.
(1) D2: announces self to D1
— Convergence —
D3 is fired-up with D1 assigned as its neighbour.
(2) D3: requests peer list from D1
(3) D1: sends peer list to D3
(4) D3: announces self to D1 and requests announcement propagation
(5) D1: propagates the announcement
— Convergence —
D4 is fired-up with D3 assigned as its neighbour.
(6) D4: requests peer list from D3
(7) D3: sends peer list to D4
(8) D4: announces self to D3 and requests propagation
(9) D3: propagates announcement to D2
(10) D3: propagates announcement to D1
— Convergence —
Convergence reached at steps: 1, 5 and 10
Number of required messages for convergence: 2 + node_num (for node_num > 2)
Bandwidth usage for convergence: sizeof( 1 + node_num messages) + sizeof( peer list)
(In case you’re wondering, convergence is reached when all nodes share the same knowledge.)
I’ve actually written the code for this and it seems to work.
This acts like a railroad switch.
If the user has requested that the next scan should be in high-performance mode, it’ll return a H.P. Instance (stands for high-performance) instead of one straight out of the pool.
The H.P. Instance will have the same API as a regular Instance but it will operate as discussed in the “High-performance grid” section or the next section.
Default will be high-availability mode which will return an instance from the Dispatcher with the smallest workload (or based on user provided criteria like weight, cost, etc).
The High-Performance Instance will crawl the site, split the sitemap in batches of URLs and then assign each batch to other instances.
The other instances will need to be from Dispatchers with a different Pipe-ID (i.e. different bandwidth lines) in order to take advantage of line aggregation.
It will then flush their output buffer and combine the messages into its own buffer in order to provide accurate progress information to the client.
Once all instances have finished their scans it will grab their reports, merge them into a single report and have it ready for the client.
Here’s an incomprehensible and confusing diagram (Substitute Supernode with H.P. Instance):
I’m currently working on the H.P. Instance so I should have something to showcase in a few days.
When I do, I’ll let you guys know.
Leave a comment