Ruby cache implementation
I recently found that my URL normalization methods were sucking up loads of CPU time and that caching these methods (with a simple Ruby Hash) was cutting the time of a 1000-page crawl to almost half; that was a great day as you can imagine, but for the time being I left it at that and continued working on the specs — which I’ve now gotten sick of.
After I finished adding specs for all Arachni’s core classes I went back to add a proper cache, after all I couldn’t just let these methods infinitely populate their cache hashes.
Never having been in that situation before I took a stroll over at Wikipedia to get me started. I soon discovered that what I needed was a Least Recently Used cache, thus I did a bit of googling to see if some kind soul had already written such a cache in Ruby. Luckily, I did find a few implementations that seemed to do the job and with those as a base I got cracking.
Sadly, the overhead of having to maintain ages for all entries after each access operation completely negated the performance improvements of caching — i.e. the crawl time was the same either way. I spent hours trying to write or find a better LRU algorithm but no dice.
So back to that Wikipedia page I went, trying to find some other algorithm that could suite my needs. Then, my eye fell on the Random Replacement one, which sounded nice. Since it randomly removes cache entries when it needs to make space for new ones there’s pretty much no overhead, nothing to keep track of — plus the description has a nice side-note that For its simplicity, it has been used in ARM processors so I figured why not.
Fortunately, it performed beautifully, fast and simple and no overhead.
But, as it usually is, that wasn’t enough (for me)… I figured that I should try to push it a bit further and update the RR algo to be cost-inclusive.
That is, each entry would belong to a cost class, (low, medium and high) with the cache giving precedence to entries of the lowest classes when it needs to make room for new entries.
Unfortunately though, the operations in question weren’t costly enough so as to justify the gains one would enjoy from introducing that tiny cost-checking overhead in the cache; soooo… no dice.
And this is why I’m writing this post, if someone out there finds (him|her)self in a similar situation and needs a cache implementation (or 3) you can just use mine and skip all the hassle.
I may have waisted a few hours but I ended up with 3 cache implementations and at least one of them works so overall it wasn’t too bad.
Here’s the code of the cache classes: https://github.com/Arachni/arachni/tree/experimental/lib/arachni/cache
And their RSpec tests: https://github.com/Arachni/arachni/tree/experimental/spec/arachni/cache
Leave a comment