Sunday, September 16, 2012

Week 3 (9/10-9/16) - Collatz Wrap-Up

As the first project wrapped up, my biggest frustration was figuring out how to write valid acceptance tests essentially using the very logic that should be under test. I settled on the reasonable [I think] assumption that if my solution worked for sphere at any point then the acceptance tests I generated from that particular version of the code would be valid. I wrote acceptance tests that would exercise both large and small ranges, and which would run long enough that I would be able to see a real improvement when I implemented my metacache. I decided to save a lazy cache for later ("if I get to it") because I've done similar work with dynamic programming and a metacache is something new.

Many unit tests later I finally got my metacache to work and it is very satisfying to see how blazingly fast the program becomes when you make that change. I turned the program in with a less-than-optimal metacache, which was accepted by V2 of the Sphere problem.

However, after turning in the project, I pushed the cache to the max source file size, and only got a little less than 2x improvement from the submitted version, which is really small compared to the almost 20x improvement between no cache and the first version of the cache.

CACHING



So I wanted to a take a moment to reflect on what I learned about caching from this project. Rule of thumb: a good time to use caching is when all or part of a result will not change as you perform more computations. In other words, if you've got some computation with a lot of intermediate results, and all of those results can be entirely determined from the inputs to some algorithm, it may be beneficial to save the results. A cache takes up space, but it can save a lot of time, and since memory is so abundant these days, it's not something you ALWAYS need to worry about. Sometimes a little bit of information can save you a lot of time. Looking up a cached result is a constant time lookup (using an array or hashtable), and even if that only gives you some information, that can save a lot of time.

LAZY CACHE


Since I've done a bit of dynamic programming the idea of a lazy cache seemed pretty familiar. Basically, you store intermediate results while you perform other computations, so that you can reuse that information when you need a particular result again.

Storing intermediate computations is similar to storing the result of a method call in a local variable, when your code uses that result more than once. You as a programmer might know that the method will return the same thing each time you call it, but most of the time the compiler won't be able to tell that nothing has changed and the program will actually call that method again. So, it is best to save a local copy of the result and use it in various places. Much less expensive than a function call.

I admit that working with a Lazy Cache might have given me an opportunity to learn something about the types offered by the STL, but after seeing the dramatic improvement my program got from using the Metacache, I thought that optimizing the Metacache would give even better results, more quickly.

EAGER CACHE


On the other hand you could spend the time calculating everything things before the program even does anything useful. This is best when you can afford to have a long startup time and your program will be running for a long time. Otherwise, you should think carefully before doing this.

METACACHE


Here's why a metacache is cool. In some applications, a lot of things that you might want to compute during the course of running the program don't actually change between runs of the program. Wouldn't it make sense to do some of this computation intensive work before the program even runs? Why not store the results in the program itself so that the only person who has to do the computations is the person writing the program.

In early programming classes, they tell you that storing results in your code is a bad thing to do because it causes your code to rely on work that you did outside the runtime of the program (not kosher for many programming contests, or for understanding algorithmic material). Sure you could make a lookup table to solve every problem, but then your computer isn't really doing the work for you, or you've moved the work to a different program, which seemingly defeats the purpose of writing the original program. However in a problem like the collatz problem where the range of inputs is so vast, you have more of a problem to deal with. Storing those results doesn't take the work from your algorithm, but it does reduce the time-intensive work that needs to be done.

In this iteration of the collatz problem (finding maximums), many of the results can be folded into bins which represent a large number of inputs, so the cache is not even as large as storing every result would be.

In my case, a first-draft metacache improved runtime by nearly 20x. A second pass at optimizing the cache brought that to over 30x improvement from the non-cached version. That's awesome. This is a great take-away from this project.

No comments:

Post a Comment