Sunday, June 28, 2009

Breaking the O(n^2) barrier... while drunk

(Complete solution in java to the Gattaca puzzle available here)

I had a pesky, seemingly insoluble programming task I've been dabbling with for about 3 weeks now in my downtime. Some fellow programmers have posted implications that they have created a correct solution, but until last night, I had been getting something wrong. I suspected my reading of the specs wasn't correct, my approach was skewed, or, God forbid, my choice of programming language was wrong. A little background:

Facebook has a "careers" link on its homepage, which I found one day when a grass-is-greener-on-the-other-side moment led me to poke around job app sites and see what was out there that may make me happier. [Predictably, nothing seemed more attractive or stable than AEP... at least, nothing that would let me stay in my current home]. On Facebook's careers page is also a link to "Puzzles". Curious, I investigated.

The Puzzles page has a list of programming languages (plus make and model, e.g., Gnu C++ version 5.2 instead of just C++) that their black-box puzzle evaluator supported, and a set of puzzles with input and output specifications for prospective programmers to try their hands at. Prominently displayed are a couple of pictures of smiling winners under the caption "solved and hired", implying that if I can write good solutions to all the puzzles, Facebook will fly me out to Palo Alto for a new, joyous life of coding for a social networking company.

After seeing that, I queried the missus "Hey, angel, want to move to Palo Alto?". "No." So that settled that, but some of the puzzles seemed interesting. One was a basic test to see if you can read a spec and write something that is no more complicated than a "Hello, World" program. Not being a tool, I skipped that and moved onto the next, which was reading an input file that had a number in it, counting from 1 to that number, and outputting one of three strings if the count was evenly divisible by 3, 5, or 15. A few minutes later I had some sample perl code to do that, submitted it, and later got an email back congratulating me on my working solution.

Then came the "Gattaca" puzzle. The bad news about submitting code for Facebook puzzles is that if your code doesn't pass their tests, they don't tell you why. It's all automated, and all the replies have the same text, to the tune of "sorry, that didn't work, try reading the specs or trying more test cases, and don't use any funny modules I might not have installed." Good advice, to be sure, but the one thing I've always had in all my coding jobs is good feedback. The error message. The log file. I've never been faced with mysterious, black box test cases, and no results other than a pass or fail. That's a real drag.

So I had been drowning in the shame of suddenly being sub-elite as my first four code submissions to the puzzle all failed, but contradictorily I was excited about all the cool optimizations I found in code rewrites. For example, an early version of the code took 6 minutes to handle a sample test of 100 genes in a 1000 character long DNA string, where my final version tackled the same set in a couple milliseconds.

The premise of the puzzle is that you have a string of DNA codes [ACGT], a list of the start and stop positions of possible genes, and a value, or "weight" for those genes. The genes may or may not overlap each other. The goal is to calculate a path through the entire DNA string that has the highest total genetic weight with no selected genes overlapping. The output is just the final total value. Here's an example, omitting the actual DNA string:
StartEndWeight
10155
122010
25303
35403
27375


In this case, the second gene overlaps the first, but none of the others touch them. So the algorithm must decide whether gene 1 or gene 2 is a better pick. Gene 2 has the highest value, so we go with that. Next, the third and fourth genes don't touch each other, but the fifth touches both of them. So if we use 5, we can't use 3 or 4. The values of the third and fourth genes total 6, whereas the fifth gene only has a value of 5, so we go with the third and fourth. Our optimal solution then, is genes 2, 3, and 4, with a total score of 16.

That simple solution can be eyeballed and you can say "well, yeah, dur dur dur, it's obviously 2, 3, and 4," and it's not too difficult to code that, but what happens when you run into nested genes that overlap from start to finish? Or what happens when you have a string 10,000 digits long with 800 genes to compare, and there's a time limit? Then you want to make your algorithm organized, efficient, and not waste any time doing the same compares more than once, and not end up with a giant binary tree that makes your time to solve be exponential to the input.

And in my case, since I've been doing this in the evening when Liberty is down sewing her latest creation, and I'm trying to wind down from the stressful day at work, how can I do all this when I have a couple Strongbows or Magic Hats, and possibly a shot or two of Port in me? It makes it harder, it really does. But I carried on with, or got carried away by solving the problem, occasionally falling asleep at the keyboard, whittling away at the problem 30 minutes here, 30 minutes there.

Each attempt had slightly better run-times, had more optimizations and cleaner code than the one before it, but my fourth solution attempt was where the dam started to break. It added the idea of first sorting the genes by start position, then recording the last gene the current gene overlapped instead of *every* gene it overlapped. This way, when totaling potential paths through the DNA string, when I got to a gene, I knew what the nearest gene was that could be added. Suddenly, I was able to analyze 10,000 genes in under a minute. What I hadn't known at the time, but would discover in reading the following week, was that I had just independently thought up the Sweep Line algorithm. But still, my times on test cases still didn't come anywhere near what some people in the forum reported.

So on and on I went, vowing not to submit another solution until it was the best damned thing that I could possibly write. I read through several wikipedia articles on sorting techniques and graph theory, I picked up a copy of the first volume of Knuth's "The Art of Computer Programming" (heavy -- I had heard it was heavy, but this was an order of magnitude more difficult to follow than I expected). I stared at spreadsheets organized this way, and that way. I ignored the problem and forgot my code until my brain was sufficiently randomized again, and attacked it anew. And then everything came together.

My final version attacks the problem by doing sort of a Postman's sort on the genes to sort by start position, a merge sort to quickly find each gene's highest overlap, and discards the idea of keeping track of all the potential paths. Keeping track of all that is a giant drain on speed and memory. Intead, I keep track of the highest score seen so far up to each gene's overlap point. When I get to the last gene, the last overlap score is the highest possible score. This one can tackle a test case of 1 million genes in about 40 seconds!

The result of all this was a simple email (below, with my final, pretty pathfinding algorithm) declaring my success, and blurb I can post to my Facebook wall, and a little smug satisfaction. It's good to flex the mental muscles, especially at times, like now, when my day job throws more red-tape problems than coding problems my way.

From Facebook stuff


...and some of the code that made it possible. I'd post the whole thing, but not only would none of my friends really care that much to see it, but I don't want to provide a method of cheating for anyone on the web searching for "gattaca facebook" on Google. This is the final chunk that totals up all the best scores from the pre-sorted genes analyzed for highest overlaps:
# Find best paths (performs lookaheads, so must be done separate from above)
for $gene (0 .. $zc) {
my $limit = $limits[$gene];
$limit = $zc unless defined $limit;
if ($limit > $high_limit) {
my $high_score = $scores[$high_limit];
$scores[$_] = $high_score for ($high_limit + 1 .. $limit);
$high_limit = $limit;
}
my $non_overlap_score = $scores[$gene - 1] + $weights[$gene];
if ($non_overlap_score > $scores[$limit]) {
for ($limit .. $high_limit) {
$scores[$_] = $non_overlap_score if ($non_overlap_score > $scores[$_]);
}
}
}

1 comment: