Sunday, August 08, 2010

Solving the Facebook “Gattaca” puzzle

If you're coming here looking for an algorithm or complete code to solve Facebook's Gattaca puzzle, then first, you're a damned cheater and nobody likes you. Second, they'll never hire you anyway once they realize you couldn't do it on your own. Third, you've come to the right place!

Google assures me that there are publicly available solutions for this puzzle already, and this particular puzzle has been posted in the Facebook careers section for a while now, so I'm going to break with their recommendation to not distribute complete solutions.


I first solved this puzzle in June 2009 using perl, and posted about it in this entry last year. That blog entry shows my considerable mulling of the problem, and a vague outline of how I solved it. Below is a better description of the algorithm I came up with, and complete code, this time in Java, and about 7 times faster.

The problem statement

You are tasked with finding a new gene in a DNA string by performing Exon chaining. You have a listing of a DNA sequence, and a table of where exons appear in the sequence, including their start and end positions, and a "weight" value determining their likelihood of being part of the new gene.

Your goal is to find the most probable collection of exons that make up the gene. Do this by calculating the best total weight of exons in the sequence, where no two exons in your collection overlap. Print out only the total weight. Here is an example of a small DNA string, where exons appear in it, and their likelihood of being important to the gene:

            1         2         3         4
   123456789012345678901234567890123456789012345
   AGCTAGCTACGTACGATCGACGATCGATCGATGCATCATGCATCG
E0 |--------|                                    Weight: 11
E1              |----|                           Weight:  5
E2                  |-----------------|          Weight: 13
E3                          |----|               Weight:  5
E4                                   |----|      Weight:  5

First, nothing overlaps E0, so that automatically goes into the collection. E2 overlaps E1, 3, and 4, but none of them touch each other. Therefore, E2's weight can be added, or the total weight of E1, 3, and 4 can. A quick peek at the numbers shows which is better: 5+5+5 = 15, which is greater than 13. Final score: 11 + 15 = 26.

When the Facebook puzzle bot runs your program, it passes it a file describing the DNA sequence, including how many characters are in it, followed by the characters themselves (80 columns to a line), followed by the number of exons to consider, and finally "triplets" that describe the start and stop points of the exons, and their weights. The file representing my example above would look like this:

45
AGCTAGCTACGTACGATCGACGATCGATCGATGCATCATGCATCG
5
1    10   11
14   19    5
18   36   13
26   31    5
35   40    5

Beat O(n^2)

The above example is a no-brainer, of course, but the problem gets harder to eyeball in proportion to the number of data points. The solution required by the Facebook puzzle requires a better than brute-force attack of the problem. Basically they give you a large set of exons, and time how long your program takes to calculate the best score. If your program considers all possible non-overlapping paths in the sequence, it will fail. Your program will only pass if it uses a better mathematical approach to the problem.

The approach I took involves sorting all the exons by start position (my example above happens to already by sorted that way, but the data sets the puzzle bot uses are not), then finding how far up the list each exon overlaps its neighbors. At that point, I iterate through the sorted, overlap identified list, and use look-aheads to propagate the best score seen so far to the overlapped exons.

Here is some pseudo-code to illustrate how it works:

initialize tempscore, highoverlap to 0
for each element e in sorted list of exons
  // Phase a, populate scores to highest overlap position
  if e's overlap is greater than highoverlap,
    set score of all cells between highoverlap and e's overlap with e's score
    set highoverlap to e's overlap
  end if

  // Phase b, compare last score + e's weight with e's score
  add e's weight to tempscore
  if tempscore is greater than overlap's score,
    for each element i between e's overlap and highoverlap,
      set i's score to tempscore if tempscore is greater
    end for i
  end if
  set tempscore to e's score
end for e
print tempscore

It becomes clearer why this works if you run my example dataset through the algorithm. Let's step through each element in the example list, and see what happens to the scores. The exon being considered has an asterisk(*) beside it

Step 1:
Exon    Weight  Overlap Score
E0*     11      0       11

After the first exon, E0, is considered, not much happens, mainly because it doesn't overlap anything. In phase b, tempscore becomes 11, and being that E0 is its own overlap point, its score becomes 11.

Step 2a:
Exon    Weight  Overlap Score
E0      11      0       11
E1*     5       2       11
E2      13      4       11

E0's score of 11 gets propagated up to E1's overlap point, E2.

Step 2b:
Exon    Weight  Overlap Score
E0      11      0       11
E1*     5       2       11
E2      13      4       16

E0's score of 11 is added to E1's weight of 5, and becomes E2's score, since E2 is the overlap point, and its old score of 11 was less than 16.

Step 3a:
Exon    Weight  Overlap Score
E1      5       2       11
E2*     13      4       16
E3      5       3       16
E4      5       4       16

E2's score gets propagated up to it's overlap point, E4.

Step 3b:
Exon    Weight  Overlap Score
E1      5       2       11
E2*     13      4       16
E3      5       3       16
E4      5       4       24

This is where it starts to get interesting. E1's score, 11, gets added to E2's weight, 13, for a total of 24. This seemingly wrong answer gets set as E4's score, the last overlap point of E2. Fortunately, this leaves E3's score alone.

Step 4:
Exon    Weight  Overlap Score
E2      13      4       16
E3*     5       3       21
E4      5       4       24

In phase a, nothing happens, as E3 is its own overlap. In phase b, E3 continues on the correct 11+5+5+5 path, setting its own score to 21, becoming a sort of accumulator, which will contend in the last step with the wrong 11 + 13 path.

Step 5:
Exon    Weight  Overlap Score
E3      5       3       21
E4*     5       4       26

In the last step, phase a always does nothing, since there are no more overlaps to consider. The last exon is always its own overlap, so only one comparison happens: is the last score plus my weight greater than my score? 21 + 5 is 26, which is greater than 24, so yes.

In very large datasets, this algorithm functions quite nicely. I tested a dataset of a million random genes on a million character long DNA sequence, and came to the correct answer in roughly 7 seconds, 1.8 seconds of which were spent finding the best path. The exact breakdown was as follows:

Seconds for each step
Parse file     2.391
Sort file      1.609
Find overlaps  1.031
Find paths     1.844
--------------------
Total          6.875

The code: Enter Java

I've been steeped in Java lately... no pun intended. I've decided to sharpen my Java skills on recreational projects, my current cure for boredom when my wife is doing homework for her Women's Studies or Art History classes. What I've been working on lately is going back over my Facebook perl solutions and re-implementing them in Java, trying along the way to make them less arcane looking, starting with the Gattaca puzzle.

My Gattaca solution contains one small class, and one static method, and functions more like a procedural program than an object-oriented one.

class Exon

To begin with, I created a class for the data points, giving them all the elements that my perl solution tracked using crazy data constructs like hashes of arrays.

class Exon implements Comparable<Exon> {
  int start, end, weight, overlapsTo, score;

  public Exon(int start) {
    this.start = start;
    this.score = 0;
  }

  public int compareTo (Exon e) {
    if (this == e) return 0;
    if (this.start < e.start) return -1;
    if (this.start > e.start) return 1;
    return 0;
  }
}

The constructor I'm using is odd, and is not something I would put into production code. I'm working in the main static method with a StreamTokenizer, to avoid unnecessary (time consuming) casts to Strings or StringBuffers when reading the input file. The tokenizer gives me one number at a time, so I decided to make the constructor use one input, leave the other variables accessible, and fill them in as I get each new token. This method also allows creating a dummy Exon object with only a start position to make handy use of Collections.binarySearch().

The start, end, and weight variables are directly from the problem statement. The overlapsTo variable refers to the last element in a list sorted by starting position that the Exon overlaps; this will tell us how far forward to carry a highscore variable, as shown in the example above.

I've implemented the Comparable interface and added a compareTo() method, which allows me to use Collections.sort() from the main program, to sort the exon ArrayList by start position, as well as Collections.binarySearch() to speed up searching for each exon's overlapsTo.

In the above class, I'm deliberately not performing input checking or securing any variables. The bot that tests your code guarantees correct input, and I don't plan on using the object class for anything else, so securing variables, using setters and getters, invariants, state checkers, etc., isn't needed.

class gattaca

public class gattaca {
  public static void main(String args[]) throws IOException {

    // Set up file reader, read how many digits to skip, skip them
    BufferedReader in = new BufferedReader (new FileReader(args[0]));

    // Read number of characters to skip (plus line breaks every 80 chars)
    long skip = Long.parseLong(in.readLine());
    skip += Math.ceil(skip/80);
    in.skip(skip);

    // Set up stream tokenizer *after* skipping the reader forward
    // Unexpected behavior happens when mixing tokenizer and reader calls
    StreamTokenizer st = new StreamTokenizer(in);
    st.parseNumbers();

    // Read number of exon description lines
    List<Exon> exonList = new ArrayList<Exon>();
    while(st.ttype != StreamTokenizer.TT_NUMBER) st.nextToken();
    int count = (int)st.nval;

    // Read exon descriptions, add to ArrayList
    Exon e;
    for (int i = 0; i < count; i++) {
      st.nextToken();
      e = new Exon((int)st.nval);
      st.nextToken();
      e.end = (int)st.nval;
      st.nextToken();
      e.weight = (int)st.nval;
      exonList.add(e);
    }
    in.close();

    // Sort ArrayList by start position
    Collections.sort(exonList);

    // Find each exon's last overlap
    Exon finder = new Exon(0);
    int end, next;
    for (int n = 0; n < count; n++) {
      e = exonList.get(n);
      end = e.end;
      finder.start = end;
      next = Collections.binarySearch(exonList, finder);
      if (next < 0) e.overlapsTo = -(next + 2);
      else {
        while (next < count && end >= exonList.get(next).start) next++;
        e.overlapsTo = next - 1;
      }
    }

    // Find best paths
    int highScore = 0;
    int nonOverlapScore;
    int lastScore = 0;
    int highestOverlap = 0;
    Exon overlap;
    for (int i = 0; i < count; i++) {
      e = exonList.get(i);
      if (e.overlapsTo > highestOverlap) {
        highScore = exonList.get(highestOverlap).score;
        for (int n = highestOverlap + 1; n <= e.overlapsTo; n++) exonList.get(n).score = highScore;
        highestOverlap = e.overlapsTo;
      }
      nonOverlapScore = lastScore + e.weight;
      if (nonOverlapScore > exonList.get(e.overlapsTo).score ) {
        for (int n = e.overlapsTo; n<= highestOverlap; n++) {
          overlap = exonList.get(n);
          if (nonOverlapScore > overlap.score) overlap.score = nonOverlapScore;
        }
      }
      lastScore = e.score;
    }
    System.out.println(lastScore);
  }

}

In my first build of the Java code, there were two major differences from the above. First, I didn't use StreamTokenizer to parse the file, I used regular expressions, which paid a bit of a penalty. That code looked like this:

Pattern triple = Pattern.compile("(\\d+)\\s+(\\d+)\\s+(\\d+)");
    while ((line = in.readLine()) != null) {
      Matcher tm = triple.matcher(line);
      if (tm.find()) {
        geneList.add(new Gene(tm.group(1), tm.group(2), tm.group(3)));
      }
    }

I had a different constructor in the Exon class then, which used three strings, and did an Integer.parseInt() on them. When I was trying to speed this routine up, I toyed with using char readers and building up a String to parse as an int, assuming that it would be closer to the bits on the wire, and hence speed things up quite a bit, but it didn't. The char reader and regex approach both parsed the file in roughly 4 seconds. Converting to the StreamReader approach cut that time nearly in half.

The second difference was in the "Find each exon's last overlap" section. My initial approach was to check the exon's end position, and then iterate forward one exon at a time until I found a start position greater than that, then go back one. That code looked like this:

for (int n = 0; n < count; n++) {
      g = geneList.get(n);
      breakFlag = false;
      overlapPos = n + 1;
      while (!breakFlag && overlapPos < count) {
        overlap = geneList.get(overlapPos);
        if (g.end >= overlap.start) overlapPos++;
        else breakFlag = true;
      }
      g.lastOverlap = overlapPos - 1;
    }

(I had an intermediary rewrite which took out the "breakFlag" variable, which I didn't save. I feel that using flag like that makes my code look amateurish. Both versions ran at roughly same speed.) Searching for overlaps this way on my million-exon file took a whopping 14 seconds(!) on average. Switching that to the binarySearch() method was the single most effective speedup, cutting the time for that step down to just over 1 second.

The algorithm that Collections.binarySearch() uses is getting the midpoint object of the list, and seeing if its comparator (the result of the object class's compareTo() method) matches the comparator of the search key. If it matches, the position of that object is returned. Otherwise, the appropriate side of the list is divided in half again, and the search continues. If the method gets to the point where the list can't be divided in half anymore, but an exact match hasn't been found, then -(index + 1) is returned. My approach to finding an exon's overlapsTo variable must, therefore, account for both possible return types:

next = Collections.binarySearch(exonList, finder);
      if (next < 0) e.overlapsTo = -(next + 2);
      else {
        while (next < count && end >= exonList.get(next).start) next++;
        e.overlapsTo = next - 1;
      }

For the negative return type (no match found), a little basic Algebra shows us that -(-(index + 1) + 2) = index - 1, or the index of the last exon that overlapped. For the case of the positive return, I'm failing back to my original method of iterating over the list, starting at the returned index, in case there is more than one exon with that start position.

The 14-fold speedup was surprising, and another example of the kind of payoff you can get from a little algorithm research, which is the overall theme of the Facebook career puzzles: Writing a solution that only gives the correct answer isn't good enough; you must also write one that consumes as few resources as possible. When Facebook deploys code that isn't as effective as possible, they have to buy more servers, and maintain them, and give them power, and lifecycle them.

If you're a software developer, the same is true for your company. Writing effective code has a direct impact on your company's cash position. I think about this sort of thing a lot, as part of my current job is refactoring code running on older servers. We migrate to lifecycle servers and retire the old hardware, and if I can make each application's footprint smaller, I don't overburden the new server, and need to go begging for more RAM and CPU a lot less. Which lets me say during performance reviews "the new code runs faster" and "now we need one less server in our virtual cluster", which translates to "you're actually in the black after hiring me because of my work; how's about throwing some of that savings back my way?"

Alright, enough about the tribulations of the application developer's world. So now I've given you most of what is needed to pass this Facebook puzzle, lacking only what method to incorporate the two classes together and get the puzzle bot to compile and execute everything correctly. I encourage you to learn from my code and my story, but to write your own program, which is always more fun than plagiarism.

6 comments:

  1. Hello Curtis,

    I am trying to solve the Gattaca puzzle using a similar algorithm as yours(I read the sorting technique and assumed that its the same algorithm. Haven't read the entire post though).
    I have tried submitting my solution 5-6 times, but its always failing with some bug.
    Do you have some example input files to share?
    I am not sure what scenario I am missing here.

    ReplyDelete
  2. I'll do you one better. David Eisenstat (great guy, helped me a lot when I was getting starting interfacing with the puzzle bot) wrote a java test case generator, available here.

    In that thread there are also sample command lines used to generate specific datasets, and what their correct solutions are. The sub-second time his program got finding a solution for the million-gene case I still find phenomenal.

    ReplyDelete
  3. Hi Curtis, I found this page while trying to figure out how people were solving the Gattaca puzzle in so little time. I recently got a good score from the puzzlebot for the Liarliar puzzle (411ms), but this one had me stumped. Your algorithm ('find best paths') is interesting, and surprisingly simple, I'm not sure I could have come up with it myself.

    I have a couple of observations; first, the pseudocode is slightly off compared to what you implemented:

    "set score of all cells between highoverlap and e's overlap with e's score"

    versus:

    highScore = exonList.get(highestOverlap).score;
    for (int n = highestOverlap + 1; n <= e.overlapsTo; n++) exonList.get(n).score = highScore;

    i.e. the intervening exons have their score set to the old highoverlap's score in the implementation, as opposed to e's score. The implementation is right, of course. Just FYI in case you want to fix the pseudocode.

    Second observation: you can avoid both sorting and (binary) searching. I have a C++-based solution that does neither (using your 'find best paths' algorithm). It gives the right answer for that million-gene test case of David's (968886) in around 1.7 seconds. I'm still trying to speed it up some more before I submit it to the puzzlebot. Granted C++ tends to run faster than Java, but your time could be closer to 4 seconds for that million-gene test case if you can manage to avoid sorting and searching.

    ReplyDelete
  4. Good catch, thanks. I'll edit the pseudocode to match... eventually. As for running my algorithm without sorting, wouldn't you end up with a brute-force solution? How are you iterating over the genes?

    ReplyDelete
  5. I'll clarify about not sorting. The exons *are* sorted by the time the 'find overlaps' phase is run, but I don't pay an O(n log n) sorting cost to achieve that, instead I pay just O(max(n, g)), i.e. linear in the number of base pairs or exons, whichever is larger. I believe it's called 'bucket sort', a classic approach for sorting things with discrete keys that fall into a well-defined range, as the start positions of the exons do.

    How I managed to not search is a classic space-versus-time tradeoff, by having an array of lists of exons that end at a given position, populating that as the results of the bucket sort are flattened, and using it to know whose overlaps field to set to what as you advance from one starting-position bucket to the next.

    ReplyDelete
  6. After reducing my runtime on the million-gene test case from 1.7s to 1.14s, I submitted it to the puzzlebot. I just got the answer back: 140.306ms on the longest test case.

    Moving on to 'smallworld' . . . Happy Holidays!

    ReplyDelete