Saturday, August 01, 2009

kNN, Finding the k nearest neighbors of points on a Cartesian plane

So I'm obsessed with yet another Facebook engineering puzzle. After solving Gattaca before my last blog post, I worked on User Bin Crash (where you have to calculate the best selection of items to toss out of an airplane to keep it from crashing), which I solved a couple weeks ago with little fanfare. The current puzzle is "Small World", whose problem statement begins thusly:

As a popular engineer, you know many people in your home city. While traveling around town, visiting your friends, you realize it would be really handy to have a program that tells you which of your friends are closest based upon which friend you are currently visiting.


The problem translates to a "nearest neighbor" problem variant known as "kNN", or k Nearest Neighbors. There is a host of wikipedia and scholarpedia articles related to this, however I never found a source with a good plain-English problem description that didn't also claim one search method or other as "The kNN algorithm", rather than acknowledging more than one good approach. Here's a bad approach:

The naive approach

For each point, find the distance to each other point, remembering the best 3 distances along the way. I was able to whip up something without much trouble (in perl, naturally) that does just that. Discounting the routine that parses the file into variables, and the one that summarizes all the results, the rest of the code looks like this:

sub add_to {
# See if point $f can displace any previously found nearest neighbors of $n.
my ($n, $f, $dist) = @_;
return if $n == $f;
my $best_p_n = $best_p[$n];  # Creating an array reference to cut down on the amount
my $best_d_n = $best_d[$n];  # of punctuation I need to search an array of arrays
for (0 .. $k) {                       # For each element in n's "best so far" list
if ($dist < $$best_d_n[$_]) {       # if f is closer,
splice (@{$best_d_n},$_,0,$dist); # put f just before current element,
splice (@{$best_p_n},$_,0,$f);
pop(@{$best_d_n});                # then get rid of the last element.
pop(@{$best_p_n});                
last;                             # and stop! (Because it would be closer than the next element, too)
}
}
}
for (my $n = 0; $n <= $points; $n++) {
for (my $f = $n + 1; $f <= $points; $f++) {
my $dist = sqrt( ($x[$n] - $x[$f]) ** 2 + ($y[$n] - $y[$f]) ** 2 );
add_to($n, $f, $dist);
add_to($f, $n, $dist);
}
}

The code snippet above performs $points2 compares, and half that many distance calculations. While the code may be optimized for speed by using simple data structures and not much fluff, it's still a lousy attack for the problem, as the times for my test cases show.

Time to process a file containing 100 points: .06 seconds
1,000 points: 5.46 seconds
10,000 points: 9 minutes, 17.94 seconds

According to some of the comments from users who are attempting this puzzle, a decent time for 100,000 points is 2 minutes, and all the heavy-hitters in the forums are hitting that in under 15 seconds. Clearly the naive approach is nowhere near the required speed to pass this puzzle, but I can still use it on small files as a control to make sure my "real" program gets the right answers.

An experiment in tweaking

If I haven't chased you away already, here's a little math:

If A > B, then A2 > B2. Substituting x and y for A2 and B2, we get...
If √x > √y, then x > y.

If the square root of something is greater than the square root of another thing, you can tell which is bigger without doing the square root. How does that help me? I'm using the Pythagorean theorem to calculate distances between two points by drawing a right triangle between them and calculating the hypotenuse. Perhaps you remember it from school? A2 + B2 = C2? I'm getting the real distance C after taking it's square root. In reality, I don't care what the real distance is, just how the distances compare to each other.

Under the assumption that it takes time to do a square root calculation, I took it out of my program to see what would happen. It still output the same answers, thankfully, and the code ran a little faster, but not much.

After removing the square root calculation:
Time to process a file containing 100 points: still .06 seconds
1,000 points: 5.37 seconds
10,000 points: 9 minutes, 2.78 seconds - 15 seconds faster, less than a 3% improvement.

Not a remarkable improvement, but it still shows that a small tweak to even one line of code can have a measurable impact. It's also interesting to know that my computer can perform 50 million square roots in 15 seconds. With that kind of power, why do computers still run sort of slow? All the tweaking that isn't done, in favor of 1960s style planned obsolescence. Get the new version out! It's hot! It's on fffff-eye-your! So what if it's slower than dirt and buggy, just wait 'til you see version 3 next year!

A brief rant

Normally I'd conclude that with a "...but I digress...", except I'm not digressing, this is exactly what my point is. I am thrilled that there exist companies that filter their programmers this way. "Show me you can tweak some shit before you expect us to even give your obviously BS, self-important, hubris-laden résumé a passing glance... bitch." And its to Facebook's advantage to do so, since they have so many eyes looking at them all the time. Shaving off a couple milliseconds on an app times millions of hits per hour can translate into them needing to buy less hardware, consume less electricity, require less bandwidth, crash less frequently. For all the bemoaning I and my peers have done about Facebook and its affect on lifestyles, yay them for taking this stuff seriously.

I imagine Google is the same way, I mean, look at Chrome: fast, lightweight, and they finally have a Linux dev version that quickly replaced Firefox for me, helping me continue the dream of running my computer with only 256 megs of ram far past its retirement age. And online, nobody uses "search engines" any more. They don't "go to Yahoo and search for...", kids don't even remember Altavista, and Lycos? Didn't they have that commercial with the dog? Who were they? No, everyone's on Google, the noun. They "google" things, the verb. In the 10 years I've been using Google, I've seen only a handful of instances where they went down, lost mail, threw an error on a search (requiring a single page refresh to fix), or performed slowly.

Not a bad track record for 10 years of ever-increasing use. Yes, their coders too have some game. Trusting fool that I am, I loaded them up with everything I could: spreadsheets with my budget plan, pictures of my family, all the emails of my wife and I courting, addresses of people I know, a few years of blogging, my daily weights for the last 3 years, medical files... Not because I'm a fanboy, but because their mail, maps, docs, blogger, and picasa services are all fantastic. And free, and fast, and have great uptimes. What's not to love? Good code just makes the world better.

Where was I?...

So back to my attempt at making good code, I realized well in advance that the naive approach was useful for nothing other than a control, so I spent about a week searching off and on for algorithms to attack kNN searches. The two items I kept bumping into were Cover Trees (which no one seems capable of describing very well outside of Set Theory terms and monstrously huge, undocumented C++ code), and kd-trees.

The kd-tree approach was similar to my own made up "bunch of boxes" and "closest corners" approaches, but it was more mathy than I could process with the amount of port I was consuming that night, so I bookmarked the Wikipedia article for future reference, and went about deciding which of my two approaches sounded more reasonable. Both of my approaches do pre-processing on the set of points to try to break them up into groups of rectangles. Points are most likely to have nearest neighbors inside their own box, and then after that you can exclude other boxes from being searched in if they are too far away.

I went with "closest corners", which I won't bother to describe in detail. It was faster than the naive approach, but it still ran too slowly for me to submit it to the puzzle bot. There was nothing for it, then, but to take a better (sober) look at kd-trees.

As it turns out, they're easy to grok, and interesting, despite how the Wikipedia article describes them.

Do tell

You sort all the points based on one axis, grab the median point and make it the root "node" of a binary tree. Everything before the median then goes down the left branch, everything greater to the right. Then you take all the points from a branch and sort them by the other axis, the one in the middle becomes the "node", and you have two more branches with half the remaining points. And so on until there is only one point at each node.

If you plotted the points on graph paper and drew line through the nodes, it would like something like this image, stolen from the article:



The end result is that you can guarantee as you travel down the tree that everything will be on one side of the plane or the other. If your nearest neighbors are closer than the distance to the line drawn by the node you're looking at, then you can stop searching down that side of the tree, culling out major chunks of work.

So after a few days of pounding, writing test cases, annoying the wife with constant mumbling and staring at hand-drawn charts, I finally came up with a decent solution. This outputs the same answers as the naive approach, in a fraction of the time. To whit:

Time to process a file containing 100 points: .05 seconds
1,000 points: .82 seconds
10,000 points: 12.06 seconds
100,000 points: 3 minutes, 11.16 seconds(The first version to return any answers at this many points before I killed the process because it was taking too long)

Not quite under the 2 minute mark, and the puzzle bot rejected it as predicted. So, I'll keep tweaking, or maybe break down and use a compiled language. I write a lot of java at work, and I can talk enough C# to fix broken .net apps, it's just, perl looks better to me.

And at the risk of encouraging plagiarism, here's the meat of my kd-tree implementation, containing a small variation on the "official" search method, and enough violations of perl purists' sense of best-practices to send them into a tizzy:
sub build_kd_tree {
my ($node,$y) = @_;
my $node_len = scalar @{$tree[$node]};

# $y is true when we sort on Y axis, false when sorting by X
my @sorted = $y ? sort {$y[$a] <=> $y[$b]} @{$tree[$node]}
: sort {$x[$a] <=> $x[$b]} @{$tree[$node]};
my $median = int ($node_len / 2 );
my $left_child  = 2 * $node + 1;
my $right_child = $left_child + 1;
for (my $n = 0; $n < $node_len; $n++) {
push @{$tree[$left_child]}, $sorted[$n] if $n < $median;
push @{$tree[$right_child]},$sorted[$n] if $n > $median;
}
$tree[$node] = $sorted[$median];  # Casts this node from array to scalar
undef @sorted; # So that we never keep more then twice the number of points in memory
$y = !$y;  # Flip $y from false to true, or vice-versa
build_kd_tree ($left_child, $y) if $tree[$left_child];  # Work next branch only if
build_kd_tree ($right_child,$y) if $tree[$right_child]; # there are points left
}

sub find_k_closest {
my ($base, $node, $axis) = @_;
my $np  = $tree[$node];
return if !defined $np;

my $bp  = $tree[$base];
if ($bp != $np) {
my $dist = ( ($x[$bp] - $x[$np]) ** 2 + ($y[$bp] - $y[$np]) ** 2 );
add_to($bp,$np,$dist);
}
my $left       = 2 * $node + 1;
my $right      = $left + 1;
my $right_flag = 1;

my $pos = $axis ? $y[$bp] - $y[$np] : $x[$bp] - $x[$np];
($left,$right) = ($right,$left) if $pos > 0;
my $intersect_dist = $pos ** 2;
$right_flag = 0 if ($intersect_dist > $best_d[$bp][$k]);

$axis = !$axis;
find_k_closest ($base, $left,$axis) if $base != $left;
find_k_closest ($base,$right,$axis) if $right_flag && $base != $right;
}

$tree[0] = [0 .. $points];
build_kd_tree (0,0);
my $tree_len = scalar(@tree);
find_k_closest(0,0,0);
my $l2 = log(2);
for (my $n = 1; $n < $tree_len; $n++) {
next unless exists $tree[$n];
my $f=int(log($n+1)/$l2)%2; # Go ahead, ask me what this does :)
find_k_closest($n,$n,$f);
find_k_closest($n,0,0);
}

Why the hit on perl purists? They tend to have a certain air about them that people in Mensa do (in fact, I compared the two once here). Over the years, I've received good feedback and coding advice from a popular perl hangout, perlmonks.org, but as I reached a level of competency I was comfortable with, I began to see more petty comments and advice littering the forum. An example is a script I posted just yesterday and the quick response to it from a man who seems to have issues with how one goes about opening files. I leave it to you to judge the merits of the code and the appropriateness of the unsolicited feedback and the budding debate that follows it: X9.37 parser.

And lastly, after recovering from last night's Whiskey Daredevils show, I'm off to attack hunger pangs, a dirty house, entertain my daughter and my mentee, and then to begin the arduous task of rewriting the damned kNN solver... bah!

4 comments:

  1. I'm confused by this:

    my @sorted = $y ? sort {$y[$a] <=> $y[$b]} @{$tree[$node]}
    : sort {$x[$a] <=> $x[$b]} @{$tree[$node]};

    and

    What are $y[$a], $x[$a] in the sort algorithm? I see that $y ia a parameter to your tree building function. But where is $x defined. I don't see if declare anywhere.

    ReplyDelete
  2. it's me again.. are you using the $y in two different contexts? I understand that it is a true/false value for sorting the array by x or y coordinates for the points. What I don't get is where and how you're storing the x and y coordinates for each point.

    ReplyDelete
  3. Emily,

    I didn't post the entire solution to the problem here, because I didn't want anyone to be able to win the Facebook puzzle (smallworld) by googling an answer instead of building one.

    There is a sub earlier in the script that parses out the input file, throwing the x coordinates into array @x, and the y to @y. It goes something like this:

    my @x = ();
    my @y = ();

    sub read_input {
    while(<>) {
    #integer space real space real
    if (/(\d+)\s+([\d.-]+)\s+([\d.-]+)/) {
    $points++;
    $label[$points] = $1;
    $x[$points] = $2;
    $y[$points] = $3;
    }
    }

    ReplyDelete
  4. Hey. I figured out that part of the code was missing after I read through your whole post. I didn't realize it was part of a Facebook coding contest right away.

    ReplyDelete