Friday, September 04, 2009

Fun with Base64

"I may be a sorry case, but I don't write jokes in base 13." - Douglas Adams

So I coded this new perl script... what, I lost you already?

If you're still reading, so I coded this new perl script to solve the problem of the day. My group was notified that some of our processes are filling up some webMethods server logs with what looks like debug messages. When you're dealing with an environment the size of our's, with dozens of code migrations each month, you run into things like this from time to time. "Oh, I left the test password in the SQL script, crap!" and so forth. You fix it and move on, and occasionally offer up a sacrificial lamb if there is a business disruption. We calls these "lessons learned" meetings.

Anyway, in this case we were able to check the timing of the log messages with the record of what services were called at what times, and found a match. The webMethods service at fault was invoking a java service that had as one of its lines:
output = Service.doInvoke( "pub.flow", "debugLog", input );

In theory this type of thing should be commented out or removed prior to code being migrated to production, but in the webMethods IDE, you have to sort of go out of your way to see raw java code, selecting each java service in turn, and visually scanning. There is no function in the IDE to search java code.

The question I wanted an answer to was "how many other services call debugLog this way?" The lack of a search feature in the IDE made this unanswerable, so we went to the filesystem to search the source code directly. Alas, the java services were XML files whose java sources were MIME encoded elements. Before I lose you completely, a little background...


Binary email?

So there's this innocent sounding sentence in RFC 821 (the specs for SMTP, the protocol that most dominates email traffic) that has a broad reaching effect: "The mail data may contain any of the 128 ASCII characters."

The world at the time of that specs writing was very different from our world. Terminals were mainly text-based, connection speeds were slow, net-enabled machines were just starting to be connected more than sporadically. Joe Becker was still years away from envisioning Unicode. Email and Usenet posts of the day were plain text messages - if you wanted to send a file to someone, you had to put it on an FTP or Gopher server somewhere.

Leaving your pirated computer program or your porn stash on a public FTP server somewhere was less than ideal. People wanted to share them directly with each other, attaching them to emails and newsgroup posts, the former giving you a modicum of privacy, the latter anonymity. The obvious problem to overcome, then, was that programs and image files (and database files, and zip/tgz files) were all binary files. "Binary" meaning they required 8-bit bytes, and could not be expressed with "the 128 ASCII characters", or 7-bit bytes.

So then, how to express binary files using only 7-bits? Or to be more precise, to avoid problems with early transfer protocols using some characters as flow control (x-on, x-off for ASCII systems, enq, stx, stb, etb, etx, eot on mainframes), how to express binary files using only printable characters?

The answer people settled on was to pick 64 printable characters that could pass through email and Usenet systems safely, and use them as a lookup table for number between 0 and 63 (binary 000000 and 111111). Then you use the simple math equation 8 x 3 = 6 x 4. So three 8-bit bytes can be expressed as four 6-bit numbers. The numbers would then be looked up to see their printable character counterparts, and those four characters would be attached to the email/newgroup post.

Wrap everything with some standard headers and trailers to identify where files start and stop, what the filename is, any other relevant info, and think of a clever way to handle the last one or two characters of the file if it isn't an even multiple of three bytes long, and you've got yourself an encoding method.

The two most common encodings are UUEncoding, developed for UUCP Mail (Unix to Unix Copy Protocol), the precursor to SMTP, and Base64 encoding, that old-timers like me usually refer to erroneously as "MIME encoding". MIME refers to much more than just encoding binary to ASCII, but Base64 is the default encoding formula for MIME messages, so... MIME encoding.

The 64 characters picked for UUE start at ASCII 32 (space) and continue for the next 63 characters. Look at an ASCII chart, and you'll see that while these are all printable characters, there is a lot of punctuation in there, which caused some programs ancillary problems during the transmission or decoding. UUE also includes an initial "length byte", which again starts at space and continues up through M, representing a line length of between 0 and 45. Since you want to use as few lines as possible, most UUEncoded files are instantly recognizable as being of uniform line length, and all starting with M.

Base64 avoids UUE's problems of too much punctuation and the wasted space of a line-length byte by picking A-Z, then a-z, then 0-9, then + and / as the 64 characters, and specifying simply "ignore anything that isn't one of these characters", which guards against formatting/line-length/whitespace problems.

Here's a quick visual of trying to send three characters, "50¢", using both of these encodings:
ASCII character |       5       |       0       |       ¢       |
8-bit decimal   |      [53]     |      [48]     |     [162]     |
Binary stream   |0 0 1 1 0 1 0 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0|
6-bit decimal   |    [13]   |    [19]   |    [02]   |    [34]   |
UUE string      |     -     |     3     |     "     |     B     |
Base64  string  |     N     |     T     |     C     |     i     |

Base64 encoding has found its way into other uses over the years. PGP encryption uses a technique called "ASCII armor" to both compress and Base64 encode encrypted files, so that you can safely send the files between Mainframes, Unix, Mac, and Windows systems with narry a worry about dropping carriage returns or ASCII/EBCDIC conversions causing your data to be undecryptable. It is also useful in embedding data that may contain markup characters(<, >, ", etc.) in an XML file, which was the problem I was running up against.

webMethods

To explain what I was seeing in webMethods without jeopardizing my job, I wrote a simple Java "Hello World" service to illustrate. The output is very simple when invoked from a browser:



It simply creates a variable called outMessage, and assigns a value to it of "Hello world!". I can see the raw java in the webMethods IDE; in fact, that's where I entered it in the first place. But I can't search from there, I have to already know where to look. The raw source code on the file system does little to improve matters:
$ cat java.frag
<?xml version="1.0" encoding="UTF-8"?>

<Values version="2.0">
<value name="name">helloWorld</value>
<value name="encodeutf8">true</value>
<value name="body">
SURhdGFDdXJzb3IgbXlDdXJzb3IgPSBwaXBlbGluZS5nZXRDdXJzb3IoKTsKSURhdGFVdGlsLnB1
dCggbXlDdXJzb3IsICJvdXRNZXNzYWdlIiwgIkhlbGxvIHdvcmxkISIpOwpteUN1cnNvci5kZXN0
cm95KCk7Cg==
</value>
</Values>

The "body" element is where the raw java would be, but as you can see, it is MIME encoded, and hence useless to standard search programs such as grep. Enter perl and a brainstorm.

perl

I asked myself "Why not write a quick grep-like program that decodes the MIME portion of xml "body" elements?" I whipped up some code to take a list of filenames and a search term, and output what filenames contained the search term, plus the entire block the term was found in. Only, I didn't have a handy "decode Base64" algorithm. I found a perl package to do this on CPAN, but it had C source code, which makes installing if you're not root a little painful, so I searched my old haunt Perlmonks for an alternate solution. What I found was simple, beautiful, and pretty funny.

Perl can intrinsically decode a UUEncoded string with the "unpack" function. Base64 and UUE are basically the same, just using a different 64 characters, and plus the pesky "length byte". Someone had solved both problems elegantly in just a few lines (which I've numbered to make the explanation easier):
1 sub p64 {
2       my $line = shift;
3       $line =~ tr#A-Za-z0-9+/##cd;                   # remove non-base64 chars
4       $line =~ tr#A-Za-z0-9+/# -_#;                  # convert to uuencoded format
5       my $len = pack("c", 32 + 0.75*length($line));  # compute length byte
6       return unpack("u", $len . $line);              # uudecode and return
7 }

Line 2 takes what is passed to the routine and calls it $line. Line 3 is a thing of beauty and hard to explain if you don't speak perl, but I'll do my best.

The "tr" function translates one set of characters to another. The basic format is tr#old#new#flags. In this case "old" is the set of characters in base64 (A through Z, a through z, 0 through 9, + and /), "new" is nothing, and flags c and d are added. The "d" flag means delete, the "c" flag means take the complenent of "old". In other words, delete everything *other than* the characters in Base64.

Line 4 uses "tr" again, but in a much more straightforward way. We translate the Base64 characters into the range of characters starting at space, and ending at underscore. This corresponds to the range of characters used in UUE encoding.

Line 5 calculates the length byte. The length refers to the length of the source line, not the encoded line, and our line is already encoded, so we have to calculate what the source length would have been. It's 3/4ths of the Base64 encoded line length. We take that number and add it to 32 (where ASCII space is), and the corresponding ASCII character is our length byte.

Line 6 brings it on home, concatenating the length byte and the transformed-to-UUE line, and calls the UUDecode function, returning the result to the caller.

Problem solved

The end result of all this was a completed "grepmim.pl" script that could decode Base64 on the fly in XML files, and output what it found that matches what you're looking for, and where it found it, as shown:
$ grepmim.pl Hello curtis/java/*
java.frag
IDataCursor myCursor = pipeline.getCursor();
IDataUtil.put( myCursor, "outMessage", "Hello world!");
myCursor.destroy();

So I used this to search our entire webMethods code repository, finding 42 instances of inappropriate debug logging. Yowza!

The Quote

If any of you understood the reference of the quote at the beginning of this entry before you read down to here, I'll buy you a beer, and we can geek out about Beowulf clusters, Han shooting first, and whether Ruby documentation should be written in Latin or Esperanto.

The above quote is in reference to a sequence in "Life, the Universe, and Everything" where Arthur Dent is trying to find the ultimate question to the universe (to which the answer is 42) by randomly pulling stone scrabble tiles out of a bag. The question he spells out is "What do you get if you multiply six by nine?" As it happens, 613 x 913 = 4213. However, as Mr. Adams assures us, he was unaware of this.

No comments:

Post a Comment