Tuesday, May 14, 2013

Ignoring the tyranny of official widgets

...and a small explanation of why that's important to me.

In 2009 I became immersed in the world of check image files, often referred to as "Check 21" files, in reference to the Check Clearing for the 21st Century Act, which gives a scanned image of a check legal credibility as a financial instrument. In our modern world of online bill pay, web 2.0, and pizza boys with credit card scanners that plug into their mobile phones, paying utility bills by mailing a check is slowly becoming anachronistic... or so you might think.

In reality, over 300,000 physical checks per month get mailed to our Canton remittance center, many from businesses with multi-thousand dollar electric bills - businesses large enough to employ an IT department to do ACH transactions, wire payments, what have you. But the physical checks keep being printed and mailed, for much the same reason that companies that have ever used a mainframe continue to use them despite the cool kids shouting "dinosaur!" or "RESTful!" or "node.js!!" at them: It's well understood and works reliably.

Our Canton center has hardware to create scanned images of these checks and create from them an X9.37 file, the older of the two ANSI Check 21 standards, which contains TIFF image data surrounded by text records in the EBCDIC character set. On a slow day, these files weigh in at 250 megs, and I have seen them as large as 750. For years, my code has delivered this file daily to the bank using FTPS, of all things, where each send contains a little bit of holding your breath. Because of the particulars of the FTPS protocol, it's possible for a "control" connection to time out while a data transfer is still going on. With files this size, if there is a network slowdown anywhere, a timeout can happen. Then we wait. If we get the return acknowledgement file from the bank with control totals in it, all is well. If not, it's a phone call to try to identity if the file is still processing or never made it.

Somewhere in this file is a check from a little old lady who recently got a cutoff notice for being 3 months late on her bill (I actually have no idea how our cutoff policy works, but if we cut off a little old lady's electricity for being less than 2 months behind, then we're a bunch of assholes, so I assume it's 3 or more, hopefully with trying to get them into a local assistance program like Virginia and Michigan use during the winter months). Anyway, if her check is processed, then she gets to have a working refrigerator the next month. And if something goes wrong with my process that I can't fix, then she's in real trouble.

I know of no actual little old lady on the brink like this, but the point is there are real people paying for our power who can be adversely affected if I can't get their money to the bank. The economy being what it is, likely some of these people are late on their bills and just scraping by. And I'm going to do every god damned thing I can to help them. And over the years, I've done exactly that, and both the bank IT folks and the users in Canton have grown to rely on my troubleshooting, where whatever management-imposed official "incident ticket" system is currently in favor is unceremoniously ignored in favor of just calling me directly. I encourgage this behavior in my customers, and it has adversely affected my work in the form of project delays for new work, or me putting off administrative work like timesheet entry, mandatory HR training (how not to let people tailgate you into the building, how not to fall down the stairs, etc.), and I suspect my matter-of-fact bluntness about operating this way cost me a promotion last year. Join me for a brew sometime if you want to hear me complain about this, otherwise... on with the story.

When this solution was first deployed, I worked closely with our network group, the bank's techs, had them capture network traffic and send me sniffer logs, had them pool over firewall rules looking for ways to shore up the connection, check for anywhere a timeout setting could be adjusted, dug for a way in my application infrastructure (webMethods Integration Server) to play with low-level network settings. And for the most part everything works famously now. But about twice a year something catastrophic happens. The bank's ISP wires something backwards, or the bank upgrades their software, or one of our routers overheats, or there's a cross-connect to production during a DR test. In a big IT system with error-prone humans poking at it, things go bad sometimes; it's inevitable.

The thing I pushed for the hardest kept receiving the most resistance: switching the transfer protocol to an HTTPS post. The bank could only do that via the AS/2 protocol, which would require some additional coding and testing, as our company had never used AS/2 before. I was all for it, loving new things and not being afraid to break new ground, but management, as is the nature of management, was conservative and change-averse, not wanting to dedicate funds to it, especially since [the mortgage meltdown|lower cost of shale gas|The Reorg] was going on.

Eventually, though, we got approval and funding. Unfortunately the timeline for the project would have had me finishing right before this year's "Lean Tranformation" would be affecting enterprise IT, and I didn't know if I'd have a job once the layoffs started. My worst-case scenario would be to implement something that I would not be around to monitor and patch, and go to bed everynight with the thought of the little old lady and her refrigerator hanging over my head like the sword of Damocles. Was my code healthy enough? Will the documentation I left behind be enough to help whoever supports this in the future?

So I made other work a priority and pushed the project schedule well into the other side of the upcoming reorg. As it turned out, my work kept me around yet again (this is reorg 4 since I was hired a decade ago), so I began work in earnest recently, learning the basics of how the webMethods products "Integration Server" and "Trading Networks" (TN) interact to perform AS/2. The long and short of it is they expect many small messages, and are very loose about how data structures, and even the raw data itself, is copied around in the file system and TN database. Nothing in the specs explicitly prevented sending very large files this way, but it turned out that this would cause a few problems, requiring me to ignore the tyranny of the official widget, and roll my own AS/2 delivery service that was more streamlined.

AS/2 via webMethods the vendor-defined way is pretty straight-forward: define some partner information in the Trading Networks space including SSL keys, server address and credentials, and IDs for the AS/2 envelope, and post a data stream to the service wm.EDIINT:send, which takes care of all the MIME and HTTPS legwork.

Early testing with this method worked famously; our end-users were able to push files to the service, and the bank received them in the proper format. Except, that is, for the largest test file, a 400 meg behemoth in the normal world, and average sized production file in my world. And so I dove down a bizarre troubleshooting rabbit hole trying to untangle what went wrong, and how to fix it.

The file in question failed with this error message:

Unrecognized input object - input     [EDIINT.000001.000005]

This was essentially TN saying "where's that object I'm supposed to stream to the bank? All I see is NULL". Our smaller files all had the following three objects in the TN transactions log:

...where the larger file was missing the largest obect, and the remaining variables looked different:

The mention of "tspace" in the bad transaction was interesting. This is where webMethods creates temporary files representing generic "stream" objects. Integration Server is essentially a java application server, working in much the same way as WebLogic, except that it makes an overt attempt to hide low-level details of the underlying java objects; buffered input streams, file readers, XML node iterators, and the like are all just generic "objects" in the development tool, where inputs can be dragged to built-in services in a visio-like design tool, and said services go through the trouble of figuring out what type of data they've been handed, and how to finagle it into what they really need.

This is pretty nice if you use amateur developers to code fungible widgets, and only expect the services to receive small messages. But if you have millions of dollars in a daily file that is always guaranteed to be larger than your JVM memory allocation, then it's not so nice.

So with tspace being my first lead, I decided to watch what happened in the tspace mount while the service was running. I wrote a quick perl one-liner to output the current time, a list of temp files, and free space on the mount, repeating every second. The last thing I saw before the error was this:

Wed May  8 11:47:34 2013
-rw-r-----    1 admin    admin     431082539 May  8 11:47 18DocRes.dat
-rw-r-----    1 admin    admin     431082539 May  8 11:47 19DocRes.dat
-rw-r-----    1 admin    admin     124847371 May  8 11:47 20DocRes.dat
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/tspacemnt    4194304   1966468   54%       10     1% /opt/tspace

First, crap! It was duplicating the 400 meg file not once, but twice! I duplicated this on a smaller file and saw that the file was indeed being triplicated, with the third file being slightly larger, as it contained the AS/2 envelope and signature. If this had run to completion with the large file, it would have used over a gig of temp space. As it was, the error happened when we were pretty close to exactly a gig. Hurm... The next refresh of the one-liner gave this:

Wed May  8 11:47:35 2013
-rw-r-----    1 admin    admin     431082539 May  8 11:47 18DocRes.dat
-rw-r-----    1 admin    admin     431082539 May  8 11:47 19DocRes.dat
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/tspacemnt    4194304   2088436   51%        9     1% /opt/tspace

The third file disappeared, leaving the other two waiting for the service to timeout before being removed. I was curious about exactly how much space was being used before the error, so later in the day I ran the service again, and as the third file was being generated, a took out the one second refresh and just had it refresh constantly until the file disappeard. This was the last refresh before the third file disappeared:

-rw-r-----    1 admin    admin     137757736 May  8 15:29 10DocRes.dat
-rw-r-----    1 admin    admin     431082539 May  8 15:28 8DocRes.dat
-rw-r-----    1 admin    admin     431082539 May  8 15:28 9DocRes.dat

Those file sizes total 999,922,814 bytes. Looking closer, every refresh had roughly a 100k filesize difference, and we were within 100k of 1 billion bytes. As it turned out, there was a setting in the integration server config declaring exactly this:

$ pwd
/IntegrationServer/config
$ grep tspace.[a-z]*= server.cnf
watt.server.tspace.location=/opt/tspace
watt.server.tspace.max=1000000000
$

So that explained what happened, if not why the hell the design would be to silently delete the most current file in tspace once the max was exceeded, without even throwing an error to the service writing to the file. On top of that, the tspace mount looked like this after all the temp files got removed:

$ cd /opt/tspace
$ ls -l
total 0
drwxr-xr-x    2 root     system          256 Dec 16 2011  lost+found
$ df -k .
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/tspacemnt    4194304   3772356   11%        5     1% /opt/tspace
$

Whoa! No files, but 400 megs are sapped from the filesystem. The file being creamed must have an adverse effect on garbage collection, preventing it from freeing the space from one of the file handles. Bouncing the server and grabbing a thread dump confirmed my suspicion:

"util.mime.getMimeInputStream" Id=105756 in TIMED_WAITING (running in native)
        at java.lang.Object.wait(Native Method)
        - waiting on <0x45b345b3> (a java.io.PipedInputStream)

Many of the webMethods built-in services are Java only, with the source not being included in the packages. If they don't work correctly, you're hosed unless you can write the service yourself. The AS/2 enveloping and file transfer services are not examples of this, fortunately; instead, they are coded with "Flow" code - the drag and drop visual tool I described above. This means I can poke around in the code to see what makes it tick with relative ease.

After examining the code for EDIINT:send, the source of the problem was apparent. First, "send" calls "createMime":

createMime, in turn, calls "streamToBytesOrReservation". This is one of the services I alluded to which takes a random object and turns it into something the service can use. In this case, it wants to take any sort of reader and turn it into either a byte array or a temp file (reservation, the "Res" in "DocRes" files above refers to reservation). In our case, however, the source object was already a reservation, so this was unnecessary.

Further down in "send", a mapping step has a transformer that again invokes "streamToBytesOrReservation", dropping the output into the "ediintdata" field that shows up in the "Transaction Details" TN screen from above... if everything worked right, that is.

So now I knew what needed to be changed. My next step was to find the AS/2 syntax used in the actual HTTPS post, so that I could duplicate it without invoking the built-in AS/2 services. I ran a small dummy file through the old code and captured the final MIME envelope. I then used the most fantastic, flexible, and overlooked Unix tool of all time, commandline openssl, to emulate a session and see the server reponse:

$ openssl s_client -connect [server]:[port] -ssl3
CONNECTED(00000003)
depth=1 /C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=Terms of use at https://www.verisign.com/rpa (c)10/CN=VeriSign Class 3 Secure Server CA - G3
verify error:num=20:unable to get local issuer certificate
verify return:0
---
Certificate chain
 0 s:/C=US/ST=...[rest of certificate chain]
---
No client certificate CA names sent
---
SSL handshake has read 3585 bytes and written 270 bytes
---
New, TLSv1/SSLv3, Cipher is DHE-RSA-AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : SSLv3
    Cipher    : DHE-RSA-AES256-SHA
    Session-ID: 518D0D911844B53DAED1DD24333FA90961FABF21876FA49BA07F741293A18A9B
    Session-ID-ctx:
    Master-Key: 8728B62C10CF721C0FCB82CB4A15F9A55151F74D896B18104686F5BA120884E2493753359C76E19926EBF394F641F2FF
    Key-Arg   : None
    Start Time: 1368198681
    Timeout   : 7200 (sec)
    Verify return code: 20 (unable to get local issuer certificate)
---
POST /as2 HTTP/1.1
Host: [server]
User-Agent: CERN-LineMode/2.15 libwww/2.17b3
Authorization: Basic [encoded user:pass]
AS2-From: [AEP identifier]
AS2-To: [Bank identifier]
AS2-Version: 1.1
Message-ID: <1659527914.01368194059987.JavaMail.admin@[local server]>
Content-Type: multipart/signed;    boundary="----=_Part_6_1615028291.1368194059968";    protocol="application/pkcs7-signature"; micalg=sha1
Content-Length: 1137

------=_Part_6_1615028291.1368194059968
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: binary

line 1
line 2

------=_Part_6_1615028291.1368194059968
Content-Type: application/pkcs7-signature; name=smime.p7s
Content-Transfer-Encoding: base64
Content-Disposition: attachment; filename=smime.p7s

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEHAQAAMYICAjCCAf4C
AQEwgacwgZkxCzAJBgNVBAYTAlVTMQ0wCwYDVQQIEwRPaGlvMREwDwYDVQQHEwhDb2x1bWJ1czEg
MB4GA1UEChMXQW1lcmljYW4gRWxlY3RyaWMgUG93ZXIxDDAKBgNVBAsTA0IyQjEcMBoGA1UEAxMT
cm9laWFzZDMxLmFlcHNjLmNvbTEaMBgGCSqGSIb3DQEJARYLZWRpQGFlcC5jb20CCQCwrdolXfCc
8zAJBgUrDgMCGgUAoIGxMBgGCSqGSIb3DQEJAzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8X
DTEzMDUxMDEzNTQxOVowIwYJKoZIhvcNAQkEMRYEFAJEm9GL1ntHw6HGwzD8biAEwAYGMFIGCSqG
SIb3DQEJDzFFMEMwCgYIKoZIhvcNAwcwDgYIKoZIhvcNAwICAgCAMA0GCCqGSIb3DQMCAgFAMA0G
CCqGSIb3DQMCAgEoMAcGBSsOAwIHMA0GCSqGSIb3DQEBAQUABIGAP1VtnSKeh5c4cUpkNlNj1dkG
cBxSwotOKzIVuKHqQui0hfHrWtljMyeguqrVNR5oHhV/p9uKROKmhIxRB1kHR2JjSMa2lesJmMO1
JvvIYN8mvpiViT6SJ9mHy1LQm6+ydaHG0il3VLASoIszY6oAABnCWigs8oJ7P4NFpzMIqFUAAAAA
AAA=
------=_Part_6_1615028291.1368194059968--
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Date: Fri, 10 May 2013 15:09:18 GMT
Connection: close
Content-Type: text/plain
Content-Length: 131

The message with msg id [<1659527914.01368194059987.JavaMail.admin@[local server]>] was received and has been handed off for processing.

The very last message is the ideal response I was looking for. Now I had everything I needed to create an equivalent service, "sendAS2", in my local package, that did everything "send" and "createMIME" does, minus the file duplication:

The proof is in the pudding, as they say, so after some small file tests to verify this was working, I threw the 400 meg file at the new service, and watched tspace to see what happened. As with the old service, at first it took about 5 minutes to transfer the file down from the Canton share drive:

$ perl -le 'while(1){print scalar localtime; print `ls -l *.dat; df -k .`; sleep(15)}'
Sat May 11 16:12:03 2013
*.dat not found
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/tspacemnt     4194304   4193320    1%        4     1% /opt/tspace

Sat May 11 16:12:18 2013
-rw-r-----    1 admin    admin      11547296 May 11 16:12 1DocRes.dat
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/tspacemnt     4194304   4182040    1%        5     1% /opt/tspace

...

Sat May 11 16:20:50 2013
-rw-r-----    1 admin    admin     431082539 May 11 16:20 1DocRes.dat
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/tspacemnt     4194304   3772340   11%        5     1% /opt/tspace

And then no other new temp file was created:

Sat May 11 16:21:05 2013
-rw-r-----    1 admin    admin     431082539 May 11 16:20 1DocRes.dat
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/tspacemnt     4194304   3772340   11%        5     1% /opt/tspace

Finally, after the HTTPS post completes, the temp file is removed, and the filesystem space is freed when the filehandle is closed:

Sat May 11 16:21:20 2013
*.dat not found
Filesystem    1024-blocks      Free %Used    Iused %Iused Mounted on
/dev/tspacemnt     4194304   4193320    1%        4     1% /opt/tspace

So between the new code managing temp files better, and adjusting the tspace.max config value to match the actual size of the tspace mount, we're in much better shape. To me, this is yet another example of the need to understand bits on the wire. The development framework that we've had forced on us for most of the last decade gives management the illusion that development can be drag and drop and pretty pictures, and that it's not necessary to have competent engineers with deep knowledge of the systems in the enterprise. That is simply not realistic.

Sometimes I reflect on why I have so much passion about this particular type of file. I take my work seriously, always have, and my customers get my full attention and all the effort I can give them to solve a problem, but this is a special case. As a child, my family lived pretty hand to mouth. My stepfather, when he lived with us, was often in trouble with the law or with drugs, and my mother worked two kitchen jobs trying to keep us afloat. Late notices on the kitchen counter were a common site, and mom was occasionally on the phone with various companies making payment plan arrangements.

A particularly stressful day for me as a young boy was when I was home alone, and there was a knock on the door from an electric company rep.

"Um, hi. I'm here to disconnect your power. Here's how you can make payment arrangements to get it turned back on."

Having no other intelligent reply to give, I said simply "ok. Thanks."

Mom got off work a few hours later and came home to a dark house, and was near tears. "I paid that bill."

Maybe there's not a little old lady who'll be cold tomorrow if my work isn't true. But there damn well won't be a little boy who feels helpless watching his mom beg for one more day because of an administrative fuck up. Not if I have anything to say about it.

No comments:

Post a Comment