Saturday, September 25, 2010

Concise Perl

One of my favorite tech blogs is the relatively new Ksplice Software Blog, recently made popular with its unique solution to the recent Linux kernel exploit. In late May they posted an entry called The top 10 tricks of Perl one-liners, which brought back a slew of memories from several years ago when I was doing lots of system admin work from a Unix shell prompt, and also taught me a regular expression switch I had never come across before: \K. Commandline junkies being who they are, there was also ample discussion in the comments on everyone's favorite trick or idiom.

A few days after reading the article, I brought my eldest daughter, then 13, to work with me. She recently started high school, and was hanging out with me in the office completing a summer book summary assignment for an honors class due the first day of school. "Dad, what does 'concise' mean?" she asked me, while looking at the instructions. I replied something to the effect that it meant expressing a lot of detail in as few words as possible. The exchange reminded me of the above blog article, and also of some perl mojo I whipped up a couple weeks earlier at work.

Background

In our production Unix environment, we have shell accounts with read access to config files. We also have automatically logged superuser access for emergencies, under the conditions that we use it as infrequently as possible for critical problems only, and that we document the hell out of what we did and why for peer/management review.

In those situations, it is bad mojo to use a file editor when you're a superuser, as the logging mechanism is text-based, so the reviewer will see a bunch of garbage in the log representing your editing keystrokes. Instead we try to do everything with basic shell commands. Anything more complicated requires opening and documenting a change request, coordinating with the change control team, getting approval... time consuming if you're trying to take care of a hot issue.

As a hedge between the basic shell commands and the change request, I've found that using a perl one-liner is acceptable, provided I can explain what it is doing. The reason I love perl is that it can be very concise and expressive, a good tool for tackling many problems that come up at work.

The situation I was faced with dealt with an MQ queue, and a config file that determined what services were subscribed to it. I received a call about a queue being backed up, so I took a peek at the config file of the joiner that watches the queue, to see where it was trying to write the queue's data to. I tested all the destination addresses, and one of them was invalid (retired without my group being notified, as it turned out). So I needed a quick way to comment out that data sink in an xml config file, but unfortunately the xml file already had comments within the data sink itself. To illustrate:

<sink>
    <type>db</type>
    <driver>oracle.jdbc.driver.OracleDriver</driver>
    <url>jdbc:oracle:thin:@(server):(port):(dbName)</url>
    <!--
      <user>old user</user>
      <password>old password</password>
    -->
    <user>new user</user>
    <password>new password</password>
    <insert>
        <table>owner.tableName</table>
        <value>field 1</value>
        <value>field 2</value>
        <value>...etc</value>
    </insert>
</sink>

In the above xml snippet (sanitized to remove company info) the "old user" and "old password" section has been commented out with <!-- and -->, the xml comment block. What I want to do is comment out the entire section, like so:

<!-- <sink>
    ...lines omitted
    <!--
      <user>old user</user>
      <password>old password</password>
    -->
    ...lines omitted
</sink> -->

This is now invalid xml, according to the specs. From the W3C xml specification:

For compatibility, the string " -- " (double-hyphen) must not occur within comments.

Nested comments are illegal, then, according to the specs. So not only did I need to wrap the <sink> block in comment tags, I needed to remove the instances of -- that occur within it. Enter perl, and the following multi-faceted command:

perl -i.b -0777e 'for(split/(<si.+?nk>)/s,<>){if(/dbName/){s/--//g;$_="<!-- $_ -->"}print}' config.xml

There's a lot going on here, so I'll break it down one step at a time.

perl -i.b -e '(commands)' config.xml

The -i switch instructs perl to perform an in-place edit on the file (config.xml in this case). This means that whatever the script outputs to STDOUT becomes the new file. The ".b" suffix implies the original file will be copied to config.xml.b as a precaution. The -e switch followed by '(commands)' means treat whatever is in single-quotes as an inline script.

perl -0777e '(commands)' config.xml

This is "slurp mode", meaning if anything in the inline script reads from the file, it reads the entire file at once, and assigns the text to the special scalar variable $_.

for(list or array){(commands)}

Iterate through each item of the list in question, assign the item to local scalar $_ within the bracketed commands section. Fortunately, the original $_ representing the entire file doesn't get overwritten when this happens. Behold the power of scope.

split /(<si.+?nk>)/s, <>

The "<>" reads from STDIN and assigns the input to $_. Since we're in slurp mode, and STDIN is config.xml, this single instruction reads the entire file. The "split" command returns a list, delimiting $_ by whatever regular expression is passed to it. Surrounding the regex with parentheses will include the delimiters as elements of the list as well.

The regex /<si.+?nk>/ is where the mojo really happens. This grabs a chunk of text starting with "<si" plus at least one more character, ending with the first occurrence of "nk>". Since the xml file in question contains no other tags that end with "nk", this grabs an entire <sink> to </sink> block. (Why doesn't it just grab the opening <sink> tag and stop? Because of the mandated extra character between i and n.)

So the complete split command reads in the entire file, splits it into sections delimited by <sink> blocks, and feeds the sections and the sink blocks into the for loop, executing the {(commands)} block on each element of the list.

if(/dbName/){s/--//g;$_="<!-- $_ -->"}print

This is where the text manipulation happens. If whatever $_ chunk I'm looking at contains "dbName", search for all occurrences of "--" and delete them. When finished, replace the entire block with "<!-- ", followed by the old block value, followed by " -->". The final print command outputs the current state of $_. If "dbName" wasn't included in the text, no manipulation happens.

Combining all these elements together, the file is read, divvied up into sink blocks and surrounding text, and any block containing dbName gets its comments removed, and enclosed in one giant comment block. All in 102 characters. That is the definition of concise. Running the command on the config file yielded the results I was hoping for, correcting the problem with the backed-up queue:

$ diff config.xml.b config.xml
18c18
< <sink>
---
> <!-- <sink>
22c22
<     <!--
---
>     <!
25c25
<     -->
---
>     >
34c34
< </sink>
---
> </sink> -->
$

I won't touch on perl too much in this blog since I'm focusing these days on mainly Java, but I encourage you to, if nothing else, explore regular expressions; they're very useful, and I'm continually surprised when I find my coding peers aren't familiar with them. The KSplice blog entry from above is a nice intro to what commandline perl is capable of, and I also wrote a mid-level perl tutorial a few years ago in my Curtis Quarterly blog, where I keep odds and ends that don't have a more appropriate home.

No comments:

Post a Comment