Monday, April 08, 2013

Stripping a UTF-8 byte order mark with Go

Recently I've been watching some videos on the Go programming language, and I was impressed enough to go install it and start tinkering. Right out of the gate, I ran into trouble getting the code on tour.golang.org's first page working locally. The code is simply:

Google's goal in this case seems to be illustrating to seasoned coders on their first visit that the language has UTF-8 support built-in. Notepad on Windows 7 does display CJK characters correctly out of the box, without needing to play around with language options in the control panel. It also lets you save files in a few unicode-friendly formats, one of which is UTF-8. Unfortunately, Notepad always writes a byte-order mark (BOM) at the beginning of files saved in UTF-8 format, and this can't be disabled:

The UTF-8 standard is ambiguous about whether the BOM should be accepted in all cases, but regardless of how you read the standard, the Go compiler does not support them, and throws this error when I attempt to compile the same program:

So if you're on Windows on want to code an app that uses Unicode characters, you're going to need another editing tool, or a program to strip the BOM before you compile. After a quick search, I found the top answer on this page to be an adequate python solution, which allowed me to run the sample program, bringing me to my next problem:

Apparently, in the Western build of Windows 7, there is no double byte character set (DBCS) support in the cmd.exe console. The "chcp 65001" business was switching to the UTF-8 codepage, which did cause the output to attempt rendering the Chinese ideograms instead of the 6 ASCII bytes. The font had no support for them, as you can see, hence the blocks. I spent a while reading through other people's attempted solutions, but as far as I can tell no one truly licked it.

The current build of Cygwin uses mintty as its terminal, which has UTF-8 support out of the box, so switching to Cygwin to run the program produced better results:

So with all those problems licked, I decided to write a BOM stripper in Go to use in the future instead of the Python one. Writing this simple program turned out to be a better tutorial in the language than the tour or the many Go videos on youtube. Once again, just tinkering with a tool is a better teacher... for me, anyway. Without further ado, here's the BOM-stripper, my first Go program:

package main

import (
 "fmt"
 "io/ioutil"
 "os"
 "bytes"
)

func main() {
  bom := []byte{0xef, 0xbb, 0xbf} // UTF-8

  if len(os.Args) < 2 {
    fmt.Println("Include file name to parse on command-line")
    return
  }
  fileName := os.Args[1]
  contents, err := ioutil.ReadFile(fileName)
    if err != nil {
      fmt.Println("Error reading file")
      fmt.Println(err)
      return
    }

  if !bytes.Equal(contents[:3], bom) {
    fmt.Println("No byte-order mark found")
    return
  }

  err = os.Rename(fileName, fileName + ".bak")
  if err != nil {
    fmt.Println("Error renaming file")
    fmt.Println(err)
    return    
  }

  err = ioutil.WriteFile(fileName, contents[3:], 0644)
  if err != nil {
    fmt.Println("Error re-writing file")
    fmt.Println(err)
    return    
  }
}

Here it is in action. I've taken the original program, stripped out the comments, and saved it with Notepad, re-introducing the BOM. This shows the original "illegal character" error, the change in byte size after bom.go is run, and it not making a second change to the file if run again.

I think I'm going to enjoy programming in Go. It seems to be a good combination of expressive, self-documenting, and low-level. I like how the common idiom "err = function()" makes you think about error handling at each stage. I'm kicking myself that I didn't start using Go earlier.

1 comment:

  1. utfbom package will facilitate this task. It detects BOM and removes it as necessary.
    https://github.com/dimchansky/utfbom

    ReplyDelete