A Princeton professor, finding a little time for himself in the summer academic lull, emailed an old friend a couple months ago. Brian Kernighan said hello, asked how their US visit was going, and dropped off hundreds of lines of code that could add Unicode support for AWK, the text-parsing tool he helped create for Unix at Bell Labs in 1977.
“I have tested this a fair amount but clearly more tests are needed,” Kernighan wrote in the email, posted in late May as a kind of pseudo-commit on the onetrueawk repo by longtime maintainer Arnold Robbins. “Once I figure out how … I will try to submit a pull request. I wish I understood git better, but in spite of your help, I still don’t have a proper understanding, so this may take a while.”
Kernighan is the “K” in AWK, a special-purpose language for extracting and manipulating language that was key to Unix’s pipeline features and interoperability between systems. A working
awk function (AWK is the language,
awk the command to invoke it) is critical to both Standard UNIX Specification and IEEE POSIX certification for interoperability. There are countless variants of
awk—including modern derivations with support for Unicode—but “One True AWK,” sometimes known as
nawk, is a kind of canonical version based on Kernighan’s 1985 book The AWK Programming Language and his subsequent input.
Kernighan is also the “K” in “K&R C,” the foundational 1978 book The C Programming Language he cowrote with Dennis Ritchie that sticks with programmers, mentally and in dog-eared paper form. C’s roots go much deeper. Kernighan had been teaching C to workers at Bell Labs and convinced its creator, Ritchie, to collaborate on a book to spread the knowledge. That book gave birth to “the one true brace style,” the endless debate that goes with it, and the structure underpinning every modern programming language.
The onetrueawk repository, where Kernighan appeared in late May, is a relatively quiet place, with 21 contributors, 46 GitHub users watching, and commits coming every few months. As noted by The Register, Kernighan’s Unicode fix came to light mostly because it was mentioned in an interview with the professor by YouTube channel Computerphile.
“It’s always been an embarrassment that AWK only worked with ASCII, or maybe 8-bit inputs, but it doesn’t really handle Unicode at all,” Kernighan tells interviewer professor David Brailsford. “A few months ago, I spent some time working with (laughs) an incredibly old program. I have it at this point where it will actually handle UTF-8 input and output so that you can have regular expressions that, you know, pick up Japanese characters, things like that.”
Kernighan, now 80, offhandedly mentions in the interview that he has also patched something “quick and dirty” to let AWK handle CSV files.