Python and Unicode: Two Helpful Resources



In the spirit of the comments in Noobie McNoob here! and Getting started with python, I want to share something that I’ve just read that’s filled in some important gaps in my working knowledge of Python.

Some personal background for context:

I started writing Python code in 2004 (or 2005?) when I was working for a small software vendor making digital forensics tools. We needed to automate a native code app’s build process. (Non-automated build processes are a symptom of being at the rock bottom of software development maturity and capability…) I had heard of Python, and wanted to try it out, and so I wrote my build script in Python. It worked! I was very pleasantly surprised at how I could write something of tremendous value for my team, seemingly without undo effort and without having to undertake any formal training, because of the usefulness / usability of the Python documentation at, and other resources that I found via Google search. (Prior to that time I’d written working code, though not necessarily clean and elegant code, in C++, Java, and PHP.)

During the ensuing years I’ve used Python here and there, and I’ve continued to have great experiences, but I’ve never written Python code as a primary job function. I’ve still got a lot to learn about Python and writing clean code.

For almost three years now I’ve had some occasional opportunities to write Python code to integrate disparate infosec tools and/or automate secops processes. Across these recent experiences I’ve had several encounters with non-ascii text causing buggy symptoms in my code. These bugs proved harder than I would have liked to address, and the resolutions didn’t feel complete.

For example, I wrote some code to consume JSON files containing database connection info. These JSON files were sometimes generated by software running on Windows, and sometimes they were generated by software running on a Mac. Both sets of files were human-readable for me in Notepad++, but I had to do a lot of trial and error to get my code to read and parse them both successfully as the Windows JSON files were made up of bytes in one character encoding (UTF-16, if I recall correctly) and the Mac JSON files were made of bytes in another (probably UTF-8?).

I’ve also seen trouble when retrieving information from Active Directory that may have been UTF-8, or Windows-1252, or ISO 8859-1 - I can’t recall if I ever knew for sure.

At another time I was helping a collaborator write code to parse the bodies of e-mail messages retrieved through IMAP via a third-party library, and the message bodies seemed to vary in terms of their encoding. We wound up replacing characters that wouldn’t decode into UTF-8 instead of rendering them properly - ouch!

All of these little episodes were lamentable and ironic in that in each case I was dealing with English text. Would I have done worse if non-English text had been present?

Great resources for learning to deal well with Unicode, etc:

Somehow I recently stumbled across, a brief and eye-opening read. This is, for me, a very helpful and comprehensive introduction to the broad topic of Python and Unicode. I certainly would have done better with the above-mentioned projects if I had read this essay beforehand. Simple tips like the encouraged use of type() and repr(), as well as a clarified understanding of the difference between decode() and encode() - I can’t believe I hadn’t picked that up on my own - will certainly be helpful going forward.

One of the references cited in the URL above is, something that I read years ago and am motivated to re-read now.

I love what Batchelder says:

“You don’t have to play whack-a-mole. Unicode isn’t simple, but it isn’t difficult either. With knowledge and discipline, you can deal with Unicode easily and with grace.”

That sounds good to me!