Dan's Musings

Byte Order Marks Must Die

The byte order mark is a terrible wrench that can gum up the gears. It is at the center of a rather old question in the Unicode community, with one answer being championed by Windows and the other answer by... Everyone else, maybe? Not sure.

The problem in question was summed up nicely in Joel Spolsky's seminal blog post on Unicode:

If you completely forget everything I just explained, please remember one extremely important fact. It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

How to solve this problem?

Windows explains their position on this issue in a relatively recent blog post. It decided to always make sure it knew what encoding and text files were in. It does this with the byte order mark. See, byte order marks look different in every UTF encoding, including UTF-8:

Encoding Bytes
UTF-32 0000FEFF

Thus Windows decided to use the byte order mark for something it was never intended to do: mark the encoding of a plain text file even when byte order wasn't the problem.

This doesn't fully solve their problem anyway because it doesn't guarantee that the file is in fact plain text. Nevertheless programs that expect plain text still look for this mark on Windows. It's like their engineers all agreed that this was THE way to decide what encoding the text was encoded in.

The flaw with this plan is that it assumes that the only encodings that exist on this planet start with the letters UTF. Rob Pike puts it best:

The Unicode Standard defines an adequate character set but an unreasonable representation. It states that all characters are 16 bits wide and are communicated and stored in 16-bit units. It also reserves a pair of characters (hexadecimal FFFE and FEFF) to detect byte order in transmitted text, requiring state in the byte stream. (The Unicode Consortium was thinking of files, not pipes.) To adopt this encoding, we would have had to convert all text going into and out of Plan 9 between ASCII and Unicode, which cannot be done. Within a single program, in command of all its input and output, it is possible to define characters as 16-bit quantities; in the context of a networked system with hundreds of applications on diverse machines by different manufacturers, it is impossible.

(Emphasis mine.)

In deciding that BOMs were the answer to their plain text problems, Windows engineers left older encodings such as ISO-8859-1, ASCII, etc. out in the cold.

This is "not a problem" on Windows, presumably because Windows Says So. In practice this is mostly true. Most editors that run on windows, (including VS Code, gVim and NeoVim QT, the ones I've had experience with) in order to play ball, use byte order marks by default on Windows and even often use UTF-16le unless they are configured to you something else like UTF-8.

The problem arose for me when I wrote a bash script, then fired up a Linux VM and tried to run it. (This is a surprisingly common use case for professionals who code for Linux targets but are required to use Windows laptops by their IT department.) See, BASH uses ISO-8859-1, and totally chokes on the BOM at the beginning of the file. But my editor reported that there was no problem, so it took a long time for me to figure out that there was something there that I didn't see. Even when I figured out that the BOM was there and I set nobomb in vim, it was a few more weeks of coding and silently placing these BOM characters in my get repositories before I realized I needed to set it globally instead of just locally. All in all, weeks of pain.

The only thing to do here of course is to configure your editor so that it never ever use byte order marks.

All the places in my editor where bite order marks are forbidden.

You should further write your own cmdlet in Powershell that takes standard input bytes and writes them out in utf-8 without a BOM and then always use it instead of the redirect operator (which uses utf-16le much of the time).

Text shouldn't have anything about its encoding inside of it. That information should we had from somewhere else, so that as Rob Pike pointed out, state is not introduced to the byte stream. Let's all do our part to render byte order marks unneeded and unused.