Byte order mark

From Wikipedia, the free encyclopedia
Jump to: navigation, search

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. It is encoded at U+FEFF byte order mark (BOM). BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.1

Because Unicode can be encoded as 16-bit or 32-bit integers, a computer receiving these encodings from arbitrary sources needs to know which byte order the integers are encoded in. The BOM gives the producer of the text a way to describe the text stream's endianness to the consumer of the text without requiring some contract or metadata outside of the text stream itself. Once the receiving computer has consumed the text stream, it presumably processes the characters in its own native byte order and no longer needs the BOM. Hence the need for a BOM arises in the context of text interchange, rather than in normal text processing within a closed environment.

Usage

If the BOM character appears in the middle of a data stream, Unicode says it should be interpreted as a "zero-width non-breaking space" (inhibits line-breaking between word-glyphs). In Unicode 3.2, this usage is deprecated in favour of the "Word Joiner" character, U+2060.1 This allows U+FEFF to be only used as a BOM.

UTF-8

The UTF-8 representation of the BOM is the byte sequence 0xEF,0xBB,0xBF. A text editor or web browser interpreting the text as ISO-8859-1 or CP1252 will display the characters  for this.

The Unicode Standard permits the BOM in UTF-8,2 but does not require nor recommend its use.3 Byte order has no meaning in UTF-8,4 so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8. The BOM may also appear when UTF-8 data is converted from other encodings that use a BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.5 6

Reasons the standard does not advocate the UTF-8 BOM include:

  • To encourage conversion to Unicode. A BOM complicates programming. Heuristics can usually determine whether a stream of bytes is UTF-8 or not, if it is necessary to know.
  • A plain ASCII file is in UTF-8 encoding. Requiring a BOM makes an artificial distinction between ASCII and UTF-8.citation needed
  • A language parser that transparently handles bytes with the high bit set in certain free-text contexts (such as string literals or comments) but otherwise uses a syntax defined only by ASCII characters, is already able to read and process UTF-8 correctly, even if it is not designed for Unicode. However the BOM at the start would violate its syntax and cause a parsing error. This is true of almost all languages written for personal computers and designed to handle legacy encodings such as CP1252.
  • The presence of a BOM defeats software that does not anticipate it, such as those that use pattern matching to look for specific bytes at the start of a text file. For example, Unix uses a shebang at the start of an interpreted script,7 but will miss the shebang if the file contains a BOM. PHP will output the BOM, thereby sending the HTTP header if PHP is executed in a HTTP environment. After that point, calls to function that manipulate the HTTP header will fail.8

Despite this, Microsoft compilers9 and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

UTF-16

In UTF-16, a BOM (U+FEFF) may be placed as the first character of a file or character stream to indicate the endianness (byte order) of all the 16-bit code units of the file or stream.

  • If the 16-bit units are represented in big-endian byte order, this BOM character will appear in the sequence of bytes as 0xFE followed by 0xFF. This sequence appears as the ISO-8859-1 characters þÿ in a text display that expects the text to be ISO-8859-1.
  • if the 16-bit units use little-endian order, the sequence of bytes will have 0xFF followed by 0xFE. This sequence appears as the ISO-8859-1 characters ÿþ in a text display that expects the text to be ISO-8859-1.

Programs expecting UTF-8 may show these or error indicators, depending on how they handle UTF-8 encoding errors. In all cases they will probably display the rest of the file as garbage (a UTF-16 text containing ASCII only will be fairly readable).

For the IANA registered charsets UTF-16BE and UTF-16LE, a byte order mark should not be used because the names of these character sets already determine the byte order. If encountered anywhere in such a text stream, U+FEFF is to be interpreted as a "zero width no-break space".

Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore the presumption of big-endian is widely ignored. When those same files are accessible on the Internet, on the other hand, no such presumption can be made. Searching for 16-bit characters in the ASCII range or just the space character (U+0020) is a method of determining the UTF-16 byte order.

UTF-32

Although a BOM could be used with UTF-32, this encoding is rarely used for transmission. Otherwise the same rules as for UTF-16 are applicable.

Representations of byte order marks by encoding

This table illustrates how BOMs are represented as byte sequences and how they might appear in a text editor that is interpreting each byte as a legacy encoding (CP1252 and symbols for the C0 controls):

Encoding Representation (hexadecimal) Representation (decimal) Bytes as CP1252 characters
UTF-8t 1 EF BB BF 239 187 191 
UTF-16 (BE) FE FF 254 255 þÿ
UTF-16 (LE) FF FE 255 254 ÿþ
UTF-32 (BE) 00 00 FE FF 0 0 254 255 ␀␀þÿ (␀ refers to the ASCII null character)
UTF-32 (LE) FF FE 00 00 255 254 0 0 ÿþ␀␀ (␀ refers to the ASCII null character)
UTF-7t 1 2B 2F 76 38
2B 2F 76 39
2B 2F 76 2B
2B 2F 76 2F
t 2
2B 2F 76 38 2Dt 3
43 47 118 56
43 47 118 57
43 47 118 43
43 47 118 47
43 47 118 56 45
+/v8
+/v9
+/v+
+/v/
+/v8-
UTF-1t 1 F7 64 4C 247 100 76 ÷dL
UTF-EBCDICt 1 DD 73 66 73 221 115 102 115 Ýsfs
SCSUt 1 0E FE FFt 4 14 254 255 ␎þÿ (␎ represents the ASCII "shift out" character)
BOCU-1t 1 FB EE 28 251 238 40 ûî(
GB-18030t 1 84 31 95 33 132 49 149 51 „1•3
  1. ^ a b c d e f g This is not literally a "byte order" mark, since the byte is also the code unit in these encodings and there is no byte order to resolve. The sequence can be used to indicate the encoding of the text which it is preceding, however.410
  2. ^ In UTF-7, the fourth byte of the BOM, before encoding as base64, is 001111xx in binary. The final two bits, xx, are not specifically part of the BOM, but contain the first two bits of the first encoded character following the BOM. All four possible byte combinations are shown in the table, as well as a fifth which is used for an empty string.
  3. ^ If no following character is encoded, 38 is used for the fourth byte and the following byte is 2D.
  4. ^ SCSU allows other encodings of U+FEFF, the shown form is the signature recommended in UTR #6.11

See also

References

  1. ^ a b Unicode FAQ: UTF-8, UTF-16, UTF-32 & BOM
  2. ^ "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). p. 36. Retrieved 2009-03-29. "Table 2-4. The Seven Unicode Encoding Schemes" 
  3. ^ "The Unicode Standard 5.0, Chapter 2:General Structure" (PDF). p. 36. Retrieved 2008-11-30. "Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature" 
  4. ^ a b "FAQ - UTF-8, UTF-16, UTF-32 & BOM: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?". Retrieved 2009-01-04. 
  5. ^ "Re: pre-HTML5 and the BOM from Asmus Freytag on 2012-07-13 (Unicode Mail List Archive)". Retrieved 2012-07-14. 
  6. ^ Bug ID: JDK-6378911 UTF-8 decoder handling of byte-order mark has changed
  7. ^ Markus Kuhn (2007). "UTF-8 and Unicode FAQ for Unix/Linux: What different encodings are there?". Retrieved 20 January 2009. "Adding a UTF-8 signature at the start of a file would interfere with many established conventions such as the kernel looking for “#!” at the beginning of a plaintext executable to locate the appropriate interpreter." 
  8. ^ http://www.w3.org/International/questions/qa-byte-order-mark
  9. ^ Alf P. Steinbach (2011). "Unicode part 1: Windows console i/o approaches". Retrieved 24 March 2012. "However, since the C++ source code was encoded as UTF-8 without BOM (as is usual in Linux), the Visual C++ compiler erroneously assumed that the source code was encoded as Windows ANSI." 
  10. ^ STD 63: UTF-8, a transformation of ISO 10646 Byte Order Mark (BOM)
  11. ^ UTR #6: Signature Byte Sequence for SCSU

External links








Creative Commons License