Course:LIBR557/2020WT2/UTF-8 character encoding

From UBC Wiki

UTF-8 Character Encoding

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit. UTF-8 is by far the most common encoding for the World Wide Web, accounting for over 96% of all web pages, and up to 100% for some languages, as of 2021.

History

The International Organization for Standardization (ISO) set out to compose a universal multi-byte character set in 1989. The draft ISO 10646 standard contained a non-required annex called UTF-1 that provided a byte stream encoding of its 32-bit code points. In July 1992, the X/Open committee XoJIG was looking for a better encoding. Dave Prosser of Unix System Laboratories submitted a proposal for one that had faster implementation characteristics and introduced the improvement that 7-bit ASCII characters would only represent themselves; all multi-byte sequences would include only bytes where the high bit was set. The name File System Safe UCS Transformation Format (FSS-UTF) and most of the text of this proposal were later preserved in the final specification.

In August 1992, this proposal was circulated by an IBM X/Open representative to interested parties. A modification by Ken Thompson of the Plan 9 operating system group at Bell Labs made it somewhat less bit-efficient than the previous proposal but crucially allowed it to be self-synchronizing, letting a reader start anywhere and immediately detect byte sequence boundaries. It also abandoned the use of biases and instead added the rule that only the shortest possible encoding is allowed; the additional loss in compactness is relatively insignificant, but readers now have to look out for invalid encodings to avoid reliability and especially security issues. Thompson's design was outlined on September 2, 1992, on a placemat in a New Jersey diner with Rob Pike. In the following days, Pike and Thompson implemented it and updated Plan 9 to use it throughout, and then communicated their success back to X/Open, which accepted it as the specification for FSS-UTF.

FSS-UTF 1992 UTF-8 1993.png

UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993. The Internet Engineering Task Force adopted UTF-8 in its Policy on Character Sets and Languages in RFC 2277 (BCP 18) for future Internet standards work, replacing Single Byte Character Sets such as Latin-1 in older RFCs. In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences.

Adoption

Since 2009, UTF-8 has been the most common encoding for the World Wide Web. The World Wide Web Consortium recommends UTF-8 as the default encoding in XML and HTML (and not just using UTF-8, also stating it in metadata), "even when all characters are in the ASCII range .. Using non-UTF-8 encodings can have unexpected results". Many other standards only support UTF-8, e.g. open JSON exchange requires it. As of February 2021, UTF-8 accounts for on average 96.4% of all web pages; and 971 of the top 1,000 highest ranked web pages (the next most popular encoding, ISO-8859-1, is used by 9 sites). This takes into account that ASCII is valid UTF-8. For local text files UTF-8 usage is lower, and many legacy single-byte encodings remain in use. This is primarily due to editors that will not display or write UTF-8 unless the first character in a file is a byte order mark, making it impossible for other software to use UTF-8 without being rewritten to ignore the byte order mark on input and add it on output.

Standards

·        RFC 3629 / STD 63 (2003), which establishes UTF-8 as a standard Internet protocol element

·        RFC 5198 defines UTF-8 NFC for Network Interchange (2008)

·        ISO/IEC 10646:2014 §9.1 (2014)

·        The Unicode Standard, Version 11.0 (2018)

·        The Unicode Standard, Version 2.0, Appendix A (1996)

·        ISO/IEC 10646-1:1993 Amendment 2 / Annex R (1996)

·        RFC 2044 (1996)

·        RFC 2279 (1998)

·        The Unicode Standard, Version 3.0, §2.3 (2000) plus Corrigendum #1 : UTF-8 Shortest Form (2000)

·        Unicode Standard Annex #27: Unicode 3.1 (2001)

·        The Unicode Standard, Version 5.0 (2006)

·        The Unicode Standard, Version 6.0 (2010)

Encoding

Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point. The following table shows the structure of the encoding. The x characters are replaced by the bits of the code point.

Layout of UTF-8 byte sequences.png

Codepage Layout

The following table summarizes usage of UTF-8 code units (individual bytes or octets) in a code page format. The upper half (0_ to 7_) is for bytes used only in single-byte codes, so it looks like a normal code page; the lower half is for continuation bytes (8_ to B_) and leading bytes (C_ to F_), and is explained further in the legend below.

ASCII-Table.svg

Overlong Encodings

In principle, it would be possible to inflate the number of bytes in an encoding by padding the code point with leading 0s. To encode the Euro sign € from the above example in four bytes instead of three, it could be padded with leading 0s until it was 21 bits long – 000 000010 000010 101100, and encoded as 11110000 10000010 10000010 10101100 (or F0 82 82 AC in hexadecimal). This is called an overlong encoding.

Invalid Sequences and Error Handling

Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

·        invalid bytes

·        an unexpected continuation byte

·        a non-continuation byte before the end of the character

·        the string ending before the end of the character (which can happen in simple string truncation)

·        an overlong encoding

·        a sequence that decodes to an invalid code point

Invalid UTF-8 has been used to bypass security validations in high-profile products including Microsoft's IIS web server and Apache's Tomcat servlet container. RFC 3629 states "Implementations of the decoding algorithm MUST protect against decoding invalid sequences." The Unicode Standard requires decoders to "...treat any ill-formed code unit sequence as an error condition. This guarantees that it will neither interpret nor emit an ill-formed code unit sequence."

References

"Chapter 2. General Structure". The Unicode Standard (6.0 ed.). Mountain View, California, US: The Unicode Consortium. ISBN 978-1-936213-01-6.

"Usage Survey of Character Encodings broken down by Ranking". w3techs.com. Retrieved 2021-02-19.

"The JavaScript Object Notation (JSON) Data Interchange Format". IETF. December 2017. Retrieved 16 February 2018.

Alvestrand, Harald (January 1998). IETF Policy on Character Sets and Languages. doi:10.17487/RFC2277. BCP 18.

"Specifying the document's character encoding". HTML5.2. World Wide Web Consortium. 14 December 2017. Retrieved 2018-06-03.

The Unicode Standard, Version 11.0 §3.9 D92, §3.10 D95, 2018.

Unicode Standard Annex #27: Unicode 3.1, 2001.

The Unicode Standard, Version 5.0 §3.9–§3.10 ch. 3, 2006.

The Unicode Standard, Version 6.0 §3.9 D92, §3.10 D95, 2010.

Marin, Marvin (2000-10-17). "Web Server Folder Traversal MS00-078".

Yergeau, F. (November 2003). UTF-8, a transformation format of ISO 10646. IETF. doi:10.17487/RFC3629. STD 63. RFC 3629. Retrieved August 20, 2020.

"Usage Statistics and Market Share of US-ASCII for Websites, August 2020". w3techs.com. Retrieved 2020-08-28.