Download for Windows Download for Linux Download for FreeBSD Download for Mac Manual Wiki Forum IRC Trac

Tuesday, October 14, 2008

Unicode, UTF-8, UTF-16, UCS-2 - In a Nutshell

I have noticed that lots of people have no idea what exactly is the whole "Unicode", "UTF-8", "UTF-16" and "UCS-2" stuff, aside from the fact that it's somehow related to the display of foreign characters. The objective of this post is to briefly explain them and dispel some of the myths associated with them.

Unicode is a coding system used to represent characters from many languages (including Japanese and Chinese) without the need to change your language locale. If you've tried writing Kanji in Medusa, you know what I'm talking about. In Unicode, characters are given an unique number. For example, the capital letter "A" is U+0041 (65 in decimal), and the Hiragana "ふ" is U+3075 (12405 in decimal). Characters are divided into planes of 65536 characters for convenience. Almost all common characters are in plane 0 (also known as the Basic Multilingual Plane, or BMP), which goes from code points U+0000 to U+FFFF. All kanji are in planes 0 and 2.

UTF-8, UTF-16 and UCS-2 are simply techniques used to encode those values into text files. Windows helped create a myth that Unicode is UTF-16 by calling UTF-16 "Unicode" in applications such as Notepad - but the fact is that UTF-16 is as much Unicode as UTF-8 is.

UCS-2 (UCS = Universal Character Set) is an old encoding system that can store characters from the BMP by simply writing them as 16-bit values. The advantage of this system is that it's simple and covers most characters, but anything outside the BMP will fail catastrophically. That's why some Kanji have issues with some programs. An interesting consequence of UCS-2 is that it allows the mapping of characters that don't exist, such as the ones reserved for surrogate pairs (see the next paragraph).

UTF-16 (UTF = Unicode Transformation Format) builds on UCS-2. Indeed, for characters on the BMP, UTF-16 is identical to UCS-2; the difference lies in planes above the BMP. UTF-16 is capable of representing characters in planes 1 through 16 (even though no planes above 3 are specified yet) with a surrogate pair, that is, it uses two 16-bit values to store a character. This means that you can't measure the length of a UTF-16 string by counting how many 16-bit values it has!

UTF-8 is similar to UTF-16, but a character can be encoded as anything ranging from 1 to 6 bytes, although no character is mapped to anything that would be over 4 bytes long in UTF-8. Similarly to how UTF-16 is identical to UCS-2 for the BMP, UTF-8 is identical to Western encoding for the ASCII range (U+0000 to U+007F), making it "backwards compatible" with software that isn't Unicode-aware. It also means that text that is mostly composed of ASCII characters (such as, say, ASS subtitles) will be much shorter as UTF-8 than as other Unicode formats. That's why Aegisub uses UTF-8 as its standard format.

Regardless of encoding differences, UTF-8 and UTF-16 can both represent ANY Unicode character. UTF-16 can sometimes be shorter than UTF-8, but that will, in practice, never be the case for ASS subtitles, even if they are entirely written in Japanese/Chinese, due to all the ASCII text involved in the format syntax.

Related Posts by Categories



18 comments:

  1. I am using aegisub to translate to Hebrew,
    when I open the file in subtitle workshop it turns into jibrish. *sigh*

    I use "Export Subtitle... [as Local]" so sws will be able to read the ASS file.

    ReplyDelete
  2. Why the ultomas versions have not included the Spanish language?

    I would like to add thank you and good luck.

    ReplyDelete
  3. @acro: That's because Subtitles Workshop doesn't support Unicode, so it misinterprets Aegisub's generated UTF-8 as your local charset.

    @yANyZx: Because nobody has updated the Spanish translation to support the new version.

    ReplyDelete
  4. Are you talking about Aegisub2 spanish translation? A long time ago i have sent it to one of the programmers on IRC (can't remember if i was jfs...)

    Anyway, i can post it in my site if anyone needs it.

    ReplyDelete
  5. Can i add Subtitle in khmer (Cambodian) language? Aegisub (this program) Can read khmer character? i am doing on this to add khmer from srt to sub/idx... this program is new to me..

    ReplyDelete
  6. Jordan Brand adds to their Metallic releases to the Air Jordan 5 tongue with the debut of the Air Jordan 5 Bronze.Dressed in an Obsidian, White, Metallic Red Bronze and Bright Grape color Jordan 11 Space Jam scheme. This Air Jordan 5 features an Obsidian leather upper with Metallic Red Bronze tongue, White midsole with a Bronze flame. Completing the look is an icy translucent outsole.Look for the Air Jordan 5 Retro “Bronze” Jordans for sale to release on September 24th at select Jordan Brand retail stores. The retail price tag is set at $190 USD.
    SoleFly and Jordan Brand are set to team up once again for another upcoming Air Jordan Jordan 13 Shoes collaboration, this time celebrating their local World Series history.Following their recent Jordan Eclipse collab, SoleFly just teased a video of footage from when the Florida Marlins won their first MLB championship in 1997. No images have Jordan Eclipse surfaced, but you can expect the upcoming Air Jordan model to be dressed in the OG Marlins’ Teal and Black color scheme.Check out the clip below and stay tuned to Sneaker Bar for more update as Jordan Shoes 2016 they develop on this new SoleFly x Air Jordan collab.
    The Air Jordan 12 OVO debuts this Saturday, which will includes a full matching OVO Air Jordan 12 Apparel Collection in White and Gold.Officially kicking off Kyrie Shoes the month of October, October’s Very Own (OVO) has you covered. The apparel collection includes hoodie ($120), joggers ($100), headband ($15), wristband ($15), t-shirts ($45), hat ($40), slides ($65), bag ($130) and of course the Air Jordan 7 Jordan 12 ($225).Each piece of gear comes dressed in the White and Gold theme (besides the two added October’s Very Own t-shirts).Check out the entire OVO Air Jordan Apparel Collection below and look for them releasing Cheap Jordans

    ReplyDelete

  7. Thanks for sharing your info. I really appreciate your efforts and I will be waiting for your further write.
    Thanks for sharing !
    tanki online 2 | 2048 game online

    ReplyDelete

If you need help with Aegisub or have a bug report please use our forum instead of leaving a comment here. If you have a feature request, please go to our UserVoice page.

You will get better help on our forum than in the blog comments.