I have noticed that lots of people have no idea what exactly is the whole "Unicode", "UTF-8", "UTF-16" and "UCS-2" stuff, aside from the fact that it's somehow related to the display of foreign characters. The objective of this post is to briefly explain them and dispel some of the myths associated with them.
Unicode is a coding system used to represent characters from many languages (including Japanese and Chinese) without the need to change your language locale. If you've tried writing Kanji in Medusa, you know what I'm talking about. In Unicode, characters are given an unique number. For example, the capital letter "A" is U+0041 (65 in decimal), and the Hiragana "ふ" is U+3075 (12405 in decimal). Characters are divided into planes of 65536 characters for convenience. Almost all common characters are in plane 0 (also known as the Basic Multilingual Plane, or BMP), which goes from code points U+0000 to U+FFFF. All kanji are in planes 0 and 2.
UTF-8, UTF-16 and UCS-2 are simply techniques used to encode those values into text files. Windows helped create a myth that Unicode is UTF-16 by calling UTF-16 "Unicode" in applications such as Notepad - but the fact is that UTF-16 is as much Unicode as UTF-8 is.
UCS-2 (UCS = Universal Character Set) is an old encoding system that can store characters from the BMP by simply writing them as 16-bit values. The advantage of this system is that it's simple and covers most characters, but anything outside the BMP will fail catastrophically. That's why some Kanji have issues with some programs. An interesting consequence of UCS-2 is that it allows the mapping of characters that don't exist, such as the ones reserved for surrogate pairs (see the next paragraph).
UTF-16 (UTF = Unicode Transformation Format) builds on UCS-2. Indeed, for characters on the BMP, UTF-16 is identical to UCS-2; the difference lies in planes above the BMP. UTF-16 is capable of representing characters in planes 1 through 16 (even though no planes above 3 are specified yet) with a surrogate pair, that is, it uses two 16-bit values to store a character. This means that you can't measure the length of a UTF-16 string by counting how many 16-bit values it has!
UTF-8 is similar to UTF-16, but a character can be encoded as anything ranging from 1 to 6 bytes, although no character is mapped to anything that would be over 4 bytes long in UTF-8. Similarly to how UTF-16 is identical to UCS-2 for the BMP, UTF-8 is identical to Western encoding for the ASCII range (U+0000 to U+007F), making it "backwards compatible" with software that isn't Unicode-aware. It also means that text that is mostly composed of ASCII characters (such as, say, ASS subtitles) will be much shorter as UTF-8 than as other Unicode formats. That's why Aegisub uses UTF-8 as its standard format.
Regardless of encoding differences, UTF-8 and UTF-16 can both represent ANY Unicode character. UTF-16 can sometimes be shorter than UTF-8, but that will, in practice, never be the case for ASS subtitles, even if they are entirely written in Japanese/Chinese, due to all the ASCII text involved in the format syntax.
Tuesday, October 14, 2008
Unicode, UTF-8, UTF-16, UCS-2 - In a Nutshell
Friday, October 10, 2008
Kanjimemo brainstorming and input request
漢字メモ?
[Text above added to try to attract some attention to this post]Ever since I've posted about Kanamemo, there have been quite a few requests for a "Kanjimemo", a tool based on the same idea, but for Kanji.
Even before, I had considered writing something like that... But I never started because I couldn't quite figure out all the details on how it'd work. On this post, I'll talk about some of the ideas that I had for it. If you're interested in a "Kanjimemo", please leave your feedback and suggestions in the comments!
Programming Language
First of all, I'm not sure which programming language to write it in. At first, I considered C++, since that would be the easiest for me and allow the maximum flexibility, at least as far as PCs are concerned. The problem is that I'm already fairly experienced with C++, and so it wouldn't be much of a learning experience (which is always a plus :)) unless I went for Direct3D.
Then I pondered about Java: with all the cell phones supporting J2ME, it seemed like a good idea - Kanjimemo on the go? Great! The real problem came when I realized that J2ME is *REALLY* limited - you often have less than 1 MB of heap memory available (!) for your application, which makes a program like Kanjimemo almost impossible to implement. I also lack a J2ME-enabled cell phone, so I couldn't even work on a J2ME port right away.
A few other languages crossed my mind. C# is something that I've always wanted to learn, but its cross-platform support is quite bad (I'm looking at you, Mono). It's also much slower than Java. Python is another "to learn" language, but I question the sanity of doing complicated data analysis on such a high level and slow language... Plus all the horrible dependencies. Same goes for Ruby.
So, any thoughts on the "language barrier" might be useful.
Basics
On to how the program would ACTUALLY work... Learning kanji is nowhere as easy as learning kana. The problem with kanji is that most of them have multiple (typically two) readings, depending on the word... but some (like 日, one of the most basic kanji) can have many more. So my idea is to have an algorithm that works like this:
- Select a group of five or so kanji for each level (like Kanamemo)
- Mine EDICT for all words marked as [Common] that use that kanji
- Perhaps attempt to extract the pronunciation of your kanji on that word? If that doesn't work, just go with individual words.
- Create a list of all the different unique pronunciations and associated words.
- Have the user learn all the unique pronunciations, preferably by using words that contain nothing but that kanji and kana.
- If there's no word with that kanji by itself, make sure that the user already "learned" all the other kanji in the word displayed.
Progression
Progression would work similarly to Kanamemo, with a new set of 5 kanji unlocked with each memorized set. Ideally, the user could choose profiles to control the new kanji: perhaps follow the JLPT progression, or the japanese school system progression, or how common a given kanji is, or a combination of them (i.e. start with all JLPT4 kanji sorted by frequency, then all JLPT3 sorted by frequency, etc). The user should also be able to customize a list of kanji that he wants to learn.
Given this system, it'd be possible to simply consider kana as being kanji, and have the program work in the same way for those, so you'd be entering actual japanese words when learning kana. This has the advantage of making your japanese reading skill progress.
Multiple fonts
One problem that I noticed with kanamemo is that it was easy to just memorize the font glyph, as opposed to the more abstract shape of the kana. This could prove to be an issue with kana that are very different depending on how they're written (such as さ and ふ). This program would fix that problem by using different types of fonts (cursive, brush, type) randomly, or perhaps by forcing you to learn all the different variation before progressing.
Translation
Since the concept of the program is word-focused, it might feel strange to be learning how to read words without learning what they mean. If you're an anime watcher, then perhaps you already have a relatively big vocabulary of words, but you won't know all of them, and not everyone is an anime watcher. EDICT provides translations, but I'm not sure if just slapping the translations there will do any good... Thoughts on this?
Voice
Finally, it might be useful to have someone read the words out loud for you whenever you get them right. I'm not sure how hard it would be to add support for some third-party voice synthesizer, but it might be worth the trouble.
Other ideas
Perhaps the program should be designed to look more like a game? A little mascot cheering for you, a scrolling background, some background music? Perhaps this game could have multiple "stages" that you would do in alternating order: First learn to read the kanji, then what the word means, then perhaps a speed typing test? Maybe even a grammar test mode?
Development
Of course, what this needs the most right now are IDEAS! If you have any, please share them with us. If you know of somebody who might be interested in this sort of thing, link them to this page! If you want to help with the development itself, drop by IRC and let us know. The idea is that this should be an open, free project.
Wednesday, August 13, 2008
TrayDict: EDICT on your systray
A while ago, I posted about Kanamemo, a tool that I made to help me learn Hiragana and Katakana. Another tool that I've made to help me in the learning of the Japanese language was TrayDict:
TrayDict is a very simple application that sits on your systray until you bring it up with WinKey+Z. Then it lets you type a word in Kanji, Kana, Rōmaji or English, and it will search for it in EDICT (or any of the other supported dictionaries), returning every match, ordered by relevance.
The catch is: it's a complete hack. I made it in just a few days, without worrying at all about how well it'd work - I just wanted a tool for personal use. Even though I'm posting the link to it here, here are some things that you'll have to keep in mind:
- It has a bug that prevents Windows from shutting down as long as it's running. Right click on it on systray and choose "Exit" to terminate it.
- There are no options at all. The shortcut is WinKey+Z, and that can't be changed. You can also bring it up/down via systray.
- This is no longer maintained, so don't bother sending bug reports and/or suggestions.
Anyway, here is the Win32 binary: Link
The source is available here, but the HEAD is completely broken (result of my experimentation with Gecko), so you'll have to dig the repository for an older version.
Thursday, July 24, 2008
Kanamemo: a tool for the apprentice weeaboo
Back in 2006, when I decided to learn Hiragana and Katakana, I looked around for flashcard programs to help me in my task. After finding that none of them actually worked as I thought that they SHOULD, I decided to roll my own. The result is Kanamemo:
It works by teaching you Hiragana and/or Katakana (at your choice) by "levels". Each level typically contains 5 different kana. It basically shows you a kana and asks you to enter its Hepburn roomaji transliteration. If you get it right, you get +1 point in it. If you get it wrong, you get -10 to it AND -10 to the one that you confused it with. Once all kana of a given level are at least at level 5 (or so, I don't remember the exact rules), you've learned them and it unlocks the next level.
It also never stops flashing old kana to you, but the probability of a given kana being picked is inversely proportional to how good you are at it - that way, it makes sure that you don't forget the ones that you learned earlier, while focusing on the ones that you struggle with.
I found that it works exceptionally well, and I could learn to read all of Hiragana and Katakana in 2 days, but if you're particularly diligent, you can probably do it in a day.
You can download a Win32 binary for it here. Please post a comment if you have any trouble with the runtimes (i.e. missing DLL errors, or "application hasn't been installed properly").
The source code for the program has been available at the Aegisub repository for a while. Here's a link.
I have been meaning to write a similar tool for kanji+words (by mining data from EDICT and KANJIDIC), but my sloth has been preventing me from doing so.
[EDIT] If you want to build it natively on Linux, see this.









