What is Unicode?
Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.
Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.
These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.
Unicode is changing all that!
Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.
Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.
About the Unicode Consortium
The Unicode Consortium is a non-profit organization founded to develop, extend and promote use of the Unicode Standard, which specifies the representation of text in modern software products and standards. The membership of the consortium represents a broad spectrum of corporations and organizations in the computer and information processing industry. The consortium is supported financially solely through membership dues. Membership in the Unicode Consortium is open to organizations and individuals anywhere in the world who support the Unicode Standard and wish to assist in its extension and implementation.
For more information, see the Glossary, Unicode Enabled Products, Technical Introduction and Useful Resources.
Multilingual text-rendering engines
Input methods
Because keyboard layouts cannot have simple key combinations for all characters, several operating systems provide alternative input methods that allow access to the entire repertoire.
In Microsoft Windows (since Windows 2000), the "Character Map" program (Start/Programs/Accessories/System Tools/Character Map) provides rich-text editing controls for all Table I characters up to U+FFFF, by selection from a drop-down table, assuming that a Unicode font is selected. Word processing programs such as Microsoft Word have a similar control embedded (Insert/Symbol). Rather more painfully and where the code point of the desired character is known, it is possible to create Unicode characters by pressing Alt + PLUS + #
, where # represents the hexadecimal code point up to FFFF; for example, Alt + PLUS + F1
will produce the Unicode character ñ. This also works in many other Windows applications, but not in applications that use the standard Windows edit control, and do not make any special provisions to allow this type of input. See Alt codes. To add Unicode characters to chart titles in Microsoft Excel first type the title text into a worksheet cell, where the (Insert/Symbol) control can be used. The resulting text can be cut and pasted into chart titles.
Apple Macintosh users have a similar feature with an input method called 'Unicode Hex Input', in Mac OS X and in Mac OS 8.5 and later: hold down the Option key, and type the four-hex-digit Unicode code point. Inputting code points above U+FFFF is done by entering surrogate pairs; the software will convert each pair into a single character automatically. Mac OS X (version 10.2 and newer) also has a 'Character Palette', which allows users to visually select any Unicode character from a table organized numerically, by Unicode block, or by a selected font's available characters. The 'Unicode Hex Input' method must be activated in the International System Preferences in Mac OS X or the 'Keyboard' Control Panel in Mac OS 8.5 and later. Once activated, 'Unicode Hex Input' must also be selected in the Keyboard menu (designated by the flag icon) before a Unicode code point can be entered.
GNOME provides a 'Character Map' utility (Applications/Accessories/Character Map) which displays characters ordered by Unicode block or by writing system, and allows searching by character name or extended description. Where the character's code point is known, it can be entered in accordance with ISO 14755: hold down Ctrl and Shift and enter the hexadecimal Unicode value, preceded by the letter U if using GNOME 2.15 or later. The input code is an UTF-32 value. For example, type Ctrl+Shift+100050
to type a character in Unicode private plane 16.
At the X Input Method or GTK+ Input Module level, the input method editor SCIM provides a “raw code” input method to allow the user to enter the 4-digit hexadecimal Unicode value.
The Linux console allows Unicode characters to be entered by holding down Alt and typing the decimal code on the numeric keypad. (In order for this to work, the console should be placed in Unicode mode with unicode_start(1)
and a suitable font selected with setfont(8)
.) The AltGr key allows the hexadecimal code to be entered instead, using NumLock-Enter as A-F (clockwise). ISO 14755 compliant input (Ctrl+Shift+hexadecimal code on normal keys) is also available in the unicode
keymap.
The Opera web browser in version 7.5 and over allows users to enter any Unicode character directly into a text field by typing its hexadecimal code and pressing Alt + x
.
To input a Unicode character in a text box in Mozilla Firefox on Linux, type the hexadecimal character code while holding down the control and shift keys.
In the Vim text editor, Unicode characters can be entered by pressing CTRL-V and then entering a key combination. For example, an em-dash can typically be entered by typing CTRL-V, then "u2014". For more information, type ":help i_CTRL-V_digit
" in Vim. (Note that the entered text will be Unicode only if the current encoding is set to UTF-8 or another Unicode encoding; type ":help encoding
" in Vim for details.) Many Unicode characters can also be entered using digraphs; a table of such characters and their corresponding digraphs can be obtained using the ":digraphs
" command (again provided the current encoding is set to Unicode).
WordPad and Word 2002/2003 for Windows additionally allow for entering Unicode characters by typing the hexadecimal code point, for example 014B for ŋ, and then pressing Alt + x
to substitute the string to the left by its Unicode character. Usefully, the reverse also applies: if a user positions a cursor to the right of a non-ASCII character and presses Alt + x
, then the Microsoft software will substitute the character with the hexadecimal Unicode code point. (Note that the key combination may vary: in French version of Windows it is Alt + c
).
Several visual keyboards are available that make entering Unicode characters and symbols very easy.
Issues
East Asia
Some parties in Japan oppose Unicode and ISO/IEC 10646-1 in general, claiming technical limitations. People working on the Unicode standard regard such claims simply as misunderstandings of the Unicode standard and of the process by which it has evolved. The most common mistake, according to this view, involves confusion between abstract characters and their highly-variable visual forms (glyphs). The next most common source of confusion is attributing to Unicode decisions made by earlier national standards organizations, which Unicode was not in a position to undo.
For example, it is claimed that contrary to its policy, the Unicode Standard has also included characters which have merely stylistic differences, such as ligatures and some accented letters. In fact, including such forms that appear in legacy character encodings is a requirement of the Source Separation rule. Examples from ISO 8859-1 include æàáâãäå. One-to-one mappings must be provided between characters in existing legacy character sets and characters in Unicode to facilitate conversion to Unicode.
Some Japanese computer programmers object to Unicode because it requires them to separate the use of '\' U+005C REVERSE SOLIDUS (backslash) and '¥' U+00A5 YEN SIGN, which was mapped to 0x5C in JIS X 0201, and there is a lot of legacy code with this usage.(This encoding also replaces tilde '~' 0x7E with overline '¯', now 0xAF.) The separation of these characters exists in ISO 8859-1, from long before Unicode.
Some have decried Unicode as a plot against Asian cultures perpetrated by Westerners with no understanding of the characters as used in Chinese, Korean, and Japanese, despite the presence of a majority of experts from all three regions in the Ideographic Rapporteur Group (IRG). The IRG advises the Consortium and ISO on additions to the repertoire and on Han unification, the identification of forms in the three languages which one can treat as stylistic variations of the same historical character. Han unification has become one of the most controversial aspects of Unicode.
Unicode is criticized for failing to allow for older and alternate forms of kanji which, critics argue, complicates the processing of ancient Japanese and uncommon Japanese names, although it follows the recommendations of Japanese language scholars and of the Japanese government.There have been several attempts to create an alternative to Unicode. Among them are TRON (although it is not widely adopted in Japan, some, particularly those who need to handle historical Japanese text, favor this), and UTF-2000.
It is true that many older forms were not included in early versions of the Unicode standard, but Unicode 4.0 contains more than 70,000 Han characters -- far more than any dictionary or any other standard -- and work continues on adding characters from the early literature of China, Korea, and Japan. Some argue, however, that this is not satisfactory, pointing out as an example the need to create new characters, representing words in various Chinese dialects, more of which may be invented in the future.
An alternative way, pursued by people like Chu Bong-Foo, uses an encoding which provides information on the components of Han characters. For example, a 1991 Chinese computing system by Chu already provides 60,000 Han characters support, and takes up only 80KB memory space for the generation of glyphs from raw Cangjie codes. Their argument against Unicode is that the Unicode approach to Han characters is the same as assigning every English word a separate code.
Despite these problems, the official encoding of China, GB-18030, supports the same characters as Unicode, although encoded in a slightly different form
Southeast Asia
Thai language support has been criticized for its illogical ordering of Thai characters. This complication is due to Unicode inheriting the Thai Industrial Standard 620, which worked in the same way. This ordering problem complicates the Unicode collation process slightly, requiring table lookups to get the correct character order.
Indic scripts
Indic Scripts such as Hindi and Telugu are each allocated only 128 slots of the Unicode space, matching the ISCII standard. The correct rendering of Unicode Indic text requires transforming the stored logical order characters into visual order and the forming of ligatures out of components. Local scholars are arguing in favor of an assignment of Unicode codepoint to these ligatures, going against the practice for all other writing systems, including other Indic alphabets and Arabic. Encoding of any new ligatures in Unicode will not happen, in part because the set of ligatures is font-dependent, and Unicode has nothing to do with font variations. The same kind of issue arose for Tibetan script, where even the Chinese National Standard organization failed to achieve a similar change. These problems are rightly the domain of rendering engines (Pango, Apple ATSUI, Microsoft Uniscribe, and others) and fonts. Font developers have to learn to create the needed OpenType font tables to handle reordering and complex ligatures.
Some opponents of Unicode continue to claim that it cannot handle more than 65,535 characters, even though this limitation was removed in Unicode 2.0. Using either the surrogate mechanism or the equivalent variable-length UTF-8 encoding provides for 17 blocks of 65,536 codes each, of which very few are designated as not available for encoding characters. The most generous estimate of possible need suggests an upper bound of a quarter-million characters for all known languages and writing systems, while Unicode now has well over a million code points.