Charsets and Unicode Identifiers in Java
Join the DZone community and get the full member experience.
Join For FreeChar Data Type
The char data type in a programming language is used to represent a unit of text. How is text data represented? The text data is represented as sequence of characters. The char data type is simply a numeric value of the character from a character set.
Character Sets
What is a character set? A character set is a collection of a unit of text(character), which are assigned some unique numeric value. There are various character sets available. The most commonly known character set is the ASCII (American Standard Code for Information Interchange) character set, which assigns only 128 characters (including the control characters) to the numeric values in the range from 0 — 127. Not all characters in a character set are printable. Character sets also include control characters. e.g. we have characters for a carriage-return, line-feed, form-feed, tab, bell etc. These are not having a print position, but give effect to position of next character or some such function.
In the initial days of computing each OS would support a particular character set and the most commonly supported character sets used to be ASCII or EBCDIC (Extended Binary Coded Decimal Interchange Code). Later ASCII and its extensions were adopted by most of the OSs. The ASCII character set included only the commonly used Latin characters and control characters. Each extension of the ASCII character set was catering to demands of a particular region/culture’s requirement of text representations. Most of these extensions of ASCII utilized the numeric values from 128 to 255 for the additional characters, specific to the region.
Some of the common examples of character sets which are extensions of ASCII are the ISO 8859 series, e.g. 8859-1 or Latin-1 caters to the Western European, 8859-2 or Latin-2 caters to Eastern European, 8859-3 or Latin-3 caters to Southern European, 8859-4 or Latin-4 caters to Northern European, 8859-5 or Cyrillic caters to Russian, Bulgarian, 8859-6 or Arabic, caters to the Arabic characters, 8859-7 or Greek, caters to the Greek characters, 8859-8 or Hebrew, caters to Hebrew characters, 8859-9 or Latin-5 caters to Turkish characters, 8859-10 or Latin-6 caters to Northern European, 8859-11 or Thai, caters to the Thai characters, 8859-13 or Latin-7 caters to Baltic, 8859-14 or Latin-8 caters to Celtic, 8859-15 or Latin-9 caters to Western European and 8859-16 or Latin-10 caters to Eastern European.
There other extensions of ASCII. We have ISCII (Indian Script Code for Information Interchange) which caters to the Indian scripts. The ISO 8859-12 was reserved for catering to the Devanagari script, but this was abandoned. The ISO 8859 series of character sets are summarized in Table 1. For catering to the Indic scripts another extension of ASCII called ISCII(Indian Script Code for Information Interchange) was developed in 1988, which was later revised in 1991.
ISO Number |
Name |
Region/Languages |
---|---|---|
8859-1 |
Latin-1 |
Western European |
8859-2 |
Latin-2 |
Eastern European |
8859-3 |
Latin-3 |
Southern European |
8859-4 |
Latin-4 |
Northern European |
8859-5 |
Cyrillic |
Russian, Bulgarian |
8859-6 |
Arabic |
Arabic |
8859-7 |
Greek |
Greek |
8859-9 |
Hebrew |
Hebrew |
8859-9 |
Latin-5 |
Turkish |
8859-10 |
Latin-6 |
Northern European |
8859-11 |
Thai |
Thai |
8859-12 |
Devanagari |
Abandoned |
8859-13 |
Latin-7 |
Baltic |
8859-14 |
Latin-8 |
Celtic |
8859-15 |
Latin-9 |
Western European |
8859-16 |
Latin-10 |
Eastern European |
Table – 1 ISO 8859 series of Character sets summarized
All the character sets mentioned above are extensions of ASCII. i.e. They have the same characters in the range from 0 – 127, same as ASCII. A quick summary of the ASCII characters is given in Table 2.
Problems in having multiple character sets
The char data type used by many of the programming languages would simply rely on the OSs interpretation of the numeric value of the char data. i.e. if the OS used a different character set, the same numeric value would be interpreted differently. e.g. According to Latin-1 (8859-1) the value EB(Hex) is used for representing the character ë, whereas according to Greek (8859-7) the same value EB(Hex) is used for representing the character λ.
Value in Hex |
Character(s) Description |
00 – 1F |
Control Characters |
20 |
Space |
21 – 2F |
Punctuations ! " # $ % & ' ( ) * + , - . / |
30 – 39 |
Digits 0 – 9 |
3A – 40 |
Punctuations : ; < = > ? @ |
41 – 5A |
Uppercase Letters A – Z |
5B – 60 |
Punctuations [ \ ] ^ _ ` |
61 – 7A |
Lowercase Letters a – z |
7B – 7E |
Punctuations { | } ~ |
7F |
Control Character |
Table – 2: ASCII character set summarized
Unicode
To solve this problem a universal character set was designed in the form of Unicode. The first version of Unicode was introduced in 1991. Unicode character set was designed to include all the characters available in all the languages/scripts of the world. This character set does get revised to include newer characters being added in various regions, as well as identification of some languages/scripts which were not included in the earlier version. This character set has been designed to use numeric values from 0 — 10FFFF(Hex). This character set is also an extension of ASCII, so the initial values from 0 to 127 are same as ASCII. Most of the Indian scripts have been provided a block of 128 characters each, starting from 0x0900 onwards. The Unicode blocks for the Indian scripts is based on the ISCII 1988 and not on the ISCII 1991. In the Unicode character set, there is no provision for removing or updating any character, so newer versions of Unicode can only add new characters and it may deprecate any existing characters. The blocks for the South Central and South East Asian Scripts in Unicode are summarized in Tables 3 to 7.
Value Range in Hex |
Script |
0900 – 090F |
Devanagari |
A8EA – A8FF |
Devenagari Extended |
1CD) – !CFF |
Vedic Extensions |
0980 – 09FF |
Bengali |
0A00 - 0A7F |
Gurmukhi |
0A80 – 0AFF |
Gujarati |
0B00 – 0B7F |
Oriya |
0B80 – 0BFF |
Tamil |
11FC0 – 11FFF |
Tamil Supplement |
0C00 – 0C7F |
Telugu |
0C80 – 0CFF |
Kannada |
0D00 – 0D7F |
Malayalam |
Table – 3: Official Scripts of India in Unicode
Value Range in Hex |
Script |
0780 – 07BF |
Thaana |
0D80 – 0DFF |
Sinhala |
11400 – 1147F |
Newa |
0F00 – 0FFF |
Tibetan |
1800 – 18AF |
Mongolian |
11660 – 1167F |
Mongolian Supplement |
1900 – 194F |
Limbu |
ABC0 – ABFF |
Meetei Mayek |
AAE0 – AAFF |
Meetei Mayek Extensions |
16A40 – 16A6F |
Mro |
118A0 – 118FF |
Warang Citi |
1C50 – 1C7F |
Ol Chiki |
11100 – 1114F |
Chakma |
1C00 – 1C4F |
Lepcha |
A880 – A8DF |
Saurashtra |
11D00 – 11D5F |
Masaram Gondi |
11D60 – 11DAF |
Gunjala Gondi |
1E2C0 – 1E2FF |
Wancho |
Table – 4: Other Modern Scripts of South and Central Asia in Unicode
Value Range in Hex |
Script |
11000 – 1107F |
Brahmi |
10A00 – 10A5F |
Kharoshthi |
11C00 – 11C6F |
Bhaiksuki |
A840 – A87F |
Phags-pa |
11C70 – 11CBF |
Marchen |
10C00 – 10C4F |
Old Turkic |
11A50 – 11AAF |
Soyombo |
11A00 - 11A4F |
Zanabazar Square |
10F00 – 10F2F |
Old Sogdian |
10F30 – 10F6F |
Sogdian |
Table – 5: Ancient Scripts of South and Central Asia in Unicode
Value Range in Hex |
Script |
A800 – A82F |
Syloti Nagri |
11080 – 110CF |
Kaithi |
11180 – 111DF |
Sharada |
11680 – 116CF |
Takri |
11580 – 115FF |
Siddham |
11150 – 1117F |
Mahajani |
11200 – 1124F |
Khojki |
112B0 – 112FF |
Khudawadi |
11280 – 112AF |
Multani |
11480 – 114DF |
Tirhuta |
11600 – 1165F |
Modi |
119A0 – 119FF |
Nandinagari |
11300 – 1137F |
Grantha |
11900 – 1195F |
Dives Akuru |
11700 – 1173F |
Ahom |
110D0 – 110FF |
Sora Sompeng |
11800 – 1184F |
Dogra |
Table – 6: Other Historic Scripts of South and Central Asia in Unicode
Value Range in Hex |
Script |
0E00 – 0E7F |
Thai |
0E80 – 0EFF |
Lao |
1000 – 109F |
Myanmar |
AA60 – AA7F |
Myanmar Extended-A |
A9E0 – A9FF |
Myanmar Extended-B |
1780 – 17FF |
Khmer |
19E0 – 19FF |
Khmer Symbols |
1950 – 197F |
Tai Le |
1980 – 19DF |
New Tai Le |
1A20 – 1AAF |
Tai Tham |
AA80 – AADF |
Tai Viet |
A900 – A92F |
Kayah Li |
AA00 – AA5F |
Cham |
16B00 – 16B8F |
Pahawh Hmong |
1E100 – 1E14F |
Nyiakeng Puachue Hmong |
11AC0 – 11AFF |
Pau Cin Hau |
10D00 – 10D3F |
Hanifi Rohingya |
Table – 7: Scripts of Southeast Asia in Unicode
What is the size of char in C? The size of char is usually 1 byte in C. The size of char in Java is 2 bytes. It is an unsigned integral, 16-bit value, used for representing UTF-16 code-units.
Why is the size of a char, 2 bytes in Java? In C, char represents a character from the platform’s local character set, which in most cases, is some extension of ASCII. The number of characters in most of these character sets is normally up-to 256, so they require only 1 byte. In the case of Java, the char type is used to represent characters from the Unicode character set using the UTF-16 encoding, which requires 16 bits. The details about UTF encodings is given in the article on UTF.
Let us understand Unicode. Unicode is a character set that has characters from all the languages of the world. There are various versions of the Unicode character set. At the time of this writing, the version of Unicode was 13.0. The Unicode standard maps characters from all the languages to a unique code-point value. The code-point values can be in the range of 0 —10FFFF (Hex). This code-point range has been divided into 17 planes, each of 65536 values, i.e. 216. The zeroth plane, i.e. values from 0 —FFFF(Hex) is known as BMP (Basic Multilingual Plane), and other planes define the supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary
characters are represented as a pair of 16-bit code units, the first code unit is from the high-surrogates range (D800 – DBFF(Hex)), and the second code unit is from the low-surrogates range (DC00 – DFFF(Hex)). In Unicode standard, the range of code-point values from D800 to DFFF (Hex) has not been assigned to any valid character and is reserved for surrogates. For characters in the range of 0000 —FFFF(Hex), the values of code-points and UTF-16 code units are the same. The Java programming language represents text in sequences of 16-bit code units using the UTF-16 encoding. The char type in the Java programming language represents the 16-bit code unit.
Unicode Escapes in Java Source Code
The Java source code is a sequence of Unicode characters. The Java source code can contain characters from any language and not just characters from the ASCII character set. Most of the time the source code is encoded in some native character set, which is an extension of ASCII. Even in these cases the Java source code can include characters that are not part of the native character set. This is done by using the Unicode escape. In the source code we can specify any UTF-16 code unit by specifying the value as \u followed by four hexadecimal digits.
Identifiers in Java
In the Java source code, we define/declare several entities. These entities are identifiable by some names. Identifiers are used to give names to the entities. These identifiers are used for naming the class(es), interface(s), enum(s), annotation(s). They are also used for giving names to various kinds of members of the class, interface, enum and annotation. The members which may be given names are the methods. The identifiers are also used for giving names to variables. The identifiers are also used for giving names to the type parameters of a Generic class. The identifiers are also used for giving names to packages and sub-packages. Identifiers are also used for giving names to labels within a method or a block. These labels are target of the break and the continue statements.
What are the rules for defining an identifier in Java? In Java, an identifier may contain any number of “Java letters” or “Java digits”, and it can start only with a “Java letter”. The sequence of “Java letters” in an identifier cannot match any of the keywords of the Java language or the boolean literals true, false or the literal null. A “Java letter” is not just the letters A — Z and a — z from the ASCII character set, but it also includes the letters from other languages available from the Unicode character set. The “Java letter” also includes the connecting punctuation characters like the ‘_’ character, currency symbols like the ‘$’, ‘₹’, ‘€’, ‘£’, ‘¥’, etc. sign of a numeric letter like the roman numeral ‘X’. The “Java digit” also includes the digits used in the various languages available in the Unicode character set and not just the digits 0 — 9 from the ASCII character set. The “Java letter” also includes the combining marks and the non-spacing marks, which may be used for combining characters, The following declaration shows a valid declaration of a Java identifier:
char अ = 'अ';
Here अ has been used as an identifier; since it is a letter in Hindi, this declaration is valid. But then how do we use these characters in a Java source file, which may be created using a text editor, where only the ASCII characters may be available? In a Java source file before the compiler identifies the lines and the tokens, it looks for Unicode escapes in the Java source file. The Java compiler works on Unicode characters. Our Java source file is normally encoded in ASCII or some extension of ASCII. While decoding from ASCII to Unicode, the compiler would first replace the Unicode escapes in the Java file with the actual Unicode character value. Using the Unicode escape we can write the above declaration in a Java source file encoded in ASCII as shown below:
char \u0905 = '\u0905';// 0905 is the hex value for hindi letter A
Unicode escape is written as \u followed by four hexadecimal digits, where the hexadecimal digits are the code-point values for that character in the Unicode character set.
The following code segment would not compile:
char ch = '\u000A'; // 000A is value for line feed
since this will be seen by the Java compiler as:
xxxxxxxxxx
char ch = '
';
Instead char ch ='\n'; should be used to have character literal for newline.
Continuing with examples of valid and invalid identifiers for Java, the declaration
String नमस्ते = ”नमस्ते ”;
is valid in Java since it is only made up of Letters, but
String ९नमस्ते = ”नमस्ते ”;
is not valid since it starts with a digit(९ is Devanagari digit nine). But
String ₹९नमस्ते = ”नमस्ते ”;
would become valid since it now does not start with a digit, but a currency sign.
Exercise
Define a class called नमस्तेदुनिया with a main method similar to the main method of the typical HelloWorld class. Use parameter name as आर्ग instead of args, and it should print “नमस्ते दुनिया” on the standard output instead of “Hello world”.
This can be done as given below:
Listing for class नमस्तेदुनिया
x
class नमस्तेदुनिया {
public static void main(String[] आर्ग) {
System.out.println(“नमस्ते दुनिया”);
}
}
The above code would compile successfully and execute, You should be able to see the output if your terminal supports the requred fonts.
It is not a good idea to use non-ASCII text in the Java source code directly, since the interpretation of non-ASCII text would largely depend on the encoding used by the native OS. So, if we want to include non-ASCII characters in the Java source file, we may be better off using the Unicode escapes for all such characters. This may seem to be a tedious task. This can easily be taken care of by the native2ascii utility, which is part of JDK. This utility can convert any text file encoded using any of the standard encoding to ASCII by applying the Unicode escapes for all the non-ASCII characters, and can also be reversed back to native encoding using the same utility. e.g. if the code in Listing above is saved in a file named HelloWorldHindi.java. And saved using the UTF-8 encoding, then this Java source file can be converted to “ASCII only” by using the command as given below:
native2ascii -encoding utf-8 HelloWorldHindi.java HelloWorldHindi.java
This file can be converted back to UTF-8 encoded format with the help of command as given below:
native2ascii -encoding utf-8 -reverse HelloWorldHindi.java HelloWorldHindi.java
Identifier Ignorable Characters
There are also some non-printable characters (most of these are control-characters), which are ignored in an identifier by the Java compiler. These characters are known as Java-Identifier-Ignorable. i.e. if any such character is used in an identifier it will be ignored, so it is possible to have two different sequence of Unicode characters in an identifier, which mean the same. e.g.
x
class TestIdentifierIgnorable {
public static void main (String[] args) {
String str = ”Hello world!”;
System.out.println(s\u0001tr); // \u0001 is Java Identifier Ignorable character
}
}
In the above code listing, the variable name s1 in line 3 is same as s\u00011 used in line 4. So, the above code is legal, compiles successfully and when run, would print “Hello world!” on the standard output.
The following are the numeric value in hex for the Java-Identifier-Ignorable characters.
0, 1, 2, 3, 4, 5, 6, 7, 8, e, f, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1a, 1b, 7f, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 8a, 8b, 8c, 8d, 8e, 8f, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 9a, 9b, 9c, 9d, 9e, 9f, ad, 600, 601, 602, 603, 604, 605, 61c, 6dd, 70f, 8e2, 180e, 200b, 200c, 200d, 200e, 200f, 202a, 202b, 202c, 202d, 202e, 2060, 2061, 2062, 2063, 2064, 2066, 2067, 2068, 2069, 206a, 206b, 206c, 206d, 206e, 206f, feff, fff9, fffa, fffb,
110bd, 110cd, 13430, 13431, 13432, 13433, 13434, 13435, 13436, 13437, 13438, 1bca0, 1bca1, 1bca2, 1bca3, 1d173, 1d174, 1d175, 1d176, 1d177, 1d178, 1d179, 1d17a
e0001, e0020, e0021, e0022, e0023, e0024, e0025, e0026, e0027, e0028, e0029, e002a, e002b, e002c, e002d, e002e, e002f, e0030, e0031, e0032, e0033, e0034, e0035, e0036, e0037, e0038, e0039, e003a, e003b, e003c, e003d, e003e, e003f, e0040, e0041, e0042, e0043, e0044, e0045, e0046, e0047, e0048, e0049, e004a, e004b, e004c, e004d, e004e, e004f, e0050, e0051, e0052, e0053, e0054, e0055, e0056, e0057, e0058, e0059, e005a, e005b, e005c, e005d, e005e, e005f, e0060, e0061, e0062, e0063, e0064, e0065, e0066, e0067, e0068, e0069, e006a, e006b, e006c, e006d, e006e, e006f, e0070, e0071, e0072, e0073, e0074, e0075, e0076, e0077, e0078, e0079, e007a, e007b, e007c, e007d, e007e, e007f
Note: In order to use any of the values above Hex FFFF(supplementary characters), it will have to be encoded using the UTF-16 encoding, which will require two char values (a high-surrogate followed by a low-surrogate). e.g. the value Hex 110BD will be encoded as two 16-bit values, which will be, Hex D804 and Hex DCBD(These pair of values is known as surrogate pair, first value is high-surrogate and the second value is low-surrogate). So in any identifier a pair of character \uD804\uDCBD will be ignored by the compiler. i.e. the identifier str is also equivalent to s\uD804\uDCBDtr. Details of how UTF-16 encoding is given in the article on UTF..
Opinions expressed by DZone contributors are their own.
Comments