Charsets and Unicode Identifiers in Java

Pravin Jain

Oct. 06, 20 · Analysis

Likes (5)

Comment

Save

11.1K Views

Char Data Type

The char data type in a programming language is used to represent a unit of text. How is text data represented? The text data is represented as sequence of characters. The char data type is simply a numeric value of the character from a character set.

Character Sets

What is a character set? A character set is a collection of a unit of text(character), which are assigned some unique numeric value. There are various character sets available. The most commonly known character set is the ASCII (American Standard Code for Information Interchange) character set, which assigns only 128 characters (including the control characters) to the numeric values in the range from 0 — 127. Not all characters in a character set are printable. Character sets also include control characters. e.g. we have characters for a carriage-return, line-feed, form-feed, tab, bell etc. These are not having a print position, but give effect to position of next character or some such function.

In the initial days of computing each OS would support a particular character set and the most commonly supported character sets used to be ASCII or EBCDIC (Extended Binary Coded Decimal Interchange Code). Later ASCII and its extensions were adopted by most of the OSs. The ASCII character set included only the commonly used Latin characters and control characters. Each extension of the ASCII character set was catering to demands of a particular region/culture’s requirement of text representations. Most of these extensions of ASCII utilized the numeric values from 128 to 255 for the additional characters, specific to the region.

Some of the common examples of character sets which are extensions of ASCII are the ISO 8859 series, e.g. 8859-1 or Latin-1 caters to the Western European, 8859-2 or Latin-2 caters to Eastern European, 8859-3 or Latin-3 caters to Southern European, 8859-4 or Latin-4 caters to Northern European, 8859-5 or Cyrillic caters to Russian, Bulgarian, 8859-6 or Arabic, caters to the Arabic characters, 8859-7 or Greek, caters to the Greek characters, 8859-8 or Hebrew, caters to Hebrew characters, 8859-9 or Latin-5 caters to Turkish characters, 8859-10 or Latin-6 caters to Northern European, 8859-11 or Thai, caters to the Thai characters, 8859-13 or Latin-7 caters to Baltic, 8859-14 or Latin-8 caters to Celtic, 8859-15 or Latin-9 caters to Western European and 8859-16 or Latin-10 caters to Eastern European.

There other extensions of ASCII. We have ISCII (Indian Script Code for Information Interchange) which caters to the Indian scripts. The ISO 8859-12 was reserved for catering to the Devanagari script, but this was abandoned. The ISO 8859 series of character sets are summarized in Table 1. For catering to the Indic scripts another extension of ASCII called ISCII(Indian Script Code for Information Interchange) was developed in 1988, which was later revised in 1991.

ISO Number	Name	Region/Languages
8859-1	Latin-1	Western European
8859-2	Latin-2	Eastern European
8859-3	Latin-3	Southern European
8859-4	Latin-4	Northern European
8859-5	Cyrillic	Russian, Bulgarian
8859-6	Arabic	Arabic
8859-7	Greek	Greek
8859-9	Hebrew	Hebrew
8859-9	Latin-5	Turkish
8859-10	Latin-6	Northern European
8859-11	Thai	Thai
8859-12	Devanagari	Abandoned
8859-13	Latin-7	Baltic
8859-14	Latin-8	Celtic
8859-15	Latin-9	Western European
8859-16	Latin-10	Eastern European

Table – 1 ISO 8859 series of Character sets summarized

All the character sets mentioned above are extensions of ASCII. i.e. They have the same characters in the range from 0 – 127, same as ASCII. A quick summary of the ASCII characters is given in Table 2.

Problems in having multiple character sets

The char data type used by many of the programming languages would simply rely on the OSs interpretation of the numeric value of the char data. i.e. if the OS used a different character set, the same numeric value would be interpreted differently. e.g. According to Latin-1 (8859-1) the value EB(Hex) is used for representing the character ë, whereas according to Greek (8859-7) the same value EB(Hex) is used for representing the character λ.

Value in Hex	Character(s) Description
00 – 1F	Control Characters
20	Space
21 – 2F	Punctuations ! " # $ % & ' ( ) * + , - . /
30 – 39	Digits 0 – 9
3A – 40	Punctuations : ; < = > ? @
41 – 5A	Uppercase Letters A – Z
5B – 60	Punctuations [ \ ] ^ _ `
61 – 7A	Lowercase Letters a – z
7B – 7E	Punctuations { \| } ~
7F	Control Character

Table – 2: ASCII character set summarized

Unicode

To solve this problem a universal character set was designed in the form of Unicode. The first version of Unicode was introduced in 1991. Unicode character set was designed to include all the characters available in all the languages/scripts of the world. This character set does get revised to include newer characters being added in various regions, as well as identification of some languages/scripts which were not included in the earlier version. This character set has been designed to use numeric values from 0 — 10FFFF(Hex). This character set is also an extension of ASCII, so the initial values from 0 to 127 are same as ASCII. Most of the Indian scripts have been provided a block of 128 characters each, starting from 0x0900 onwards. The Unicode blocks for the Indian scripts is based on the ISCII 1988 and not on the ISCII 1991. In the Unicode character set, there is no provision for removing or updating any character, so newer versions of Unicode can only add new characters and it may deprecate any existing characters. The blocks for the South Central and South East Asian Scripts in Unicode are summarized in Tables 3 to 7.

Value Range in Hex	Script
0900 – 090F	Devanagari
A8EA – A8FF	Devenagari Extended
1CD) – !CFF	Vedic Extensions
0980 – 09FF	Bengali
0A00 - 0A7F	Gurmukhi
0A80 – 0AFF	Gujarati
0B00 – 0B7F	Oriya
0B80 – 0BFF	Tamil
11FC0 – 11FFF	Tamil Supplement
0C00 – 0C7F	Telugu
0C80 – 0CFF	Kannada
0D00 – 0D7F	Malayalam

Table – 3: Official Scripts of India in Unicode

Value Range in Hex	Script
0780 – 07BF	Thaana
0D80 – 0DFF	Sinhala
11400 – 1147F	Newa
0F00 – 0FFF	Tibetan
1800 – 18AF	Mongolian
11660 – 1167F	Mongolian Supplement
1900 – 194F	Limbu
ABC0 – ABFF	Meetei Mayek
AAE0 – AAFF	Meetei Mayek Extensions
16A40 – 16A6F	Mro
118A0 – 118FF	Warang Citi
1C50 – 1C7F	Ol Chiki
11100 – 1114F	Chakma
1C00 – 1C4F	Lepcha
A880 – A8DF	Saurashtra
11D00 – 11D5F	Masaram Gondi
11D60 – 11DAF	Gunjala Gondi
1E2C0 – 1E2FF	Wancho

Table – 4: Other Modern Scripts of South and Central Asia in Unicode

Value Range in Hex	Script
11000 – 1107F	Brahmi
10A00 – 10A5F	Kharoshthi
11C00 – 11C6F	Bhaiksuki
A840 – A87F	Phags-pa
11C70 – 11CBF	Marchen
10C00 – 10C4F	Old Turkic
11A50 – 11AAF	Soyombo
11A00 - 11A4F	Zanabazar Square
10F00 – 10F2F	Old Sogdian
10F30 – 10F6F	Sogdian

Table – 5: Ancient Scripts of South and Central Asia in Unicode

Value Range in Hex	Script
A800 – A82F	Syloti Nagri
11080 – 110CF	Kaithi
11180 – 111DF	Sharada
11680 – 116CF	Takri
11580 – 115FF	Siddham
11150 – 1117F	Mahajani
11200 – 1124F	Khojki
112B0 – 112FF	Khudawadi
11280 – 112AF	Multani
11480 – 114DF	Tirhuta
11600 – 1165F	Modi
119A0 – 119FF	Nandinagari
11300 – 1137F	Grantha
11900 – 1195F	Dives Akuru
11700 – 1173F	Ahom
110D0 – 110FF	Sora Sompeng
11800 – 1184F	Dogra

Table – 6: Other Historic Scripts of South and Central Asia in Unicode

Value Range in Hex	Script
0E00 – 0E7F	Thai
0E80 – 0EFF	Lao
1000 – 109F	Myanmar
AA60 – AA7F	Myanmar Extended-A
A9E0 – A9FF	Myanmar Extended-B
1780 – 17FF	Khmer
19E0 – 19FF	Khmer Symbols
1950 – 197F	Tai Le
1980 – 19DF	New Tai Le
1A20 – 1AAF	Tai Tham
AA80 – AADF	Tai Viet
A900 – A92F	Kayah Li
AA00 – AA5F	Cham
16B00 – 16B8F	Pahawh Hmong
1E100 – 1E14F	Nyiakeng Puachue Hmong
11AC0 – 11AFF	Pau Cin Hau
10D00 – 10D3F	Hanifi Rohingya

Table – 7: Scripts of Southeast Asia in Unicode

What is the size of char in C? The size of char is usually 1 byte in C. The size of char in Java is 2 bytes. It is an unsigned integral, 16-bit value, used for representing UTF-16 code-units.

Why is the size of a char, 2 bytes in Java? In C, char represents a character from the platform’s local character set, which in most cases, is some extension of ASCII. The number of characters in most of these character sets is normally up-to 256, so they require only 1 byte. In the case of Java, the char type is used to represent characters from the Unicode character set using the UTF-16 encoding, which requires 16 bits. The details about UTF encodings is given in the article on UTF.

Let us understand Unicode. Unicode is a character set that has characters from all the languages of the world. There are various versions of the Unicode character set. At the time of this writing, the version of Unicode was 13.0. The Unicode standard maps characters from all the languages to a unique code-point value. The code-point values can be in the range of 0 —10FFFF (Hex). This code-point range has been divided into 17 planes, each of 65536 values, i.e. 2¹⁶. The zeroth plane, i.e. values from 0 —FFFF(Hex) is known as BMP (Basic Multilingual Plane), and other planes define the supplementary characters. To represent the complete range of characters using only 16-bit units, the Unicode standard defines an encoding called UTF-16. In this encoding, supplementary

characters are represented as a pair of 16-bit code units, the first code unit is from the high-surrogates range (D800 – DBFF(Hex)), and the second code unit is from the low-surrogates range (DC00 – DFFF(Hex)). In Unicode standard, the range of code-point values from D800 to DFFF (Hex) has not been assigned to any valid character and is reserved for surrogates. For characters in the range of 0000 —FFFF(Hex), the values of code-points and UTF-16 code units are the same. The Java programming language represents text in sequences of 16-bit code units using the UTF-16 encoding. The char type in the Java programming language represents the 16-bit code unit.

Unicode Escapes in Java Source Code

The Java source code is a sequence of Unicode characters. The Java source code can contain characters from any language and not just characters from the ASCII character set. Most of the time the source code is encoded in some native character set, which is an extension of ASCII. Even in these cases the Java source code can include characters that are not part of the native character set. This is done by using the Unicode escape. In the source code we can specify any UTF-16 code unit by specifying the value as \u followed by four hexadecimal digits.

Identifiers in Java

In the Java source code, we define/declare several entities. These entities are identifiable by some names. Identifiers are used to give names to the entities. These identifiers are used for naming the class(es), interface(s), enum(s), annotation(s). They are also used for giving names to various kinds of members of the class, interface, enum and annotation. The members which may be given names are the methods. The identifiers are also used for giving names to variables. The identifiers are also used for giving names to the type parameters of a Generic class. The identifiers are also used for giving names to packages and sub-packages. Identifiers are also used for giving names to labels within a method or a block. These labels are target of the break and the continue statements.

What are the rules for defining an identifier in Java? In Java, an identifier may contain any number of “Java letters” or “Java digits”, and it can start only with a “Java letter”. The sequence of “Java letters” in an identifier cannot match any of the keywords of the Java language or the boolean literals true, false or the literal null. A “Java letter” is not just the letters A — Z and a — z from the ASCII character set, but it also includes the letters from other languages available from the Unicode character set. The “Java letter” also includes the connecting punctuation characters like the ‘_’ character, currency symbols like the ‘$’, ‘₹’, ‘€’, ‘£’, ‘¥’, etc. sign of a numeric letter like the roman numeral ‘X’. The “Java digit” also includes the digits used in the various languages available in the Unicode character set and not just the digits 0 — 9 from the ASCII character set. The “Java letter” also includes the combining marks and the non-spacing marks, which may be used for combining characters, The following declaration shows a valid declaration of a Java identifier:

char अ = 'अ';

Here अ has been used as an identifier; since it is a letter in Hindi, this declaration is valid. But then how do we use these characters in a Java source file, which may be created using a text editor, where only the ASCII characters may be available? In a Java source file before the compiler identifies the lines and the tokens, it looks for Unicode escapes in the Java source file. The Java compiler works on Unicode characters. Our Java source file is normally encoded in ASCII or some extension of ASCII. While decoding from ASCII to Unicode, the compiler would first replace the Unicode escapes in the Java file with the actual Unicode character value. Using the Unicode escape we can write the above declaration in a Java source file encoded in ASCII as shown below:

char \u0905 = '\u0905';// 0905 is the hex value for hindi letter A

Unicode escape is written as \u followed by four hexadecimal digits, where the hexadecimal digits are the code-point values for that character in the Unicode character set.

The following code segment would not compile:

char ch = '\u000A'; // 000A is value for line feed

since this will be seen by the Java compiler as:

    Java
   
xxxxxxxxxx

char ch = '
';

Instead char ch ='\n'; should be used to have character literal for newline.

Continuing with examples of valid and invalid identifiers for Java, the declaration

String नमस्ते = ”नमस्ते ”;

is valid in Java since it is only made up of Letters, but

String ९नमस्ते = ”नमस्ते ”;

is not valid since it starts with a digit(९ is Devanagari digit nine). But

String ₹९नमस्ते = ”नमस्ते ”;

would become valid since it now does not start with a digit, but a currency sign.

Exercise

Define a class called नमस्तेदुनिया with a main method similar to the main method of the typical HelloWorld class. Use parameter name as आर्ग instead of args, and it should print “नमस्ते दुनिया” on the standard output instead of “Hello world”.

This can be done as given below:

Listing for class नमस्तेदुनिया

    Java
   
x

class नमस्तेदुनिया {
    public static void main(String[] आर्ग) {
        System.out.println(“नमस्ते दुनिया”);
    }
}

The above code would compile successfully and execute, You should be able to see the output if your terminal supports the requred fonts.

It is not a good idea to use non-ASCII text in the Java source code directly, since the interpretation of non-ASCII text would largely depend on the encoding used by the native OS. So, if we want to include non-ASCII characters in the Java source file, we may be better off using the Unicode escapes for all such characters. This may seem to be a tedious task. This can easily be taken care of by the native2ascii utility, which is part of JDK. This utility can convert any text file encoded using any of the standard encoding to ASCII by applying the Unicode escapes for all the non-ASCII characters, and can also be reversed back to native encoding using the same utility. e.g. if the code in Listing above is saved in a file named HelloWorldHindi.java. And saved using the UTF-8 encoding, then this Java source file can be converted to “ASCII only” by using the command as given below:

native2ascii -encoding utf-8 HelloWorldHindi.java HelloWorldHindi.java

This file can be converted back to UTF-8 encoded format with the help of command as given below:

native2ascii -encoding utf-8 -reverse HelloWorldHindi.java HelloWorldHindi.java

Identifier Ignorable Characters

There are also some non-printable characters (most of these are control-characters), which are ignored in an identifier by the Java compiler. These characters are known as Java-Identifier-Ignorable. i.e. if any such character is used in an identifier it will be ignored, so it is possible to have two different sequence of Unicode characters in an identifier, which mean the same. e.g.

    Java
   
x

class TestIdentifierIgnorable {
    public static void main (String[] args) {
        String str = ”Hello world!”;
        System.out.println(s\u0001tr); // \u0001 is Java Identifier Ignorable character
    }
}

In the above code listing, the variable name s1 in line 3 is same as s\u00011 used in line 4. So, the above code is legal, compiles successfully and when run, would print “Hello world!” on the standard output.

The following are the numeric value in hex for the Java-Identifier-Ignorable characters.

0, 1, 2, 3, 4, 5, 6, 7, 8, e, f, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 1a, 1b, 7f, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 8a, 8b, 8c, 8d, 8e, 8f, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 9a, 9b, 9c, 9d, 9e, 9f, ad, 600, 601, 602, 603, 604, 605, 61c, 6dd, 70f, 8e2, 180e, 200b, 200c, 200d, 200e, 200f, 202a, 202b, 202c, 202d, 202e, 2060, 2061, 2062, 2063, 2064, 2066, 2067, 2068, 2069, 206a, 206b, 206c, 206d, 206e, 206f, feff, fff9, fffa, fffb,

110bd, 110cd, 13430, 13431, 13432, 13433, 13434, 13435, 13436, 13437, 13438, 1bca0, 1bca1, 1bca2, 1bca3, 1d173, 1d174, 1d175, 1d176, 1d177, 1d178, 1d179, 1d17a

e0001, e0020, e0021, e0022, e0023, e0024, e0025, e0026, e0027, e0028, e0029, e002a, e002b, e002c, e002d, e002e, e002f, e0030, e0031, e0032, e0033, e0034, e0035, e0036, e0037, e0038, e0039, e003a, e003b, e003c, e003d, e003e, e003f, e0040, e0041, e0042, e0043, e0044, e0045, e0046, e0047, e0048, e0049, e004a, e004b, e004c, e004d, e004e, e004f, e0050, e0051, e0052, e0053, e0054, e0055, e0056, e0057, e0058, e0059, e005a, e005b, e005c, e005d, e005e, e005f, e0060, e0061, e0062, e0063, e0064, e0065, e0066, e0067, e0068, e0069, e006a, e006b, e006c, e006d, e006e, e006f, e0070, e0071, e0072, e0073, e0074, e0075, e0076, e0077, e0078, e0079, e007a, e007b, e007c, e007d, e007e, e007f

Note: In order to use any of the values above Hex FFFF(supplementary characters), it will have to be encoded using the UTF-16 encoding, which will require two char values (a high-surrogate followed by a low-surrogate). e.g. the value Hex 110BD will be encoded as two 16-bit values, which will be, Hex D804 and Hex DCBD(These pair of values is known as surrogate pair, first value is high-surrogate and the second value is low-surrogate). So in any identifier a pair of character \uD804\uDCBD will be ignored by the compiler. i.e. the identifier str is also equivalent to s\uD804\uDCBDtr. Details of how UTF-16 encoding is given in the article on UTF..

Java (programming language) Identifier ASCII Database UTF-16 Data Types Code point

Opinions expressed by DZone contributors are their own.

Related

Trending