Stay Informed

Sign up to whitepaper announcements here.

From the Isode blog...

Subscribe to RSS headline updates from:
Powered by FeedBurner

 

Creative Commons

Creative Commons License
Isode's whitepapers are licenced under a Creative Commons Licence.

For use by a computer, a character needs to be represented by a number. The problem is that, historically, a 7 or 8 bit number has been used to represent characters. This restricts the number of characters to 128 or 256. There are various ways in which this limit is overcome, and this note outlines these. It also has a brief discussion as to where these techniques are used in Directory and Messaging protocols.

Single Byte Character Sets

In a single byte character set up to 256 characters can be represented. Therefore, to represent a wider range of characters there are different such character sets. Specification of which character set is used must be made, as the same byte value represents different characters in different character sets. Sometimes the character set to use is determined by the setting of the 'locale' for the computer.

Examples of such character sets are the ISO 8859 sets:

  • ISO 8859-1 (Latin 1) - covers most western European languages
  • ISO 8859-2 (Latin 2) - covers most eastern European languages
  • ISO 8859-5 - Cyrillic
  • ISO 8859-6 - Arabic
  • ISO 8859-7 - Greek
  • ISO 8859-8 - Hebrew
  • ISO 8859-15 - modification of 8859-1 e.g. including the Euro sign

There are also proprietary character sets. For instance, while Microsoft Windows 3.1 typically uses one of the ISO 8859 character sets, MS DOS uses a 'code page', selected by the user. Each code page is such a single byte character set. The Macintosh OS uses its own single byte character set.

Confusingly, many word processing programs confuse character sets and type styles, putting them together into the concept of a 'font'. I.e. changing font can mean an implicit change of character set as well as a change in the appearance of characters.

Multibyte Character Sets

A single byte character set can only represent up to 256 symbols, and this is insufficient for languages which use ideograms rather than an alphabet. The oldest method for overcoming this is to use a multibyte character set. In this some characters are represented using two byte values rather than one. Note that the data is still a sequence of bytes, and some characters are still given by a single byte. This includes control characters such as Carriage Return and Line Feed.

Examples of such multibyte character sets are:

  • GB 2312-1980 - Chinese
  • JIS X0208 - Japanese
  • JIS X0212 - Japanese
  • KS C 5601 - Korean

Another use of multibyte characters is in the use of two bytes to represent accented characters. This combines the character code for the unaccented character with a code for the accent. This is used in the standard T.61 character set.

Character Set Switching

If one wishes to have text in multiple languages, then no single character set so far can support this. ISO 2022 is an international standard that gives a technique for supporting multiple character sets, and switching between them. This uses escape sequences and control characters to specify the character sets to be used and in when, in the sequence of bytes, they are to be assumed. There is an international register of character sets that can be used in this way. This includes the ISO 8859 sets and the multibyte sets listed above.

'Wide' Character Sets

Another technique to accommodate a wide range of characters is to use a basic data value which is used to hold the code for a character. There are two standard ones:

  • Unicode - using a 16 bit value
  • ISO 10646 - using a 31 bit value (normally held in 32 bits)

In fact, Unicode is aligned to the first 2^16 values in ISO 10646, which is known as the Basic Multilingual Plane (BMP). Unicode is a subset of ISO 10646. In fact, ISO 10646 does not, at present, define any characters outside the BMP.

These also use multi-value sequences to represent accented characters. That is, the character sets support non-spacing characters which are displayed with the character to which they apply.

Handling characters as numerical quantities other that 8 bit bytes has consequences for programs, data and display systems.

Since data is transferred normally as a sequence of 8 bit bytes, there are issues with transferring Unicode and ISO 10646 character sequences. Which way round are the two 8 bit components of a Unicode character transferred?

There is a way of encoding Unicode and ISO 10646 as a sequence of bytes. This is called UTF-8. Each character is encoded in a sequence of bytes that may be from one to six bytes long. The encoding has the important property that those characters which are also in ASCII are encoded as the same byte value that they have in ASCII, in the range 0 to 127. The encoding of non-ASCII characters uses only bytes in the range 128..255.

Use of Character Sets in ASN.1

Abstract Syntax Notation 1 (ASN.1) defines various character string types. These are all based on octet strings (i.e. strings of 8 bit bytes). If the character set uses data values larger than 8 bits, the order of the components is defined.

These are types of interest:

  • NumericString - holds only digits 0 to 9
  • PrintableString - holds a restricted subset of ASCII
  • IA5String - very close to ASCII
  • TeletexString - uses a range of character sets, especially (T61string) those defined in T.61
  • BMPString - holds Unicode (16 bits)
  • UniversalString - holds ISO 10646 (32 bits)
  • GraphicString - holds displayable characters
  • GeneralString - any registered character set

IA5 is a 7bit character set which is very nearly the same as US ASCII. Most applications treat it as the same as ASCII.

TeletexString, GraphicsString and GeneralString all use the character set switching techniques defined by ISO 2022. GraphicsStrings do not contain control characters, and TeletexStrings have a restricted range of character sets that can be used.

Both X.400 Messaging and X.500 Directory services use some of these types.

Use of Character Set in Messaging

X.400 items are based on ASN.1. X.400 addresses use values which are PrintableStrings or TeletexStrings. X.400 text messages can be IA5String, TeletexString or GeneralString.

Internet mail is based round ASCII. Only ASCII can be used in message headers. Addresses must be ASCII.

In some places in message headers non-ASCII can be included by using an encoding which specifies the character values in ASCII. For instance:

=?ISO-8859-1?Q?M=FCller?=

encodes the name Müller, where the second letter is actually a 'u' with an umlaut. (See RFC 2047).

In message text, the character set used, if not ASCII, must be specified. Otherwise the receiver has no way of knowing what sequence of characters are represented by the sequence of bytes. The ISO 8859 sets are commonly used. The use of charset="ISO-2022-JP" means that the byte sequence can switch, using ISO 2022 techniques, between 4 character sets, including two Japanese multibyte character sets.

Use of Character Sets in Directory Services

A Directory Entry has a number of attributes. Each attribute type has syntax. That syntax defines the kind of values which can be held. The most common attribute types have the syntax 'DirectoryString', which holds text.

X.500 is also based on ASN.1. In X.500 values of DirectoryString can be held in one of these types:

  • PrintableString
  • TeletexString
  • BMPString
  • UniversalString

The method of transfer of the value can distinguish between these, so the receiver knows how to interpret the value. TeletexString can hold some multibyte character sets, but it is believed that these are only Japanese.

When DirectoryString values are compared, it is the sequence of characters that are compared, not the sequence of bytes. Therefore you can compare, for instance, a value which is held in a PrintableString with one held in a BMPString.

LDAP is based on X.500. However, the values transferred are simpler. For attribute values with the syntax DirectoryString, in LDAPv2 (the most commonly implemented version at the time of writing) the value is the either the PrintableString or TeletexString value as a sequence of bytes. (PrintableString characters are a subset of TeletexString characters, so there is no ambiguity). It is not defined how other values are transferred. So there is a problem with values that cannot be held in a TeletexString.

LDAPv3 (see RFC 2251 and RFC 2252) defines the value that is transferred for the DirectoryString syntax as using UTF-8. This is different from LDAPv2 but does mean that any ISO 10646 character can be represented.

Using Characters in Applications

It is generally straightforward in a server to deal with different character sets. If it is desired to compare values using different character sets, then knowledge of the different character sets is required. Such knowledge is also required if the server is to convert from one character set to another. Such conversion is commonly required in gateways between different messaging systems.

The main problems in dealing with character sets are in user applications. Enabling the user to input characters from different character sets, and also rendering different characters on the display device is a non-trivial task. If the user application is to communicate with a server using a standard protocol, then the user application must convert from any proprietary internal form for the characters to one which is suitable for the protocol being used.

For instance, a Directory user agent in Poland might be using ISO 8859-2. It would need to convert from the character set received in values from the server to this for display. When the user types a key, or keys, giving a character value encoded using ISO 8859-2, if this is sent to the server, it must be converted in the reverse direction.

Application building tools are beginning to perform some of this. Commonly, the application needs to use a standard character set (Unicode or UTF-8 are the common choices). The application builder then provides the means for displaying and inputting characters in a way that can be adapted to the local environment.

Copyright © 2008 Isode privacy   feedback Subscribe to our rss newsfeed