The UTF8 structure

The UTF8 structure provides support for working with UTF-8 encoded strings. UTF-8 is a way to represent Unicode code points in an 8-bit character type while being backward compatible with the ASCII encoding for 7-bit characters. The encoding scheme uses one to four bytes as follows:

Wide Character Bits Byte 0 Byte 1 Byte 2 Byte 3

00000 00000000 0xxxxxxx

0xxxxxxx

00000 00000yyy yyxxxxxx

110yyyyy

10xxxxxx

00000 zzzzyyyy yyxxxxxx

1110zzzz

10yyyyyy

10xxxxxx

wwwzz zzzzyyyy yyxxxxxx

11110www

10zzzzzz

10yyyyyy

10xxxxxx

There are three additional well-formedness restrictions on UTF-8 encodings that were introduced in the Unicode 3.1 and 3.2 standards.

  • Characters cannot be larger than 0x10FFFF (the maximum code point).

  • Characters must be in the shortest encoding for the codepoint (e.g., using two bytes to encode an ASCII character is invalid).

  • Surogate pairs should be encoded as a single three-byte character instead of as two three-byte sequences.

Synopsis

signature UTF8
structure UTF8 :> UTF8

Interface

type wchar = word

val maxCodePoint : wchar

exception Incomplete
exception Invalid

val getu : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader

val encode : wchar -> string

val isAscii : wchar -> bool
val toAscii : wchar -> char
val fromAscii : char -> wchar

val toString : wchar -> string

val size : string -> int

val size' : substring -> int

val explode : string -> wchar list
val implode : wchar list -> string

val map : (wchar -> wchar) -> string -> string
val app : (wchar -> unit) -> string -> unit
val fold : ((wchar * 'a) -> 'a) -> 'a -> string -> 'a
val all : (wchar -> bool) -> string -> bool
val exists : (wchar -> bool) -> string -> bool

Description

type wchar = word

The type of a Unicode code point.
Note that we use the word type for this because SML/NJ does not currently have a wide-character type. If such a type is introduced, then this type definition will likely change.

val maxCodePoint : wchar

The maximum code point in the Unicode character set (0wx10FFFF).

exception Incomplete

This exception is raised when certain operations are applied to incomplete strings (i.e., strings that end in the middle of multi-byte UTF-8 character encoding).

exception Invalid

This exception is raised when invalid UTF-8 encodings, such as non-shortest-length encodings, are encountered.

val getu : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader

getu getc returns a wide-character reader for the character reader getc. The resulting reader raises the Incomplete exception if it encounters an incomplete UTF-8 character and it raises the Invalid exception if it encounters an invalid encoding.

val encode : wchar -> string

encode wc returns the UTF-8 encoding of the wide character wc. This expression raises the Invalid exception if wc is greater than the maximum Unicode code point.

val isAscii : wchar -> bool

isAscii wc returns true if, and only if, wc is an ASCII character.

val toAscii : wchar -> char (* truncates to 7-bits *)

toAscii wc converts wc to an 8-bit character by truncating wc to its low seven bits.

val fromAscii : char -> wchar (* truncates to 7-bits *)

toAscii c converts the 8-bit character c to a wide character in the ASCII range (the high bit of c is ignored).

val toString : wchar -> string

toString wc returns a printable string representation of a wide character as a Unicode escape sequence.

val size : string -> int

size s returns the number of UTF-8 encoded Unicode characters in the string s. This expression raises the Incomplete exception if an incomplete character is encountered.

val size : string -> int

size s returns the number of UTF-8 encoded Unicode characters in the string s. This expression raises the Incomplete exception if it encounters an incomplete UTF-8 character and it raises the Invalid exception if it encounters an invalid encoding.

val size' : substring -> int

size' ss returns the number of UTF-8 encoded Unicode characters in the substring ss. This expression raises the Incomplete exception if it encounters an incomplete UTF-8 character and it raises the Invalid exception if it encounters an invalid encoding.

val explode : string -> wchar list

explode s returns the list of UTF-8 encoded Unicode characters that comprise the string s.

val implode : wchar list -> string

implode wcs returns the UTF-8 encoded string that represents the list wcs of Unicode code points. This expression raises the Invalid exception if it encounters an invalid encoding.

val map : (wchar -> wchar) -> string -> string

map f s maps the function f over the UTF-8 encoded characters in the string s to produce a new UTF-8 string. This expression raises the Incomplete exception if it encounters an incomplete UTF-8 character and it raises the Invalid exception if it encounters an invalid encoding. It is equivalent to the expression

implode (List.map f (explode s))
val app : (wchar -> unit) -> string -> unit

app f s applies the function f to the UTF-8 encoded characters in the string s. This expression raises the Incomplete exception if it encounters an incomplete UTF-8 character and it raises the Invalid exception if it encounters an invalid encoding. It is equivalent to the expression

List.app f (explode s)
val fold : ((wchar * 'a) -> 'a) -> 'a -> string -> 'a

fold f init s folds a function from left-to-right over the UTF-8 encoded characters in the string. Incomplete exception if it encounters an incomplete UTF-8 character and it raises the Invalid exception if it encounters an invalid encoding. It is equivalent to the expression

List.foldl f init (explode s)
val all : (wchar -> bool) -> string -> bool

all pred s returns true if, and only if, the function pred returns true for all of the UTF-8 encoded characters in the string. It short-circuits evaluation as soon as a character is encountered for which pred returns false. This expression raises the Incomplete exception if it encounters an incomplete UTF-8 character and it raises the Invalid exception if it encounters an invalid encoding. It is equivalent to the expression

List.all pred (explode s)

when s only contains complete characters.

val exists : (wchar -> bool) -> string -> bool

exists pred s returns true if, and only if, the function pred returns true for at least one UTF-8 encoded character in the string s. It short-circuits evaluation as soon as a character is encountered for which pred returns true. This expression raises the Incomplete exception if it encounters an incomplete UTF-8 character and it raises the Invalid exception if it encounters an invalid encoding. It is equivalent to the expression

List.exists pred (explode s)

when s only contains complete characters.

See Also