The UTF8
structure provides support for working
with UTF-8
encoded strings. UTF-8 is a way to represent Unicode
code points in an 8-bit character type while being backward
compatible with the ASCII encoding for 7-bit characters.
The encoding scheme uses one to four bytes as follows:
Wide Character Bits | Byte 0 | Byte 1 | Byte 2 | Byte 3 |
---|---|---|---|---|
|
|
|||
|
|
|
||
|
|
|
|
|
|
|
|
|
|
There are three additional well-formedness restrictions on UTF-8 encodings that were introduced in the Unicode 3.1 and 3.2 standards.
-
Characters cannot be larger than
0x10FFFF
(the maximum code point). -
Characters must be in the shortest encoding for the codepoint (e.g., using two bytes to encode an ASCII character is invalid).
-
Surogate pairs should be encoded as a single three-byte character instead of as two three-byte sequences.
Synopsis
signature UTF8
structure UTF8 :> UTF8
Interface
type wchar = word
val maxCodePoint : wchar
exception Incomplete
exception Invalid
val getu : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader
val encode : wchar -> string
val isAscii : wchar -> bool
val toAscii : wchar -> char
val fromAscii : char -> wchar
val toString : wchar -> string
val size : string -> int
val size' : substring -> int
val explode : string -> wchar list
val implode : wchar list -> string
val map : (wchar -> wchar) -> string -> string
val app : (wchar -> unit) -> string -> unit
val fold : ((wchar * 'a) -> 'a) -> 'a -> string -> 'a
val all : (wchar -> bool) -> string -> bool
val exists : (wchar -> bool) -> string -> bool
Description
type wchar = word
-
The type of a Unicode code point.
Note that we use theword
type for this because SML/NJ does not currently have a wide-character type. If such a type is introduced, then this type definition will likely change. val maxCodePoint : wchar
-
The maximum code point in the Unicode character set (
0wx10FFFF
).
exception Incomplete
-
This exception is raised when certain operations are applied to incomplete strings (i.e., strings that end in the middle of multi-byte UTF-8 character encoding).
exception Invalid
-
This exception is raised when invalid UTF-8 encodings, such as non-shortest-length encodings, are encountered.
val getu : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader
-
getu getc
returns a wide-character reader for the character readergetc
. The resulting reader raises theIncomplete
exception if it encounters an incomplete UTF-8 character and it raises theInvalid
exception if it encounters an invalid encoding. val encode : wchar -> string
-
encode wc
returns the UTF-8 encoding of the wide characterwc
. This expression raises theInvalid
exception ifwc
is greater than the maximum Unicode code point. val isAscii : wchar -> bool
-
isAscii wc
returnstrue
if, and only if,wc
is an ASCII character. val toAscii : wchar -> char (* truncates to 7-bits *)
-
toAscii wc
convertswc
to an 8-bit character by truncatingwc
to its low seven bits. val fromAscii : char -> wchar (* truncates to 7-bits *)
-
toAscii c
converts the 8-bit characterc
to a wide character in the ASCII range (the high bit ofc
is ignored). val toString : wchar -> string
-
toString wc
returns a printable string representation of a wide character as a Unicode escape sequence. val size : string -> int
-
size s
returns the number of UTF-8 encoded Unicode characters in the strings
. This expression raises theIncomplete
exception if an incomplete character is encountered. val size : string -> int
-
size s
returns the number of UTF-8 encoded Unicode characters in the strings
. This expression raises theIncomplete
exception if it encounters an incomplete UTF-8 character and it raises theInvalid
exception if it encounters an invalid encoding. val size' : substring -> int
-
size' ss
returns the number of UTF-8 encoded Unicode characters in the substringss
. This expression raises theIncomplete
exception if it encounters an incomplete UTF-8 character and it raises theInvalid
exception if it encounters an invalid encoding. val explode : string -> wchar list
-
explode s
returns the list of UTF-8 encoded Unicode characters that comprise the strings
. val implode : wchar list -> string
-
implode wcs
returns the UTF-8 encoded string that represents the listwcs
of Unicode code points. This expression raises theInvalid
exception if it encounters an invalid encoding. val map : (wchar -> wchar) -> string -> string
-
map f s
maps the functionf
over the UTF-8 encoded characters in the strings
to produce a new UTF-8 string. This expression raises theIncomplete
exception if it encounters an incomplete UTF-8 character and it raises theInvalid
exception if it encounters an invalid encoding. It is equivalent to the expressionimplode (List.map f (explode s))
val app : (wchar -> unit) -> string -> unit
-
app f s
applies the functionf
to the UTF-8 encoded characters in the strings
. This expression raises theIncomplete
exception if it encounters an incomplete UTF-8 character and it raises theInvalid
exception if it encounters an invalid encoding. It is equivalent to the expressionList.app f (explode s)
val fold : ((wchar * 'a) -> 'a) -> 'a -> string -> 'a
-
fold f init s
folds a function from left-to-right over the UTF-8 encoded characters in the string.Incomplete
exception if it encounters an incomplete UTF-8 character and it raises theInvalid
exception if it encounters an invalid encoding. It is equivalent to the expressionList.foldl f init (explode s)
val all : (wchar -> bool) -> string -> bool
-
all pred s
returnstrue
if, and only if, the functionpred
returns true for all of the UTF-8 encoded characters in the string. It short-circuits evaluation as soon as a character is encountered for whichpred
returnsfalse
. This expression raises theIncomplete
exception if it encounters an incomplete UTF-8 character and it raises theInvalid
exception if it encounters an invalid encoding. It is equivalent to the expressionList.all pred (explode s)
when
s
only contains complete characters. val exists : (wchar -> bool) -> string -> bool
-
exists pred s
returnstrue
if, and only if, the functionpred
returnstrue
for at least one UTF-8 encoded character in the strings
. It short-circuits evaluation as soon as a character is encountered for whichpred
returnstrue
. This expression raises theIncomplete
exception if it encounters an incomplete UTF-8 character and it raises theInvalid
exception if it encounters an invalid encoding. It is equivalent to the expressionList.exists pred (explode s)
when
s
only contains complete characters.