The UTF8
structure provides support for working
with UTF-8
encoded strings. UTF-8 is a way to represent Unicode
code points in an 8-bit character type while being backward
compatible with the ASCII encoding for 7-bit characters.
The encoding scheme uses one to four bytes as follows:
Wide Character Bits | Byte 0 | Byte 1 | Byte 2 | Byte 3 |
---|---|---|---|---|
|
|
|||
|
|
|
||
|
|
|
|
|
|
|
|
|
|
Synopsis
signature UTF8
structure UTF8 :> UTF8
Interface
type wchar = word
val maxCodePoint : wchar
exception Incomplete
val getu : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader
val encode : wchar -> string
val isAscii : wchar -> bool
val toAscii : wchar -> char
val fromAscii : char -> wchar
val toString : wchar -> string
val size : string -> int
val explode : string -> wchar list
val implode : wchar list -> string
val map : (wchar -> wchar) -> string -> string
val app : (wchar -> unit) -> string -> unit
val fold : ((wchar * 'a) -> 'a) -> 'a -> string -> 'a
val all : (wchar -> bool) -> string -> bool
val exists : (wchar -> bool) -> string -> bool
Description
type wchar = word
-
The type of a Unicode code point.
Note that we use theword
type for this because SML/NJ does not currently have a wide-character type. If such a type is introduced, then this type definition will likely change. val maxCodePoint : wchar
-
The maximum code point in the Unicode character set (
0wx10FFFF
).
exception Incomplete
-
This exception is raised when certain operations are applied to incomplete strings (i.e., strings that end with a partial UTF-8 character encoding).
val getu : (char, 'strm) StringCvt.reader -> (wchar, 'strm) StringCvt.reader
-
getu getc
returns a wide-character reader for the character readergetc
. The resulting reader will raise theIncomplete
exception if it encounters an incomplete UTF-8 character. val encode : wchar -> string
-
encode wc
returns the UTF-8 encoding of the wide characterwc
. This expression raises theDomain
exception ifwc
is greater than the maximum Unicode code point. val isAscii : wchar -> bool
-
isAscii wc
returnstrue
if, and only if,wc
is an ASCII character. val toAscii : wchar -> char (* truncates to 7-bits *)
-
toAscii wc
convertswc
to an 8-bit character by truncatingwc
to its low seven bits. val fromAscii : char -> wchar (* truncates to 7-bits *)
-
toAscii c
converts the 8-bit characterc
to a wide character in the ASCII range (the high bit ofc
is ignored). val toString : wchar -> string
-
toString wc
returns a printable string representation of a wide character as a Unicode escape sequence. val size : string -> int
-
size s
returns the number of UTF-8 encoded Unicode characters in the strings
. This expression raises theIncomplete
exception if an incomplete character is encountered. val explode : string -> wchar list
-
explode s
returns the list of UTF-8 encoded Unicode characters that comprise the strings
. val implode : wchar list -> string
-
implode wcs
returns the UTF-8 encoded string that represents the listwcs
of Unicode code points. This expression raises theDomain
exception if any character in the list is greater than the maximum Unicode code point. val map : (wchar -> wchar) -> string -> string
-
map f s
maps the functionf
over the UTF-8 encoded characters in the strings
to produce a new UTF-8 string. This expression raises theIncomplete
exception if an incomplete character is encountered and theDomain
exception iff
returns a value that is greater than the maximum Unicode code point. It is equivalent to the expressionimplode (List.map f (explode s))
val app : (wchar -> unit) -> string -> unit
-
app f s
applies the functionf
to the UTF-8 encoded characters in the strings
. This expression raises theIncomplete
exception if an incomplete character is encountered. It is equivalent to the expressionList.app f (explode s)
val fold : ((wchar * 'a) -> 'a) -> 'a -> string -> 'a
-
fold f init s
folds a function from left-to-right over the UTF-8 encoded characters in the string. This expression raises theIncomplete
exception if an incomplete character is encountered. It is equivalent to the expressionList.foldl f init (explode s)
val all : (wchar -> bool) -> string -> bool
-
all pred s
returnstrue
if, and only if, the functionpred
returns true for all of the UTF-8 encoded characters in the string. It short-circuits evaluation as soon as a character is encountered for whichpred
returnsfalse
. This expression raises theIncomplete
exception if an incomplete character is encountered. It is equivalent to the expressionList.all pred (explode s)
when
s
only contains complete characters. val exists : (wchar -> bool) -> string -> bool
-
exists pred s
returnstrue
if, and only if, the functionpred
returnstrue
for at least one UTF-8 encoded character in the strings
. It short-circuits evaluation as soon as a character is encountered for whichpred
returnstrue
. This expression raises theIncomplete
exception if an incomplete character is encountered. It is equivalent to the expressionList.exists pred (explode s)
when
s
only contains complete characters.