2 unstable releases
0.2.0 | Oct 10, 2020 |
---|---|
0.1.0 | Oct 9, 2020 |
#1038 in Encoding
63,001 downloads per month
Used in 10 crates
(4 directly)
74KB
1.5K
SLoC
UTF-16 string types
This crate provides two string types to work with UTF-16 encoded
bytes, they are directly analogous to how String
and &str
work
with UTF-8 encoded bytes.
UTF-16 can be encoded in little- and big-endian byte order, this crate identifies which encoding the types contain to using a generic byteorder type, thus the main types exposed are:
&WStr<ByteOrder>
WString<ByteOrder>
These types aim to behave very similar to the standard libarary &str
and String
types. While many APIs are already covered, feel free to
contribute more methods.
Documentation is at docs.rs. Currently a lot of documentation is rather terse, referring to the matching methods on the string types in the standard library is best in those cases. Feel free to contribute more exhaustive in-line docs.
lib.rs
:
A UTF-16 little-endian string type.
This crate provides two string types to handle UTF-16 encoded bytes directly as strings:
WString
and WStr
. They are to UTF-16 exactly like String
and [str
] are to
UTF-8. Some of the concepts and functions here are rather tersely documented, in this
case you can look up their equivalents on String
or [str
] and the behaviour should
be exactly the same, only the underlying byte encoding is different.
Thus WString
is a type which owns the bytes containing the string. Just like
String
and the underlying [Vec
] it is built on, it distinguishes length
(WString::len
) and capacity (String::capacity
). Here length is the number of
bytes used while capacity is the number of bytes the string can grow withouth
reallocating.
The WStr
type does not own any bytes, it can only point to a slice of bytes
containing valid UTF-16. As such you will only ever use it as a reference like &WStr
,
just you you only use [str
] as &str
.
The WString
type implements Deref<Target = WStr<ByteOrder>
UTF-16 ByteOrder
UTF-16 encodes to unsigned 16-bit integers ([u16
]), denoting code units. However
different CPU architectures encode these [u16
] integers using different byte order:
little-endian and big-endian. Thus when handling UTF-16 strings you need to be
aware of the byte order of the encoding, commonly the encoding variants are know as
UTF-16LE and UTF-16BE respectively.
For this crate this means the types need to be aware of the byte order, which is done
using the byteorder::ByteOrder
trait as a generic parameter to the types:
WString<ByteOrder>
and WStr<ByteOrder>
commonly written as WString<E>
and
WStr<E>
where E
stands for "endianess".
This crate exports BigEndian
, [BE
], LittleEndian
and [LE
] in case you need
to denote the type:
use utf16string::{BigEndian, BE, WString};
let s0: WString<BigEndian> = WString::from("hello");
assert_eq!(s0.len(), 10);
let s1: WString<BE> = WString::from("hello");
assert_eq!(s0, s1);
As these types can often be a bit cumbersome to write they can often be inferred,
especially with the help of the shorthand constructors like WString::from_utf16le
,
WString::from_utf16be
, WStr::from_utf16le
, WStr::from_utf16be
and related.
For example:
use utf16string::{LE, WStr};
let b = b"h\x00e\x00l\x00l\x00o\x00";
let s0: &WStr<LE> = WStr::from_utf16(b)?;
let s1 = WStr::from_utf16le(b)?;
assert_eq!(s0, s1);
assert_eq!(s0.to_utf8(), "hello");
Dependencies
~120KB