encoding-next-types

1 unstable release

0.2.0	Jun 30, 2022

#35 in #charset

287 downloads per month
Used in encoding-next

MIT license

20KB
266 lines

Interface to the character encoding.
Raw incremental interface
Methods which name starts with raw_ constitute the raw incremental interface,
the lowest-available API for encoders and decoders.
This interface divides the entire input to four parts:
- Processed bytes do not affect the future result.
- Unprocessed bytes may affect the future result
and can be a part of problematic sequence according to the future input.
- Problematic byte is the first byte that causes an error condition.
- Remaining bytes are not yet processed nor read,
so the caller should feed any remaining bytes again.
The following figure illustrates an example of successive raw_feed calls:
1st raw_feed :2nd raw_feed :3rd raw_feed
----------+----:---------------:--+--+---------
```
      |    :               :  |  |
```
----------+----:---------------:--+--+---------
processed unprocessed | remaining
```
                          problematic
```
Since these parts can span the multiple input sequences to raw_feed,
raw_feed returns two offsets (one optional)
with that the caller can track the problematic sequence.
The first offset (the first usize in the tuple) points to the first unprocessed bytes,
or is zero when unprocessed bytes have started before the current call.
(The first unprocessed byte can also be at offset 0,
which doesn't make a difference for the caller.)
The second offset (upto field in the CodecError struct), if any,
points to the first remaining bytes.
If the caller needs to recover the error via the problematic sequence,
then the caller starts to save the unprocessed bytes when the first offset < the input length,
appends any new unprocessed bytes while the first offset is zero,
and discards unprocessed bytes when first offset becomes non-zero
while saving new unprocessed bytes when the first offset < the input length.
Then the caller checks for the error condition
and can use the saved unprocessed bytes for error recovery.
Alternatively, if the caller only wants to replace the problematic sequence
with a fixed string (like U+FFFD),
then it can just discard the first sequence and can emit the fixed string on an error.
It still has to feed the input bytes starting at the second offset again.

1 unstable release

Raw incremental interface

No runtime deps