1 unstable release
0.2.0 | Jun 30, 2022 |
---|
#35 in #charset
150 downloads per month
Used in encoding-next
20KB
266 lines
- Interface to the character encoding.
-
Raw incremental interface
- Methods which name starts with
raw_
constitute the raw incremental interface, - the lowest-available API for encoders and decoders.
- This interface divides the entire input to four parts:
-
- Processed bytes do not affect the future result.
-
- Unprocessed bytes may affect the future result
- and can be a part of problematic sequence according to the future input.
-
- Problematic byte is the first byte that causes an error condition.
-
- Remaining bytes are not yet processed nor read,
- so the caller should feed any remaining bytes again.
- The following figure illustrates an example of successive
raw_feed
calls: -
- 1st raw_feed :2nd raw_feed :3rd raw_feed
- ----------+----:---------------:--+--+---------
-
| : : | |
- ----------+----:---------------:--+--+---------
- processed unprocessed | remaining
-
problematic
-
- Since these parts can span the multiple input sequences to
raw_feed
, raw_feed
returns two offsets (one optional)- with that the caller can track the problematic sequence.
- The first offset (the first
usize
in the tuple) points to the first unprocessed bytes, - or is zero when unprocessed bytes have started before the current call.
- (The first unprocessed byte can also be at offset 0,
- which doesn't make a difference for the caller.)
- The second offset (
upto
field in theCodecError
struct), if any, - points to the first remaining bytes.
- If the caller needs to recover the error via the problematic sequence,
- then the caller starts to save the unprocessed bytes when the first offset < the input length,
- appends any new unprocessed bytes while the first offset is zero,
- and discards unprocessed bytes when first offset becomes non-zero
- while saving new unprocessed bytes when the first offset < the input length.
- Then the caller checks for the error condition
- and can use the saved unprocessed bytes for error recovery.
- Alternatively, if the caller only wants to replace the problematic sequence
- with a fixed string (like U+FFFD),
- then it can just discard the first sequence and can emit the fixed string on an error.
- It still has to feed the input bytes starting at the second offset again.