4 releases
0.2.0 | Jan 30, 2023 |
---|---|
0.1.2 | Jan 7, 2023 |
0.1.1 | Jan 6, 2023 |
0.1.0 | Jan 5, 2023 |
0.0.0 |
|
#2366 in Parser implementations
140KB
3.5K
SLoC
sndjvu_format is a library for working with the transfer format for DjVu documents.
The "transfer format" is the canonical DjVu file format defined by the DjVu v3 standard. You can use this library to parse a DjVu file or create one programmatically. The lowest-level details of the format are abstracted away, but you still need to understand the structure of a DjVu document at the "chunk" level (see below) to use this library effectively.
Overview of the DjVu v3 document model
(This overview is not intended to substitute for reading the relevant parts of the DjVu v3 standard.)
A DjVu document is either single-page or multi-page. A single-page document consists of a single component; a multi-page document consists of zero or more components, plus some metadata.
DjVu components come in three types: DJVU
, DJVI
, and THUM
. A DJVU
component represents
a page, a DJVI
component holds data that's shared between several pages, and a THUM
component holds thumbnail images for several pages. The single component of a single-page
document must be of type DJVU
.
Every piece of data in a DjVu document is contained in a chunk, and each chunk has a type.
Most chunks are contained in a components; the exceptions are the DIRM
and NAVM
chunks that
contain the metadata for a multi-page document. A chunk of type INFO
can only appear at the
start of a DJVU
component (and is mandatory in that position); it describes some basic
properties of the corresponding page, like its width and height in pixels. Other than the
INFO
chunk, the same types of chunk can appear in the DJVU
and DJVI
components. A chunk
of one of these types is called an element, and describes one aspect of the page or pages
with which it is associated (image data, OCRed text, annotations, etc.).