4 releases

0.2.0 Jan 30, 2023
0.1.2 Jan 7, 2023
0.1.1 Jan 6, 2023
0.1.0 Jan 5, 2023
0.0.0 Jul 23, 2022

#2366 in Parser implementations

MIT/Apache

140KB
3.5K SLoC

sndjvu_format is a library for working with the transfer format for DjVu documents.

The "transfer format" is the canonical DjVu file format defined by the DjVu v3 standard. You can use this library to parse a DjVu file or create one programmatically. The lowest-level details of the format are abstracted away, but you still need to understand the structure of a DjVu document at the "chunk" level (see below) to use this library effectively.

Overview of the DjVu v3 document model

(This overview is not intended to substitute for reading the relevant parts of the DjVu v3 standard.)

A DjVu document is either single-page or multi-page. A single-page document consists of a single component; a multi-page document consists of zero or more components, plus some metadata.

DjVu components come in three types: DJVU, DJVI, and THUM. A DJVU component represents a page, a DJVI component holds data that's shared between several pages, and a THUM component holds thumbnail images for several pages. The single component of a single-page document must be of type DJVU.

Every piece of data in a DjVu document is contained in a chunk, and each chunk has a type. Most chunks are contained in a components; the exceptions are the DIRM and NAVM chunks that contain the metadata for a multi-page document. A chunk of type INFO can only appear at the start of a DJVU component (and is mandatory in that position); it describes some basic properties of the corresponding page, like its width and height in pixels. Other than the INFO chunk, the same types of chunk can appear in the DJVU and DJVI components. A chunk of one of these types is called an element, and describes one aspect of the page or pages with which it is associated (image data, OCRed text, annotations, etc.).

No runtime deps

Features