#web-archive #warc #wacz #cdxj #save-the-internet

wacksy

Experimental library for writing WACZ achives

1 unstable release

Uses new Rust 2024

0.0.1-alpha Apr 5, 2025

#8 in #warc

Download history 103/week @ 2025-04-02 15/week @ 2025-04-09

118 downloads per month

MIT license

25KB
450 lines

What needs to go into a WACZ file, according to the example in the spec:

archive
└── data.warc.gz
datapackage.json
datapackage-digest.json
indexes
└── index.cdx.gz
pages
└── pages.jsonl

Operations chart

Broadly what needs to be done, read the WACZ file, create an index and, a datapackage, in that order and then convert everything to bytes and zip it up.

flowchart
    A@{ shape: lean-r, label: "WARC file"}
    B@{ shape: rect, label: "Create index" }
    C@{ shape: rect, label: "Create datapackage" }
    D@{ shape: rect, label: "Create datapackage digest" }
    E1@{ shape: lean-l, label: "Convert index to bytes" }
    E2@{ shape: lean-l, label: "Convert to bytes" }
    F@{ shape: lean-l, label: "Zip up the files" }
    G@{ shape: lean-r, label: "WACZ file"}
    A --> index
    subgraph index
    B --> E1
    end
    index --> datapackage
    subgraph datapackage
    C --> E2 --> D --> E2
    style index stroke-dasharray: 5 5
    end
    A --> F
    index --> F
    datapackage --> F --> G

Dependencies

~5.5–7.5MB
~138K SLoC