1 unstable release
0.1.0 | Jul 6, 2022 |
---|
#255 in Video
72KB
1K
SLoC
vobsubocr
vobsubocr
is a blazingly fast and accurate DVD VobSub to SRT subtitle conversion tool.
Background
DVD subtitles are unfortunately encoded essentially as a series of images. This
presents problems when needing a text representation of the subtitle, e.g. for
language learning. vobsubocr
can alleviate this problem by generating SRT
subtitles from an input VobSub file, leveraging the power of
Tesseract.
Installation
This package is not on crates.io yet, so you will have to clone and build with
cargo
. You will need to have Tesseract's development libraries installed; see
the leptess readme for more details.
Usage
# Convert simplified Chinese vobsub subtitles and print them to stdout.
vobsubocr -l chi_sim shrek_chi.idx
# Convert English vobsub subtitles and write them to a file named "shrek_eng.srt".
vobsubocr -l eng -o shrek_eng.srt shrek_eng.idx
We can also specify more advanced configuration options for Tesseract with -c
.
# Convert subtitles and blacklist the specified characters from being (mistakenly) recognized.
vobsubocr -l eng -c tessedit_char_blacklist='|\/`_~' shrek_eng.idx
How does it work/compare to similar tools?
The most comparable tool to vobsubocr
is
VobSub2SRT, but vobsubocr
has
significantly better output, especially for non-English languages, mainly
because VobSub2SRT
does not do much preprocessing of the image at all before
sending it to Tesseract. For example, Tesseract 4.0 expects black text on a
white background, which VobSub2SRT
does not guarantee, but vobsubocr
does.
Additionally, vobsubocr
splits each line into separate images to take
advantage of page segmentation method 7, which greatly improves accuracy of
non-English languages in particular.
Official documentation on how to improve accuracy of Tesseract output can be viewed here.
Miscellaneous Notes
From my understanding, the chi_sim
and chi_tra
Tesseract models work on both
simplified and traditional Chinese text, but automatically convert said text to
their respective forms.
Dependencies
~23–31MB
~487K SLoC