9 releases
0.2.1 | Aug 11, 2024 |
---|---|
0.2.0 | Jul 18, 2024 |
0.1.8 | Jul 12, 2024 |
0.1.7 | Jun 4, 2024 |
0.1.3 | Feb 4, 2024 |
#75 in Video
655 downloads per month
72KB
1K
SLoC
subtile-ocr
subtile-ocr
is a blazingly fast and accurate DVD VobSub
to SRT subtitle conversion tool.
It's started as a fork of vobsubocr.
Background
DVD subtitles are unfortunately encoded essentially as a series of images. This
presents problems when needing a text representation of the subtitle, e.g. for
language learning. subtile-ocr
can alleviate this problem by generating SRT
subtitles from an input VobSub
file, leveraging the power of
Tesseract.
Installation
Install the latest release with cargo:
cargo install subtile-ocr
Or alternatively, install the development version from git:
cargo install --git https://github.com/gwen-lg/subtile-ocr
You will need to have Tesseract's development libraries installed; see the leptess readme for more details. If you use Nix, the provided shell.nix provides an environment with all of the necessary dependencies.
Usage
# Convert simplified Chinese vobsub subtitles and print them to stdout.
subtile-ocr -l chi_sim shrek_chi.idx
# Convert English vobsub subtitles and write them to a file named "shrek_eng.srt".
subtile-ocr -l eng -o shrek_eng.srt shrek_eng.idx
We can also specify more advanced configuration options for Tesseract with -c
.
# Convert subtitles and blacklist the specified characters from being (mistakenly) recognized.
subtile-ocr -l eng -c tessedit_char_blacklist='|\/`_~' shrek_eng.idx
How does it work/compare to similar tools?
The most comparable tool to subtile-ocr
is
VobSub2SRT, but subtile-ocr
has
significantly better output, especially for non-English languages, mainly
because VobSub2SRT
does not do much preprocessing of the image at all before
sending it to Tesseract. For example, Tesseract 4.0 expects black text on a
white background, which VobSub2SRT
does not guarantee, but subtile-ocr
does.
Additionally, subtile-ocr
splits each line into separate images to take
advantage of page segmentation method 7, which greatly improves accuracy of
non-English languages in particular.
Official documentation on how to improve accuracy of Tesseract output can be viewed here.
Miscellaneous Notes
From my understanding, the chi_sim
and chi_tra
Tesseract models work on both
simplified and traditional Chinese text, but automatically convert said text to
their respective forms.
Dependencies
~11–22MB
~285K SLoC