#subtitle #ocr #dvd #text-image #vobsub #pgs #command-line-tool

bin+lib subtile-ocr

Converts DVD VOB subtitles to SRT subtitles with Tesseract OCR

9 releases

0.2.1 Aug 11, 2024
0.2.0 Jul 18, 2024
0.1.8 Jul 12, 2024
0.1.7 Jun 4, 2024
0.1.3 Feb 4, 2024

#75 in Video

Download history 3/week @ 2024-07-28 118/week @ 2024-08-11 7/week @ 2024-08-18 2/week @ 2024-09-15 33/week @ 2024-09-22 21/week @ 2024-09-29 1/week @ 2024-10-06

655 downloads per month

GPL-3.0 license

72KB
1K SLoC

subtile-ocr

subtile-ocr is a blazingly fast and accurate DVD VobSub to SRT subtitle conversion tool. It's started as a fork of vobsubocr.

Background

DVD subtitles are unfortunately encoded essentially as a series of images. This presents problems when needing a text representation of the subtitle, e.g. for language learning. subtile-ocr can alleviate this problem by generating SRT subtitles from an input VobSub file, leveraging the power of Tesseract.

Installation

Install the latest release with cargo:

cargo install subtile-ocr

Or alternatively, install the development version from git:

cargo install --git https://github.com/gwen-lg/subtile-ocr

You will need to have Tesseract's development libraries installed; see the leptess readme for more details. If you use Nix, the provided shell.nix provides an environment with all of the necessary dependencies.

Usage

# Convert simplified Chinese vobsub subtitles and print them to stdout.
subtile-ocr -l chi_sim shrek_chi.idx

# Convert English vobsub subtitles and write them to a file named "shrek_eng.srt".
subtile-ocr -l eng -o shrek_eng.srt shrek_eng.idx

We can also specify more advanced configuration options for Tesseract with -c.

# Convert subtitles and blacklist the specified characters from being (mistakenly) recognized.
subtile-ocr -l eng -c tessedit_char_blacklist='|\/`_~' shrek_eng.idx

How does it work/compare to similar tools?

The most comparable tool to subtile-ocr is VobSub2SRT, but subtile-ocr has significantly better output, especially for non-English languages, mainly because VobSub2SRT does not do much preprocessing of the image at all before sending it to Tesseract. For example, Tesseract 4.0 expects black text on a white background, which VobSub2SRT does not guarantee, but subtile-ocr does. Additionally, subtile-ocr splits each line into separate images to take advantage of page segmentation method 7, which greatly improves accuracy of non-English languages in particular.

Official documentation on how to improve accuracy of Tesseract output can be viewed here.

Miscellaneous Notes

From my understanding, the chi_sim and chi_tra Tesseract models work on both simplified and traditional Chinese text, but automatically convert said text to their respective forms.

Dependencies

~11–22MB
~285K SLoC