Add support for LLM-based transcriptions#148
Conversation
|
Thanks for the contribution. I'll review it in the next days |
|
I tested this locally and it seems very inefficient since there's 1 call to the LLM API for each subtitle entry. That could result in thousand calls (depending on the subtitle) and also you need to be very patient. In Tesseract I grouped all the subtitle entries in 1 (rarely 2) PNG images, where I sorted the subtitles based on their size to maximize how many entries I can have in a single image. That works fine with Tesseract because its result gives me back the text coordinates, so I know which subtitle entry is what on the result. Maybe it's possible to add horizontal lines and vertical lines to this image to divide this big image in blocks and send 1 image to LLM API and ask to report back in the order block, and assemble back the final transcripted subtitles |
|
The problem about that approach is that the fast OCR models (the ones you probably wanna use as a hobbyst on a local machine) are not trained to give back coordinates. GLM-OCR runs at about 200-250 tokens, taking <100ms per subtitle on my Ryzen AI MAX 395 using llama.cpp with the Vulkan backend. That is comparable to Tesseract's speeds, yet properly recognizes uppercase "i" vs lowercase "L" for example. It, however, is not possible to ask for coordinates because the 900M model simply isn't trained for that. More capable models that can indeed process multiple and give back coordinates such as Qwen3.6-35B-A3B-GGUF models or Gemma 4 26B A4B are an option, but those are much slower at 30-40 tokens per second with 8 bit quantisation and taking ~800ms per subtitle. Time increases linearly with the amounts of tokens to generate, so using a simpler, single page model is gonna be always faster than bundling multiple with a coordinate-producing model. |
This PR adds support for calling a OCR model using any OpenAI-compatible endpoint. I'm personally using it with GLM-OCR, which yields a fairly satisfactory result on my local machine.
The changes are fairly minimal, implementing simply a new
LlmPgsToSrtRipperclass and options for that. No new libraries are needed, using the built-inhttp.client.