Skip to content

Add support for LLM-based transcriptions#148

Open
socram8888 wants to merge 1 commit into
ratoaq2:mainfrom
socram8888:llm
Open

Add support for LLM-based transcriptions#148
socram8888 wants to merge 1 commit into
ratoaq2:mainfrom
socram8888:llm

Conversation

@socram8888

Copy link
Copy Markdown

This PR adds support for calling a OCR model using any OpenAI-compatible endpoint. I'm personally using it with GLM-OCR, which yields a fairly satisfactory result on my local machine.

The changes are fairly minimal, implementing simply a new LlmPgsToSrtRipper class and options for that. No new libraries are needed, using the built-in http.client.

@ratoaq2

ratoaq2 commented May 10, 2026

Copy link
Copy Markdown
Owner

Thanks for the contribution. I'll review it in the next days

@ratoaq2

ratoaq2 commented May 15, 2026

Copy link
Copy Markdown
Owner

I tested this locally and it seems very inefficient since there's 1 call to the LLM API for each subtitle entry. That could result in thousand calls (depending on the subtitle) and also you need to be very patient.

In Tesseract I grouped all the subtitle entries in 1 (rarely 2) PNG images, where I sorted the subtitles based on their size to maximize how many entries I can have in a single image. That works fine with Tesseract because its result gives me back the text coordinates, so I know which subtitle entry is what on the result.

Maybe it's possible to add horizontal lines and vertical lines to this image to divide this big image in blocks and send 1 image to LLM API and ask to report back in the order block, and assemble back the final transcripted subtitles

@socram8888

socram8888 commented May 15, 2026

Copy link
Copy Markdown
Author

The problem about that approach is that the fast OCR models (the ones you probably wanna use as a hobbyst on a local machine) are not trained to give back coordinates.

GLM-OCR runs at about 200-250 tokens, taking <100ms per subtitle on my Ryzen AI MAX 395 using llama.cpp with the Vulkan backend. That is comparable to Tesseract's speeds, yet properly recognizes uppercase "i" vs lowercase "L" for example. It, however, is not possible to ask for coordinates because the 900M model simply isn't trained for that.

More capable models that can indeed process multiple and give back coordinates such as Qwen3.6-35B-A3B-GGUF models or Gemma 4 26B A4B are an option, but those are much slower at 30-40 tokens per second with 8 bit quantisation and taking ~800ms per subtitle.

Time increases linearly with the amounts of tokens to generate, so using a simpler, single page model is gonna be always faster than bundling multiple with a coordinate-producing model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants