Add support for LLM-based transcriptions by socram8888 · Pull Request #148 · ratoaq2/pgsrip

socram8888 · 2026-05-08T10:23:35Z

This PR adds support for calling a OCR model using any OpenAI-compatible endpoint. I'm personally using it with GLM-OCR, which yields a fairly satisfactory result on my local machine.

The changes are fairly minimal, implementing simply a new LlmPgsToSrtRipper class and options for that. No new libraries are needed, using the built-in http.client.

ratoaq2 · 2026-05-10T18:46:29Z

Thanks for the contribution. I'll review it in the next days

ratoaq2 · 2026-05-15T08:03:14Z

I tested this locally and it seems very inefficient since there's 1 call to the LLM API for each subtitle entry. That could result in thousand calls (depending on the subtitle) and also you need to be very patient.

In Tesseract I grouped all the subtitle entries in 1 (rarely 2) PNG images, where I sorted the subtitles based on their size to maximize how many entries I can have in a single image. That works fine with Tesseract because its result gives me back the text coordinates, so I know which subtitle entry is what on the result.

Maybe it's possible to add horizontal lines and vertical lines to this image to divide this big image in blocks and send 1 image to LLM API and ask to report back in the order block, and assemble back the final transcripted subtitles

socram8888 · 2026-05-15T09:16:15Z

The problem about that approach is that the fast OCR models (the ones you probably wanna use as a hobbyst on a local machine) are not trained to give back coordinates.

GLM-OCR runs at about 200-250 tokens, taking <100ms per subtitle on my Ryzen AI MAX 395 using llama.cpp with the Vulkan backend. That is comparable to Tesseract's speeds, yet properly recognizes uppercase "i" vs lowercase "L" for example. It, however, is not possible to ask for coordinates because the 900M model simply isn't trained for that.

More capable models that can indeed process multiple and give back coordinates such as Qwen3.6-35B-A3B-GGUF models or Gemma 4 26B A4B are an option, but those are much slower at 30-40 tokens per second with 8 bit quantisation and taking ~800ms per subtitle.

Time increases linearly with the amounts of tokens to generate, so using a simpler, single page model is gonna be always faster than bundling multiple with a coordinate-producing model.

socram8888 force-pushed the llm branch from 308031f to db85886 Compare May 8, 2026 10:51

Add support for LLM-based transcriptions

8701e03

socram8888 force-pushed the llm branch from db85886 to 8701e03 Compare May 8, 2026 15:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for LLM-based transcriptions#148

Add support for LLM-based transcriptions#148
socram8888 wants to merge 1 commit into
ratoaq2:mainfrom
socram8888:llm

socram8888 commented May 8, 2026

Uh oh!

ratoaq2 commented May 10, 2026

Uh oh!

ratoaq2 commented May 15, 2026

Uh oh!

socram8888 commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

socram8888 commented May 8, 2026

Uh oh!

ratoaq2 commented May 10, 2026

Uh oh!

ratoaq2 commented May 15, 2026

Uh oh!

socram8888 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

socram8888 commented May 15, 2026 •

edited

Loading