Tokenizer may cause "string index out of range" on Japanese

Hi, I use this software for extended real-time transcription sessions (2-3 hours), but occasionally encounter the error attached below, causing transcription to halt.
Regarding when the error occurs, it seems to happen when � remains at the End of decoding loop, though I haven't verified if this is a necessary and sufficient condition.
Regarding the error, Faster-Whisper has implemented a fix via the following pull request, but it has not been fixed in the original Whisper.
https://github.com/SYSTRAN/faster-whisper/pull/111

```
DEBUG	<|startoftranscript|><|ja|><|transcribe|><|notimestamps|>もう渡込みちゃうやばいんだよね。もう一回行く。もう一回行く。やったー。おー!われらにそうやった。いつからそうだった?いや、明日ですよ。え、そうやめっちゃ嬉
DEBUG	[998] most att frames
DEBUG	current tokenstorch.Size([1, 65])
DEBUG	attention reaches the end: 998/1020
INFO	End of decoding loop
DEBUG	new_hypothesis: [1543, 6474, 1231, 9955, 7355, 11429, 41380, 161, 105]
INFO	Output: 。え、そうやめっちゃ�
Traceback (most recent call last):
  File "/mnt/c/Users/usr/sample/SimulStreaming/simulstreaming_whisper_server.py", line 6, in <module>
    main_server(simul_asr_factory, add_args=simulwhisper_args)
  File "/mnt/c/Users/usr/sample/SimulStreaming/whisper_streaming/whisper_server.py", line 174, in main_server
    proc.process()
  File "/mnt/c/Users/usr/sample/SimulStreaming/whisper_streaming/whisper_server.py", line 105, in process
    o = self.online_asr_proc.process_iter()
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/usr/sample/SimulStreaming/whisper_streaming/vac_online_processor.py", line 101, in process_iter
    ret = self.online.process_iter()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/usr/sample/SimulStreaming/simulstreaming_whisper.py", line 220, in process_iter
    tokens = self.hide_incomplete_unicode(tokens)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/usr/sample/SimulStreaming/simulstreaming_whisper.py", line 200, in hide_incomplete_unicode
    chars, _ = self.model.tokenizer.split_tokens_on_unicode(tokens)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/c/Users/usr/sample/SimulStreaming/simul_whisper/whisper/tokenizer.py", line 301, in split_tokens_on_unicode
    or decoded_full[unicode_offset + decoded.index(replacement_char)]
       ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
IndexError: string index out of range
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer may cause "string index out of range" on Japanese #31

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizer may cause "string index out of range" on Japanese #31

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions