Conversation
eiennohito
left a comment
There was a problem hiding this comment.
Using regexes is probably OK (but you should use JVM API, not Scala one here), but it is better to skip ~50% of input document for language detection as I have explained. Header section usually contains a lot of not interesting stuff written in ASCII only and filtering only scripts won't help that much. There are also comments, inline stylesheets and other things we can ignore if we start language detection from the first tag after ~50% of text content
| output.put(char) | ||
| private def copyMeaningfulContent(input: CharBuffer, output: CharBuffer): Unit = { | ||
| // Convert the input to a string | ||
| val content = input.toString |
There was a problem hiding this comment.
It is possible to avoid creating this string completely, you do not need it.
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html#matcher(java.lang.CharSequence) can use CharBuffers directly as inputs as they implement CharSequence interface.
1. Used java regex pattern 2. estimate from 50% ~ 3. dismiss css, etc
|
I changed the code according to your instruction, including:
However, it still cannot recognize English properly. |
|
Is the regex-based method too rudimentary? Should I use Jsoup? |
|
I also tried on a relatively big English corpus, but I got this There is only English disappearing. I think it's strange, so I tried an English html in this corpus, and the LangEstimator could estimate it to English correctly. Is there anything I need to change in uzushio? Any insights would be appreciated. |
No description provided.