Add support for English by ZeekYin · Pull Request #41 · WorksApplications/uzushio

ZeekYin · 2024-10-11T07:41:38Z

No description provided.

eiennohito

Using regexes is probably OK (but you should use JVM API, not Scala one here), but it is better to skip ~50% of input document for language detection as I have explained. Header section usually contains a lot of not interesting stuff written in ASCII only and filtering only scripts won't help that much. There are also comments, inline stylesheets and other things we can ignore if we start language detection from the first tag after ~50% of text content

eiennohito · 2024-10-11T07:50:08Z

lib/src/main/scala/com/worksap/nlp/uzushio/lib/lang/LangEstimation.scala

-        output.put(char)
+  private def copyMeaningfulContent(input: CharBuffer, output: CharBuffer): Unit = {
+    // Convert the input to a string
+    val content = input.toString


It is possible to avoid creating this string completely, you do not need it.
https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/regex/Pattern.html#matcher(java.lang.CharSequence) can use CharBuffers directly as inputs as they implement CharSequence interface.

1. Used java regex pattern 2. estimate from 50% ~ 3. dismiss css, etc

ZeekYin · 2024-10-20T04:45:03Z

I changed the code according to your instruction, including:

Used java regex pattern
estimate from 50% ~
dismiss css, etc

However, it still cannot recognize English properly.

ZeekYin · 2024-10-22T01:51:10Z

Is the regex-based method too rudimentary? Should I use Jsoup?

ZeekYin · 2024-10-23T02:51:53Z

I also tried on a relatively big English corpus, but I got this

'language=ar'   'language=fr'  'language=ko'  'language=pt'  'language=uk'
'language=ast'  'language=ga'  'language=lt'  'language=ru'  'language=ur'
'language=be'   'language=gl'  'language=lv'  'language=sk'  'language=vi'
'language=bg'   'language=hi'  'language=mk'  'language=sq'  'language=zh'
'language=bn'   'language=is'  'language=mr'  'language=sr'   _SUCCESS
'language=cs'   'language=ja'  'language=mt'  'language=sv'
'language=el'   'language=km'  'language=oc'  'language=th'
'language=fa'   'language=kn'  'language=pl'  'language=tr'

There is only English disappearing. I think it's strange, so I tried an English html in this corpus, and the LangEstimator could estimate it to English correctly. Is there anything I need to change in uzushio? Any insights would be appreciated.

ZeekYin added 5 commits October 6, 2024 20:44

params changed for English

15c9a69

add support for ascii char

ddc2454

Update LangEstimation.scala

2d2a167

1

c6058f8

english detectable

99887b0

ZeekYin marked this pull request as ready for review October 11, 2024 07:43

eiennohito reviewed Oct 11, 2024

View reviewed changes

ZeekYin added 2 commits October 18, 2024 16:06

start judge from 50%~

572d0a6

Changed estimation method

c1f9214

1. Used java regex pattern 2. estimate from 50% ~ 3. dismiss css, etc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for English#41

Add support for English#41
ZeekYin wants to merge 7 commits intoWorksApplications:mainfrom
ZeekYin:main

ZeekYin commented Oct 11, 2024

Uh oh!

eiennohito left a comment

Uh oh!

eiennohito Oct 11, 2024

Uh oh!

ZeekYin commented Oct 20, 2024

Uh oh!

ZeekYin commented Oct 22, 2024

Uh oh!

ZeekYin commented Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ZeekYin commented Oct 11, 2024

Uh oh!

eiennohito left a comment

Choose a reason for hiding this comment

Uh oh!

eiennohito Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

ZeekYin commented Oct 20, 2024

Uh oh!

ZeekYin commented Oct 22, 2024

Uh oh!

ZeekYin commented Oct 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants