Updated some entries in the file COUNTER_Robots_list and added 1636 n…#62
Updated some entries in the file COUNTER_Robots_list and added 1636 n…#62CRMGB wants to merge 3 commits intoatmire:masterfrom
Conversation
…ew entries, updated the CHANGES.md with the new bots and changes and corrections to the convert_to_txt file.
|
@CRMGB there is a ton of duplication in your proposed additions, for example: Not to mention, our list already has this much better regular expression: Also, your proposed additions have not escaped special characters like forward slashes and dots: And your patterns are strangely cut off at a certain line length or something? Lastly, patterns like these would already be matched by the So this pull request is not in good shape as is. There may be some new bot patterns we can use from the other project, but in such a large list it is very difficult to verify them. Personally, I would prefer to have bots that have been verified from access log files directly. |
|
No problem @alanorth, Regards. |
…t this regular expression: '^java\/\d{1,2}.\d'
…erns duplications. - Added the escape for special characters. - Added the right lenght for some patterns. - Deleted patters with a previous match. Updated CHANGES.md and the txt version respectively.
9eedfa3 to
b22e6c6
Compare
Updated the robots list with 1636 new robots.
Most of the new entries have been gathered from https://github.com/monperrus/crawler-user-agents/blob/master/crawler-user-agents.json using using it as a guide and detecting new bots from our user-agents dataset.
This list is designed to be used as a REGEX pattern to identify crawlers/bots from our user-agent entries and exclude them from our metrics if detected.
The following files have been modified:
CHANGES.md
COUNTER_Robots_list.json
convert_to_txt
generated/COUNTER_Robots_list.txt