Skip to content

Commit d2451af

Browse files
committed
Merge branch 'master' into NUTCH-2455
2 parents 16f26f1 + f82959d commit d2451af

File tree

72 files changed

+6641
-780
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+6641
-780
lines changed

CHANGES.txt

Lines changed: 97 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,107 @@
11
# Nutch Change Log
22

3-
Nutch 1.14 Release (dd/mm/yyyy)
3+
Nutch 1.15 Release (dd/mm/yyyy)
44

55
Comments
66

7-
Fellow committers, Nutch 1.14 contains a breaking change NUTCH-2046. Please use the note below and
8-
in the release announcement and keep it on top in this CHANGES.txt for the Nutch 1.14 release.
9-
* the bin/crawl script now expects the path to the seed to be preceded by -s
7+
Breaking Changes
8+
9+
10+
Nutch 1.14 Release 18/12/2017 (dd/mm/yyyy)
11+
12+
- the bin/crawl script now expects the path to the seed to be preceded by -s (NUTCH-2046)
13+
14+
Bug
15+
16+
[NUTCH-2071] - A parser failure on a single document may fail crawling job
17+
[NUTCH-2235] - Classpath discrepancy with protocol-selenium in deploy mode
18+
[NUTCH-2269] - Clean not working after crawl
19+
[NUTCH-2295] - Nutch master docker container broken
20+
[NUTCH-2297] - CrawlDbReader -stats wrong values for earliest fetch time and shortest interval
21+
[NUTCH-2316] - Library conflict with Parser-Tika Plugin and Lib Folder
22+
[NUTCH-2317] - Plugin jars don't get added to classpath while running in local
23+
[NUTCH-2322] - URL not available for Jexl operations
24+
[NUTCH-2354] - Upgrade Hadoop dependencies to 2.7.4
25+
[NUTCH-2365] - HTTP Redirects to SubDomains don't get crawled if db.ignore.external.links.mode == byDomain
26+
[NUTCH-2371] - Injector to support noFilter and noNormalize
27+
[NUTCH-2372] - Javadocs build failing.
28+
[NUTCH-2386] - BasicURLNormalizer does not encode curly braces
29+
[NUTCH-2391] - Spurious Duplications for MD5
30+
[NUTCH-2394] - Possible bugs in the source code
31+
[NUTCH-2398] - Fetcher saving redirected robots.txt under redirect target URL
32+
[NUTCH-2399] - indexer-elastic does not index multi-value fields (only the first value is indexed)
33+
[NUTCH-2401] - headings plugin does not trim values
34+
[NUTCH-2403] - Nutch Selenium: Wrong documentation about PhantomJS
35+
[NUTCH-2413] - Parsing fetcher to respect property "parse.filter.urls"
36+
[NUTCH-2420] - Bug in variable generate.max.count and fetcher.server.delay
37+
[NUTCH-2436] - Remove empty comment, and redundant semicolon from CommandRunner
38+
[NUTCH-2442] - Injector to stop if job fails to avoid loss of CrawlDb
39+
[NUTCH-2444] - HostDB CSV dumper to emit field header by default
40+
[NUTCH-2446] - URLFiltersCheck fix
41+
[NUTCH-2448] - Allow Sending an empty http.agent.version
42+
[NUTCH-2451] - protocol-ftp to resolve relative URL when following redirects
43+
[NUTCH-2452] - Problem retrieving encoded URLs via FTP?
44+
[NUTCH-2456] - Allow to index pages/URLs not contained in CrawlDb
45+
[NUTCH-2458] - TikaParser doesn't work with tika-config.xml set
46+
[NUTCH-2464] - Plugin headings: Headers That Contain HTML Elements Are Not Parsed
47+
[NUTCH-2465] - Broken Eclipse project. Classpaths and interactiveselenium should be fixed.
48+
[NUTCH-2472] - Sitemap processor does not honour db.ignore.external.links
49+
[NUTCH-2473] - Elasticsearch REST Indexer broken due to wrong depenency
50+
[NUTCH-2474] - CrawlDbReader -stats fails with ClassCastException
51+
[NUTCH-2478] - // is not a valid base URL
52+
[NUTCH-2483] - Remove/replace indirect dependencies to org.json
53+
54+
Improvement
55+
56+
[NUTCH-1763] - Improving comments on the Injector Class
57+
[NUTCH-2034] - CrawlDB filtered documents counter.
58+
[NUTCH-2035] - Regex filter using case sensitive rules.
59+
[NUTCH-2046] - The crawl script should be able to skip an initial injection.
60+
[NUTCH-2135] - Ant Eclipse build does not include protocol-interactiveselenium
61+
[NUTCH-2193] - Upgrade feed parser plugin to use rome 1.5
62+
[NUTCH-2216] - db.ignore.*.links to optionally follow internal redirects
63+
[NUTCH-2281] - Support non-default FileSystem
64+
[NUTCH-2296] - Elasticsearch Indexing Over Rest
65+
[NUTCH-2320] - URLFilterChecker to run as TCP Telnet service
66+
[NUTCH-2335] - Injector not to filter and normalize existing URLs in CrawlDb
67+
[NUTCH-2362] - Upgrade MaxMind GeoIP version in index-geoip
68+
[NUTCH-2368] - Variable generate.max.count and fetcher.server.delay
69+
[NUTCH-2370] - FileDumper: save JSON mapping file -> URL
70+
[NUTCH-2376] - Improve configurability of HTTP Accept* header fields
71+
[NUTCH-2378] - ChildFirst plugin classloader
72+
[NUTCH-2380] - indexer-elastic version upgrade to 5.3.0
73+
[NUTCH-2397] - Parser to add paragraph line breaks
74+
[NUTCH-2400] - Solr 6.6.0 compatibility
75+
[NUTCH-2406] - Sum up constants, make minor changes
76+
[NUTCH-2408] - CrawlDb: allow update from unparsed segments
77+
[NUTCH-2409] - Injector: complete command-line help and counters
78+
[NUTCH-2414] - Allow LanguageIndexingFilter to actually filter documents by language.
79+
[NUTCH-2430] - Complete plugin build configuration
80+
[NUTCH-2431] - URLFilterchecker to implement Tool-interface
81+
[NUTCH-2439] - Upgrade to Apache Tika 1.17
82+
[NUTCH-2443] - Extract links from the video tag with the parse-html plugin
83+
[NUTCH-2445] - Fetcher following outlinks to keep track of already fetched items
84+
[NUTCH-2463] - Enable sampling CrawlDB
85+
[NUTCH-2468] - should filter out invalid URLs by default
86+
[NUTCH-2470] - CrawlDbReader -stats to show quantiles of score
87+
[NUTCH-2477] - Refactor *Checker classes to use base class for common code
88+
[NUTCH-2480] - Upgrade crawler-commons dependency to 0.9
1089

1190
New Feature
12-
[NUTCH-2046] - The crawl script should be able to skip an initial injection
91+
92+
[NUTCH-1465] - Support sitemaps in Nutch
93+
[NUTCH-1932] - Automatically remove orphaned pages
94+
[NUTCH-2333] - Indexer for RabbitMQ
95+
[NUTCH-2338] - URLNormalizerChecker to run as TCP Telnet service
96+
[NUTCH-2415] - Create a JEXL based IndexingFilter
97+
[NUTCH-2433] - Html Parser: keep htmltag where the outlinks are found
98+
[NUTCH-2435] - New configuration allowing to choose whether to store 'parse_text' directory or not.
99+
[NUTCH-2484] - Extend indexer-elastic-rest to support languages
100+
101+
Task
102+
103+
[NUTCH-2181] - Add Webpage for 3rd Party Connectors/Libraries to Apache Nutch
104+
13105

14106
Nutch 1.13 Release 28/03/2017 (dd/mm/yyyy)
15107
Release Report: https://s.apache.org/wq3x

NOTICE.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Apache Nutch
2-
Copyright 2017 The Apache Software Foundation
2+
Copyright 2018 The Apache Software Foundation
33

44
This product includes software developed by The Apache Software
55
Foundation (http://www.apache.org/).

build.xml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -173,12 +173,14 @@
173173
<arg value="${javadoc.proxy.port}"/>
174174

175175
<packageset dir="${src.dir}"/>
176+
<packageset dir="${plugins.dir}/any23/src/java/" />
176177
<packageset dir="${plugins.dir}/creativecommons/src/java"/>
177178
<packageset dir="${plugins.dir}/feed/src/java"/>
178179
<packageset dir="${plugins.dir}/headings/src/java"/>
179180
<packageset dir="${plugins.dir}/index-anchor/src/java"/>
180181
<packageset dir="${plugins.dir}/index-basic/src/java"/>
181182
<packageset dir="${plugins.dir}/index-geoip/src/java"/>
183+
<packageset dir="${plugins.dir}/index-jexl-filter/src/java"/>
182184
<packageset dir="${plugins.dir}/index-links/src/java"/>
183185
<packageset dir="${plugins.dir}/index-metadata/src/java"/>
184186
<packageset dir="${plugins.dir}/index-more/src/java"/>
@@ -624,12 +626,14 @@
624626
<arg value="${javadoc.proxy.port}"/>
625627

626628
<packageset dir="${src.dir}"/>
629+
<packageset dir="${plugins.dir}/any23/src/java/" />
627630
<packageset dir="${plugins.dir}/creativecommons/src/java"/>
628631
<packageset dir="${plugins.dir}/feed/src/java"/>
629632
<packageset dir="${plugins.dir}/headings/src/java"/>
630633
<packageset dir="${plugins.dir}/index-anchor/src/java"/>
631634
<packageset dir="${plugins.dir}/index-basic/src/java"/>
632635
<packageset dir="${plugins.dir}/index-geoip/src/java"/>
636+
<packageset dir="${plugins.dir}/index-jexl-filter/src/java"/>
633637
<packageset dir="${plugins.dir}/index-links/src/java"/>
634638
<packageset dir="${plugins.dir}/index-metadata/src/java"/>
635639
<packageset dir="${plugins.dir}/index-more/src/java"/>
@@ -1030,6 +1034,8 @@
10301034
<source path="${basedir}/src/java/" />
10311035
<source path="${basedir}/src/test/" output="build/test/classes" />
10321036

1037+
<source path="${plugins.dir}/any23/src/java/" />
1038+
<source path="${plugins.dir}/any23/src/test/" />
10331039
<source path="${plugins.dir}/creativecommons/src/java/" />
10341040
<source path="${plugins.dir}/creativecommons/src/test/" />
10351041
<source path="${plugins.dir}/feed/src/java/" />
@@ -1040,6 +1046,8 @@
10401046
<source path="${plugins.dir}/index-basic/src/java/" />
10411047
<source path="${plugins.dir}/index-basic/src/test/" />
10421048
<source path="${plugins.dir}/index-geoip/src/java/" />
1049+
<source path="${plugins.dir}/index-jexl-filter/src/java/" />
1050+
<source path="${plugins.dir}/index-jexl-filter/src/test/" />
10431051
<source path="${plugins.dir}/index-links/src/java/" />
10441052
<source path="${plugins.dir}/index-links/src/test/" />
10451053
<source path="${plugins.dir}/index-metadata/src/java/" />

conf/nutch-default.xml

Lines changed: 84 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -164,7 +164,7 @@
164164

165165
<property>
166166
<name>http.agent.version</name>
167-
<value>Nutch-1.14-SNAPSHOT</value>
167+
<value>Nutch-1.15-SNAPSHOT</value>
168168
<description>A version string to advertise in the User-Agent
169169
header.</description>
170170
</property>
@@ -572,7 +572,7 @@
572572
<value>false</value>
573573
<description>If true, outlinks leading from a page to internal hosts or domain
574574
will be ignored. This is an effective way to limit the crawl to include
575-
only initially injected hosts, without creating complex URLFilters.
575+
only initially injected hosts or domains, without creating complex URLFilters.
576576
See 'db.ignore.external.links.mode'.
577577
</description>
578578
</property>
@@ -582,11 +582,21 @@
582582
<value>false</value>
583583
<description>If true, outlinks leading from a page to external hosts or domain
584584
will be ignored. This is an effective way to limit the crawl to include
585-
only initially injected hosts, without creating complex URLFilters.
585+
only initially injected hosts or domains, without creating complex URLFilters.
586586
See 'db.ignore.external.links.mode'.
587587
</description>
588588
</property>
589589

590+
<property>
591+
<name>db.ignore.also.redirects</name>
592+
<value>true</value>
593+
<description>If true, the fetcher checks redirects the same way as
594+
links when ignoring internal or external links. Set to false to
595+
follow redirects despite the values for db.ignore.external.links and
596+
db.ignore.internal.links.
597+
</description>
598+
</property>
599+
590600
<property>
591601
<name>db.ignore.external.links.mode</name>
592602
<value>byHost</value>
@@ -1054,6 +1064,14 @@
10541064
Publisher implementation specific properties</description>
10551065
</property>
10561066

1067+
<!-- any23 plugin properties -->
1068+
1069+
<property>
1070+
<name>any23.extractors</name>
1071+
<value>html-microdata</value>
1072+
<description>Comma-separated list of Any23 extractors (a list of extractors is available here: http://any23.apache.org/getting-started.html)</description>
1073+
</property>
1074+
10571075
<!-- moreindexingfilter plugin properties -->
10581076

10591077
<property>
@@ -1225,7 +1243,7 @@
12251243

12261244
<property>
12271245
<name>plugin.includes</name>
1228-
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
1246+
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
12291247
<description>Regular expression naming plugin directory names to
12301248
include. Any plugin not matching this expression is excluded.
12311249
In any case you need at least include the nutch-extensionpoints plugin. By
@@ -1406,6 +1424,12 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this
14061424
</property>
14071425
-->
14081426

1427+
<property>
1428+
<name>tika.config.file</name>
1429+
<value>tika-config.xml</value>
1430+
<description>Nutch-specific Tika config file</description>
1431+
</property>
1432+
14091433
<property>
14101434
<name>tika.uppercase.element.names</name>
14111435
<value>true</value>
@@ -1608,6 +1632,34 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
16081632
</description>
16091633
</property>
16101634

1635+
<property>
1636+
<name>lang.index.languages</name>
1637+
<value></value>
1638+
<description>If not empty, should be a comma separated list of language codes.
1639+
Only documents with one of these language codes will be indexed.
1640+
"unknown" is a valid language code, will match documents where language
1641+
detection failed.
1642+
</description>
1643+
</property>
1644+
1645+
<!-- index-jexl-filter plugin properties -->
1646+
1647+
<property>
1648+
<name>index.jexl.filter</name>
1649+
<value></value>
1650+
<description> A JEXL expression. If it evaluates to false,
1651+
the document will not be indexed.
1652+
Available primitives in the JEXL context:
1653+
* status, fetchTime, modifiedTime, retries, interval, score, signature, url, text, title
1654+
Available objects in the JEXL context:
1655+
* httpStatus - contains majorCode, minorCode, message
1656+
* documentMeta, contentMeta, parseMeta - contain all the Metadata properties.
1657+
each property value is always an array of Strings (so if you expect one value, use [0])
1658+
* doc - contains all the NutchFields from the NutchDocument.
1659+
each property value is always an array of Objects.
1660+
</description>
1661+
</property>
1662+
16111663
<!-- index-static plugin properties -->
16121664

16131665
<property>
@@ -2081,6 +2133,34 @@ visit https://wiki.apache.org/nutch/SimilarityScoringFilter-->
20812133
<description>Default index to send documents to.</description>
20822134
</property>
20832135

2136+
<property>
2137+
<name>elastic.rest.index.languages</name>
2138+
<value></value>
2139+
<description>
2140+
A list of strings denoting the supported languages (e.g. `en,de,fr,it`).
2141+
If this value is empty all documents will be sent to index ${elastic.rest.index}.
2142+
If not empty the Rest client will distribute documents in different indices based on their `lang` property.
2143+
Indices are named with the following schema: ${elastic.rest.index}${elastic.rest.index.separator}${lang} (e.g. `nutch_de`).
2144+
Entries with an unsupported `lang` value will be added to index ${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink} (e.g. `nutch_others`).
2145+
</description>
2146+
</property>
2147+
2148+
<property>
2149+
<name>elastic.rest.index.separator</name>
2150+
<value>_</value>
2151+
<description>
2152+
Default value is `_`. Is used only if `elastic.rest.index.languages` is defined to build the index name (i.e. ${elastic.rest.index}${elastic.rest.index.separator}${lang}).
2153+
</description>
2154+
</property>
2155+
2156+
<property>
2157+
<name>elastic.rest.index.sink</name>
2158+
<value>others</value>
2159+
<description>
2160+
Default value is `others`. Is used only if `elastic.rest.index.languages` is defined to build the index name where to store documents with unsupported languages (i.e. ${elastic.rest.index}${elastic.rest.index.separator}${elastic.rest.index.sink}).
2161+
</description>
2162+
</property>
2163+
20842164
<property>
20852165
<name>elastic.rest.type</name>
20862166
<value>doc</value>

conf/regex-urlfilter.txt.template

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727

2828
# skip image and other suffixes we can't yet parse
2929
# for a more extensive coverage use the urlfilter-suffix plugin
30-
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
30+
-(?i)\.(gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
3131

3232
# skip URLs containing certain characters as probable queries, etc.
3333
-[?*!@=]

conf/tika-config.xml.template

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
<?xml version="1.0" encoding="UTF-8"?>
2+
<!--
3+
Licensed to the Apache Software Foundation (ASF) under one or more
4+
contributor license agreements. See the NOTICE file distributed with
5+
this work for additional information regarding copyright ownership.
6+
The ASF licenses this file to You under the Apache License, Version 2.0
7+
(the "License"); you may not use this file except in compliance with
8+
the License. You may obtain a copy of the License at
9+
10+
http://www.apache.org/licenses/LICENSE-2.0
11+
12+
Unless required by applicable law or agreed to in writing, software
13+
distributed under the License is distributed on an "AS IS" BASIS,
14+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
15+
See the License for the specific language governing permissions and
16+
limitations under the License.
17+
-->
18+
<properties>
19+
<service-loader initializableProblemHandler="ignore"/>
20+
</properties>

default.properties

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -14,9 +14,9 @@
1414
# limitations under the License.
1515

1616
name=apache-nutch
17-
version=1.14-SNAPSHOT
17+
version=1.15-SNAPSHOT
1818
final.name=${name}-${version}
19-
year=2017
19+
year=2018
2020

2121
basedir = ./
2222
src.dir = ./src/java
@@ -170,6 +170,7 @@ plugins.index=\
170170
org.apache.nutch.indexer.basic*:\
171171
org.apache.nutch.indexer.feed*:\
172172
org.apache.nutch.indexer.geoip*:\
173+
org.apache.nutch.indexer.jexl*:\
173174
org.apache.nutch.indexer.filter*:\
174175
org.apache.nutch.indexer.links*:\
175176
org.apache.nutch.indexer.metadata*:\
@@ -202,5 +203,6 @@ plugins.misc=\
202203
org.apache.nutch.collection*:\
203204
org.apache.nutch.analysis.lang*:\
204205
org.creativecommons.nutch*:\
205-
org.apache.nutch.microformats.reltag*
206+
org.apache.nutch.microformats.reltag*:\
207+
org.apache.nutch.any23*
206208

0 commit comments

Comments
 (0)