Nutch 2.2 with ElasticSearch 1.x and HBase

This document describes how to install and run Nutch 2.2.1 with HBase 0.90.4 and ElasticSearch 1.1.1 on Ubuntu 14.04

Prerequisites

Make sure you installed the Java-SDK 7.

[code language=”bash”]
$ sudo apt-get install openjdk-7-jdk
[/code]

And you set JAVA_HOME in your .bashrc:
Add the following line at the bottom of HOME/.bashrc:
[code language=”bash”]
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk
[/code]
(the jdk might differ)

Now you need to either reconnect with your terminal or type:
[code language=”bash”]
$ source ~/.bashrc
[/code]
To load the changes in that file.

Download Nutch 2.2.x

Download the latest release or 2.2.1 from:
https://nutch.apache.org/downloads.html

Unpack it and follow the steps described in the tutorial:
http://wiki.apache.org/nutch/Nutch2Tutorial

Download HBase

It’s proven to work with version 0.90.4. This version is quite old (2011) so you might try with newer versions but nutch doesn’t support them. Hopefully there will be an upgrade soon.

http://archive.apache.org/dist/hbase/hbase-0.90.4/

Download ElasticSearch

Download and unpack ElasticSearch 1.x from:

http://www.elasticsearch.org/overview/elkdownloads/

To run ElasticSearch with the default configuration just go to ES_HOME and type:
[code language=”bash”]
$ bin/elasticsearch
[/code]

Install HBase

Install HBase according to:
http://hbase.apache.org/book/quickstart.html

If you’re running on Ubuntu you need to change the file /etc/hosts
Due to some internal problems with old versions of HBase and the loopback of IP-addresses you need to specify localhost as 127.0.0.1
Just change all localhost-ips to the format above. Sometimes (on Ubuntu) localhost is 127.0.1.1.
Apparently this is fixed in newer versions of HBase, but you cannot use them yet.

Now you have to change the configuration of HBASE_HOME/conf/hbase-site.xml.
Hbase and Zookeper need directories where to save data to. Default is /temp which would be gone after restarting the computer.
So create 2 folders one for HBase and one for Zookeeper where they can save their data.

[code language=”xml”]
<property>
<name>hbase.rootdir</name>
<value>file:///DIRECTORY/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/DIRECTORY/zookeeper</value>
</property>
[/code]

Just replace DIRECTORY whith a folder of your choice. Don’t forget file:// in front of your hbase.rootdir
You need to specify a location on your local filesytem for running HBase in standalone-mode (without hdfs).

Now start Hbase and run in HBASE_HOME:
[code language=”bash”]
$ ./bin/start-hbase.sh
[/code]

Now you can check the logs at the specified location.

Now please use the shell and test your HBase installation.
[code language=”bash”]
$ ./bin/hbase shell
[/code]

You should be able to create a table:
[code language=”bash”]
$ create ‘test’, ‘ab’
[/code]

Expected output:
[code language=”bash”]
$ 0 row(s) in 1.2200 seconds
[/code]

With the command scan you can just list all the content of the created table:
[code language=”bash”]
$ scan ‘test’
[/code]

If there are no errors, you’re HBase should be set up correctly.

Setting up Nutch to work with HBase and ElasticSearch 1.x

Go to your NUTCH_HOME and edit conf/nutch-site.xml:
Enable HBase as backend-database:

[code language=”xml”]
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

<property>
<name>http.agent.name</name>
<value>My Private Spider Bot</value>
</property>

<property>
<name>http.robots.agents</name>
<value>My Private Spider Bot</value>
</property>
[/code]

Now set the versions in your dependency-manager in NUTCH_HOME/ivy/ivy.xml:

[code language=”xml”]
<!– Uncomment this to use HBase as Gora backend. –>
<dependency org=”org.apache.gora” name=”gora-hbase” rev=”0.3″ conf=”*->default” />
[/code]
To make sure that the correct version of ElasticSearch is used you also need to change the default version to the one you want to use:
[code language=”xml”]
<dependency org=”org.elasticsearch” name=”elasticsearch” rev=”1.1.1″ conf=”*->default”/>
[/code]

Now you need to edit a line of Java-Source-Code.
NUTCH_HOME/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
The line with item.failed() needs to be changed. Since there was an API-Update from the version that was used per default.
[code language=”java”]
if (item.isFailed()) {…}
[/code]

Now you need to edit in gora.properties:
Enable HBase as a default datastore:
[code language=”text”]
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
[/code]

Compile Nutch

Just go to your NUTCH_HOME directory and run:
[code language=”bash”]
$ ant runtime
[/code]

When the build was succesful you can start working.

Make sure Hbase is running!

Now you can start crawling a website

Create a folder called e.g. ‘urls’ in NUTCH_HOME/runtime
Create a file called seed.txt inside and add, line per line all the URLs that you want to crawl.

Now for the standalone mode (not using hadoop) go to NUTCH_HOME/runtime/local:

Now you need to execute a pipeline of commands all starting with bin/nutch:
http://wiki.apache.org/nutch/CommandLineOptions

[code language=”bash”]
1 $ bin/nutch inject <seed-url-dir>
2 $ bin/nutch generate -topN <n>
3 $ bin/nutch fetch -all
4 $ bin/nutch parse -all
5 $ bin/nutch updatedb
6 $ bin/nutch elasticindex <clustername> -all
[/code]

To check whether everything worked you can look at hbase (via hbase-shell):
[code language=”bash”]
$ > scan ‘webpage’
[/code]
Then repete the steps 2-5 as much as you want and then write everything to the index (6).

To check whether something has been written to the ElasticSearch index just execute:
[code language=”bash”]
$ curl -XGET ‘http://localhost:9200/index/_search?q=*&pretty=true’
[/code]

There you should see the crawled and downloaded documents with the raw text and all the metadata in json-format.

Nutch saves everything from HBase ‘webpage’ to an index called ‘index’ per default and exports all ‘documents’ to ElasticSearch with the type ‘doc’.

Useful Links:

http://www.sigpwned.com/content/nutch-2-and-elasticsearch
http://etechnologytips.com/create-web-crawler-data-miner/
http://wiki.apache.org/nutch/CommandLineOptions
http://de.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-16/nutch-search-engine

Saskia Vola

Textmining, NLP and Elasticsearch consulting