comments 9

Nutch 2.2 with ElasticSearch 1.x and HBase

This document describes how to install and run Nutch 2.2.1 with HBase 0.90.4 and ElasticSearch 1.1.1 on Ubuntu 14.04

Prerequisites

Make sure you installed the Java-SDK 7.

[code language=”bash”]
$ sudo apt-get install openjdk-7-jdk
[/code]

And you set JAVA_HOME in your .bashrc:
Add the following line at the bottom of HOME/.bashrc:
[code language=”bash”]
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk
[/code]
(the jdk might differ)

Now you need to either reconnect with your terminal or type:
[code language=”bash”]
$ source ~/.bashrc
[/code]
To load the changes in that file.

Download Nutch 2.2.x

Download the latest release or 2.2.1 from:
https://nutch.apache.org/downloads.html

Unpack it and follow the steps described in the tutorial:
http://wiki.apache.org/nutch/Nutch2Tutorial

Download HBase

It’s proven to work with version 0.90.4. This version is quite old (2011) so you might try with newer versions but nutch doesn’t support them. Hopefully there will be an upgrade soon.

http://archive.apache.org/dist/hbase/hbase-0.90.4/

Download ElasticSearch

Download and unpack ElasticSearch 1.x from:

http://www.elasticsearch.org/overview/elkdownloads/

To run ElasticSearch with the default configuration just go to ES_HOME and type:
[code language=”bash”]
$ bin/elasticsearch
[/code]

Install HBase

Install HBase according to:
http://hbase.apache.org/book/quickstart.html

If you’re running on Ubuntu you need to change the file /etc/hosts
Due to some internal problems with old versions of HBase and the loopback of IP-addresses you need to specify localhost as 127.0.0.1
Just change all localhost-ips to the format above. Sometimes (on Ubuntu) localhost is 127.0.1.1.
Apparently this is fixed in newer versions of HBase, but you cannot use them yet.

Now you have to change the configuration of HBASE_HOME/conf/hbase-site.xml.
Hbase and Zookeper need directories where to save data to. Default is /temp which would be gone after restarting the computer.
So create 2 folders one for HBase and one for Zookeeper where they can save their data.

[code language=”xml”]
<property>
<name>hbase.rootdir</name>
<value>file:///DIRECTORY/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/DIRECTORY/zookeeper</value>
</property>
[/code]

Just replace DIRECTORY whith a folder of your choice. Don’t forget file:// in front of your hbase.rootdir
You need to specify a location on your local filesytem for running HBase in standalone-mode (without hdfs).

Now start Hbase and run in HBASE_HOME:
[code language=”bash”]
$ ./bin/start-hbase.sh
[/code]

Now you can check the logs at the specified location.

Now please use the shell and test your HBase installation.
[code language=”bash”]
$ ./bin/hbase shell
[/code]

You should be able to create a table:
[code language=”bash”]
$ create ‘test’, ‘ab’
[/code]

Expected output:
[code language=”bash”]
$ 0 row(s) in 1.2200 seconds
[/code]

With the command scan you can just list all the content of the created table:
[code language=”bash”]
$ scan ‘test’
[/code]

If there are no errors, you’re HBase should be set up correctly.

Setting up Nutch to work with HBase and ElasticSearch 1.x

Go to your NUTCH_HOME and edit conf/nutch-site.xml:
Enable HBase as backend-database:

[code language=”xml”]
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

<property>
<name>http.agent.name</name>
<value>My Private Spider Bot</value>
</property>

<property>
<name>http.robots.agents</name>
<value>My Private Spider Bot</value>
</property>
[/code]

Now set the versions in your dependency-manager in NUTCH_HOME/ivy/ivy.xml:

[code language=”xml”]
<!– Uncomment this to use HBase as Gora backend. –>
<dependency org=”org.apache.gora” name=”gora-hbase” rev=”0.3″ conf=”*->default” />
[/code]
To make sure that the correct version of ElasticSearch is used you also need to change the default version to the one you want to use:
[code language=”xml”]
<dependency org=”org.elasticsearch” name=”elasticsearch” rev=”1.1.1″ conf=”*->default”/>
[/code]

Now you need to edit a line of Java-Source-Code.
NUTCH_HOME/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
The line with item.failed() needs to be changed. Since there was an API-Update from the version that was used per default.
[code language=”java”]
if (item.isFailed()) {…}
[/code]

Now you need to edit in gora.properties:
Enable HBase as a default datastore:
[code language=”text”]
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
[/code]

Compile Nutch

Just go to your NUTCH_HOME directory and run:
[code language=”bash”]
$ ant runtime
[/code]

When the build was succesful you can start working.

Make sure Hbase is running!

Now you can start crawling a website

Create a folder called e.g. ‘urls’ in NUTCH_HOME/runtime
Create a file called seed.txt inside and add, line per line all the URLs that you want to crawl.

Now for the standalone mode (not using hadoop) go to NUTCH_HOME/runtime/local:

Now you need to execute a pipeline of commands all starting with bin/nutch:
http://wiki.apache.org/nutch/CommandLineOptions

[code language=”bash”]
1 $ bin/nutch inject <seed-url-dir>
2 $ bin/nutch generate -topN <n>
3 $ bin/nutch fetch -all
4 $ bin/nutch parse -all
5 $ bin/nutch updatedb
6 $ bin/nutch elasticindex <clustername> -all
[/code]

To check whether everything worked you can look at hbase (via hbase-shell):
[code language=”bash”]
$ > scan ‘webpage’
[/code]
Then repete the steps 2-5 as much as you want and then write everything to the index (6).

To check whether something has been written to the ElasticSearch index just execute:
[code language=”bash”]
$ curl -XGET ‘http://localhost:9200/index/_search?q=*&pretty=true’
[/code]

There you should see the crawled and downloaded documents with the raw text and all the metadata in json-format.

Nutch saves everything from HBase ‘webpage’ to an index called ‘index’ per default and exports all ‘documents’ to ElasticSearch with the type ‘doc’.

Useful Links:

http://www.sigpwned.com/content/nutch-2-and-elasticsearch
http://etechnologytips.com/create-web-crawler-data-miner/
http://wiki.apache.org/nutch/CommandLineOptions
http://de.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-16/nutch-search-engine

9 Comments

  1. Sri Harsha

    Nice post.

    I am missing out the “elastic” package in the nutch src folder. Should I add it from any other third party website?

    Thank you

  2. Pingback: Nutch Error: JAVA_HOME is not set - DexPage

  3. Pingback: java - Nutch Error: JAVA_HOME is not set - java

  4. Sam

    Thanks for the wonderful tutorial. I tried to set-up this in my local machine. All the steps are executed successfully. I am even able to see the data in ‘webpage’ table in HBase. However, I don’t see anything getting indexed to Elastic search engine. Can you please help me with finding the missing piece here. Thanks.

    Below are the command line logs for the last two steps.

    MACC1MNQNK5DTY3:local kalmesh$ bin/nutch index elasticsearch -all
    IndexingJob: starting
    Active IndexWriters :
    ElasticIndexWriter
    elastic.cluster : elastic prefix cluster
    elastic.host : hostname
    elastic.port : port (default 9300)
    elastic.index : elastic index command
    elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
    elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

    IndexingJob: done.
    MACC1MNQNK5DTY3:local kalmesh$
    MACC1MNQNK5DTY3:local kalmesh$ curl -X GET “http://localhost:9200/_search?q=*”
    {“took”:1,”timed_out”:false,”_shards”:{“total”:0,”successful”:0,”failed”:0},”hits”:{“total”:0,”max_score”:0.0,”hits”:[]}}
    MACC1MNQNK5DTY3:local kalmesh$

    • Saskia

      Hi Sam,

      how many documents do you have in Hbase? If it’s less then 250, you won’t see them unless you change the setting elastic.max.bulk.docs in nutch-site.xml to 10 or something similar for testing.

      elastic.max.bulk.docs
      10

      The number of docs in the batch that will trigger a flush to elasticsearch.

      Kind regards,
      Saskia

  5. marc mceachern

    Hello Saska,

    This tutorial has been excellent, really clear. Finally got it working.

    I am using Fedora used the same versions of software you specified, downloaded from the various archives. Any deviation, I ran into issue. -Thanks!

  6. Raj

    Hi Saska,

    Very nice and clear article. I tried and it’s working fine but same setting not working with Nutch 2.3 REST API.
    I am facing an issue while generating segments
    POST job/create
    {
    “args”: {
    “crawlId”:”crawl04″,
    “batchId”:”1474149229176-4330″
    “curTime”:1474149229176
    },
    “confId”:”default”,
    “crawlId”:”crawl04″,
    “type”:”GENERATE”
    }
    ava.lang.RuntimeException: job failed: name=[crawl04]generate: null, jobid=job_local1217831069_0002
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:55)
    Using Nutch 2.3 , hbase-0.94.27 and ES 1.4.4

    Any idea?

Leave a Reply