Nutch 2.2 with ElasticSearch 1.x and HBase

This document describes how to install and run Nutch 2.2.1 with HBase 0.90.4 and ElasticSearch 1.1.1 on Ubuntu 14.04

Prerequisites

Make sure you installed the Java-SDK 7.


$ sudo apt-get install openjdk-7-jdk

And you set JAVA_HOME in your .bashrc:
Add the following line at the bottom of HOME/.bashrc:


export JAVA_HOME=/usr/lib/jvm/java-7-openjdk
(the jdk might differ)

Now you need to either reconnect with your terminal or type:


$ source ~/.bashrc
To load the changes in that file.

Download Nutch 2.2.x

Download the latest release or 2.2.1 from:
https://nutch.apache.org/downloads.html

Unpack it and follow the steps described in the tutorial:
http://wiki.apache.org/nutch/Nutch2Tutorial

Download HBase

It’s proven to work with version 0.90.4. This version is quite old (2011) so you might try with newer versions but nutch doesn’t support them. Hopefully there will be an upgrade soon.

http://archive.apache.org/dist/hbase/hbase-0.90.4/

Download ElasticSearch

Download and unpack ElasticSearch 1.x from:

http://www.elasticsearch.org/overview/elkdownloads/

To run ElasticSearch with the default configuration just go to ES_HOME and type:


$ bin/elasticsearch

Install HBase

Install HBase according to:
http://hbase.apache.org/book/quickstart.html

If you’re running on Ubuntu you need to change the file /etc/hosts
Due to some internal problems with old versions of HBase and the loopback of IP-addresses you need to specify localhost as 127.0.0.1
Just change all localhost-ips to the format above. Sometimes (on Ubuntu) localhost is 127.0.1.1.
Apparently this is fixed in newer versions of HBase, but you cannot use them yet.

Now you have to change the configuration of HBASE_HOME/conf/hbase-site.xml.
Hbase and Zookeper need directories where to save data to. Default is /temp which would be gone after restarting the computer.
So create 2 folders one for HBase and one for Zookeeper where they can save their data.


<property>
<name>hbase.rootdir</name>
<value>file:///DIRECTORY/hbase</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/DIRECTORY/zookeeper</value>
</property>

Just replace DIRECTORY whith a folder of your choice. Don’t forget file:// in front of your hbase.rootdir
You need to specify a location on your local filesytem for running HBase in standalone-mode (without hdfs).

Now start Hbase and run in HBASE_HOME:


$ ./bin/start-hbase.sh

Now you can check the logs at the specified location.

Now please use the shell and test your HBase installation.


$ ./bin/hbase shell

You should be able to create a table:


$ create 'test', 'ab'

Expected output:


$ 0 row(s) in 1.2200 seconds

With the command scan you can just list all the content of the created table:


$ scan 'test'

If there are no errors, you’re HBase should be set up correctly.

Setting up Nutch to work with HBase and ElasticSearch 1.x

Go to your NUTCH_HOME and edit conf/nutch-site.xml:
Enable HBase as backend-database:


<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>

<property>
<name>http.agent.name</name>
<value>My Private Spider Bot</value>
</property>

<property>
<name>http.robots.agents</name>
<value>My Private Spider Bot</value>
</property>

Now set the versions in your dependency-manager in NUTCH_HOME/ivy/ivy.xml:


<!-- Uncomment this to use HBase as Gora backend. -->
<dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />
To make sure that the correct version of ElasticSearch is used you also need to change the default version to the one you want to use:

<dependency org="org.elasticsearch" name="elasticsearch" rev="1.1.1" conf="*->default"/>

Now you need to edit a line of Java-Source-Code.
NUTCH_HOME/src/java/org/apache/nutch/indexer/elastic/ElasticWriter.java
The line with item.failed() needs to be changed. Since there was an API-Update from the version that was used per default.


if (item.isFailed()) {...}

Now you need to edit in gora.properties:
Enable HBase as a default datastore:


gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Compile Nutch

Just go to your NUTCH_HOME directory and run:


$ ant runtime

When the build was succesful you can start working.

Make sure Hbase is running!

Now you can start crawling a website

Create a folder called e.g. ‘urls’ in NUTCH_HOME/runtime
Create a file called seed.txt inside and add, line per line all the URLs that you want to crawl.

Now for the standalone mode (not using hadoop) go to NUTCH_HOME/runtime/local:

Now you need to execute a pipeline of commands all starting with bin/nutch:
http://wiki.apache.org/nutch/CommandLineOptions


1 $ bin/nutch inject <seed-url-dir>
2 $ bin/nutch generate -topN <n>
3 $ bin/nutch fetch -all
4 $ bin/nutch parse -all
5 $ bin/nutch updatedb
6 $ bin/nutch elasticindex <clustername> -all

To check whether everything worked you can look at hbase (via hbase-shell):


$ > scan 'webpage'
Then repete the steps 2-5 as much as you want and then write everything to the index (6).

To check whether something has been written to the ElasticSearch index just execute:


$ curl -XGET 'http://localhost:9200/index/_search?q=*&pretty=true'

There you should see the crawled and downloaded documents with the raw text and all the metadata in json-format.

Nutch saves everything from HBase ‘webpage’ to an index called ‘index’ per default and exports all ‘documents’ to ElasticSearch with the type ‘doc’.

Useful Links:

http://www.sigpwned.com/content/nutch-2-and-elasticsearch
http://etechnologytips.com/create-web-crawler-data-miner/
http://wiki.apache.org/nutch/CommandLineOptions
http://de.slideshare.net/digitalpebble/j-nioche-lucenerevoeu2013
https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-16/nutch-search-engine

Berlino-2012Nutch 2.2 with ElasticSearch 1.x and HBase

9 Comments on “Nutch 2.2 with ElasticSearch 1.x and HBase”

  1. Sri Harsha

    Nice post.

    I am missing out the “elastic” package in the nutch src folder. Should I add it from any other third party website?

    Thank you

  2. Pingback: Nutch Error: JAVA_HOME is not set - DexPage

  3. Pingback: java - Nutch Error: JAVA_HOME is not set - java

  4. Sam

    Thanks for the wonderful tutorial. I tried to set-up this in my local machine. All the steps are executed successfully. I am even able to see the data in ‘webpage’ table in HBase. However, I don’t see anything getting indexed to Elastic search engine. Can you please help me with finding the missing piece here. Thanks.

    Below are the command line logs for the last two steps.

    MACC1MNQNK5DTY3:local kalmesh$ bin/nutch index elasticsearch -all
    IndexingJob: starting
    Active IndexWriters :
    ElasticIndexWriter
    elastic.cluster : elastic prefix cluster
    elastic.host : hostname
    elastic.port : port (default 9300)
    elastic.index : elastic index command
    elastic.max.bulk.docs : elastic bulk index doc counts. (default 250)
    elastic.max.bulk.size : elastic bulk index length. (default 2500500 ~2.5MB)

    IndexingJob: done.
    MACC1MNQNK5DTY3:local kalmesh$
    MACC1MNQNK5DTY3:local kalmesh$ curl -X GET “http://localhost:9200/_search?q=*”
    {“took”:1,”timed_out”:false,”_shards”:{“total”:0,”successful”:0,”failed”:0},”hits”:{“total”:0,”max_score”:0.0,”hits”:[]}}
    MACC1MNQNK5DTY3:local kalmesh$

    1. Saskia

      Hi Sam,

      how many documents do you have in Hbase? If it’s less then 250, you won’t see them unless you change the setting elastic.max.bulk.docs in nutch-site.xml to 10 or something similar for testing.

      elastic.max.bulk.docs
      10

      The number of docs in the batch that will trigger a flush to elasticsearch.

      Kind regards,
      Saskia

  5. marc mceachern

    Hello Saska,

    This tutorial has been excellent, really clear. Finally got it working.

    I am using Fedora used the same versions of software you specified, downloaded from the various archives. Any deviation, I ran into issue. -Thanks!

  6. Raj

    Hi Saska,

    Very nice and clear article. I tried and it’s working fine but same setting not working with Nutch 2.3 REST API.
    I am facing an issue while generating segments
    POST job/create
    {
    “args”: {
    “crawlId”:”crawl04″,
    “batchId”:”1474149229176-4330″
    “curTime”:1474149229176
    },
    “confId”:”default”,
    “crawlId”:”crawl04″,
    “type”:”GENERATE”
    }
    ava.lang.RuntimeException: job failed: name=[crawl04]generate: null, jobid=job_local1217831069_0002
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:55)
    Using Nutch 2.3 , hbase-0.94.27 and ES 1.4.4

    Any idea?

Leave a Reply

Your email address will not be published. Required fields are marked *