nutch subcommand reference

The nutch command sets up the java class environment for nutch to run. Subcommand aliases provide a conveninet way to run specific nutch classes.

The nutch subcommand is given below along with the aliased class name. The documentation comes from the version 0.6 code comments, usage descriptions, and the tutorial.

There is a canned crawl command for intranets:
$ bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log

Details are forund in the Nutch tutorial file. A summary of the normal workflow for an internet crawl follows:

  1. Bootstrap the nutch database:
    $ mkdir db # for the nutch page and link database
    $ nutchpath=/path/to/nutch
    $ datapath=/path/to/data
    $ mkdir segments # subdirectories hold pages fetched and indexed as a unit
    $ ${nutchpath}/bin/nutch admin ${datapath}/db -create
    $ ${nutchpath}/bin/nutch inject ${datapath}/db -urlfile starting_urls.txt
  2. Iteratively build a set of data segments:
    1. Generate a segment subdirectory and a fetchlist from the database:
      $ ${nutchpath}/bin/nutch generate ${datapath}/db ${datapath}/segments
    2. Get the segment subdirectory name that was just created:
      (N is the iteration number 1,2,3,...)
      $ sN=`ls -d ${datapath}/segments/2* | tail -1`
      $ echo $sN
    3. Fetch the content of the segment from its fetchlist:
      $ ${nutchpath}/bin/nutch fetch $sN
    4. Update the nutch database with the fetched results:
      $ ${nutchpath}/bin/nutch updatedb ${datapath}/db $sN
    5. Run several iterations of link analysis to prioritize popular pages:
      $ ${nutchpath}/bin/nutch analyze ${datapath}/db 2 #use 5 for inital analysis
  3. Index each of the segments that you fetched:
    $ ${nutchpath}/bin/nutch index $sN
  4. Delete duplicate pages
    $ ${nutchpath}/bin/nutch dedup ${datapath}/segments dedup.tmp

admin = net.nutch.tools.WebDBAdminTool

The WebDBAdminTool is for Nutch administrators who need special access to the webdb. It allows for finer editing of the stored values.

Usage: java net.nutch.tools.WebDBAdminTool (-local | -ndfs <namenode:port>) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k]

analyze = net.nutch.tools.LinkAnalysisTool

LinkAnalysisTool performs link-analysis by using the DistributedAnalysisTool. This single-process all-in-one tool is a wrapper around the more complicated distributed one.

Usage: java net.nutch.tools.LinkAnalysisTool (-local | -ndfs <namenode:port>) <db_dir> <numIterations>

crawl = net.nutch.tools.CrawlTool

Perform complete crawling and indexing given a set of root urls.

Usage: CrawlTool (-local | -ndfs <nameserver:port>) <root_url_file> [-dir d] [-threads n] [-depth i] [-showThreadID]

datanode = 'net.nutch.ndfs.NDFS'

The NDFS class holds the NDFS client and server.

DataNode controls just one critical table: block-> BLOCK_SIZE stream of bytes

This info is stored on disk (the NameNode is responsible for asking other machines to replicate the data). The DataNode reports the table's contents to the NameNode upon startup and every so often afterwards.

Usage: NDFS$DataNode <dataDir> <localMachine> <namenode:port>

dedup = net.nutch.indexer.DeleteDuplicates

Deletes duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL.

Usage: DeleteDuplicates (-local | -ndfs <namenode:port>) [-workingdir <workingdir>] <segmentsDir>

fetch = net.nutch.fetcher.Fetcher

The fetcher. Most of the work is done by plugins.

Usage: Fetcher (-local | -ndfs <namenode:port>) [-logLevel level] [-noParsing] [-showThreadID] [-threads n] <dir>

fetchlist = net.nutch.pagedb.FetchListEntry

Usage: FetchListEntry (-local | -ndfs <namenode:port>) [ -recno N | -dumpurls ] segmentDir

generate = net.nutch.tools.FetchListTool

This class takes an IWebDBReader, computes a relevant subset, and then emits the subset.

Usage: FetchListTool (-local | -ndfs <namenode:port>) <db> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]

index = net.nutch.indexer.IndexSegment

Creates an index for the output corresponding to a single fetcher run.

Useage: IndexSegment (-local | -ndfs <namenode:port>) <segment_directory> [-dir <workingdir>]

inject = net.nutch.db.WebDBInjector

This class takes a flat file of URLs and adds them as entries into a pagedb. Useful for bootstrapping the system.

Usage: WebDBInjector (-local | -ndfs <namenode:port>) <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]

merge = net.nutch.indexer.IndexMerger

IndexMerger creates an index for the output corresponding to a single fetcher run.

Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir <workingdir>] outputIndex segments...

mergesegs = net.nutch.tools.SegmentMergeTool

This class cleans up accumulated segments data, and merges them into a single (or optionally multiple) segment(s), with no duplicates in it.

There are no prerequisites for its correct operation except for a set of already fetched segments (they don't have to contain parsed content, only fetcher output is required). This tool does not use DeleteDuplicates, but creates its own "master" index of all pages in all segments. Then it walks sequentially through this index and picks up only most recent versions of pages for every unique value of url or hash.

If some of the input segments are corrupted, this tool will attempt to repair them, using net.nutch.segment.SegmentReader.fixSegment(NutchFileSystem, File, boolean, boolean, boolean, boolean) method.

Output segment can be optionally split on the fly into several segments of fixed length.

The newly created segment(s) can be then optionally indexed, so that it can be either merged with more new segments, or used for searching as it is.

Old segments may be optionally removed, because all needed data has already been copied to the new merged segment. NOTE: this tool will remove also all corrupted input segments, which are not useable anyway - however, this option may be dangerous if you inadvertently included non-segment directories as input...

You may want to run SegmentMergeTool instead of following the manual procedures, with all options turned on, i.e. to merge segments into the output segment(s), index it, and then delete the original segments data.

Usage: SegmentMergeTool (-local | -nfs ...) (-dir <input_segments_dir> | seg1 seg2 ...) [-o <output_segments_dir>] [-max count] [-i] [-ds]
-dir <input_segments_dir>
path to directory containing input segments
seg1 seg2 seg3
individual paths to input segments
-o <output_segment_dir>
(optional) path to directory which will contain output segment(s).
NOTE: If not present, the original segments path will be used.
-max count
(optional) output multiple segments, each with maximum 'count' entries
-i
(optional) index the output segment when finished merging
-ds
(optional) delete the original input segments when finished

namenode = 'net.nutch.ndfs.NDFS'

The NDFS class holds the NDFS client and server.

DataNode controls just one critical table: block-> BLOCK_SIZE stream of bytes

This info is stored on disk (the NameNode is responsible for asking other machines to replicate the data). The DataNode reports the table's contents to the NameNode upon startup and every so often afterwards.

Usage: NDFS$NameNode <port> <namespace_dir>

ndfs = net.nutch.fs.TestClient

This class provides some NDFS administrative access.

Usage: java NDFSClient (-local | -ndfs <namenode:port>) [-ls <path>] [-du <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>] [-put <localsrc> <dst>] [-copyFromLocal <localsrc> <dst>] [-moveFromLocal <localsrc> <dst>] [-get <src> <localdst>] [-copyToLocal <src> <localdst>] [-moveToLocal <src> <localdst>]

parse = net.nutch.tools.ParseSegment

Parse contents in one segment.

It assumes, under given segment, existence of ./fetcher_output/, which is typically generated after a non-parsing fetcher run (i.e., fetcher is started with option -noParsing).

Contents in one segemnt are parsed and saved in these steps:

  1. ./fetcher_output/ and ./content/ are looped together (possibly by multiple ParserThreads), and content is parsed for each entry. The entry number and resultant ParserOutput are saved in ./parser.unsorted.
  2. ./parser.unsorted is sorted by entry number, result saved as ./parser.sorted.
  3. ./parser.sorted and ./fetcher_output/ are looped together. At each entry, ParserOutput is split into ParseDate and ParseText, which are saved in ./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput with parsing status, which is saved in ./fetcher/.

In the end, ./fetcher/ should be identical to one resulted from fetcher run WITHOUT option -noParsing.

By default, intermediates ./parser.unsorted and ./parser.sorted are removed at the end, unless option -noClean is used. However ./fetcher_output/ is kept intact.

Check Fetcher.java and FetcherOutput.java for further discussion.

Usage: ParseSegment (-local | -ndfs <namenode:port>) [-threads n] [-showThreadID] [-dryRun] [-logLevel level] [-noClean] dir

prune = net.nutch.tools.PruneIndexTool

This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped.

NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge of available Lucene document fields is required. This can be obtained by reading sources of index-basic and index-more plugins, or using tools like Luke. During query parsing a WhitespaceAnalyzer is used - this choice has been made to minimize side effects of Analyzer on the final set of query terms. You can use link net.nutch.searcher.Query.main(String[]) method to translate queries in Nutch syntax to queries in Lucene syntax.
If additional level of control is required, an instance of {@link PruneChecker} can be provided to check each document before it's deleted. The results of all checkers are logically AND-ed, which means that any checker in the chain can veto the deletion of the current document. Two example checker implementations are provided - PrintFieldsChecker prints the values of selected index fields, StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can be activated by providing respective command-line options.

The typical command-line usage is as follows:

PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title
This command will just print out fields of matching documents.
PruneIndexTool index_dir -queries queries.txt
This command will actually remove all matching entries, according to the queries read from queries.txt file.

NOTE 2: This tool removes matching documents ONLY from segment indexes (or from a merged index). In particular it does NOT remove the pages and links from WebDB. This means that unwanted URLs may pop up again when new segments are created. To prevent this, use your own link net.nutch.net.URLFilter, or PruneDBTool (under construction...).

NOTE 3: This tool uses a low-level Lucene interface to collect all matching documents. For large indexes and broad queries this may result in high memory consumption. If you encounter OutOfMemory exceptions, try to narrow down your queries, or increase the heap size.

readdb = net.nutch.db.WebDBReader

The WebDBReader implements all the read-only parts of accessing our web database. All the writing ones can be found in WebDBWriter.

Usage: java net.nutch.db.WebDBReader (-local | -ndfs <namenode:port>) <db> [-pageurl url] | [-pagemd5 md5] | [-dumppageurl] | [-dumppagemd5] | [-toppages <k>] | [-linkurl url] | [-linkmd5 md5] | [-dumplinks] | [-stats]

segread = net.nutch.segment.SegmentReader

This class holds together all data readers for an existing segment. Some convenience methods are also provided, to read from the segment and to reposition the current pointer.

Usage: SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option.
-fix
automatically fix corrupted segments
-dump
dump segment data in human-readable format
-dumpsort
dump segment data in human-readable format, sorted by URL
-list
print useful information about segments
-nocontent
ignore content data
-noparsedata
ignore parse_data data
-nocontent
ignore parse_text data
-dir segments
directory containing multiple segments
seg1 seg2 ...
segment directories

segslice = net.nutch.segment.SegmentSlicer

This class reads data from one or more input segments, and outputs it to one or more output segments, optionally deleting the input segments when it's finished.

Data is read sequentially from input segments, and appended to output segment until it reaches the target count of entries, at which point the next output segment is created, and so on.

NOTE 1: this tool does NOT de-duplicate data - use SegmentMergeTool for that.

NOTE 2: this tool does NOT copy indexes. It is currently impossible to slice Lucene indexes. The proper procedure is first to create slices, and then to index them.

NOTE 3: if one or more input segments are in non-parsed format, the output segments will also use non-parsed format. This means that any parseData and parseText data from input segments will NOT be copied to the output segments.

Usage: SegmentSlicer (-local | -ndfs <namenode:port>) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option.
outputDir is always required.
-o
outputDir
output directory for segments
-max count
(optional) output multiple segments, each with maximum 'count' entries
-fix
(optional) automatically fix corrupted segments
-nocontent
(optional) ignore content data
-noparsedata
(optional) ignore parse_data data
-nocontent
(optional) ignore parse_text data
-dir segments
directory containing multiple segments
seg1 seg2 ...
segment directories\n

server = 'net.nutch.searcher.DistributedSearch'

Implements the search API over IPC connnections.

Usage: DistributedSearch$Server <port> <index dir>

updatedb = net.nutch.tools.UpdateDatabaseTool

This class takes the output of the fetcher and updates the page and link DBs accordingly. Eventually, as the database scales, this will broken into several phases, each consuming and emitting batch files, but, for now, we're doing it all here.

Usage: UpdateDatabaseTool (-local | -ndfs <namenode:port>) [-max N] [-noAdditions] <db> <seg_dir> [ <seg_dir> ... ]

Version 0.3

Compiled by Rob Pettengill 05/07/07