The nutch command sets up the java class environment for nutch to run. Subcommand aliases provide a conveninet way to run specific nutch classes.
The nutch subcommand is given below along with the aliased class name. The documentation comes from the version 0.6 code comments, usage descriptions, and the tutorial.
There is a canned crawl command for intranets:
$ bin/nutch crawl urls -dir crawl.test -depth 3 >& crawl.log
Details are forund in the Nutch tutorial file. A summary of the normal workflow for an internet crawl follows:
$ mkdir db # for the nutch page and link database
$ nutchpath=/path/to/nutch
$ datapath=/path/to/data
$ mkdir segments # subdirectories hold pages fetched and indexed as a unit
$ ${nutchpath}/bin/nutch admin ${datapath}/db -create
$ ${nutchpath}/bin/nutch inject ${datapath}/db -urlfile starting_urls.txt
$ ${nutchpath}/bin/nutch generate ${datapath}/db ${datapath}/segments$ sN=`ls -d ${datapath}/segments/2* | tail -1`
$ echo $sN$ ${nutchpath}/bin/nutch fetch $sN$ ${nutchpath}/bin/nutch updatedb ${datapath}/db $sN$ ${nutchpath}/bin/nutch analyze ${datapath}/db 2 #use 5 for inital analysis
$ ${nutchpath}/bin/nutch index $sN$ ${nutchpath}/bin/nutch dedup ${datapath}/segments dedup.tmpThe WebDBAdminTool is for Nutch administrators who need special access to the webdb. It allows for finer editing of the stored values.
Usage: java net.nutch.tools.WebDBAdminTool (-local | -ndfs <namenode:port>) db [-create] [-textdump dumpPrefix] [-scoredump] [-top k]
LinkAnalysisTool performs link-analysis by using the DistributedAnalysisTool. This single-process all-in-one tool is a wrapper around the more complicated distributed one.
Usage: java net.nutch.tools.LinkAnalysisTool (-local | -ndfs <namenode:port>) <db_dir> <numIterations>
Perform complete crawling and indexing given a set of root urls.
Usage: CrawlTool (-local | -ndfs <nameserver:port>) <root_url_file> [-dir d] [-threads n] [-depth i] [-showThreadID]
The NDFS class holds the NDFS client and server.
DataNode controls just one critical table: block-> BLOCK_SIZE stream of bytes
This info is stored on disk (the NameNode is responsible for asking other machines to replicate the data). The DataNode reports the table's contents to the NameNode upon startup and every so often afterwards.
Usage: NDFS$DataNode <dataDir> <localMachine> <namenode:port>
Deletes duplicate documents in a set of Lucene indexes. Duplicates have either the same contents (via MD5 hash) or the same URL.
Usage: DeleteDuplicates (-local | -ndfs <namenode:port>) [-workingdir <workingdir>] <segmentsDir>
The fetcher. Most of the work is done by plugins.
Usage: Fetcher (-local | -ndfs <namenode:port>) [-logLevel level] [-noParsing] [-showThreadID] [-threads n] <dir>
Usage: FetchListEntry (-local | -ndfs <namenode:port>) [ -recno N | -dumpurls ] segmentDir
This class takes an IWebDBReader, computes a relevant subset, and then emits the subset.
Usage: FetchListTool (-local | -ndfs <namenode:port>) <db> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays]
Creates an index for the output corresponding to a single fetcher run.
Useage: IndexSegment (-local | -ndfs <namenode:port>) <segment_directory> [-dir <workingdir>]
This class takes a flat file of URLs and adds them as entries into a pagedb. Useful for bootstrapping the system.
Usage: WebDBInjector (-local | -ndfs <namenode:port>) <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> [...]]]
IndexMerger creates an index for the output corresponding to a single fetcher run.
Usage: IndexMerger (-local | -ndfs <nameserver:port>) [-workingdir <workingdir>] outputIndex segments...
This class cleans up accumulated segments data, and merges them into a single (or optionally multiple) segment(s), with no duplicates in it.
There are no prerequisites for its correct operation except for a set of already fetched segments (they don't have to contain parsed content, only fetcher output is required). This tool does not use DeleteDuplicates, but creates its own "master" index of all pages in all segments. Then it walks sequentially through this index and picks up only most recent versions of pages for every unique value of url or hash.
If some of the input segments are corrupted, this tool will attempt to
repair them, using
net.nutch.segment.SegmentReader.fixSegment(NutchFileSystem, File, boolean, boolean, boolean, boolean) method.
Output segment can be optionally split on the fly into several segments of fixed length.
The newly created segment(s) can be then optionally indexed, so that it can be either merged with more new segments, or used for searching as it is.
Old segments may be optionally removed, because all needed data has already been copied to the new merged segment. NOTE: this tool will remove also all corrupted input segments, which are not useable anyway - however, this option may be dangerous if you inadvertently included non-segment directories as input...
You may want to run SegmentMergeTool instead of following the manual procedures, with all options turned on, i.e. to merge segments into the output segment(s), index it, and then delete the original segments data.
Usage: SegmentMergeTool (-local | -nfs ...) (-dir <input_segments_dir> | seg1 seg2 ...) [-o <output_segments_dir>] [-max count] [-i] [-ds]
- -dir <input_segments_dir>
- path to directory containing input segments
- seg1 seg2 seg3
- individual paths to input segments
- -o <output_segment_dir>
- (optional) path to directory which will contain output segment(s).
NOTE: If not present, the original segments path will be used.
- -max count
- (optional) output multiple segments, each with maximum 'count' entries
- -i
- (optional) index the output segment when finished merging
- -ds
- (optional) delete the original input segments when finished
The NDFS class holds the NDFS client and server.
DataNode controls just one critical table: block-> BLOCK_SIZE stream of bytes
This info is stored on disk (the NameNode is responsible for asking other machines to replicate the data). The DataNode reports the table's contents to the NameNode upon startup and every so often afterwards.
Usage: NDFS$NameNode <port> <namespace_dir>
This class provides some NDFS administrative access.
Usage: java NDFSClient (-local | -ndfs <namenode:port>) [-ls <path>] [-du <path>] [-mv <src> <dst>] [-cp <src> <dst>] [-rm <src>] [-put <localsrc> <dst>] [-copyFromLocal <localsrc> <dst>] [-moveFromLocal <localsrc> <dst>] [-get <src> <localdst>] [-copyToLocal <src> <localdst>] [-moveToLocal <src> <localdst>]
Parse contents in one segment.
It assumes, under given segment, existence of ./fetcher_output/, which is typically generated after a non-parsing fetcher run (i.e., fetcher is started with option -noParsing).
Contents in one segemnt are parsed and saved in these steps:
In the end, ./fetcher/ should be identical to one resulted from fetcher run WITHOUT option -noParsing.
By default, intermediates ./parser.unsorted and ./parser.sorted are removed at the end, unless option -noClean is used. However ./fetcher_output/ is kept intact.
Check Fetcher.java and FetcherOutput.java for further discussion.
Usage: ParseSegment (-local | -ndfs <namenode:port>) [-threads n] [-showThreadID] [-dryRun] [-logLevel level] [-noClean] dir
This tool prunes existing Nutch indexes of unwanted content. The main method accepts a list of segment directories (containing indexes). These indexes will be pruned of any content that matches one or more query from a list of Lucene queries read from a file (defined in standard config file, or explicitly overridden from command-line). Segments should already be indexed, if some of them are missing indexes then these segments will be skipped.
NOTE 1: Queries are expressed in Lucene's QueryParser syntax, so a knowledge
of available Lucene document fields is required. This can be obtained by reading sources
of index-basic and index-more plugins, or using tools
like Luke. During query parsing a
WhitespaceAnalyzer is used - this choice has been made to minimize side effects of
Analyzer on the final set of query terms. You can use link net.nutch.searcher.Query.main(String[])
method to translate queries in Nutch syntax to queries in Lucene syntax.
If additional level of control is required, an instance of {@link PruneChecker} can
be provided to check each document before it's deleted. The results of all
checkers are logically AND-ed, which means that any checker in the chain
can veto the deletion of the current document. Two example checker implementations
are provided - PrintFieldsChecker prints the values of selected index fields,
StoreUrlsChecker stores the URLs of deleted documents to a file. Any of them can
be activated by providing respective command-line options.
The typical command-line usage is as follows:
PruneIndexTool index_dir -dryrun -queries queries.txt -showfields url,title
This command will just print out fields of matching documents.
PruneIndexTool index_dir -queries queries.txt
This command will actually remove all matching entries, according to the queries read fromqueries.txtfile.
NOTE 2: This tool removes matching documents ONLY from segment indexes (or from a merged index). In particular it does NOT remove the pages and links from WebDB. This means that unwanted URLs may pop up again when new segments are created. To prevent this, use your own link net.nutch.net.URLFilter, or PruneDBTool (under construction...).
NOTE 3: This tool uses a low-level Lucene interface to collect all matching documents. For large indexes and broad queries this may result in high memory consumption. If you encounter OutOfMemory exceptions, try to narrow down your queries, or increase the heap size.
The WebDBReader implements all the read-only parts of accessing our web database. All the writing ones can be found in WebDBWriter.
Usage: java net.nutch.db.WebDBReader (-local | -ndfs <namenode:port>) <db> [-pageurl url] | [-pagemd5 md5] | [-dumppageurl] | [-dumppagemd5] | [-toppages <k>] | [-linkurl url] | [-linkmd5 md5] | [-dumplinks] | [-stats]
This class holds together all data readers for an existing segment. Some convenience methods are also provided, to read from the segment and to reposition the current pointer.
Usage: SegmentReader [-fix] [-dump] [-dumpsort] [-list] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option.
- -fix
- automatically fix corrupted segments
- -dump
- dump segment data in human-readable format
- -dumpsort
- dump segment data in human-readable format, sorted by URL
- -list
- print useful information about segments
- -nocontent
- ignore content data
- -noparsedata
- ignore parse_data data
- -nocontent
- ignore parse_text data
- -dir segments
- directory containing multiple segments
- seg1 seg2 ...
- segment directories
This class reads data from one or more input segments, and outputs it to one or more output segments, optionally deleting the input segments when it's finished.
Data is read sequentially from input segments, and appended to output segment until it reaches the target count of entries, at which point the next output segment is created, and so on.
NOTE 1: this tool does NOT de-duplicate data - use SegmentMergeTool for that.
NOTE 2: this tool does NOT copy indexes. It is currently impossible to slice Lucene indexes. The proper procedure is first to create slices, and then to index them.
NOTE 3: if one or more input segments are in non-parsed format, the output segments will also use non-parsed format. This means that any parseData and parseText data from input segments will NOT be copied to the output segments.
Usage: SegmentSlicer (-local | -ndfs <namenode:port>) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] [-noparsetext] (-dir segments | seg1 seg2 ...)
NOTE: at least one segment dir name is required, or '-dir' option.
outputDir is always required.
- -o
- outputDir
- output directory for segments
- -max count
- (optional) output multiple segments, each with maximum 'count' entries
- -fix
- (optional) automatically fix corrupted segments
- -nocontent
- (optional) ignore content data
- -noparsedata
- (optional) ignore parse_data data
- -nocontent
- (optional) ignore parse_text data
- -dir segments
- directory containing multiple segments
- seg1 seg2 ...
- segment directories\n
Implements the search API over IPC connnections.
Usage: DistributedSearch$Server <port> <index dir>
This class takes the output of the fetcher and updates the page and link DBs accordingly. Eventually, as the database scales, this will broken into several phases, each consuming and emitting batch files, but, for now, we're doing it all here.
Usage: UpdateDatabaseTool (-local | -ndfs <namenode:port>) [-max N] [-noAdditions] <db> <seg_dir> [ <seg_dir> ... ]
Version 0.3
Compiled by Rob Pettengill 05/07/07