How to make nutch run in eclipse ?

﹏ヽ暗。殇╰゛Y 2022-04-13 10:24 263阅读 0赞

[b][color=green][size=large]Nutch是一个优秀的开源的数据爬取框架,我们只需要简单的配置,就可以完成数据爬取,当然,Nutch里面也提供了很灵活的的插件机制,我们随时都可以对它进行二次开发,以满足我们的需求,本篇散仙,先来介绍下,如何在eclipse里面以local模式调试nutch,只有在eclipse里面把它弄清楚了,那么,我们学习起来,才会更加容易,因为,目前大多数人,使用nutch,都是基于命令行的操作,虽然很简单方便,但是想深入定制开发,就很困难,所以,散仙在本篇里,会介绍下nutch基本的调试,以及编译。
[/size][/color][/b]

[b][color=olive][size=large]下面进入正题,我们先来看下基本的步骤。
[table]
|序号|名称|描述
|1|安装部署ant|编译nutch编码使用
|2|下载nutch源码|必须步骤
|3|在nutch源码根目录下,执行ant等待编译完成|构建nutch
|4|配置nutch-site.xml|必须步骤
|5|ant eclipse 构建eclipse项目|导入eclipse中,调试
|6|conf目录置顶|nutch加载时,会读取配置文件
|7|执行org.apache.nutch.crawl.Injector注入种子|local调试
|8|执行org.apache.nutch.crawl.Generator生成一个抓取列表|local调试
|9|执行org.apache.nutch.fetcher.Fetcher生成一个抓取队列|local调试
|10|执行org.apache.nutch.parse.ParseSegment执行contet生一个段文件|local调试
|11|配置好solr服务|检索服务查询
|12|执行org.apache.nutch.indexer.IndexingJob映射solr索引|local调试
|13|映射完成后,就可以solr里面执行查询了|校验结果
[/table]
[/size][/color][/b]
[b][color=olive][size=large]编译完,导入eclipse的中如下图所示,注意conf文件夹置顶:[/size][/color][/b]

[img]http://dl2.iteye.com/upload/attachment/0097/3205/931f90d0-b161-321a-a7b9-199d22819ad7.jpg\[/img\]
[color=green][size=large]nutch-site.xml里面的配置如下:[/size][/color]

  1. <?xml version="1.0"?>
  2. <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
  3. <!-- Put site-specific property overrides in this file. -->
  4. <configuration>
  5. <property>
  6. <name>http.agent.name</name>
  7. <value>mynutch</value>
  8. </property>
  9. <property>
  10. <name>http.robots.agents</name>
  11. <value>*</value>
  12. <description>The agent strings we'll look for in robots.txt files,
  13. comma-separated, in decreasing order of precedence. You should
  14. put the value of http.agent.name as the first agent name, and keep the
  15. default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  16. </description>
  17. </property>
  18. <property>
  19. <name>plugin.folders</name>
  20. <value>./src/plugin</value>
  21. <description>Directories where nutch plugins are located. Each
  22. element may be a relative or absolute path. If absolute, it is used
  23. as is. If relative, it is searched for on the classpath.</description>
  24. </property>
  25. </configuration>

[b][color=green][size=large]下面简单介绍下,在各个类里运行,需要做的一些改动,首先运行nutch,是基于Hadoop的local模式调试的,所以,你得改变下hadoop的权限,否则在运行过程中,会报错。散仙在这里提供一个简单的方法,拷贝hadoop的FileUtils类进行eclipse中,修改它的权限校验即可,如果你是在linux上运行,就不需要考虑这个问题了。

在开始调试之前,你需要在项目的根目录下建一个urls文件夹,并新建一个种子文件放入你要抓取的网址。

在Injector类里面,run方法里,改成
[/size][/color][/b]

  1. public int run(String[] args) throws Exception {
  2. // if (args.length < 2) {
  3. // System.err.println("Usage: Injector <crawldb> <url_dir>");
  4. // return -1;
  5. // }
  6. args=new String[]{"mydir","urls"};//urls
  7. try {
  8. inject(new Path(args[0]), new Path(args[1]));
  9. return 0;
  10. } catch (Exception e) {
  11. LOG.error("Injector: " + StringUtils.stringifyException(e));
  12. return -1;
  13. }
  14. }

[color=olive][size=large]在Generator里面的run方法改成[/size][/color]

  1. public int run(String[] args) throws Exception {
  2. // if (args.length < 2) {
  3. // System.out
  4. // .println("Usage: Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]");
  5. // return -1;
  6. // }
  7. args=new String[]{"mydir","myseg","6","7",""};
  8. Path dbDir = new Path(args[0]);
  9. Path segmentsDir = new Path(args[1]);
  10. long curTime = System.currentTimeMillis();
  11. long topN = Long.MAX_VALUE;
  12. int numFetchers = -1;
  13. boolean filter = true;
  14. boolean norm = true;
  15. boolean force = false;
  16. int maxNumSegments = 1;
  17. for (int i = 2; i < args.length; i++) {
  18. if ("-topN".equals(args[i])) {
  19. topN = Long.parseLong(args[i + 1]);
  20. i++;
  21. } else if ("-numFetchers".equals(args[i])) {
  22. numFetchers = Integer.parseInt(args[i + 1]);
  23. i++;
  24. } else if ("-adddays".equals(args[i])) {
  25. long numDays = Integer.parseInt(args[i + 1]);
  26. curTime += numDays * 1000L * 60 * 60 * 24;
  27. } else if ("-noFilter".equals(args[i])) {
  28. filter = false;
  29. } else if ("-noNorm".equals(args[i])) {
  30. norm = false;
  31. } else if ("-force".equals(args[i])) {
  32. force = true;
  33. } else if ("-maxNumSegments".equals(args[i])) {
  34. maxNumSegments = Integer.parseInt(args[i + 1]);
  35. }
  36. }
  37. try {
  38. Path[] segs = generate(dbDir, segmentsDir, numFetchers, topN, curTime, filter,
  39. norm, force, maxNumSegments);
  40. if (segs == null) return -1;
  41. } catch (Exception e) {
  42. LOG.error("Generator: " + StringUtils.stringifyException(e));
  43. return -1;
  44. }
  45. return 0;
  46. }

[color=green][size=large]在Fetcher的run方法里面改动:[/size][/color]

  1. public int run(String[] args) throws Exception {
  2. String usage = "Usage: Fetcher <segment> [-threads n]";
  3. args=new String[]{"D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541","4"};
  4. // if (args.length < 1) {
  5. // System.err.println(usage);
  6. // return -1;
  7. // }
  8. Path segment = new Path(args[0]);
  9. int threads = getConf().getInt("fetcher.threads.fetch", 10);
  10. boolean parsing = false;
  11. for (int i = 1; i < args.length; i++) { // parse command line
  12. if (args[i].equals("-threads")) { // found -threads option
  13. threads = Integer.parseInt(args[++i]);
  14. }
  15. }
  16. getConf().setInt("fetcher.threads.fetch", threads);
  17. try {
  18. fetch(segment, threads);
  19. return 0;
  20. } catch (Exception e) {
  21. LOG.error("Fetcher: " + StringUtils.stringifyException(e));
  22. return -1;
  23. }
  24. }

[b][color=green][size=large]在ParseSegment里面的run方法改动:[/size]
[/color][/b]

  1. public int run(String[] args) throws Exception {
  2. Path segment;
  3. String usage = "Usage: ParseSegment segment [-noFilter] [-noNormalize]";
  4. // if (args.length == 0) {
  5. // System.err.println(usage);
  6. // System.exit(-1);
  7. // }
  8. args=new String[]{"D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541"};
  9. if(args.length > 1) {
  10. for(int i = 1; i < args.length; i++) {
  11. String param = args[i];
  12. if("-nofilter".equalsIgnoreCase(param)) {
  13. getConf().setBoolean("parse.filter.urls", false);
  14. } else if ("-nonormalize".equalsIgnoreCase(param)) {
  15. getConf().setBoolean("parse.normalize.urls", false);
  16. }
  17. }
  18. }
  19. segment = new Path(args[0]);
  20. parse(segment);
  21. return 0;
  22. }

[color=olive][size=large]在IndexingJob的run方法里面改动:[/size][/color]

  1. public int run(String[] args) throws Exception {
  2. args=new String[]{"mydir","D:\\20140520nutchplugin\\apache-nutch-1.8\\myseg\\20140520120541"};
  3. if (args.length < 2) {
  4. System.err
  5. .println("Usage: Indexer <crawldb> [-linkdb <linkdb>] [-params k1=v1&k2=v2...] (<segment> ... | -dir <segments>) [-noCommit] [-deleteGone] [-filter] [-normalize]");
  6. IndexWriters writers = new IndexWriters(getConf());
  7. System.err.println(writers.describe());
  8. return -1;
  9. }
  10. final Path crawlDb = new Path(args[0]);
  11. Path linkDb = null;
  12. final List<Path> segments = new ArrayList<Path>();
  13. String params = null;
  14. boolean noCommit = false;
  15. boolean deleteGone = false;
  16. boolean filter = false;
  17. boolean normalize = false;
  18. for (int i = 1; i < args.length; i++) {
  19. if (args[i].equals("-linkdb")) {
  20. linkDb = new Path(args[++i]);
  21. } else if (args[i].equals("-dir")) {
  22. Path dir = new Path(args[++i]);
  23. FileSystem fs = dir.getFileSystem(getConf());
  24. FileStatus[] fstats = fs.listStatus(dir,
  25. HadoopFSUtil.getPassDirectoriesFilter(fs));
  26. Path[] files = HadoopFSUtil.getPaths(fstats);
  27. for (Path p : files) {
  28. segments.add(p);
  29. }
  30. } else if (args[i].equals("-noCommit")) {
  31. noCommit = true;
  32. } else if (args[i].equals("-deleteGone")) {
  33. deleteGone = true;
  34. } else if (args[i].equals("-filter")) {
  35. filter = true;
  36. } else if (args[i].equals("-normalize")) {
  37. normalize = true;
  38. } else if (args[i].equals("-params")) {
  39. params = args[++i];
  40. } else {
  41. segments.add(new Path(args[i]));
  42. }
  43. }
  44. try {
  45. index(crawlDb, linkDb, segments, noCommit, deleteGone, params,
  46. filter, normalize);
  47. return 0;
  48. } catch (final Exception e) {
  49. LOG.error("Indexer: " + StringUtils.stringifyException(e));
  50. return -1;
  51. }
  52. }

[color=green][size=large]除此之外,还需要,在SolrIndexWriter的187行和SolrUtils的54行分别添加如下代码,修改solr的映射地址:[/size][/color]

  1. String serverURL = conf.get(SolrConstants.SERVER_URL);
  2. serverURL="http://localhost:8983/solr/";
  3. // String serverURL = job.get(SolrConstants.SERVER_URL);
  4. String serverURL ="http://localhost:8983/solr";

[b][color=green][size=large]按上面几个步骤,每执行一个类的时候,就修改其的运行参数,因为nutch的作业具有依赖性,这一个作业的输入,往往是上一个作业的输出,手动依次运行修改上面的5个类,最终我们的索引就可以生成在solr里,截图如下:[/size]
[/color][/b]

[img]http://dl2.iteye.com/upload/attachment/0097/3216/ea040145-4cb7-3a72-8ecc-62e27df73922.jpg\[/img\]

[b][color=green][size=large]当然,我们还可以,配置分词策略,来使我们检索更加通用,准确.[/size][/color][/b]

发表评论

表情:
评论列表 (有 0 条评论,263人围观)

还没有评论,来说两句吧...

相关阅读