爬虫利器Scrapy框架:1:概要介绍

小灰灰 2023-05-28 06:44 103阅读 0赞

在这里插入图片描述
Scrapy是使用python实现的一个web抓取框架,这篇文章将对Scrapy的概要、安装进行说明,并结合scrapy shell获取页面的title的简单示例来获取scrapy的直观使用感受。

概要信息

Scrapy是使用python实现的一个web抓取框架, 非常适合用于网站数据爬取、结构化数据提取等操作,相较于通用搜索为目的的Apache Nutch更加小巧和灵活,概要信息如下表所示:






























项目 说明
官网 https://scrapy.org/
开源/闭源 开源
源码管理地址 https://github.com/scrapy/scrapy
开发语言 python
当前稳定版本 1.13.0 (2019/03/19)

安装

使用pip即可直接安装Scrapy,执行命令如下所示:

执行命令:pip install scrapy

本文使用python3和python并存的环境,使用pip3进行安装, 安装日志如下所示:

  1. liumiaocn:scrapy liumiao$ pip3 install scrapy
  2. Collecting scrapy
  3. Downloading
  4. ...省略
  5. Successfully built protego PyDispatcher zope.interface
  6. Installing collected packages: six, pycparser, cffi, cryptography, pyasn1, pyasn1-modules, attrs, service-identity, protego, cssselect, pyOpenSSL, w3lib, PyDispatcher, incremental, constantly, Automat, PyHamcrest, zope.interface, idna, hyperlink, Twisted, lxml, parsel, queuelib, scrapy
  7. Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-2.0.2 Twisted-20.3.0 attrs-19.3.0 cffi-1.14.0 constantly-15.1.0 cryptography-2.8 cssselect-1.1.0 hyperlink-19.0.0 idna-2.9 incremental-17.5.0 lxml-4.5.0 parsel-1.5.2 protego-0.1.16 pyOpenSSL-19.1.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.20 queuelib-1.5.0 scrapy-2.0.1 service-identity-18.1.0 six-1.14.0 w3lib-1.21.0 zope.interface-5.0.1
  8. liumiaocn:scrapy liumiao$

版本确认

  1. liumiaocn:scrapy liumiao$ scrapy -h
  2. Scrapy 2.0.1 - no active project
  3. Usage:
  4. scrapy <command> [options] [args]
  5. Available commands:
  6. bench Run quick benchmark test
  7. fetch Fetch a URL using the Scrapy downloader
  8. genspider Generate new spider using pre-defined templates
  9. runspider Run a self-contained spider (without creating a project)
  10. settings Get settings values
  11. shell Interactive scraping console
  12. startproject Create new project
  13. version Print Scrapy version
  14. view Open URL in browser, as seen by Scrapy
  15. [ more ] More commands available when run from project directory
  16. Use "scrapy <command> -h" to see more info about a command
  17. liumiaocn:scrapy liumiao$

获取页面的标题信息

爬虫实际上是对HTML进行的处理,最为简单的确认Scrapy的功能的示例方式是通过scrapy shell来进行,scrapy shell提供了一种交互式的方式进行数据的抓取,也可以用于抓取的调试。

示例说明:希望获取Scrapy官网主页的标题信息,页面如下所示
在这里插入图片描述

步骤1: 执行scrapy shell

执行如下示例命令:

执行命令:scrapy shell https://scrapy.org/

  1. liumiaocn:scrapy liumiao$ scrapy shell https://scrapy.org/
  2. 2020-03-28 05:38:09 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
  3. 2020-03-28 05:38:09 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.5 (default, Nov 1 2019, 02:16:32) - [Clang 11.0.0 (clang-1100.0.33.8)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.8, Platform Darwin-19.2.0-x86_64-i386-64bit
  4. 2020-03-28 05:38:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
  5. 2020-03-28 05:38:09 [scrapy.crawler] INFO: Overridden settings:
  6. {'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
  7. 'LOGSTATS_INTERVAL': 0}
  8. 2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet Password: 5e36afd357190e93
  9. 2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled extensions:
  10. ['scrapy.extensions.corestats.CoreStats',
  11. 'scrapy.extensions.telnet.TelnetConsole',
  12. 'scrapy.extensions.memusage.MemoryUsage']
  13. 2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
  14. ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
  15. 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
  16. 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
  17. 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
  18. 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
  19. 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
  20. 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
  21. 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
  22. 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
  23. 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
  24. 'scrapy.downloadermiddlewares.stats.DownloaderStats']
  25. 2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled spider middlewares:
  26. ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
  27. 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
  28. 'scrapy.spidermiddlewares.referer.RefererMiddleware',
  29. 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
  30. 'scrapy.spidermiddlewares.depth.DepthMiddleware']
  31. 2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled item pipelines:
  32. []
  33. 2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
  34. 2020-03-28 05:38:09 [scrapy.core.engine] INFO: Spider opened
  35. 2020-03-28 05:38:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapy.org/> (referer: None)
  36. [s] Available Scrapy objects:
  37. [s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
  38. [s] crawler <scrapy.crawler.Crawler object at 0x1073043d0>
  39. [s] item {}
  40. [s] request <GET https://scrapy.org/>
  41. [s] response <200 https://scrapy.org/>
  42. [s] settings <scrapy.settings.Settings object at 0x1075d05d0>
  43. [s] spider <DefaultSpider 'default' at 0x107acbd90>
  44. [s] Useful shortcuts:
  45. [s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
  46. [s] fetch(req) Fetch a scrapy.Request and update local objects
  47. [s] shelp() Shell help (print this help)
  48. [s] view(response) View response in a browser
  49. >>>

步骤2: 通过response.css获取title

输入response.css(‘title’),回车即可看到输出的信息中的title内容

  1. >>> response.css('title')
  2. [<Selector xpath='descendant-or-self::title' data='<title>Scrapy | A Fast and Powerful S...'>]
  3. >>>

进一步获取title的详细信息

  1. >>> response.css('title').extract_first()
  2. '<title>Scrapy | A Fast and Powerful Scraping and Web Crawling Framework</title>'
  3. >>>

发表评论

表情:
评论列表 (有 0 条评论,103人围观)

还没有评论,来说两句吧...

相关阅读

    相关 Scrapy爬虫框架介绍

    一、爬虫框架简介 爬虫框架是实现爬虫功能的一个软件结构和功能组件集合。 爬虫框架是一个半成品,能够帮助用户实现专业网络爬虫。 二、Scrapy爬虫框架结构 ![

    相关 Scrapy - 爬虫框架

    Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和 [自

    相关 Scrapy爬虫框架介绍

    Scrapy简介 > Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。 > 其最