在这里插入图片描述
Scrapy是使用python实现的一个web抓取框架，这篇文章将对Scrapy的概要、安装进行说明，并结合scrapy shell获取页面的title的简单示例来获取scrapy的直观使用感受。

概要信息

Scrapy是使用python实现的一个web抓取框架, 非常适合用于网站数据爬取、结构化数据提取等操作，相较于通用搜索为目的的Apache Nutch更加小巧和灵活，概要信息如下表所示：

项目	说明
官网	https://scrapy.org/
开源/闭源	开源
源码管理地址	https://github.com/scrapy/scrapy
开发语言	python
当前稳定版本	1.13.0 （2019/03/19）

安装

使用pip即可直接安装Scrapy，执行命令如下所示：

执行命令：pip install scrapy

本文使用python3和python并存的环境，使用pip3进行安装, 安装日志如下所示：

liumiaocn:scrapy liumiao$ pip3 install scrapy
Collecting scrapy
  Downloading 
...省略
Successfully built protego PyDispatcher zope.interface
Installing collected packages: six, pycparser, cffi, cryptography, pyasn1, pyasn1-modules, attrs, service-identity, protego, cssselect, pyOpenSSL, w3lib, PyDispatcher, incremental, constantly, Automat, PyHamcrest, zope.interface, idna, hyperlink, Twisted, lxml, parsel, queuelib, scrapy
Successfully installed Automat-20.2.0 PyDispatcher-2.0.5 PyHamcrest-2.0.2 Twisted-20.3.0 attrs-19.3.0 cffi-1.14.0 constantly-15.1.0 cryptography-2.8 cssselect-1.1.0 hyperlink-19.0.0 idna-2.9 incremental-17.5.0 lxml-4.5.0 parsel-1.5.2 protego-0.1.16 pyOpenSSL-19.1.0 pyasn1-0.4.8 pyasn1-modules-0.2.8 pycparser-2.20 queuelib-1.5.0 scrapy-2.0.1 service-identity-18.1.0 six-1.14.0 w3lib-1.21.0 zope.interface-5.0.1
liumiaocn:scrapy liumiao$

版本确认

liumiaocn:scrapy liumiao$ scrapy -h
Scrapy 2.0.1 - no active project
Usage:
  scrapy <command> [options] [args]
Available commands:
  bench         Run quick benchmark test
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy
  [ more ]      More commands available when run from project directory
Use "scrapy <command> -h" to see more info about a command
liumiaocn:scrapy liumiao$

获取页面的标题信息

爬虫实际上是对HTML进行的处理，最为简单的确认Scrapy的功能的示例方式是通过scrapy shell来进行，scrapy shell提供了一种交互式的方式进行数据的抓取，也可以用于抓取的调试。

示例说明：希望获取Scrapy官网主页的标题信息，页面如下所示

步骤1: 执行scrapy shell

执行如下示例命令：

执行命令：scrapy shell https://scrapy.org/

liumiaocn:scrapy liumiao$ scrapy shell https://scrapy.org/
2020-03-28 05:38:09 [scrapy.utils.log] INFO: Scrapy 2.0.1 started (bot: scrapybot)
2020-03-28 05:38:09 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.5 (default, Nov  1 2019, 02:16:32) - [Clang 11.0.0 (clang-1100.0.33.8)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1d  10 Sep 2019), cryptography 2.8, Platform Darwin-19.2.0-x86_64-i386-64bit
2020-03-28 05:38:09 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-03-28 05:38:09 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet Password: 5e36afd357190e93
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-03-28 05:38:09 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-03-28 05:38:09 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-03-28 05:38:09 [scrapy.core.engine] INFO: Spider opened
2020-03-28 05:38:10 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://scrapy.org/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x1073043d0>
[s]   item       {}
[s]   request    <GET https://scrapy.org/>
[s]   response   <200 https://scrapy.org/>
[s]   settings   <scrapy.settings.Settings object at 0x1075d05d0>
[s]   spider     <DefaultSpider 'default' at 0x107acbd90>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

步骤2: 通过response.css获取title

输入response.css(‘title’)，回车即可看到输出的信息中的title内容

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Scrapy | A Fast and Powerful S...'>]
>>>

进一步获取title的详细信息

>>> response.css('title').extract_first()
'<title>Scrapy | A Fast and Powerful Scraping and Web Crawling Framework</title>'
>>>

爬虫利器Scrapy框架：1:概要介绍

概要信息

安装

版本确认

获取页面的标题信息

步骤1: 执行scrapy shell

步骤2: 通过response.css获取title

发表评论取消回复

还没有评论，来说两句吧...

相关阅读

相关 Scrapy爬虫框架介绍

相关爬虫利器Scrapy框架：3:创建工程并运行爬虫

相关爬虫利器Scrapy框架：2:使用runspider运行爬虫

相关爬虫利器Scrapy框架：1:概要介绍

相关 1.scrapy框架介绍

相关 Scrapy - 爬虫框架

相关 Scrapy爬虫框架介绍

相关 Python爬虫--使用scrapy框架(1)

相关 Scrapy爬虫框架介绍

相关 python爬虫框架——Scrapy架构原理介绍

随便看看

springcloud之Zuul初识篇—背景

Java学习笔记一JSP二 3个编译指令和7个动作指令以及9个内置对象

关于Sourcetree 关联gitee的操作

《Activiti /Flowable 深入BPM工作流》- idea中如何将 .bpmn 转换成 .png 图片?

字母统计

IDEA 没识别 resources（找不到 resources）

教程文章

热评文章

1江湖小白之一起学Python （二）爬取数据的保存

2Java Shiro：简化身份验证和授权的安全框架

3Java中try()catch{}的使用方法

4Swagger注解-@ApiModel 和 @ApiModelProperty

5windows下强制杀死tomcat进程

6uni-app 条形码(一维码)/二维码生成实现

标签列表