爬虫第七篇(scrapy 框架简介)

冷不防 2024-04-17 05:53 84阅读 0赞

文档地址:https://scrapy-chs.readthedocs.io/zh\_CN/0.24/topics/signals.html

scrapy 框架简介

  • Scrapy是用纯Python实现一个为了爬取网站数据、提取结构性数据而编写的应用框架,用途非常广泛
  • 框架的力量,用户只需要定制开发几个模块就可以轻松的实现一个爬虫,用来抓取网页内容以及各种图片,非常之方便

Scrapy架构图

在这里插入图片描述

  • crapy Engine(引擎): 负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯,信号、数据传递等。
  • Scheduler(调度器): 它负责接受引擎发送过来的Request请求,并按照一定的方式进行整理排列,入队,当引擎需要时,交还给引擎。
  • Downloader(下载器):负责下载Scrapy Engine(引擎)发送的所有Requests请求,并将其获取到的Responses交还给Scrapy Engine(引擎),由引擎交给Spider来处理
  • Spider(爬虫):它负责处理所有Responses,从中分析提取数据,获取Item字段需要的数据,并将需要跟进的URL提交给引擎,再次进入Scheduler(调度器)
  • Item Pipeline(管道):它负责处理Spider中获取到的Item,并进行进行后期处理(详细分析、过滤、存储等)的地方
  • Downloader Middlewares(下载中间件):你可以当作是一个可以自定义扩展下载功能的组件
  • Spider Middlewares(Spider中间件):你可以理解为是一个可以自定扩展和操作引擎和Spider中间通信的功能组件(比如进入Spider的Responses;和从Spider出去的Requests)

1.安装Scrapy框架

  • pip install scrapy

2.创建一个scrapy项目

scrapy startproject 项目名

在这里插入图片描述

3.创建爬虫文件

scrapy genspider 文件名 域名
在这里插入图片描述

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. class BaiduSpider(scrapy.Spider):
  4. # 爬虫名称
  5. name = 'baidu'
  6. # 设置允许爬取的域(可以指定多个)
  7. allowed_domains = ['www.baidu.com']
  8. # 设置起始url(可以设置多个)
  9. start_urls = ['http://www.baidu.com/']
  10. def parse(self, response):
  11. ''' 是一个回调方法,起始url请求成功后,会回调这个方法 :param response: 响应结果 :return: '''
  12. pass
  13. # parse 方法主要做数据的提取,并把提取的数据封装在item中,传递给pipeline

使用模板创建

scrapy genspider -t crawl 文件名 域名

  1. # -*- coding: utf-8 -*-
  2. import scrapy
  3. from scrapy.linkextractors import LinkExtractor
  4. from scrapy.spiders import CrawlSpider, Rule
  5. class TaobaoSpider(CrawlSpider):
  6. name = 'taobao'
  7. allowed_domains = ['www.taobao.com']
  8. start_urls = ['http://www.taobao.com/']
  9. ''' Rule 主要是按正则匹配的规则提取链接 '''
  10. rules = (
  11. Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
  12. )
  13. ''' LinkExtractor : 设置提取链接的规则(正则表达式) allow=(), : 设置允许提取的url restrict_xpaths=(), :根据xpath语法,定位到某一标签下提取链接 restrict_css=(), :根据css选择器,定位到某一标签下提取链接 deny=(), : 设置不允许提取的url(优先级比allow高) allow_domains=(), : 设置允许提取url的域 deny_domains=(), :设置不允许提取url的域(优先级比allow_domains高) unique=True, :如果出现多个相同的url只会保留一个 strip=True :默认为True,表示自动去除url首尾的空格 '''
  14. ''' rule link_extractor, : linkExtractor对象 callback=None, : 设置回调函数 follow=None, : 设置是否跟进 process_links=None, :可以设置回调函数,对所有提取到的url进行拦截 process_request=identity : 可以设置回调函数,对request对象进行拦截 '''
  15. # 注意: CrawlSpider中一定不要出现parse回调方法 会重写父类的方法
  16. def parse_item(self, response):
  17. item = { }
  18. #item['domain_id'] = response.xpath('//input[@id="sid"]/@value').get()
  19. #item['name'] = response.xpath('//div[@id="name"]').get()
  20. #item['description'] = response.xpath('//div[@id="description"]').get()
  21. return item

item pipiline组件是一个独立的Python类,其中process_item()方法必须实现:

  1. # -*- coding: utf-8 -*-
  2. # Define your item pipelines here
  3. #
  4. # Don't forget to add your pipeline to the ITEM_PIPELINES setting
  5. # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  6. class BaiduPipeline(object):
  7. def __init__(self):
  8. # 初始化一些参数,比如说mysql, mongo连接初始化
  9. pass
  10. def process_item(self, item, spider):
  11. """ 处理spider传递过来的item :param item: item对象 :param spider: spider对象 :return: """
  12. return item
  13. def open_spider(self, spider):
  14. # 可选实现,当spider被开启时,这个方法被调用
  15. pass
  16. def close_spider(self, spider):
  17. # 可选实现,当spider被关闭时,这个方法被调用,一般用来关闭mysql, mongo连接
  18. pass

setting.py 的设置

  1. # -*- coding: utf-8 -*-
  2. # Scrapy settings for Baidu project
  3. #
  4. # For simplicity, this file contains only settings considered important or
  5. # commonly used. You can find more settings consulting the documentation:
  6. #
  7. # https://docs.scrapy.org/en/latest/topics/settings.html
  8. # https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  9. # https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  10. BOT_NAME = 'Baidu'
  11. SPIDER_MODULES = ['Baidu.spiders']
  12. NEWSPIDER_MODULE = 'Baidu.spiders'
  13. LOG_FILE = "BaiduSpider.log"
  14. LOG_LEVEL = "INFO"
  15. FEED_EXPORT_ENCODING='UTF8'
  16. # Crawl responsibly by identifying yourself (and your website) on the user-agent
  17. # USER_AGENT = 'Baidu (+http://www.yourdomain.com)'
  18. # Obey robots.txt rules
  19. ROBOTSTXT_OBEY = False
  20. # Configure maximum concurrent requests performed by Scrapy (default: 16)
  21. #CONCURRENT_REQUESTS = 32
  22. # Configure a delay for requests for the same website (default: 0)
  23. # See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
  24. # See also autothrottle settings and docs
  25. DOWNLOAD_DELAY = 1
  26. # The download delay setting will honor only one of:
  27. #CONCURRENT_REQUESTS_PER_DOMAIN = 16
  28. #CONCURRENT_REQUESTS_PER_IP = 16
  29. # Disable cookies (enabled by default)
  30. COOKIES_ENABLED = False
  31. # Disable Telnet Console (enabled by default)
  32. #TELNETCONSOLE_ENABLED = False
  33. # Override the default request headers:
  34. DEFAULT_REQUEST_HEADERS = {
  35. 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
  36. 'Accept-Language': 'en',
  37. }
  38. # Enable or disable spider middlewares
  39. # See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
  40. #SPIDER_MIDDLEWARES = {
  41. # 'Baidu.middlewares.BaiduSpiderMiddleware': 543,
  42. #}
  43. # Enable or disable downloader middlewares
  44. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
  45. #DOWNLOADER_MIDDLEWARES = {
  46. # 'Baidu.middlewares.BaiduDownloaderMiddleware': 543,
  47. #}
  48. # Enable or disable extensions
  49. # See https://docs.scrapy.org/en/latest/topics/extensions.html
  50. #EXTENSIONS = {
  51. # 'scrapy.extensions.telnet.TelnetConsole': None,
  52. #}
  53. # Configure item pipelines
  54. # See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
  55. ITEM_PIPELINES = {
  56. 'Baidu.pipelines.BaiduPipeline': 300,
  57. }
  58. # Enable and configure the AutoThrottle extension (disabled by default)
  59. # See https://docs.scrapy.org/en/latest/topics/autothrottle.html
  60. #AUTOTHROTTLE_ENABLED = True
  61. # The initial download delay
  62. #AUTOTHROTTLE_START_DELAY = 5
  63. # The maximum download delay to be set in case of high latencies
  64. #AUTOTHROTTLE_MAX_DELAY = 60
  65. # The average number of requests Scrapy should be sending in parallel to
  66. # each remote server
  67. #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
  68. # Enable showing throttling stats for every response received:
  69. #AUTOTHROTTLE_DEBUG = False
  70. # Enable and configure HTTP caching (disabled by default)
  71. # See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
  72. # 调式的过程避免每次发送请求,优先从缓存中读取
  73. HTTPCACHE_ENABLED = True
  74. HTTPCACHE_EXPIRATION_SECS = 0
  75. HTTPCACHE_DIR = 'httpcache'
  76. HTTPCACHE_IGNORE_HTTP_CODES = []
  77. HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

Dowmloader Middleware的使用

1.设置随机代理

1.1 在settings.py中添加代理IP

  1. PROXIES = ['http://183.207.95.27:80', 'http://111.6.100.99:80', 'http://122.72.99.103:80',
  2. 'http://106.46.132.2:80', 'http://112.16.4.99:81', 'http://123.58.166.113:9000',
  3. 'http://118.178.124.33:3128', 'http://116.62.11.138:3128', 'http://121.42.176.133:3128',
  4. 'http://111.13.2.131:80', 'http://111.13.7.117:80', 'http://121.248.112.20:3128',
  5. 'http://112.5.56.108:3128', 'http://42.51.26.79:3128', 'http://183.232.65.201:3128',
  6. 'http://118.190.14.150:3128', 'http://123.57.221.41:3128', 'http://183.232.65.203:3128',
  7. 'http://166.111.77.32:3128', 'http://42.202.130.246:3128', 'http://122.228.25.97:8101',
  8. 'http://61.136.163.245:3128', 'http://121.40.23.227:3128', 'http://123.96.6.216:808',
  9. 'http://59.61.72.202:8080', 'http://114.141.166.242:80', 'http://61.136.163.246:3128',
  10. 'http://60.31.239.166:3128', 'http://114.55.31.115:3128', 'http://202.85.213.220:3128']

1.2 在middlewares.py文件中,添加下面的代码

  1. import scrapy
  2. from scrapy import signals
  3. import random
  4. class ProxyMiddleware(object):
  5. ''' 设置Proxy '''
  6. def __init__(self, ip):
  7. self.ip = ip
  8. @classmethod
  9. def from_crawler(cls, crawler):
  10. return cls(ip=crawler.settings.get('PROXIES'))
  11. def process_request(self, request, spider):
  12. ip = random.choice(self.ip)
  13. request.meta['proxy'] = ip

1.3 最后将我们自定义的类添加到下载器中间件设置中,如下

  1. DOWNLOADER_MIDDLEWARES = {
  2. 'myproject.middlewares.ProxyMiddleware': 543,
  3. }

2.设置随机UserAgent

2.1 在settings.py中添加

  1. MY_USER_AGENT = [
  2. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  3. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",
  4. "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",
  5. "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",
  6. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
  7. "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
  8. "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
  9. "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
  10. ]

2.2 在middlewares.py文件中,添加下面的代码

  1. import scrapy
  2. from scrapy import signals
  3. from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
  4. import random
  5. class MyUserAgentMiddleware(UserAgentMiddleware):
  6. ''' 设置User-Agent '''
  7. def __init__(self, user_agent):
  8. self.user_agent = user_agent
  9. @classmethod
  10. def from_crawler(cls, crawler):
  11. return cls(
  12. user_agent=crawler.settings.get('MY_USER_AGENT')
  13. )
  14. def process_request(self, request, spider):
  15. agent = random.choice(self.user_agent)
  16. request.headers['User-Agent'] = agent

2.3 最后一步,就是将我们自定义的这个MyUserAgentMiddleware类添加到DOWNLOADER_MIDDLEWARES

  1. DOWNLOADER_MIDDLEWARES = {
  2. 'scrapy.downloadermiddleware.useragent.UserAgentMiddleware': None,
  3. 'myproject.middlewares.MyUserAgentMiddleware': 400,
  4. }

Crawler对象体系

  1. settings # crawler的配置管理器
  2. crawler.settings.get(name)
  3. set(name, value, priority=‘project’)
  4. setdict(values, priority=‘project’)
  5. setmodule(module, priority=‘project’)
  6. get(name, default=None)
  7. getbool(name, default=False)
  8. getint(name, default=0)
  9. getfloat(name, default=0.0)
  10. getlist(name, default=None)
  11. getdict(name, default=None)
  12. copy() # 深拷贝当前配置
  13. freeze()
  14. frozencopy()
  15. signals # crawler的信号管理器
  16. crawler.signals.connect(receiver, signal)
  17. connect(receiver, signal)
  18. send_catch_log(signal, **kwargs)
  19. send_catch_log_deferred(signal, **kwargs)
  20. disconnect(receiver, signal)
  21. disconnect_all(signal)
  22. stats # crawler的统计信息收集器
  23. crawler.stats.get_value()
  24. get_value(key, default=None)
  25. get_stats()
  26. set_value(key, value)
  27. set_stats(stats)
  28. inc_value(key, count=1, start=0)
  29. max_value(key, value)
  30. min_value(key, value)
  31. clear_stats()
  32. open_spider(spider)
  33. close_spider(spider)
  34. extensions 扩展管理器,跟踪所有开启的扩展
  35. engine 执行引擎,协调crawler的核心逻辑,包括调度,下载和spider
  36. spider 正在爬取的spider。该spider类的实例由创建crawler时所提供
  37. crawl(*args, **kwargs) 初始化spider类,启动执行引擎,启动crawler
  38. Scrapy内置信号
  39. engine_started # 引擎启动
  40. engine_stopped # 引擎停止
  41. spider_opened # spider开始
  42. spider_idle # spider进入空闲(idle)状态
  43. spider_closed # spider被关闭
  44. spider_error # spider的回调函数产生错误
  45. request_scheduled # 引擎调度一个 Request
  46. request_dropped # # 引擎丢弃一个 Request
  47. response_received # 引擎从downloader获取到一个新的 Response
  48. response_downloaded # 当一个 HTTPResponse 被下载
  49. item_scraped # item通过所有 Item Pipeline 后,没有被丢弃dropped
  50. item_dropped # DropItem丢弃item

发表评论

表情:
评论列表 (有 0 条评论,84人围观)

还没有评论,来说两句吧...

相关阅读

    相关 Scrapy - 爬虫框架

    Scrapy,Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和 [自

    相关 爬虫()scrapy入门

    Scrapy爬虫框架入门 Scrapy概述 Scrapy是Python开发的一个非常流行的网络爬虫框架,可以用来抓取Web站点并从页面中提取结构化的数据,被广泛的用