Python爬虫--使用scrapy框架(2)

矫情吗；* 2022-05-26 06:54 295阅读 0赞

#### 1.需求介绍 ####

上一篇文章中讲的是抓取一个网页的数据，但是如果爬取的一个网站中有很多页，我们又想把每一页的数据都爬取下来怎么办呢，例如我们接下来将要爬取的这个网站，如图红框中的next，点击后会进入下一页：  
![这里写图片描述][70]

#### 2.编写代码 ####

下面的代码将展示如何爬取多个页面的数据：

import scrapy
    
    class QuotesSpider(scrapy.Spider):
        name = "quotes"
        start_urls = ['http://quotes.toscrape.com/tag/humor/',]
    
        def parse(self, response):
            for quote in response.xpath('//div[@class="quote"]'):
                #把爬取到的数据返回，用于存储到文件中
                yield {
       'text': quote.xpath('span[@class="text"]/text()').extract_first(),
                       'author': quote.xpath('span/small[@class="author"]/text()').extract_first(),}
    
            #查找出next指向的url
            next_page = response.xpath('//li[@class="next"]/@herf').extract_first()
            if next_page is not None: #爬取到最后一页的时候就没有next了，这里判断一下
                next_page = response.urljoin(next_page) #拼接url，后面讲
                yield scrapy.Request(next_page, callback=self.parse)
                #callback=self.parse表示下次调用时使用的分析函数是self.parse

根据href找出的下一页的url是一个简短url形式，是不完整的，如下图，完整的url应该是[http://quotes.toscrape.com/tag/humor/page/2][http_quotes.toscrape.com_tag_humor_page_2]，所以需要用response.urljoin拼接。  
![这里写图片描述][70 1]

#### 3.运行代码 ####

在Windows的cmd命令行中执行：scrapy runspider quotes\_spider.py -o spider.json，以上程序中爬取到的数据就会输出到spider.json文件中。

[70]: /images/20220526/09f41d55d9504e71a5f48b79ed0b1c8d.png
[http_quotes.toscrape.com_tag_humor_page_2]: http://quotes.toscrape.com/tag/humor/page/2
[70 1]: /images/20220526/a45ef62e6dc243a0a4877967163cec2a.png

Python爬虫--使用scrapy框架(2)

发表评论取消回复

还没有评论，来说两句吧...

相关阅读