Scrapy 爬取图片实例

电玩女神 2022-04-18 02:40 383阅读 0赞

目标:360摄影美图

创建scrapy:

scrapy startproject images360

创建spider:

scrapy genspider images images.so.com

修改代码:

修改spider:修改images.py:代码是根据下拉网页的AJAX请求分析出来的。

  1. # -*- coding: utf-8 -*-
  2. from scrapy import Spider, Request
  3. from urllib.parse import urlencode
  4. import json
  5. from images360.items import ImageItem
  6. class ImagesSpider(Spider):
  7. name = 'images'
  8. allowed_domains = ['images.so.com']
  9. start_urls = ['http://images.so.com/']
  10. def start_requests(self):
  11. data = {'ch': 'beauty', 'listtype': 'new'}
  12. base_url = 'https://image.so.com/zj?'
  13. for page in range(1, self.settings.get('MAX_PAGE') + 1):
  14. data['sn'] = page * 30
  15. params = urlencode(data)
  16. url = base_url + params
  17. yield Request(url, self.parse)
  18. def parse(self, response):
  19. result = json.loads(response.text)
  20. for image in result.get('list'):
  21. item = ImageItem()
  22. item['id'] = image.get('imageid')
  23. item['url'] = image.get('qhimg_url')
  24. item['title'] = image.get('group_title')
  25. item['thumb'] = image.get('qhimg_thumb_url')
  26. yield item

修改items.py:想要得到的字段

  1. from scrapy import Item,Field
  2. class ImageItem(Item):
  3. collection = table = 'images'
  4. id = Field()
  5. url = Field()
  6. title = Field()
  7. thumb = Field()

修改piepeline.py:用了内置imagespipeline保存图片到本地:

  1. import pymongo
  2. from scrapy import Request
  3. from scrapy.exceptions import DropItem
  4. from scrapy.pipelines.images import ImagesPipeline
  5. class ImagePipeline(ImagesPipeline):
  6. def file_path(self, request, response=None, info=None):
  7. url = request.url
  8. file_name = url.split('/')[-1]
  9. return file_name
  10. def item_completed(self, results, item, info):
  11. image_paths = [x['path'] for ok, x in results if ok]
  12. if not image_paths:
  13. raise DropItem('Image Downloaded Failed')
  14. return item
  15. def get_media_requests(self, item, info):
  16. yield Request(item['url'])

最后修改setting:

  1. ROBOTSTXT_OBEY = False #修改
  2. ITEM_PIPELINES = {
  3. 'images360.pipelines.ImagePipeline': 300,
  4. #'images360.pipelines.MongoPipeline': 301,
  5. }
  6. MAX_PAGE = 50
  7. #MONGO_URI = '192.168.6.23'
  8. #MONGO_DB = 'images360'

最后运行:

scrapy crawl images

发表评论

表情:
评论列表 (有 0 条评论,383人围观)

还没有评论,来说两句吧...

相关阅读

    相关 scrapy提高速度

    scrapy在单机跑大量数据的时候,在对settings文件不进行设置的时候,scrapy的爬取速度很慢,再加上多个页面层级解析,往往导致上万的数据可能爬取要半个小时之久,这还