基于Python爬取Bing图片

偏执的太偏执、 2022-10-06 00:33 295阅读 0赞

首先安装第三方网页解析库:

  1. pip install bs4
  2. pip install requests
  3. pip install lxml

然后使用下面的脚本在bing搜索引擎中爬取关键词“戴帽子”对应的图片,本次共爬取2000张图片:

  1. import os
  2. import sys
  3. import time
  4. import urllib
  5. import requests
  6. import re
  7. from bs4 import BeautifulSoup
  8. import time
  9. header = {
  10. 'User-Agent':
  11. 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'
  12. }
  13. url = "https://cn.bing.com/images/async?q={0}&first={1}&count={2}&scenario=ImageBasicHover&datsrc=N_I&layout=ColumnBased&mmasync=1&dgState=c*9_y*2226s2180s2072s2043s2292s2295s2079s2203s2094_i*71_w*198&IG=0D6AD6CBAF43430EA716510A4754C951&SFX={3}&iid=images.5599"
  14. def getImage(url, count):
  15. '''从原图url中将原图保存到本地'''
  16. try:
  17. time.sleep(0.5)
  18. urllib.request.urlretrieve(url, './imgs/hat' + str(count + 1) + '.jpg')
  19. except Exception as e:
  20. time.sleep(1)
  21. print("本张图片获取异常,跳过...")
  22. else:
  23. print("图片+1,成功保存 " + str(count + 1) + " 张图")
  24. def findImgUrlFromHtml(html, rule, url, key, first, loadNum, sfx, count):
  25. '''从缩略图列表页中找到原图的url,并返回这一页的图片数量'''
  26. soup = BeautifulSoup(html, "lxml")
  27. link_list = soup.find_all("a", class_="iusc")
  28. url = []
  29. for link in link_list:
  30. result = re.search(rule, str(link))
  31. #将字符串"amp;"删除
  32. url = result.group(0)
  33. #组装完整url
  34. url = url[8:len(url)]
  35. #打开高清图片网址
  36. getImage(url, count)
  37. count += 1
  38. #完成一页,继续加载下一页
  39. return count
  40. def getStartHtml(url, key, first, loadNum, sfx):
  41. '''获取缩略图列表页'''
  42. page = urllib.request.Request(url.format(key, first, loadNum, sfx),
  43. headers=header)
  44. html = urllib.request.urlopen(page)
  45. return html
  46. if __name__ == '__main__':
  47. name = "戴帽子" #图片关键词
  48. path = './imgs/hat' #图片保存路径
  49. countNum = 2000 #爬取数量
  50. key = urllib.parse.quote(name)
  51. first = 1
  52. loadNum = 35
  53. sfx = 1
  54. count = 0
  55. rule = re.compile(r"\"murl\"\:\"http\S[^\"]+")
  56. if not os.path.exists(path):
  57. os.makedirs(path)
  58. while count < countNum:
  59. html = getStartHtml(url, key, first, loadNum, sfx)
  60. count = findImgUrlFromHtml(html, rule, url, key, first, loadNum, sfx,
  61. count)
  62. first = count + 1
  63. sfx += 1

爬取效果如下:

watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FpYW5iaW4zMjAwODk2_size_16_color_FFFFFF_t_70

发表评论

表情:
评论列表 (有 0 条评论,295人围观)

还没有评论,来说两句吧...

相关阅读