Python 爬取百度图片

不念不忘少年蓝@ 2022-06-05 09:25 460阅读 0赞

百度图片抓包数据:
SouthEast
参数详情:
SouthEast 1
数据解析:
SouthEast 2 SouthEast 3

  1. from urllib import request, parse
  2. from http import cookiejar
  3. import re
  4. import time
  5. # 1.提取数据
  6. def main(text,start,length):
  7. hx = hex(start)
  8. s = str(hx)[2:len(hx)]
  9. reqMessage = {
  10. "tn": "resultjson_com",
  11. "ipn": "rj",
  12. "ct": "201326592",
  13. "is": "",
  14. "fp": "result",
  15. "queryWord": text,
  16. "cl": "2",
  17. "lm": "-1",
  18. "ie": "utf-8",
  19. "oe": "utf-8",
  20. "adpicid": "",
  21. "st": "",
  22. "z": "",
  23. "ic": "",
  24. "word": text,
  25. "s": "",
  26. "se": "",
  27. "tab": "",
  28. "width": "",
  29. "height": "",
  30. "face": "",
  31. "istype": "",
  32. "qc": "",
  33. "nc": "",
  34. "fr": "",
  35. "cg": "head",
  36. "pn": str(start),
  37. "rn": str(length),
  38. "gsm": s,
  39. "1511330964840": ""
  40. };
  41. cookie=cookiejar.CookieJar()
  42. cookie_support = request.HTTPCookieProcessor(cookie)
  43. opener = request.build_opener(cookie_support, request.HTTPHandler)
  44. request.install_opener(opener)
  45. reqData = parse.urlencode(reqMessage)
  46. req = request.Request("http://image.baidu.com/search/acjson?" + reqData, headers={
  47. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36"})
  48. data = request.urlopen(req).read();
  49. rm = re.compile(r'"thumbURL":"[\w/\\:.,;=&]*"')
  50. list = re.findall(rm, data.decode())
  51. index = start+1
  52. result=False
  53. for thumbURL in list:
  54. url = thumbURL[12:len(thumbURL) - 1]
  55. downImg(url, "F:/file/baidu/" + str(index) + ".jpg")
  56. index += 1
  57. result=True
  58. return result
  59. # 下载图片
  60. def downImg(url, path):
  61. print(url)
  62. req=request.Request(url,headers={
  63. "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36",
  64. "Referer":"http://image.baidu.com/search/acjson"})
  65. data= request.urlopen(req).read()
  66. file=open(path,"wb")
  67. file.write(data)
  68. file.close()
  69. pass
  70. a=0
  71. while a!=-1:
  72. result= main("美女图片", a*30, 30)
  73. print("暂停中...")
  74. a += 1
  75. if result==False :
  76. a=-1
  77. time.sleep(10)
  78. pass
  79. print("执行完成")

信息不多没有什么太多的事情,需要注意的就是下载图片时请求头需要添加User-Agent以及Referer,否则百度会拒绝访问,另外百度的图片只能访问一次,访问一次过后图片链接立即失效.还有抓取的数据以及时间限制,一次性爬取的数量有限.

发表评论

表情:
评论列表 (有 0 条评论,460人围观)

还没有评论,来说两句吧...

相关阅读