【python爬虫】爬取知乎收藏夹内所有问题名称地址保存至Mysql

冷不防 2022-07-18 16:21 238阅读 0赞

转载请注明源地址,代码在Github中(可能会更新):https://github.com/qqxx6661/python/

初学python,练手项目。该代码并没有什么太大的实际意义,毕竟收藏可以直接在网页上看,没必要这样折腾。仅作学习之用。

PS:请勿长时间爬取,以免ip被知乎屏蔽。代码中代理有一定问题,由于知乎是走https,普通http代理没法用,实际运行中就算用了代理还是走本地ip

该程序中用到(可以初步理解):

1.python连接数据库:Mysql-connector

2.re正则表达式

3.requests用法:代理,post,get,headers等

4.验证码抓取

5.文件保存和读取

运行截图:

Center

Center 1

  1. # -*- coding: utf-8 -*-
  2. from __future__ import unicode_literals
  3. import requests
  4. import re
  5. import time
  6. from subprocess import Popen
  7. import mysql.connector
  8. class Spider:
  9. def __init__(self):
  10. print '爬虫初始化......'
  11. self.s = requests.session()
  12. # 代理信息
  13. self.proxies = {"http": "http://42.159.195.126:80", "https": "https://115.225.250.91:8998"}
  14. try:
  15. r = self.s.get('http://www.baidu.com/', proxies=self.proxies)
  16. except requests.exceptions.ProxyError, e:
  17. print e
  18. print '代理失效,请修改代码更换代理'
  19. exit()
  20. print '代理通过验证,可以使用'
  21. # 知乎headers
  22. self.headers = {
  23. 'Accept': '*/*',
  24. 'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
  25. 'X-Requested-With': 'XMLHttpRequest',
  26. 'Referer': 'https://www.zhihu.com',
  27. 'Accept-Language': 'zh-CN,zh;q=0.8',
  28. 'Accept-Encoding': 'gzip, deflate',
  29. 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE 2.X MetaSr 1.0',
  30. 'Host': 'www.zhihu.com'
  31. }
  32. # 登陆知乎获取验证码
  33. r = self.s.get('https://www.zhihu.com', headers=self.headers, proxies=self.proxies)
  34. cer = re.compile('name=\"_xsrf\" value=\"(.*)\"', flags=0)
  35. strlist = cer.findall(r.text)
  36. _xsrf = strlist[0]
  37. #print(r.request.headers)
  38. #print(str(int(time.time() * 1000)))
  39. Captcha_URL = 'https://www.zhihu.com/captcha.gif?r=' + str(int(time.time() * 1000))
  40. r = self.s.get(Captcha_URL, headers=self.headers)
  41. with open('code.gif', 'wb') as f:
  42. f.write(r.content)
  43. Popen('code.gif', shell=True)
  44. captcha = input('captcha: ')
  45. login_data = { # 填写账号信息
  46. '_xsrf': _xsrf,
  47. 'phone_num': '18769201984',
  48. 'password': 'ybs852421',
  49. 'remember_me': 'true',
  50. 'captcha': captcha
  51. }
  52. # 登陆知乎
  53. r = self.s.post('https://www.zhihu.com/login/phone_num', data=login_data, headers=self.headers, proxies=self.proxies)
  54. print(r.text)
  55. def get_collections(self): # 获取关注的收藏夹
  56. # 抓取自己创建的收藏夹
  57. r = self.s.get('http://www.zhihu.com/collections/mine', headers=self.headers, proxies=self.proxies)
  58. # 抓取关注的收藏夹
  59. #r = self.s.get('https://www.zhihu.com/collections', headers=self.headers, proxies=self.proxies)
  60. # 获取收藏夹名数组和地址名数组
  61. re_mine_url = re.compile(r'<a href="(/collection/\d+?)"')
  62. list_mine_url = re.findall(re_mine_url, r.content)
  63. re_mine_name = re.compile(r'\d"\s>(.*?)</a>')
  64. list_mine_name = re.findall(re_mine_name, r.content)
  65. #print '收藏夹名称:', list_mine_name
  66. # for循环遍历数组
  67. for i in range(len(list_mine_url)):
  68. list_mine_url[i] = 'https://www.zhihu.com' + list_mine_url[i]
  69. #print '收藏夹地址:', list_mine_url
  70. conn = mysql.connector.connect(user='root', password='123456', database='test')
  71. cursor = conn.cursor()
  72. for i in range(len(list_mine_url)):
  73. page = 0
  74. sql = 'create table %s (id integer(10) primary key auto_increment, name varchar(100), address varchar(100))' % list_mine_name[i].decode("UTF-8")
  75. print sql
  76. cursor.execute(sql)
  77. conn.commit()
  78. while 1:
  79. page += 1
  80. col_url = list_mine_url[i]
  81. col_url = col_url + '?page=' + str(page)
  82. print '正在抓取:', col_url
  83. r = self.s.get(col_url, headers=self.headers, proxies=self.proxies)
  84. re_col_url = re.compile(r'href="(/question/\d*?)">.+?</a></h2>')
  85. list_col_url = re.findall(re_col_url, r.content)
  86. re_col_name = re.compile(r'\d">(.+?)</a></h2>')
  87. list_col_name = re.findall(re_col_name, r.content)
  88. if list_col_name:
  89. #print '问题名称', list_col_name
  90. # for循环遍历数组
  91. for j in range(len(list_col_url)):
  92. list_col_url[j] = 'https://www.zhihu.com' + list_col_url[j]
  93. #print '问题地址:', list_col_url
  94. for j in range(len(list_col_url)):
  95. sql = 'insert into %s (name, address) values ("%s", "%s")' % (list_mine_name[i].decode("UTF-8"), list_col_name[j].decode("UTF-8"), list_col_url[j])
  96. cursor.execute(sql)
  97. conn.commit()
  98. else:
  99. print '该收藏夹已无更多问题'
  100. break
  101. cursor.close()
  102. spider = Spider()
  103. spider.get_collections()
  104. print '爬虫执行完毕,请检查数据库'

发表评论

表情:
评论列表 (有 0 条评论,238人围观)

还没有评论,来说两句吧...

相关阅读