用python爬虫爬取知乎热榜

今天药忘吃喽~ 2021-07-27 00:57 689阅读 0赞

用python爬虫爬取知乎热榜

话不多说,直接上代码!

  1. import requests
  2. import re
  3. from lxml import etree
  4. content_re = re.compile('"excerptArea":{"text":"(.*?)"}')
  5. url_re = re.compile('"link":{"url":"(.*?)"')
  6. headers = { "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36 Edg/84.0.522.59"}
  7. def get_hot(url):
  8. html = requests.get(url,headers=headers)
  9. soup = etree.HTML(html.text)
  10. hots = soup.xpath('//a[@class="HotList-item"]')
  11. for h in hots:
  12. title = "标题:"+h.xpath('div/div[@class="HotList-itemTitle"]/text()')[0]+'\n'
  13. with open("C:/Users/86135/Desktop/知乎热榜.txt",'at') as f:
  14. f.write(title)
  15. images = soup.xpath('//div[@class="HotList-itemImgContainer"]/img/@src')
  16. for i in images:
  17. with open("C:/Users/86135/Desktop/知乎图片链接.txt", 'at') as f:
  18. f.write(i+'\n')
  19. urls = url_re.findall(html.text)
  20. for url in urls:
  21. url = url+'\n'
  22. with open("C:/Users/86135/Desktop/知乎热榜.txt",'at') as f:
  23. f.write(url)
  24. contents = content_re.findall(html.text)
  25. for c in contents:
  26. with open("C:/Users/86135/Desktop/知乎热榜.txt",'at') as f:
  27. f.write('#'+c+'\n')
  28. if __name__ == '__main__':
  29. url = "https://www.zhihu.com/billboard"
  30. get_hot(url)

运行结果如下:
在这里插入图片描述

发表评论

表情:
评论列表 (有 0 条评论,689人围观)

还没有评论,来说两句吧...

相关阅读

    相关 解析

    背景 实现一个简单的需求,解析知乎热榜,主要涉及找到热榜接口、json解析、返回值中文乱码处理(Unicode编码)、RestTemplate配置等等。 这只是简单的实