糗事百科爬虫改进

Myth丶恋晨 2022-08-10 00:56 273阅读 0赞

无事,抓糗事!

看到一个哥们的代码,无事拿来改改,抓糗事百科文字内容

  1. #!/usr/bin/env python
  2. '''
  3. for qiushibaike.com
  4. '''
  5. import urllib2
  6. # import urllib
  7. import re
  8. import thread
  9. import time
  10. class Spider_Model():
  11. def __init__(self):
  12. self.page = 1
  13. self.pages = []
  14. self.enable = False
  15. def GetPage(self,page):
  16. myurl = r'http://www.qiushibaike.com/textnew/page/'+page
  17. user_agent = 'Mozilla/5.0 (X11; Linux x86_64)'
  18. headers = {'User-Agent':user_agent}
  19. req = urllib2.Request(myurl,headers=headers)
  20. myres = urllib2.urlopen(req)
  21. mypage = myres.read()
  22. unicodepage = mypage.decode('utf-8')
  23. myItems = re.findall('<div.*?class="content">(.*?)<!--.*?-->.*?</div>',unicodepage,re.S)
  24. Items = []
  25. # print myItems
  26. for item in myItems:
  27. # print item
  28. item = item.replace('\n','')
  29. Items.append(item.replace(r'<br/>','\n'))
  30. # Items.append(item[0])
  31. return Items
  32. def LoadPage(self):
  33. while self.enable:
  34. if len(self.pages) < 2:
  35. try:
  36. mypage = self.GetPage(str(self.page))
  37. self.page += 1
  38. self.pages.append(mypage)
  39. except:
  40. print 'can not connected to the url.'
  41. else:
  42. time.sleep(1)
  43. def ShowPage(self,nowPage,page):
  44. print '\n\n############################ Page %d #################################\n\n' % page
  45. for item in nowPage:
  46. print item
  47. myinput = raw_input()
  48. if myinput == 'quit':
  49. self.enable = False
  50. break
  51. def start(self):
  52. page = self.page
  53. self.enable = True
  54. print(u'waiting..............')
  55. thread.start_new_thread(self.LoadPage,())
  56. while self.enable:
  57. if self.pages:
  58. nowpage = self.pages[0]
  59. del self.pages[0]
  60. self.ShowPage(nowpage,page)
  61. page +=1
  62. if __name__ == '__main__':
  63. #---------the begin of program-----------------
  64. print u'''
  65. -------------------------------------------------
  66. xxxx
  67. x
  68. xxx
  69. xxx
  70. -------------------------------------------------
  71. '''
  72. print 'Press any key,to continue......'
  73. raw_input()
  74. mymodel = Spider_Model()
  75. mymodel.start()

一切从简。不解释不说明,随便拍!

详细内容请参考:http://blog.csdn.net/pleasecallmewhy/article/details/8932310

发表评论

表情:
评论列表 (有 0 条评论,273人围观)

还没有评论,来说两句吧...

相关阅读

    相关 百科爬虫

    这几天看了不少phtyon 的基础,试着做了一个daemo 但不是很成功 不知道家里网络不太好还正则匹配的不好,re.findall 的数据不是特别的稳定,有时候要加载很长