爬虫---数据的提取 迈不过友情╰ 2022-12-22 15:24 159阅读 0赞 ### 目录 ### * * * 一:requests模块 * * 1:cookieJar与字典的转换: * 2:证书认证问题: * 3:设置超时时长: * 4:retrying模块: * 二:数据的提取: * * <一>:Jsonpath提取json数据 * * 1:jsonpath的语法规则: * 2:jsonpath的基本使用: * 3: 爬取拉勾网城市的json数据: * 4:爬取豆瓣电影信息: * <二>:Beautifulsoup 提取HTML和XML * * 1: soup的属性和方法: * 2: 查找定位元素:find/find\_all * 3: 查找定位元素:select + CSS样式选择器: * 4:爬取新浪军事新闻: * <三>:Xpath提取数据: * * 1: 简单的Xpath语法: * 2: 查找特定的结点: * <四>: lxml提取数据: * * 1: lxml的基本使用: * 2: lxml的深入练习: ### 一:requests模块 ### #### 1:cookieJar与字典的转换: #### * 1:requests.utils.dict\_from\_cookiejar():将cookiedir类型转换成字典类型。 * 2:requests.utils.cookiejar\_from\_dict():将字典类型转换成cookiejar类型。 import requests """ requests获取的cookie是个cookieJar,需要将cookieJar转换成字典 """ url = "https://www.baidu.com" headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36' } response = requests.get(url, headers=headers) print(response.cookies) cookieJar = response.cookies # 1: cookieJar转换成字典 cookie_dir = requests.utils.dict_from_cookiejar(cookieJar) print(cookie_dir) print(type(cookie_dir)) # 2: 字典转换成cookieJar cookieJar2 = requests.utils.cookiejar_from_dict(cookie_dir) print(cookieJar2) print(type(cookieJar)) #### 2:证书认证问题: #### * 请求的时候携带:verify=False,就能跨越认证,直接登录。 * Http转https会出现警告,可以设置:urllib3.disable\_warnings(),取消警告。 import requests # requests是封装了urllib3 import urllib3 """ 如何绕过证书,正常访问? """ # 取消所有警告 urllib3.disable_warnings() url = "https://sam.huat.edu.cn:8443/selfservice/" headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36' } # verify=False, 取消安全证书认证 response = requests.get(url, headers=headers, verify=False) # 这里utf-8会出现解码错误,所以修改成GBK print(response.content.decode('gbk')) #### 3:设置超时时长: #### * 在发送请求的时候携带timeout参数就可以了,如果时间内没有响应就报异常错误。 import requests """ 我不想等太长时间 """ url = "http://www.itcast.cn" headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36' } # timeout 如果超时就报错 response = requests.get(url, timeout=1, headers=headers) print(response.content.decode()) #### 4:retrying模块: #### * @retry(stop\_max\_attempt\_number=4),使用这个装饰器去装饰,里面指定刷新的最大次数。 import requests from retrying import retry # 刷新的最大次数 @retry(stop_max_attempt_number=4) def send_request(): url = "http://www.google.com" headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36' } print("发送请求") response = requests.get(url, headers=headers, timeout=5) return response if __name__ == '__main__': send_request() ### 二:数据的提取: ### #### <一>:Jsonpath提取json数据 #### ##### 1:jsonpath的语法规则: ##### * $ : 根节点 * @: 当前结点 * . \[\] : 子节点 * … 所有条件 * * 所有元素的结点 ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center] 案例测试: > 在线解析工具测试http://jsonpath.com/ 1:获取所有的author作者的信息: ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 1] 2: 获取store下面的所有元素。 ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 2] 3:获取第三本书: 如果是最后一本书:`$.store.book[-1:]`, 注意这里只能用切片,不能使用\[-1\]下标。或者也可以使用`$.store.book[(@.length -1)]`,@指的是当前结点,取长度减一就是最后元素下标。 如果获取前两本书:`$.store.book[:2]`。 ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 3] 4: 获取有isbn属性的所有元素: > ?()指的是过滤操作。 > @.isbin指的是当前节点有isbn属性的元素。 ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 4] ##### 2:jsonpath的基本使用: ##### * author\_list = jsonpath(book\_dict, expr= “$.store.book…author”) * 参数1: 字典 * 参数2:expr= ‘匹配规则’ import requests from jsonpath import jsonpath book_dict = { "store": { "book": [ { "category": "reference", "author": "Nigel Rees", "title": "Sayings of the Century", "price": 8.95 }, { "category": "fiction", "author": "Evelyn Waugh", "title": "Sword of Honour", "price": 12.99 }, { "category": "fiction", "author": "Herman Melville", "title": "Moby Dick", "isbn": "0-553-21311-3", "price": 8.99 }, { "category": "fiction", "author": "J. R. R. Tolkien", "title": "The Lord of the Rings", "isbn": "0-395-19395-8", "price": 22.99 } ], "bicycle": { "color": "red", "price": 19.95 } } } # 1:通过jsonpath规则提取所有的作者名和价格 author_list = jsonpath(book_dict, expr= "$.store.book..author") # 提取任意位置满足key对应的value author_list2 = jsonpath(book_dict, expr= "$..author") print(author_list) print(author_list2) # 2:获取所有的价格 price_list = jsonpath(book_dict, expr="$..price") price_list2 = jsonpath(book_dict, expr="") print(price_list) ##### 3: 爬取拉勾网城市的json数据: ##### import requests from jsonpath import jsonpath import json url = "http://www.lagou.com/lbs/getAllCitySearchLabels.json" headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36' } response = requests.get(url, headers=headers) # 将json转换成字典 response_dict = response.json() # 1: 提取城市列表,expr="$..name",提取规则。 city_list = jsonpath(response_dict, expr="$..name") print(city_list) # 2: 将列表写入到文件中 with open ("city.txt", "w") as f: # 注意要将 ensure_ascii=False f.write(json.dumps(city_list, ensure_ascii=False, indent=4)) ##### 4:爬取豆瓣电影信息: ##### """ 分析豆瓣电影的爬取: url :https://movie.douban.com/j/search_subjects? 携带的参数: type: movie tag: 最新 page_limit: 20 page_start: 0 # 以20依次递增 返回数据的格式:json """ import json import requests from jsonpath import jsonpath class MovieSqider(object): def __init__(self, tag, page_start): """初始化参数""" self.url = "https://movie.douban.com/j/search_subjects" self.parames_dict = { "type": "movie", "tag": tag, "page_limit": 20, "page_start": page_start } self.headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36' } def send_request(self): """发送请求获取响应信息""" # 返回是响应对象 response = requests.get(self.url, headers=self.headers, params=self.parames_dict) # 将响应对象转化成json字符串 response_json = response.content.decode() # 转换成字典格式 response_dict = json.loads(response_json) return response_dict def parser_response(self, response_dict): """解析响应""" # 先拿到subjects列表 subjects_list = response_dict['subjects'] # 准备一个我自己的列表 movie_dict_list = [] for movie in subjects_list: my_dict = { } my_dict['movie_name'] = movie['title'] my_dict['movie_url'] = movie['url'] movie_dict_list.append(my_dict) return movie_dict_list def run(self): response_dict = self.send_request() movie_dict_list = self.parser_response(response_dict) return movie_dict_list if __name__ == '__main__': # 爬取前20页 for i in range(0, 400, 20): spider = MovieSqider("热门", i) movie_dict_list = spider.run() print(movie_dict_list) #### <二>:Beautifulsoup 提取HTML和XML #### * 1: 安装`bs4 : pip install bs4`和 `pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple` * 2:优点:用来解析 HTML 比较简单,API非常人性化,支持CSS选择器、Python标准库中的HTML解析器,也支持 lxml 的 XML解析器。 * 3: 缺点:lxml 只会局部遍历,而Beautiful Soup 是基于HTML DOM的,会载入整个文档,解析整个DOM树,因此时间和内存开销都会大很多,所以性能要低于lxml。 ##### 1: soup的属性和方法: ##### * 1:创建一个soup对象:soup = BeautifulSoup(html, ‘lxml’) * 2: 补全html标签:soup.prettify() * 3:获取整个标签元素:print(soup.标签名) * 4: 获取元素的内容:soup.title.get\_text(),先获取标签,然后调用get\_text()方法。 * 5:获取元素的属性:soup.a.get(‘href’),先获取标签,然后调用get()方法。 from bs4 import BeautifulSoup """ 主要是用来解析html和xml """ # 1:准备html字符串: html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ # 2: 构建BeautifulSoup解析对象 # 指定特定的解析器,屏蔽警告 soup = BeautifulSoup(html, 'lxml') # 3:从解析对象中提取数据 # 3.1: 解析对象能够补全html标签 # print(soup.prettify()) # 3.2: 得到整个标签,但是只能得到第一个: # print(soup.a) # print(type(soup.a)) # <a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a> # <class 'bs4.element.Tag'> # 3.3:获取元素的内容: print(soup.title) print(type(soup.title)) print(soup.title.get_text()) # <title>The Dormouse's story</title> # <class 'bs4.element.Tag'> # The Dormouse's story # 3.4:获取元素的属性值: # print(soup.p.get('class')) #['title'] ##### 2: 查找定位元素:find/find\_all ##### """ find 返回符合条件的第一个元素 或者 find_all返回符合条件的所有元素列表 """ # 1.导入模块 from bs4 import BeautifulSoup import re html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html, 'lxml') # 1:查找所有的a标签返回的是列表 # print(soup.find_all(name='a')) # 2:查找所有的a和b标签 # print(soup.find_all(name=['a', 'b'])) # 3: 通过正则表达式查找元素,返回的是列表套标签 # print(soup.find_all(name=re.compile('^b'))) # 4: 通过属性查找元素: # print(soup.find_all(attrs={"class", "title"})) # 5: 通过**kwargs 关键字参数查找元素 # 注意class_,表示python的class类。 print(soup.find_all(class_='sister')) print(soup.find_all(id='link2')) ##### 3: 查找定位元素:select + CSS样式选择器: ##### from bs4 import BeautifulSoup html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title" name="dromouse"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html, 'lxml') # 1:使用类选择器:返回的是列表 print(soup.select('.sister')) # 2: 使用ID选择器,返回列表 print(soup.select('#link2')) # 3:使用标签选择器 print(soup.select('a')) # 4: 使用属性选择器: print(soup.select('a[class="sister"]')) # 5: 层级选择器:p标签下的b标签 print(soup.select('p b')) # [<b>The Dormouse's story</b>] ##### 4:爬取新浪军事新闻: ##### """ 爬取新浪网站的新闻,标题和链接。 写入到文本文件中 分析: url :http://mil.news.sina.com.cn/roll/index.d.html?cid=57918&page=2 参数:page表示第几页 返回值:抓取的是网页中的信息,返回的是标签得字符串形式 注意:测试发现军情信息只有12页的新闻 """ import requests from bs4 import BeautifulSoup import time class XinLangNewsSpider(object): def __init__(self): # 构造13页的网页链接 self.urls = ["http://mil.news.sina.com.cn/roll/index.d.html?cid=57918&page={}".format(i) for i in range(1, 13)] self.headers = { 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.121 Mobile Safari/537.36' } def send_request(self, url): response = requests.get(url, headers=self.headers) # 将网页转换成字符串返回 return response.content.decode() def parser_response(self, html_str): # 1:创建BeautifulSoup对象 soup = BeautifulSoup(html_str, 'lxml') # 2:使用层级选择器获取所有新闻的a标签列表 a_list = soup.select(".linkNews li a") # 3:分组提取所有的网页标题和链接 news_list = [] for i in a_list: new_dict = { } # 获取标签的文本内容 new_dict['news_title'] = i.get_text() # 获取标签的链接 new_dict['news_url'] = i.get("href") news_list.append(new_dict) return news_list def write_to_txt(self, new_list): with open("新浪军事新闻.txt", "a", encoding='utf-8') as f: for new_dict in new_list: f.write(new_dict['news_title'] ) f.write(' : ') f.write(new_dict['news_url']) f.write('\n') def run(self): for url in self.urls: # print(url) # time.sleep(2) html_str = self.send_request(url) new_list = self.parser_response(html_str) self.write_to_txt(new_list) if __name__ == '__main__': xinlang_spider = XinLangNewsSpider() xinlang_spider.run() 抓取效果: ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 5] #### <三>:Xpath提取数据: #### ##### 1: 简单的Xpath语法: ##### ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 6] ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 7] ##### 2: 查找特定的结点: ##### * 在xpath中,第一个元素的位置是1,最后一个元素的位置是last(),倒数第二个是last()-1 ![在这里插入图片描述][watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 8] #### <四>: lxml提取数据: #### ##### 1: lxml的基本使用: ##### * etree.HTML,将字符串转化为Element对象,Element对象具有xpath的方法,返回结果的列表,能够接受bytes类型的数据和str类型的数据。 from lxml import html html_str = ''' <div> <ul> <li class="item-1"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' etree = html.etree # 1: 将字符串转换成HTML对象: eroot = etree.HTML(html_str) # 2:注意:创建出来的解析对象,会自动补全html字符串标签 # 将HTML对象转传承字符串是tostring()方法,tostring方法返回的是byte类型 # print(etree.tostring(eroot).decode()) # 3: 使用xpath语法提取标签或者数据值: text_list = eroot.xpath("//li[@class='item-1']/a/text()") url_list = eroot.xpath("//li[@class='item-1']/a/@href") print(text_list) print(url_list) ##### 2: lxml的深入练习: ##### """ 如果我想让链接和标题组合成一个字典,会出现哪些问题呢? """ from lxml import html etree = html.etree text = ''' <div> <ul> <li class="item-1"><a href="link1.html">first item</a></li> <li class="item-1"><a>second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a href="link5.html">fifth item</a> </ul> </div> ''' # html = etree.HTML(text) # href_list = html.xpath("//li[@class='item-1']/a/@href") # title_list = html.xpath("//li[@class='item-1']/a/text()") # print(href_list) # print(title_list) # <方案一>: 遍历的方式,将链接和标题组装在一起。 # for href in href_list: # item = {} # item['href'] = href # item['title'] = title_list[href_list.index(href)] # print(item) # 问题:假如其中一个链接,没有href属性,则就会发生错位的现象。 # <方案二> :先分组,然后再提取a标签的文本信息和href属性 eroot = etree.HTML(text) # 提取整个的a标签,获得a标签的对象列表 html = etree.HTML(text) a_list = html.xpath("//li[@class='item-1']/a") # 在每一组中继续进行数据的提取 item_list = [] for a in a_list: item = { } item["href"] = a.xpath("./@href")[0] if len(a.xpath("./@href"))>0 else None item["title"] = a.xpath("./text()")[0] if len(a.xpath("./text()"))>0 else None item_list.append(item) print(item_list) [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center]: /images/20221120/d78a4f77378c40fabf4add4a13e6c246.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 1]: /images/20221120/aa9ae77d4429449284931619f1ee9f58.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 2]: /images/20221120/d7372b01501f4aaa8e6432c4ebc2d52b.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 3]: /images/20221120/67417f489a8b4f939e2bfd8b75c537ad.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 4]: /images/20221120/63eb685777074b07b992003270e0db17.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 5]: /images/20221120/5cd539d0b2ea40dcbfbe78855af25c62.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 6]: /images/20221120/58063f1c988c43bc8805c2941ac4d300.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 7]: /images/20221120/e59295e6ae96464889da652b27de433a.png [watermark_type_ZmFuZ3poZW5naGVpdGk_shadow_10_text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3FxXzQxMzQxNzU3_size_16_color_FFFFFF_t_70_pic_center 8]: /images/20221120/66b873fd26b64a4b959e2cef4e6ca17d.png
还没有评论,来说两句吧...