【爬虫】BeautifulSoup4的使用、常用解析器、find()和find_all()、select()

素颜马尾好姑娘i 2022-11-21 11:21 156阅读 0赞

#### 1.BeautifulSoup4 ####

BeautifulSoup是一个强大的HTML/XML的解析器，我们主要用它来解析和提取 HTML/XML数据

**优点：** 使用简单，支持CSS选择器、Python标准库中的HTML解析器，也支持 lxml 的 XML解析器

**缺点：** ：会遍历整个DOM树，时间和内存逗笑都打，性能不及lxml

官方API文档  
[https://beautifulsoup.readthedocs.io/zh\_CN/v4.4.0/][https_beautifulsoup.readthedocs.io_zh_CN_v4.4.0]

#### 2.简单使用 ####

种类  
eautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种，分别是Tag、NavigableString、BeautifulSoup、Comment

安装

pip install beautifulsoup4

使用

from bs4 import BeautifulSoup
    
    html = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
    
    soup = BeautifulSoup(html, 'lxml')
    
    print(soup.title)  # 查找title标签
    print(soup.title.get_text())  # 获取title的文本
    print(soup.a.get("href"))  # 获取第一个a元素的href属性值，tag类型可以当做字典使用
    print(soup.p)  # 第一个p元素
    
    print(soup.find(name='a'))  # 查找第一个a元素
    print(soup.find_all(name="a"))  # 查找所有a元素
    
    # print(soup.prettify()) # 打印整个HTML

#### 3.常用解析器 ####

<table> 
 <thead> 
  <tr> 
   <th>解析器</th> 
   <th>使用方法</th> 
   <th>优点</th> 
   <th>缺点</th> 
  </tr> 
 </thead> 
 <tbody> 
  <tr> 
   <td>Python标准库</td> 
   <td>BeautifulSoup(markup, “html.parser”)</td> 
   <td>Python的内置标准库<br>执行速度适中<br>文档容错能力强</td> 
   <td>版本容错能力差</td> 
  </tr> 
  <tr> 
   <td>lxml HTML 解析器</td> 
   <td>BeautifulSoup(markup, “lxml”)</td> 
   <td>速度快<br>文档容错能力强</td> 
   <td>需要安装C语言库</td> 
  </tr> 
  <tr> 
   <td>lxml XML 解析器</td> 
   <td>BeautifulSoup(markup, [“lxml-xml”])<br> BeautifulSoup(markup, “xml”)</td> 
   <td>速度快<br>唯一支持XML的解析器</td> 
   <td>需要安装C语言库</td> 
  </tr> 
  <tr> 
   <td>html5lib</td> 
   <td>BeautifulSoup(markup, “html5lib”)</td> 
   <td>最好的容错性<br>以浏览器的方式解析文档<br>生成HTML5格式的文档</td> 
   <td>速度慢<br>不依赖外部扩展</td> 
  </tr> 
 </tbody> 
</table>

#### 4.find()和find\_all() ####

find(name, attrs, recursive, text, **kwargs)
    find_all(name, attrs, recursive, text, **kwargs)

这两个函数都是查找元素的，从它们的API参数可以知道用法应该是一样的，不同点就是find()只匹配第一个符合要求的元素，找不到返回None，而find\_all()则匹配所有，找不到返回空列表

soup.find_all(name='a')  # # 查找所有 a 标签
    soup.find_all(name=['a','b'])  # 查找所有 a 标签和 b 标签
    soup.find_all(name=re.compile("^b"))  # 以 b 开头的标签查找
    
    soup.find_all("a", class_="sister")  # 查找class为sister的a标签
    
    soup.find_all(attrs={ "属性名":"值"})
    
    soup.find_all(text="Elsie")  # 通过文本内容查找
    
    soup.find_all(id='link2')  # 通过id查找元素
    soup.find_all(class_="sister")  # 通过class查找元素
    
    soup.find_all("a", limit=2)  # 限制返回2个

#### 5.select() ####

select()函数可以通过 css 样式选择器进行元素查找

print(soup.select(".sister"))  # 查找类为sister的元素，返回一个列表
    print(soup.select("#link1"))  # 查找id为link1的元素，返回一个列表
    print(soup.select("a"))  # 查找a标签，返回一个列表
    print(soup.select("p[class=title]"))  # 查找class为title的p标签
    print(soup.select("p #link2"))  # 查看在p标签里的id为link2的p标签

关于css选择器可以查看之前的文章  
[【css】css常用的选择器][css_css]

[https_beautifulsoup.readthedocs.io_zh_CN_v4.4.0]: https://beautifulsoup.readthedocs.io/zh_CN/v4.4.0/
[css_css]: https://blog.csdn.net/qq_39147299/article/details/107933086