【零基础入门Python爬虫】第二节 Python 爬虫解析 Xpath、JsonPath 和 BeautifulSoup

青旅半醒 2023-10-09 21:08 17阅读 0赞

网络爬虫是一种自动化程序，用于从互联网上获取信息。在网络爬虫中，数据的解析是一个非常重要的环节。Python 是一门流行的编程语言，具有优秀的网络爬虫库，如 Scrapy、BeautifulSoup、Requests 等。本文将着重介绍 Python 中的三种解析技术：Xpath、JsonPath 和 BeautifulSoup。

--------------------

### 一、Xpath ###

XPath 是一种类 XML 的语言，可以用来选择 XML 或 HTML 文档中的节点。在 Python 编程中，我们可以使用 `lxml` 库来解析 HTML 或 XML 文件，并使用 XPath 语法选择需要的节点。

#### 1.安装 ####

*  使用 pip 安装：

pip install lxml

#### 2.解析 HTML ####

我们可以使用 `html.etree.ElementTree` 模块中的 `HTML` 方法加载 HTML 文件，然后使用 `xpath()` 方法获取需要的节点。

from lxml import etree
    
    html = '''
    <!DOCTYPE html>
    <html>
      <head>
        <title>Test</title>
      </head>
      <body>
        <ul>
          <li><a href="http://www.baidu.com">Google</a></li>
          <li><a href="http://www.qq.com">Yahoo</a></li>
        </ul>
      </body>
    </html>
    '''
    
    tree = etree.HTML(html)
    links = tree.xpath('//ul/li/a')
    
    for link in links:
        print(link.attrib['href'])

输出结果：

http://www.baidu.com
    http://www.qq.com

#### 3.解析 XML ####

如果需要解析 XML 文件，我们可以使用 `etree.parse()` 方法加载 XML 文件，然后使用 `xpath()` 方法选择需要的节点。

from lxml import etree
    
    xml = '''
    <bookstore>
      <book category="cooking">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
      </book>
      <book category="children">
        <title lang="en">Harry Potter</title>
        <author>J.K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
      </book>
    </bookstore>
    '''
    
    tree = etree.fromstring(xml)
    books = tree.xpath('//book')
    
    for book in books:
        print(book.xpath('./title/text()')[0], book.xpath('./price/text()')[0])

输出结果：

Everyday Italian 30.00
    Harry Potter 29.99

### 二、JsonPath ###

JsonPath 是一种用于在 JSON 对象中定位元素的语言。与 XPath 类似，JsonPath 提供了一些简单的表达式来选择 JSON 对象中的元素。Python 中可以使用 `jsonpath_rw` 库来实现 JsonPath 的解析。

#### 1.安装 ####

*  使用 pip 安装：

pip install jsonpath-rw

#### 2.解析 JSON ####

我们可以使用 `json.loads()` 方法将 JSON 字符串转换为 Python 对象，然后使用 `jsonpath_rw.parse()` 方法构造 JsonPath 表达式，并使用 `find()` 方法获取需要的元素。

import json
    from jsonpath_rw import parse
    
    json_str = '''
    {
      "store": {
        "book": [
          {
            "category": "reference",
            "author": "Nigel Rees",
            "title": "Sayings of the Century",
            "price": 8.95
          },
          {
            "category": "fiction",
            "author": "Evelyn Waugh",
            "title": "Sword of Honour",
            "price": 12.99
          }
        ],
        "bicycle": {
          "color": "red",
          "price": 19.95
        }
      }
    }
    '''
    
    data = json.loads(json_str)
    books = parse('$.store.book[*]').find(data)
    
    for book in books:
        print(book.value['title'], book.value['price'])

输出结果：

Sayings of the Century 8.95
    Sword of Honour 12.99

### 三、BeautifulSoup ###

BeautifulSoup 是 Python 中的一个 HTML/XML 解析库，可以从 HTML 或 XML 中提取信息。它可以自动将不规范的 HTML 或 XML 文件转换为规范化的树形结构，并提供了一些简单的方法来查询和修改这个树形结构。

#### 1.安装 ####

*  使用 pip 安装：

pip install beautifulsoup4

#### 2.解析 HTML ####

我们可以使用 `bs4` 模块中的 `BeautifulSoup` 类来解析 HTML 文件，然后使用 `select()` 方法获取需要的节点。

from bs4 import BeautifulSoup
    
    html = '''
    <!DOCTYPE html>
    <html>
      <head>
        <title>Test</title>
      </head>
      <body>
        <ul>
          <li><a href="http://www.baidu.com">Google</a></li>
          <li><a href="http://www.qq.com">Yahoo</a></li>
        </ul>
      </body>
    </html>
    '''
    
    soup = BeautifulSoup(html, 'html.parser')
    links = soup.select('ul li a')
    
    for link in links:
        print(link['href'])

输出结果：

http://www.baidu.com
    http://www.qq.com

#### 3.解析 XML ####

如果需要解析 XML 文件，我们可以使用 `lxml` 库来解析 XML 文件，并将解析结果传递给 `BeautifulSoup` 类。

from lxml import etree
    from bs4 import BeautifulSoup
    
    xml = '''
    <bookstore>
      <book category="cooking">
        <title lang="en">Everyday Italian</title>
        <author>Giada De Laurentiis</author>
        <year>2005</year>
        <price>30.00</price>
      </book>
      <book category="children">
        <title lang="en">Harry Potter</title>
        <author>J.K. Rowling</author>
        <year>2005</year>
        <price>29.99</price>
      </book>
    </bookstore>
    '''
    
    tree = etree.fromstring(xml)
    soup = BeautifulSoup(etree.tostring(tree), 'xml')
    books = soup.select('book')
    
    for book in books:
        print(book.select_one('title').text, book.select_one('price').text)

输出结果：

Everyday Italian 30.00
    Harry Potter 29.99

--------------------

### 总结 ###

本文介绍了 Python 中三种常用的解析技术：Xpath、JsonPath 和 BeautifulSoup。使用这些技术，我们可以方便地从 HTML、XML 或 JSON 文件中提取所需的信息。在实际开发中，我们可以根据具体情况选择适合自己的解析技术，并根据需求灵活运用各种工具和技术来提高数据解析效率。

【零基础入门Python爬虫】第二节 Python 爬虫解析 Xpath、JsonPath 和 BeautifulSoup

发表评论取消回复

还没有评论，来说两句吧...

相关阅读