Scrapy实例2、爬取靓号网-蒲公英云

Scrapy实例2、爬取靓号网

前言：通过实例学习了解Scrapy爬虫框架的使用，并把爬取到的数据保存到数据库中和保存成一个Json格式的文件。

项目分析：

项目名：phone 爬虫名：getphone 爬取的网址：http://www.jihaoba.com/escrow/ 集号吧

分析爬取的字段：

//div[@class="numbershow"]/ul

手机号码：

./li[@class="number hmzt"]/a/@href

需要使用正则匹配获取电话号码

re("\\d{11}")    # 11位电话号码

在这里插入图片描述

价格：

./li[@class="price"]/span/text()

存入数据库需要对人民币符号进行格式处理

replace("￥", "")

在这里插入图片描述

归属地：

./li[@class="brand"]/brand/text()

在这里插入图片描述

项目流程：

创建项目
```
scrapy startproject phone
```

创建爬虫器

cd phone
scrapy genspider getphone jihaoba.com

设计爬虫器

from ..items import PhoneItem
class GetphoneSpider(scrapy.Spider):
    name = 'getphone'
    allowed_domains = ['jihaoba.com']
    start_urls = ['http://www.jihaoba.com/escrow/']
    def parse(self, response):
        # print(response)
        lists = response.xpath('//div[@class="numbershow"]/ul')
        for list in lists:
            phoneitem = PhoneItem()
            # 电话号码
            phoneitem['number'] = list.xpath('li[@class="number hmzt"]/a/@href').re("\\d{11}")[0]
            # 价格
            price = list.xpath('li[@class="price"]/span/text()').extract()[0]
            phoneitem['price'] = price.replace("￥", "")
            # 归属地
            phoneitem['brand'] = list.xpath('li[@class="brand"]/text()').extract()[0]
            # print(number, price, brand)
            yield phoneitem
        # 继续下一页
        # next = "http://www.jihaoba.com" + response.xpath('//a[@class="m-pages-next"]/@href').extract_first()
        # yield scrapy.Request(next)

设置项目 items

class PhoneItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    number = scrapy.Field()
    price = scrapy.Field()
    brand = scrapy.Field()

设置管道 pipelines

import MySQLdb
from .settings import mysql_host, mysql_user, mysql_passwd, mysql_db, mysql_port
# 保存到数据库
class putMySQLPipeline:
    def __init__(self):
        host = mysql_host
        user = mysql_user
        passwd = mysql_passwd
        db = mysql_db
        port = mysql_port
        # 连接数据库
        self.connection = MySQLdb.connect(host, user, passwd, db, port, charset='utf8')
        # 获取游标
        self.cursor = self.connection.cursor()
    def process_item(self, item, spider):
        sql = "insert into phonetable(number,price,brand) values(%s,%s,%s)"
        param = list()        # 创建列表存放数据
        param.append(item['number'])
        param.append(item['price'])
        param.append(item['brand'])
        self.cursor.execute(sql, tuple(param))    # 执行操作，tuple()表示将列表转换为元组
        self.connection.commit()        # 提交事务
        return item
    def __del__(self):
        self.cursor.close()
        self.connection.close()
class JsonWithEncodingPipeline(object):
    def __init__(self):
        self.file = codecs.open("phone.json", "a", encoding="utf-8")
    def process_item(self, item, spider):
        lines = json.dumps(dict(item), ensure_ascii=False) + "\n"    # dict() 创建一个字典
        self.file.write(lines)
        return item
    def __del__(self):
        self.file.close()

设置 settings

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# 管道优先级(数字越小优先级越高)
ITEM_PIPELINES = {
   # 'phone.pipelines.PhonePipeline': 300,
   'phone.pipelines.putMySQLPipeline': 300,
   'phone.pipelines.JsonWithEncodingPipeline': 400
}
mysql_host = '192.168.142.200'        # 数据库IP地址
mysql_user = 'root'                # 数据库用户名
mysql_passwd = '123456'            # 数据库密码
mysql_db = 'db01'                # 数据库名
mysql_tb = 'phonetable'            # 数据表名
mysql_port = 3306                # 数据库端口号

创建 start.py

from scrapy import cmdline
cmdline.execute("scrapy crawl getphone".split())

数据库

建立数据库
```
create database db01;
```

数据库授权

grant all privileges on db01.* 'root'@'%' identified by '123456';

建数据表

create table phone (
    number bigint,
    price int,
    brand varchar(50)
);

设置字符集编码

查看数据库字符集
```
show create database db01;
```
修改数据库字符编码
```
alter database db01 character set utf8;
```
查看数据表字符集
```
show create table phone;
```
修改数据表字符编码
```
alter table phone character set utf8;
```

运行项目
直接运行 start.py

Scrapy实例2、爬取靓号网

项目分析：

项目流程：

发表评论取消回复

还没有评论，来说两句吧...

相关阅读

相关 scrapy爬取多页面

相关 Scrapy爬虫框架，爬取小说网的所有小说

相关 Scrapy框架爬取数据

相关 scrapy实现定时爬取

相关 Scrapy实例2、爬取靓号网

相关 Scrapy实例1、爬取豆瓣电影

相关 scrapy爬取itcast网站的的教师信息

相关 Scrapy 爬取图片实例

相关毕设总结2：使用python scrapy 爬取网易云音乐

相关 scrapy提高爬取速度

随便看看

HttpURLConnection与HttpClient浅析

java 泛型<T> <?>

深入理解工厂模式

Vue 嵌套路由

水平垂直居中

通过Git安装和配置自己的GitHub仓库

教程文章

热评文章

1江湖小白之一起学Python （二）爬取数据的保存

2Java Shiro：简化身份验证和授权的安全框架

3Java中try()catch{}的使用方法

4Swagger注解-@ApiModel 和 @ApiModelProperty

5windows下强制杀死tomcat进程

6uni-app 条形码(一维码)/二维码生成实现

标签列表