数据清洗之字符串数据处理

字符串数据处理

Pandas中提供了字符串的函数，但只能对字符型变量进行使用
通过str方法访问相关属性
可以使用字符串的相关方法进行数据处理

函数名称	说明
contains()	返回表示各str是否含有指定模式的字符串
replace()	替换字符串
lower()	返回字符串的副本，其中所有字母都转换为小写
upper()	返回字符串的副本，其中所有字母都转换为大写
split()	返回字符串中的单词列表
strip()	删除前导和后置空格
join()	返回一个字符串，该字符串是给定序列中所有字符串的连接

import pandas as pd
import numpy as np
import os
os.getcwd()
'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据转换'
os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
df = pd.read_csv('MotorcycleData.csv', encoding='gbk')
df.head(5)

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	…	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	$11,412	McHenry, Illinois, United States	2013.0	16,000	Black	Harley-Davidson	Unspecified	Touring	…	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	$17,200	Fort Recovery, Ohio, United States	2016.0	60	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	…	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	$3,872	Chicago, Illinois, United States	1970.0	25,763	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	…	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0
3	Used	CLEAN TITLE READY TO RIDE HOME	$6,575	Green Bay, Wisconsin, United States	2009.0	33,142	Red	Harley-Davidson	NaN	Touring	…	NaN	FALSE	100	NaN	2920	Dealer	Clear	True	FALSE	11.0
4	Used	NaN	$10,000	West Bend, Wisconsin, United States	2012.0	17,800	Blue	Harley-Davidson	NO WARRANTY	Touring	…	NaN	FALSE	100	13	271	OWNER	Clear	True	TRUE	0.0

5 rows × 22 columns

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7493 entries, 0 to 7492
Data columns (total 22 columns):
Condition         7493 non-null object
Condition_Desc    1656 non-null object
Price             7493 non-null object
Location          7491 non-null object
Model_Year        7489 non-null float64
Mileage           7468 non-null object
Exterior_Color    6778 non-null object
Make              7489 non-null object
Warranty          5109 non-null object
Model             7370 non-null object
Sub_Model         2426 non-null object
Type              6011 non-null object
Vehicle_Title     268 non-null object
OBO               7427 non-null object
Feedback_Perc     6611 non-null object
Watch_Count       3517 non-null object
N_Reviews         7487 non-null object
Seller_Status     6868 non-null object
Vehicle_Tile      7439 non-null object
Auction           7476 non-null object
Buy_Now           7256 non-null object
Bid_Count         2190 non-null float64
dtypes: float64(2), object(20)
memory usage: 1.3+ MB
# 里面有字符串，不能进行转换
# df['Price'].astype(float)
# .str 方法可用于提取字符
df['Price'].str[1:3].head(5)
0    11
1    17
2    3,
3    6,
4    10
Name: Price, dtype: object
# 首先要对字符串进行相关处理
df['价格'] = df['Price'].str.strip('$')
df['价格'].head(5)
0    11,412 
1    17,200 
2     3,872 
3     6,575 
4    10,000 
Name: 价格, dtype: object
df['价格'] = df['价格'].str.replace(',', '')
df['价格'].head(5)
0    11412 
1    17200 
2     3872 
3     6575 
4    10000 
Name: 价格, dtype: object
df['价格'] = df['价格'].astype(float)
df['价格'].head(5)
0    11412.0
1    17200.0
2     3872.0
3     6575.0
4    10000.0
Name: 价格, dtype: float64
df.dtypes
Condition          object
Condition_Desc     object
Price              object
Location           object
Model_Year        float64
Mileage            object
Exterior_Color     object
Make               object
Warranty           object
Model              object
Sub_Model          object
Type               object
Vehicle_Title      object
OBO                object
Feedback_Perc      object
Watch_Count        object
N_Reviews          object
Seller_Status      object
Vehicle_Tile       object
Auction            object
Buy_Now            object
Bid_Count         float64
价格                float64
dtype: object
# 字符串分割
df['Location'].str.split(',').str[0].head(5)
0          McHenry
1    Fort Recovery
2          Chicago
3        Green Bay
4        West Bend
Name: Location, dtype: object
# 计算字符串的长度
df['Location'].str.len().head(5)
0    32.0
1    34.0
2    32.0
3    35.0
4    35.0
Name: Location, dtype: float64