数据清洗之 字符串数据处理

逃离我推掉我的手 2023-05-22 08:43 62阅读 0赞

字符串数据处理

  • Pandas中提供了字符串的函数,但只能对字符型变量进行使用
  • 通过str方法访问相关属性
  • 可以使用字符串的相关方法进行数据处理





































函数名称 说明
contains() 返回表示各str是否含有指定模式的字符串
replace() 替换字符串
lower() 返回字符串的副本,其中所有字母都转换为小写
upper() 返回字符串的副本,其中所有字母都转换为大写
split() 返回字符串中的单词列表
strip() 删除前导和后置空格
join() 返回一个字符串,该字符串是给定序列中所有字符串的连接
  1. import pandas as pd
  2. import numpy as np
  3. import os
  4. os.getcwd()
  5. 'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据转换'
  6. os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')
  7. df = pd.read_csv('MotorcycleData.csv', encoding='gbk')
  8. df.head(5)





















































































































































Condition Condition_Desc Price Location Model_Year Mileage Exterior_Color Make Warranty Model Vehicle_Title OBO Feedback_Perc Watch_Count N_Reviews Seller_Status Vehicle_Tile Auction Buy_Now Bid_Count
0 Used mint!!! very low miles $11,412 McHenry, Illinois, United States 2013.0 16,000 Black Harley-Davidson Unspecified Touring NaN FALSE 8.1 NaN 2427 Private Seller Clear True FALSE 28.0
1 Used Perfect condition $17,200 Fort Recovery, Ohio, United States 2016.0 60 Black Harley-Davidson Vehicle has an existing warranty Touring NaN FALSE 100 17 657 Private Seller Clear True TRUE 0.0
2 Used NaN $3,872 Chicago, Illinois, United States 1970.0 25,763 Silver/Blue BMW Vehicle does NOT have an existing warranty R-Series NaN FALSE 100 NaN 136 NaN Clear True FALSE 26.0
3 Used CLEAN TITLE READY TO RIDE HOME $6,575 Green Bay, Wisconsin, United States 2009.0 33,142 Red Harley-Davidson NaN Touring NaN FALSE 100 NaN 2920 Dealer Clear True FALSE 11.0
4 Used NaN $10,000 West Bend, Wisconsin, United States 2012.0 17,800 Blue Harley-Davidson NO WARRANTY Touring NaN FALSE 100 13 271 OWNER Clear True TRUE 0.0

5 rows × 22 columns

  1. df.info()
  2. <class 'pandas.core.frame.DataFrame'>
  3. RangeIndex: 7493 entries, 0 to 7492
  4. Data columns (total 22 columns):
  5. Condition 7493 non-null object
  6. Condition_Desc 1656 non-null object
  7. Price 7493 non-null object
  8. Location 7491 non-null object
  9. Model_Year 7489 non-null float64
  10. Mileage 7468 non-null object
  11. Exterior_Color 6778 non-null object
  12. Make 7489 non-null object
  13. Warranty 5109 non-null object
  14. Model 7370 non-null object
  15. Sub_Model 2426 non-null object
  16. Type 6011 non-null object
  17. Vehicle_Title 268 non-null object
  18. OBO 7427 non-null object
  19. Feedback_Perc 6611 non-null object
  20. Watch_Count 3517 non-null object
  21. N_Reviews 7487 non-null object
  22. Seller_Status 6868 non-null object
  23. Vehicle_Tile 7439 non-null object
  24. Auction 7476 non-null object
  25. Buy_Now 7256 non-null object
  26. Bid_Count 2190 non-null float64
  27. dtypes: float64(2), object(20)
  28. memory usage: 1.3+ MB
  29. # 里面有字符串,不能进行转换
  30. # df['Price'].astype(float)
  31. # .str 方法可用于提取字符
  32. df['Price'].str[1:3].head(5)
  33. 0 11
  34. 1 17
  35. 2 3,
  36. 3 6,
  37. 4 10
  38. Name: Price, dtype: object
  39. # 首先要对字符串进行相关处理
  40. df['价格'] = df['Price'].str.strip('$')
  41. df['价格'].head(5)
  42. 0 11,412
  43. 1 17,200
  44. 2 3,872
  45. 3 6,575
  46. 4 10,000
  47. Name: 价格, dtype: object
  48. df['价格'] = df['价格'].str.replace(',', '')
  49. df['价格'].head(5)
  50. 0 11412
  51. 1 17200
  52. 2 3872
  53. 3 6575
  54. 4 10000
  55. Name: 价格, dtype: object
  56. df['价格'] = df['价格'].astype(float)
  57. df['价格'].head(5)
  58. 0 11412.0
  59. 1 17200.0
  60. 2 3872.0
  61. 3 6575.0
  62. 4 10000.0
  63. Name: 价格, dtype: float64
  64. df.dtypes
  65. Condition object
  66. Condition_Desc object
  67. Price object
  68. Location object
  69. Model_Year float64
  70. Mileage object
  71. Exterior_Color object
  72. Make object
  73. Warranty object
  74. Model object
  75. Sub_Model object
  76. Type object
  77. Vehicle_Title object
  78. OBO object
  79. Feedback_Perc object
  80. Watch_Count object
  81. N_Reviews object
  82. Seller_Status object
  83. Vehicle_Tile object
  84. Auction object
  85. Buy_Now object
  86. Bid_Count float64
  87. 价格 float64
  88. dtype: object
  89. # 字符串分割
  90. df['Location'].str.split(',').str[0].head(5)
  91. 0 McHenry
  92. 1 Fort Recovery
  93. 2 Chicago
  94. 3 Green Bay
  95. 4 West Bend
  96. Name: Location, dtype: object
  97. # 计算字符串的长度
  98. df['Location'].str.len().head(5)
  99. 0 32.0
  100. 1 34.0
  101. 2 32.0
  102. 3 35.0
  103. 4 35.0
  104. Name: Location, dtype: float64

发表评论

表情:
评论列表 (有 0 条评论,62人围观)

还没有评论,来说两句吧...

相关阅读

    相关 数据清洗 重复值处理

    重复值处理 数据清洗一般先从重复值和缺失值开始处理 重复值一般采取删除法来处理 但有些重复值不能删除,例如订单明细数据或交易明细数据等 imp

    相关 python清洗数据

    python之清洗数据 背景介绍: 清洗数据: 大概意思就是由于错误的标点符号、大小写字母不一致、断行和拼写错误等问题,零乱的数据(dirtydata),然后我们