python 数据预处理1
通过Tushare或者Tushare pro获取股票清单,但是有3707家企业,怎么挑选股票,我自己瞎琢磨。
1 根据各板块选择龙头股票
mysql分组排序取最大、最小、最新、前N条记录,下图是根据行业、总市值来判断。
select st1.code,st1.name,sb1.industry,st1.mktcap,sb1.market,st1.trade,st1.per,st1.pb
from stock_today st1
inner join stock_basic sb1 on sb1.symbol=st1.code
where 3> (
select count(*)
from stock_today st2
inner join stock_basic sb2 on sb2.symbol=st2.code
where sb1.industry=sb2.industry and sb1.market=sb2.market and st2.mktcap>st1.mktcap
)
order by sb1.industry,st1.mktcap
2 市盈率Matplotlib作图
参考Matplotlib - 箱线图、箱型图 boxplot () 所有用法详解,python中matplotlib的颜色及线条控制的示例,使用sklearn预处理数据之标准化、归一化、正则化
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
import pandas as pd
def get_outlier(x,y,upper,lower):
o_x = []
o_y = []
for i in range(0,len(x)):
if y[i]< lower or y[i]>upper:
o_x.append(x[i])
o_y.append(y[i])
else:
continue
return o_x,o_y
# figsize:指定figure的宽和高,单位为英寸, 1英寸等于2.5cm,A4纸是 21*30cm的纸张
# dpi参数指定绘图对象的分辨率,dpi默认为80
fig = plt.figure(figsize=(15,15),dpi=60)
# 数据
per=[35.676,33.163,53.553,38.374,36.239,18.501,43.831,16.49,32.333,42.884,21.019,32.684,50.282,53.996,23.175,34.972,21.294,24.31]
per=np.array(per)
mean = round(per.mean(),2)
x = np.arange(1,19,1)
# 散点图
# add_subplot,参数顺序:行数、列数,位置,221,也可以协程2,2,1
ax = fig.add_subplot(221)
ax.scatter(x, per, marker = 'o', color = 'green', s = 40, label = 'per')
ax.axhline(y=mean,label=str(mean))
ax.legend(loc='best')
# z-score标准化
ss = StandardScaler()
# fit:用于计算训练数据的均值和方差, 后面就会用均值和方差来转换训练数据
# transform:很显然,它只是进行转换,只是把训练数据转换成标准的正态分布
# fit_transform:不仅计算训练数据的均值和方差,还会基于计算出来的均值和方差来转换训练数据,从而把数据转换成标准的正太分布
# reshape中-1代表行数未知,列数为1
ss_per = ss.fit_transform(per.reshape(-1,1))
ax = fig.add_subplot(222)
# sklearn中二维数组结构,numpy https://www.zhangshengrong.com/p/bYXxZKwlaZ/,减少内存消耗,需要做一下转换
ax.scatter(x,ss_per.reshape(1,-1)[0], marker = 'o', color = 'green', s = 40, label = 'Third')
# 箱型图
ax = fig.add_subplot(223)
ax.boxplot(per, showmeans=True)
df_per = pd.DataFrame(per)
# pandas.DataFrame.describe,返回数据的统计特征
# https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html
statistics = df_per.describe()
# https://www.cnblogs.com/linux-wangkun/p/5903380.html
IQR = statistics.loc['75%']-statistics.loc['25%']
QL = statistics.loc['25%']
QU = statistics.loc['75%']
threshold_upper = QL - 1.5 * IQR
threshold_lower = QU + 1.5 * IQR
# https://www.cnblogs.com/linux-wangkun/p/5903380.html
# 注意pandas 数据结构之Series
o_x,o_y = get_outlier(x,per,QU[0],QL[0])
ax = fig.add_subplot(224)
ax.plot(x,per,marker='+')
ax.axhline(y=threshold_upper[0])
ax.axhline(y=QL[0],color='r')
ax.axhline(y=QU[0],color='c')
ax.axhline(y=threshold_lower[0])
# ax.plot(o_x,o_y)
for i in range(len(o_x)):
ax.annotate(o_y[i],xy=(o_x[i],o_y[i]),xytext=(o_x[i],o_y[i]))
# 显示
plt.show()
从下图可以看到matplotlib虽然可以画箱型图,但是图形意义只是直观,而作用并不大,至少还需要pandas分析。
数据标准化后,数据的分布并不会发生变化。
待续。。
还没有评论,来说两句吧...