超参数调节-蒲公英云

1.网格搜索参数GridSearchCV 类

class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, 
iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)

estimator ：分类器
param_grid : 需要调参的参数。格式为：param_grid = {‘criterion’: [‘gini’, ‘entropy’],
‘max_depth’: [2,3,4,5,6],
‘min_samples_split’:[2,3,4,5,6],
‘min_samples_leaf’:[2,3,4,5,6]
}
scoring: 模型评分标准：如下
refit : 默认为True ，在搜索参数结束后，用最佳参数结果再次fit一遍全部数据集
cv：交叉验证参数，默认为3折
verbose：日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出
pre_dispatch=‘2\n_jobs’：* 指定总共分发的并行任务数。

scoring参数：分类、聚类和回归的评估方法都有。

['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score', 'average_precision', 'completeness_score',
 'explained_variance', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'fowlkes_mallows_score',
 'homogeneity_score', 'mutual_info_score', 'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error',
 'neg_mean_squared_log_error', 'neg_median_absolute_error', 'normalized_mutual_info_score', 'precision', 'precision_macro', 
'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 
'recall_samples', 'recall_weighted', 'roc_auc', 'v_measure_score']

具体含义可以参考：http://sklearn.apachecn.org/cn/0.19.0/modules/model_evaluation.html#scoring-parameter

2.使用网格搜索：

需要我们人工手动输入的参数称为超参数，进行超参数的选择的过程叫做调参。

#coding=gbk
#调整估计器的超参数,进行超参数的选择的过程叫做调参
#1，网格搜索方法 GridSearchCV
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import classification_report 
from sklearn.model_selection import GridSearchCV    #导入网格搜索方法的包
# Each datapoint is a 8x8 image of a digit. 每一个数据点都是 8x8 的像素点
digits = datasets.load_digits()
print(digits.data.shape)    # (1797, 64)
print(digits.data[:5,:])
# import matplotlib.pyplot as plt 
# plt.gray() 
# plt.matshow(digits.images[3]) 
# plt.show() 
n_samples = len(digits.images)
print(n_samples)    # 1797
X = digits.data
y = digits.target
print(y[:5])    #[0 1 2 3 4] 对应的数字
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
param_grid = [{
    'kernel':['rbf'],
    'gamma':[1e-3, 1e-4],
    'C':[1, 10,100,1000]
    },{
        'kernel':['linear'],
        'C':[1, 10,100,1000]
        }]
scores =['precision', 'recall']
for score in scores :
    print('score %s' %score)
    print('------')
    clf = GridSearchCV(SVC(), param_grid, cv=5, scoring='%s_macro'%score)   
    clf.fit(X_train, y_train)
    print('best params is :')
    print(clf.best_params_)
    print('grid score')
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print('%.3f (+-/%0.03f) for %r'%(mean, std*2, params))
    print('-----')
    print('classification report')
    y_true, y_pred = y_test, clf.predict(X_test)
    report = classification_report(y_true, y_pred)
    print(report)

输出结果：

score precision
------
best params is :
{'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
grid score
0.986 (+-/0.016) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1}
0.959 (+-/0.029) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
0.982 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 10}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 100}
0.982 (+-/0.025) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 100}
0.988 (+-/0.017) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1000}
0.982 (+-/0.025) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1000}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 1}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 10}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 100}
0.975 (+-/0.014) for {'kernel': 'linear', 'C': 1000}
-----
classification report
             precision    recall  f1-score   support
          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92
avg / total       0.99      0.99      0.99       899
score recall
------
best params is :
{'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
grid score
0.986 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1}
0.957 (+-/0.029) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 10}
0.981 (+-/0.028) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 10}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 100}
0.981 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 100}
0.987 (+-/0.019) for {'kernel': 'rbf', 'gamma': 0.001, 'C': 1000}
0.981 (+-/0.026) for {'kernel': 'rbf', 'gamma': 0.0001, 'C': 1000}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 1}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 10}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 100}
0.972 (+-/0.012) for {'kernel': 'linear', 'C': 1000}
-----
classification report
             precision    recall  f1-score   support
          0       1.00      1.00      1.00        89
          1       0.97      1.00      0.98        90
          2       0.99      0.98      0.98        92
          3       1.00      0.99      0.99        93
          4       1.00      1.00      1.00        76
          5       0.99      0.98      0.99       108
          6       0.99      1.00      0.99        89
          7       0.99      1.00      0.99        78
          8       1.00      0.98      0.99        92
          9       0.99      0.99      0.99        92
avg / total       0.99      0.99      0.99       899

实际当中有用的参数，以clf表示我们的GridSearchCV对象

clf.best_params_ 返回最好的参数

clf.best_score_ 返回最好的测试分数，它的值和 clf.cv_results_[‘mean_test_score’][dt_grid.best_index_] 是相同的。

clf.best_index_ 返回列表中分数最好的下表

clf.best_estimator_ 返回最好的模型

grid_scores_ 在sklearn 0.18中已经不赞成使用了，用下面的cv_results_来代替

clf.cv_results_ 返回使用交叉验证进行搜索的结果，它本身又是一个字典，里面又有很多内容，我们来看一下上面的clf.cv_results_.keys()里面有什么：

dict_keys(
['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 
'param_C', 'param_gamma', 'param_kernel', 'params', 
'split0_test_score', 'split1_test_score', 'split2_test_score', 'split3_test_score', 'split4_test_score',
'mean_test_score', 'std_test_score', 'rank_test_score', 
'split0_train_score', 'split1_train_score', 'split2_train_score', 'split3_train_score', 'split4_train_score', 
'mean_train_score', 'std_train_score'] )