史上最詳細的XGBoost實戰（下）

yespon 2017-11-23

展開全文

作者：章華燕

編輯：田　旭

四

XGBoost 參數詳解

在運行XGboost之前，必須設置三種類型成熟：general parameters，booster parameters和task parameters：

General parameters
該參數參數控制在提升(boosting)過程中使用哪種booster，常用的booster有樹模型(tree)和線性模型(linear model)
Booster parameters
這取決于使用哪種booster
Task parameters
控制學習的場景，例如在回歸問題中會使用不同的參數控制排序

General Parameters

booster [default=gbtree]
有兩中模型可以選擇gbtree和gblinear。gbtree使用基于樹的模型進行提升計算，gblinear使用線性模型進行提升計算。缺省值為gbtree。
silent [default=0]
取0時表示打印出運行時信息，取1時表示以緘默方式運行，不打印運行時信息。缺省值為0。
nthread
XGBoost運行時的線程數。缺省值是當前系統可以獲得的最大線程數。
num_pbuffer
預測緩沖區大小，通常設置為訓練實例的數目。緩沖用于保存最后一步提升的預測結果，無需人為設置。
num_feature
Boosting過程中用到的特征維數，設置為特征個數。XGBoost會自動設置，無需人為設置。

Parameters for Tree Booster

eta [default=0.3]
為了防止過擬合，更新過程中用到的收縮步長。在每次提升計算之后，算法會直接獲得新特征的權重。 eta通過縮減特征的權重使提升計算過程更加保守。缺省值為0.3 。
取值范圍為：[0,1]
gamma [default=0]
minimum loss reduction required to make a further partition on a leaf node of the tree. the larger, the more conservative the algorithm will be.
取值范圍為：[0,∞]
max_depth [default=6]
數的最大深度。缺省值為6。
取值范圍為：[1,∞]
min_child_weight [default=1]
孩子節點中最小的樣本權重和。如果一個葉子節點的樣本權重和小于min_child_weight則拆分過程結束。在現行回歸模型中，這個參數是指建立每個模型所需要的最小樣本數。該成熟越大算法越conservative。
取值范圍為：[0,∞]
max_delta_step [default=0]
我們允許每個樹的權重被估計的值。如果它的值被設置為0，意味著沒有約束；如果它被設置為一個正值，它能夠使得更新的步驟更加保守。通常這個參數是沒有必要的，但是如果在邏輯回歸中類極其不平衡這時候他有可能會起到幫助作用。把它范圍設置為1-10之間也許能控制更新。
取值范圍為：[0,∞]
subsample [default=1]
用于訓練模型的子樣本占整個樣本集合的比例。如果設置為0.5則意味著XGBoost將隨機的從整個樣本集合中隨機的抽取出50%的子樣本建立樹模型，這能夠防止過擬合。
取值范圍為：(0,1]
colsample_bytree [default=1]
在建立樹時對特征采樣的比例。缺省值為1。
取值范圍為：(0,1]

Parameter for Linear Booster

lambda [default=0]
L2 正則的懲罰系數
alpha [default=0]
L1 正則的懲罰系數
lambda_bias
在偏置上的L2正則。缺省值為0（在L1上沒有偏置項的正則，因為L1時偏置不重要）。

Task Parameters

objective [ default=reg:linear ]
定義學習任務及相應的學習目標，可選的目標函數如下：
“reg:linear” —— 線性回歸。
“reg:logistic”—— 邏輯回歸。
“binary:logistic”—— 二分類的邏輯回歸問題，輸出為概率。
“binary:logitraw”—— 二分類的邏輯回歸問題，輸出的結果為wTx。
“count:poisson”—— 計數問題的poisson回歸，輸出結果為poisson分布。在poisson回歸中，max_delta_step的缺省值為0.7。(used to safeguard optimization)
“multi:softmax” –讓XGBoost采用softmax目標函數處理多分類問題，同時需要設置參數num_class（類別個數）
“multi:softprob” –和softmax一樣，但是輸出的是ndata * nclass的向量，可以將該向量reshape成ndata行nclass列的矩陣。沒行數據表示樣本所屬于每個類別的概率。
“rank:pairwise” –set XGBoost to do ranking task by minimizing the pairwise loss。
base_score [ default=0.5 ]
所有實例的初始化預測分數，全局偏置；為了足夠的迭代次數，改變這個值將不會有太大的影響。
eval_metric [ default according to objective ]
校驗數據所需要的評價指標，不同的目標函數將會有缺省的評價指標（rmse for regression, and error for classification, mean average precision for ranking）。
用戶可以添加多種評價指標，對于Python用戶要以list傳遞參數對給程序，而不是map參數list參數不會覆蓋’eval_metric’。
可供的選擇如下:
“rmse”: root mean square error
“logloss”: negative log-likelihood
“error”: Binary classification error rate. It is calculated as #(wrong cases)/#(all cases). For the predictions, the evaluation will regard the instances with prediction value larger than 0.5 as positive instances, and the others as negative instances.
“merror”: Multiclass classification error rate.
“mlogloss”: Multiclass logloss.
“auc”: Area under the curve for ranking evaluation.
“ndcg”:Normalized Discounted Cumulative Gain
“map”:Mean average precision
“ndcg@n”,”map@n”: n can be assigned as an integer to cut off the top positions in the lists for evaluation.
“ndcg-“,”map-“,”ndcg@n-“,”map@n-“: In XGBoost, NDCG andMAP will evaluate the score of a list without any positive samples as 1. By adding “-” in the evaluation metric XGBoostwill evaluate these score as 0 to be consistent under some conditions. training repeatively
seed [ default=0 ]
隨機數的種子。缺省值為0。

五

XGBoost 實戰

XGBoost有兩大類接口：XGBoost原生接口和 scikit-learn接口，并且XGBoost能夠實現分類和回歸兩種任務。因此，本章節分四個小塊來介紹！

基于XGBoost原生接口的分類

from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# read in the iris data
iris = load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1234565)

params = {
    'booster': 'gbtree',
    'objective': 'multi:softmax',
    'num_class': 3,
    'gamma': 0.1,
    'max_depth': 6,
    'lambda': 2,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}

plst = params.items()


dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 500
model = xgb.train(plst, dtrain, num_rounds)

# 對測試集進行預測
dtest = xgb.DMatrix(X_test)
ans = model.predict(dtest)

# 計算準確率
cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
    if ans[i] == y_test[i]:
        cnt1 += 1
    else:
        cnt2 += 1

print('Accuracy: %.2f %% ' % (100 * cnt1 / (cnt1 + cnt2)))

# 顯示重要特征
plot_importance(model)
plt.show()

輸出預測正確率以及特征重要性：

Accuracy: 96.67 %

基于XGBoost原生接口的回歸

import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# 讀取文件原始數據
data = []
labels = []
labels2 = []
with open('lppz5.csv', encoding='UTF-8') as fileObject:
    for line in fileObject:
        line_split = line.split(',')
        data.append(line_split[10:])
        labels.append(line_split[8])

X = []
for row in data:
    row = [float(x) for x in row]
    X.append(row)

y = [float(x) for x in labels]

# XGBoost訓練過程
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

params = {
    'booster': 'gbtree',
    'objective': 'reg:gamma',
    'gamma': 0.1,
    'max_depth': 5,
    'lambda': 3,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    'min_child_weight': 3,
    'silent': 1,
    'eta': 0.1,
    'seed': 1000,
    'nthread': 4,
}

dtrain = xgb.DMatrix(X_train, y_train)
num_rounds = 300
plst = params.items()
model = xgb.train(plst, dtrain, num_rounds)

# 對測試集進行預測
dtest = xgb.DMatrix(X_test)
ans = model.predict(dtest)

# 顯示重要特征
plot_importance(model)
plt.show()

重要特征(值越大，說明該特征越重要)顯示結果：

基于Scikit-learn接口的分類

from sklearn.datasets import load_iris
import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# read in the iris data
iris = load_iris()

X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 訓練模型
model = xgb.XGBClassifier(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='multi:softmax')
model.fit(X_train, y_train)

# 對測試集進行預測
ans = model.predict(X_test)

# 計算準確率
cnt1 = 0
cnt2 = 0
for i in range(len(y_test)):
    if ans[i] == y_test[i]:
        cnt1 += 1
    else:
        cnt2 += 1

print('Accuracy: %.2f %% ' % (100 * cnt1 / (cnt1 + cnt2)))

# 顯示重要特征
plot_importance(model)
plt.show()

輸出預測正確率以及特征重要性：

Accuracy: 100.00 %

基于XGBoost原生接口的回歸

import xgboost as xgb
from xgboost import plot_importance
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split

# 讀取文件原始數據
data = []
labels = []
labels2 = []
with open('lppz5.csv', encoding='UTF-8') as fileObject:
    for line in fileObject:
        line_split = line.split(',')
        data.append(line_split[10:])
        labels.append(line_split[8])

X = []
for row in data:
    row = [float(x) for x in row]
    X.append(row)

y = [float(x) for x in labels]

# XGBoost訓練過程
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = xgb.XGBRegressor(max_depth=5, learning_rate=0.1, n_estimators=160, silent=True, objective='reg:gamma')
model.fit(X_train, y_train)

# 對測試集進行預測
ans = model.predict(X_test)

# 顯示重要特征
plot_importance(model)
plt.show()