sklearn Boosting：交叉验证，无需每次重新启动就可以找到最佳估计数问题的回答

sklearn Boosting：交叉验证，无需每次重新启动就可以找到最佳估计数

回答此问题可获得 20 贡献值，回答如果被采纳可获得 50 分。

0 条评论
分类：Python问答

默认排序时间排序

1 个回答

匿名 1天前

　擅长：python、mysql、java

可以使用继承对<code>AdaBoostClassifier</code>进行“黑客”操作，它不需要重新训练估计器，并且与<code>sklearn</code>中的许多交叉验证函数兼容（必须是不洗牌数据的交叉验证）。在 如果您查看<code>sklearn.ensemble.weight_boosting.py</code>中的源代码，您会发现，如果正确地包装<code>AdaBoostClassifier.fit()</code>和{<cd5>}的行为，就可以避免不需要重新训练估计器。在 交叉验证函数的问题是，它们使用<code>sklearn.base.clone()</code>克隆原始估计器，而函数<code>sklearn.base.clone()</code>生成估计器参数的深层副本。深度复制特性使得估计器不可能在不同的交叉验证运行之间“记住”它的估计器（<code>clone()</code>复制引用的内容，而不是引用本身）。唯一的方法（至少我能想到的唯一方法）是使用全局状态来跟踪运行之间的旧估计量。这里的问题是，您必须计算X特性的散列值，这可能很昂贵！在 不管怎样，以下是对<code>AdaBoostClassifier</code>本身的破解： <pre class="lang-py prettyprint-override"><code>''' adaboost_hack.py Make a "hack" of AdaBoostClassifier in sklearn.ensemble.weight_boosting.py that doesn't need to retrain estimators and is compatible with many sklearn cross validation functions. ''' import copy import numpy as np from sklearn.ensemble import AdaBoostClassifier from sklearn.base import clone # Used to hold important variables between runs of cross validation. # Note that sklearn cross validation functions use sklearn.base.clone() # to make copies of the estimator sent to it as a function. The function # sklearn.base.clone() makes deep copies of parameters of an estimator, so # the only way to provide a way to remember previous estimators between # cross validation runs is to use a global variable. # # We will use hash values of the split of X[:, 0] as keys for remembering # previous estimators of a cv fold. Note, you can NOT use cross validators # that randomly shuffle the data before splitting. This will cause different # hashes. kfold_hash = {} class WarmRestartAdaBoostClassifier(AdaBoostClassifier): ''' Keep track of old estimators, estimator weights, the estimator errors, and the next to last sample weight seen. Note that AdaBoostClassifier._boost() does NOT boost the last seen sample weight. Simple fix to this is to drop the last estimator and retrain it. Wrap AdaBoostClassifier.fit() to decide whether to throw away estimators or add estimators depending on the current number of estimators vs the number of old esimators. Also look at the possibility of use the global kfold_hash to get old values if use_kfold_hash == True. Wrap AdaBoostClassifier._boost() with behavior to record the next to last sample weight. ''' def __init__(self, base_estimator=None, n_estimators=50, learning_rate=1., algorithm='SAMME.R', random_state=None, next_to_last_sample_weight = None, old_estimators_ = [], use_kfold_hash = False): AdaBoostClassifier.__init__(self, base_estimator, n_estimators, learning_rate, algorithm, random_state) self.next_to_last_sample_weight = next_to_last_sample_weight self._last_sample_weight = None self.old_estimators_ = old_estimators_ self.use_kfold_hash = use_kfold_hash def _boost(self, iboost, X, y, sample_weight, random_state): ''' Record the sample weight. Parameters and return behavior same as that of AdaBoostClassifier._boost() as seen in sklearn.ensemble.weight_boosting.py Parameters iboost : int The index of the current boost iteration. X : {array-like, sparse matrix} of shape = [n_samples, n_features] The training input samples. Sparse matrix can be CSC, CSR, COO, DOK, or LIL. COO, DOK, and LIL are converted to CSR. y : array-like of shape = [n_samples] The target values (class labels). sample_weight : array-like of shape = [n_samples] The current sample weights. random_state : RandomState The current random number generator Returns - sample_weight : array-like of shape = [n_samples] or None The reweighted sample weights. If None then boosting has terminated early. estimator_weight : float The weight for the current boost. If None then boosting has terminated early. error : float The classification error for the current boost. If None then boosting has terminated early. ''' fit_info = AdaBoostClassifier._boost(self, iboost, X, y, sample_weight, random_state) sample_weight, _, _ = fit_info self.next_to_last_sample_weight = self._last_sample_weight self._last_sample_weight = sample_weight return fit_info def fit(self, X, y): hash_X = None if self.use_kfold_hash: # Use a hash of X features in this kfold to access the global information # for this kfold. hash_X = hash(bytes(X[:, 0])) if hash_X in kfold_hash.keys(): self.old_estimators_ = kfold_hash[hash_X]['old_estimators_'] self.next_to_last_sample_weight = kfold_hash[hash_X]['next_to_last_sample_weight'] self.estimator_weights_ = kfold_hash[hash_X]['estimator_weights_'] self.estimator_errors_ = kfold_hash[hash_X]['estimator_errors_'] # We haven't done any fits yet. if not self.old_estimators_: AdaBoostClassifier.fit(self, X, y) self.old_estimators_ = self.estimators_ # The case that we throw away estimators. elif self.n_estimators < len(self.old_estimators_): self.estimators_ = self.old_estimators_[:self.n_estimators] self.estimator_weights_ = self.estimator_weights_[:self.n_estimators] self.estimator_errors_ = self.estimator_errors_[:self.n_estimators] # The case that we add new estimators. elif self.n_estimators > len(self.old_estimators_): n_more = self.n_estimators - len(self.old_estimators_) self.fit_more(X, y, n_more) # Record information in the global hash if necessary. if self.use_kfold_hash: kfold_hash[hash_X] = {'old_estimators_' : self.old_estimators_, 'next_to_last_sample_weight' : self.next_to_last_sample_weight, 'estimator_weights_' : self.estimator_weights_, 'estimator_errors_' : self.estimator_errors_} return self def fit_more(self, X, y, n_more): ''' Fits additional estimators. ''' # Since AdaBoostClassifier._boost() doesn't boost the last sample weight, we retrain the last estimator with # its input sample weight. self.n_estimators = n_more + 1 if self.old_estimators_ is None: raise Exception('Should have already fit estimators before calling fit_more()') self.old_estimators_ = self.old_estimators_[:-1] old_estimator_weights = self.estimator_weights_[:-1] old_estimator_errors = self.estimator_errors_[:-1] sample_weight = self.next_to_last_sample_weight AdaBoostClassifier.fit(self, X, y, sample_weight) self.old_estimators_.extend(self.estimators_) self.estimators_ = self.old_estimators_ self.n_estimators = len(self.estimators_) self.estimator_weights_ = np.concatenate([old_estimator_weights, self.estimator_weights_]) self.estimator_errors_ = np.concatenate([old_estimator_errors, self.estimator_errors_]) </code></pre> 这里有一个例子，它允许您比较黑客的时间/精确度与原始的<code>AdaBoostClassifier</code>。请注意，当我们添加估计器时，测试hack的时间会增加，但是训练不会。我发现hack比原来运行得快得多，但是我没有对大量的X样本进行散列。在 ^{pr2}$

sklearn Boosting：交叉验证，无需每次重新启动就可以找到最佳估计数

1 个回答

相关Python问题