【机器学习】随机森林算法

随机森林（Random Forest）是一种集成学习算法，它结合了多个决策树的输出，以提高预测的准确性和稳定性。随机森林被广泛应用于分类和回归任务中，尤其适用于数据特征之间存在非线性关系或噪声的情况。

在本文中，我们将详细讲解随机森林的原理，并用Numpy实现一个基本的回归随机森林。最后，我们将展示如何使用Scikit-Learn实现随机森林。

随机森林的基本原理

随机森林是由 多棵决策树 组成的集成模型，通过以下步骤生成：

样本随机抽样（Bootstrap Sampling）：
- 从原始数据集中随机抽取若干个样本，生成多个不同的数据集（可以重复抽样）。
- 对每个数据集生成一棵决策树模型。
特征随机选择（Random Feature Selection）：
- 在每个节点分裂时，从所有特征中随机选择一部分特征进行分割，选择使得分裂效果最好的特征。
- 这样可以降低决策树之间的相关性，提升模型泛化能力。
模型集成：
- 对于分类任务，通过“投票”机制（多数表决）确定最终分类结果。
- 对于回归任务，通过取多个决策树预测值的平均值得到最终预测结果。

随机森林的优点包括：

减少模型的方差，提高泛化性能。
不容易出现过拟合，尤其在数据量大、噪声多的情况下。

构建回归随机森林

我们将分以下步骤逐步实现一个简单的随机森林回归模型。

数据生成

首先，生成一组模拟数据，以便后续测试模型的效果。

import numpy as np
import matplotlib.pyplot as plt# 生成模拟数据
np.random.seed(0)
X = np.random.rand(100, 1) * 10  # 特征
y = 2 * X.flatten() + np.sin(X.flatten()) * 5 + np.random.randn(100) * 0.5  # 目标值

构建单棵决策树

在随机森林中，我们需要基于Bootstrap采样数据构建多棵决策树。这里我们实现回归树的基本构建方法，使用均方误差（MSE）作为分割标准：

# 均方误差（MSE）计算
def mean_squared_error(y):return np.var(y) * len(y)# 数据集分割
def split_dataset(X, y, feature, threshold):left_mask = X[:, feature] <= thresholdright_mask = ~left_maskreturn X[left_mask], y[left_mask], X[right_mask], y[right_mask]# 查找最佳分割特征和分割点
def best_split(X, y):best_mse = float("inf")best_feature, best_threshold = None, Nonefor feature in range(X.shape[1]):thresholds = np.unique(X[:, feature])for threshold in thresholds:_, y_left, _, y_right = split_dataset(X, y, feature, threshold)if len(y_left) == 0 or len(y_right) == 0:continuemse_split = mean_squared_error(y_left) + mean_squared_error(y_right)if mse_split < best_mse:best_mse = mse_splitbest_feature = featurebest_threshold = thresholdreturn best_feature, best_threshold# 决策树类
class RegressionTree:def __init__(self, max_depth=3, min_samples_split=2):self.max_depth = max_depthself.min_samples_split = min_samples_splitself.tree = Nonedef fit(self, X, y, depth=0):if len(y) < self.min_samples_split or depth >= self.max_depth:return np.mean(y)feature, threshold = best_split(X, y)if feature is None:return np.mean(y)left_X, left_y, right_X, right_y = split_dataset(X, y, feature, threshold)left_node = self.fit(left_X, left_y, depth + 1)right_node = self.fit(right_X, right_y, depth + 1)self.tree = {"feature": feature, "threshold": threshold, "left": left_node, "right": right_node}return self.treedef predict_sample(self, x, tree):if not isinstance(tree, dict):return treeif x[tree["feature"]] <= tree["threshold"]:return self.predict_sample(x, tree["left"])else:return self.predict_sample(x, tree["right"])def predict(self, X):return np.array([self.predict_sample(x, self.tree) for x in X])

构建随机森林模型

基于上面的决策树实现，我们可以通过多次 Bootstrap 采样来构建随机森林的模型。通过组合多棵决策树的预测结果，提升模型的稳定性。

class RandomForestRegressor:def __init__(self, n_estimators=10, max_depth=3, min_samples_split=2):self.n_estimators = n_estimatorsself.max_depth = max_depthself.min_samples_split = min_samples_splitself.trees = []def bootstrap_sample(self, X, y):indices = np.random.choice(len(y), len(y), replace=True)return X[indices], y[indices]def fit(self, X, y):self.trees = []for _ in range(self.n_estimators):X_sample, y_sample = self.bootstrap_sample(X, y)tree = RegressionTree(max_depth=self.max_depth, min_samples_split=self.min_samples_split)tree.fit(X_sample, y_sample)self.trees.append(tree)def predict(self, X):tree_predictions = np.array([tree.predict(X) for tree in self.trees])return np.mean(tree_predictions, axis=0)

训练与预测

接下来，我们用随机森林模型拟合数据，并可视化预测结果：

# 初始化并训练随机森林
forest = RandomForestRegressor(n_estimators=50, max_depth=4, min_samples_split=5)
forest.fit(X, y)# 预测并绘制结果
X_test = np.linspace(0, 10, 100).reshape(-1, 1)
y_pred = forest.predict(X_test)plt.scatter(X, y, color="blue", label="训练数据")
plt.plot(X_test, y_pred, color="red", label="随机森林预测")
plt.xlabel("特征")
plt.ylabel("目标值")
plt.title("随机森林回归预测")
plt.legend()
plt.show()

使用 Scikit-Learn 实现随机森林

Scikit-Learn 提供了一个简单易用的 RandomForestRegressor，用于快速实现和测试随机森林模型。我们可以用它来验证我们的手动实现。

from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error# 使用Scikit-Learn的随机森林
regressor = RandomForestRegressor(n_estimators=50, max_depth=4, min_samples_split=5, random_state=0)
regressor.fit(X, y)# 预测并计算MSE
y_pred_sklearn = regressor.predict(X_test)
mse = mean_squared_error(y, regressor.predict(X))
print("均方误差:", mse)# 可视化
plt.scatter(X, y, color="blue", label="训练数据")
plt.plot(X_test, y_pred_sklearn, color="green", label="Scikit-Learn 随机森林预测")
plt.xlabel("特征")
plt.ylabel("目标值")
plt.title("Scikit-Learn 随机森林预测示意图")
plt.legend()
plt.show()

总结

本文详细介绍了随机森林的工作原理，从基本概念到使用 Bootstrap 样本构建决策树的过程，手动实现了回归的随机森林算法，并用 Scikit-Learn 的 RandomForestRegressor 进行对比。随机森林算法的优势在于其高效的集成学习策略，有助于提升模型的泛化能力并减少过拟合风险。