基于机器学习的黄金价格预测案例

在这篇文章中，我们将从头开始做一个关于黄金价格预测的项目。要构建任何数据科学项目，我们必须遵循某些步骤，这些步骤不需要以相同的顺序进行。在我们的项目中，我们将按顺序完成这些步骤。

1. 问题表述

问题表述是我们在开始任何项目之前所做的最重要的步骤之一。必须对我们的数据科学项目的目标有一个明确的想法。在我们的例子中，这个项目的目标是分析黄金的价格。黄金的价格是不稳定的，它们随着时间的推移而迅速变化。我们这个项目的主要目的是预测每单位黄金的价格。

导入库

我们将在一个地方导入本文中使用的所有库，这样就不必每次使用时都导入，这将保存我们的时间和精力。

Pandas -一个构建在NumPy之上的Python库，用于有效的矩阵乘法和矩阵操作，它也用于数据清理，数据合并，数据整形和数据聚合
Numpy -一个用于数值数学计算和处理多维ndarray的Python库，它也有一个非常大的数学函数集合来操作这个数组
Matplotlib -它用于绘制2D和3D可视化图，它还支持各种输出格式，包括图形
Seaborn - seaborn库是在Matplotlib之上创建的，用于绘制漂亮的图。

import numpy as np 
import pandas as pd 
import seaborn as sns 
import matplotlib.pyplot as pltimport warnings
warnings.filterwarnings("ignore") 
sns.set_style("darkgrid", {"grid.color": ".6", "grid.linestyle": ":"})from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline 
from sklearn.linear_model import Lassofrom sklearn.ensemble import RandomForestRegressor 
from xgboost import XGBRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error 
from sklearn.model_selection import GridSearchCV

加载数据集

# read dataset using pndas function
# use parse_dates argument to change datetime dtype
# 数据集链接: https://pan.baidu.com/s/1RiOzXhG4zzM6xeWqcmqY6g?pwd=htf4 提取码: htf4 
dataset = pd.read_csv("gold_price_data.csv",parse_dates=["Date"])


# information about the dataset
dataset.info()

输出

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2290 entries, 0 to 2289
Data columns (total 6 columns):#   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         0   Date     2290 non-null   datetime64[ns]1   SPX      2290 non-null   float64       2   GLD      2290 non-null   float64       3   USO      2290 non-null   float64       4   SLV      2290 non-null   float64       5   EUR/USD  2290 non-null   float64       
dtypes: datetime64[ns](1), float64(5)
memory usage: 107.5 KB

2. 数据预处理

缺失值/重复值

缺失值对我们的模型训练有非常严重的影响。有些模型，如线性回归，不适合含有缺失值的数据集。然而，有一些模型即使在缺失数据集，如随机森林，也能很好地工作。但在处理数据集时，首先处理缺失值始终是一个好的做法。此外，需要注意的一点是，当我们使用pandas加载数据时，它会自动检测null值并将其替换为NAN。

# Missing Values/Null Values Count
dataset.isna().sum().sort_values(ascending=False)

输出

Date       0
SPX        0
GLD        0
USO        0
SLV        0
EUR/USD    0
dtype: int64

相关性

我们应该始终检查数据集的两列之间是否存在任何相关性。如果两个或多个列相互关联，并且它们都不是目标变量，那么我们必须使用一种方法来删除这种相关性。一些流行的方法是PCA（主成分分析）。我们还可以删除两个列中的一个，或者使用这两个列创建一个新列。

# Calculate correlation matrix
correlation = dataset.corr()# Create heatmap
sns.heatmap(correlation, cmap='coolwarm',center=0, annot=True)# Set title and axis labels
plt.title('Correlation Matrix Heatmap')
plt.xlabel('Features')
plt.ylabel('Features')# Show plot
plt.show()

在这里插入图片描述
这里SLV和GLD这两个列与其他列相比彼此之间有很强的相关性，这里我们将删除SLV，因为GLD列也与我们的目标列有很大的相关性。

# drop SlV column
dataset.drop("SLV", axis=1,inplace=True)

我们将首先将Date列设置为dataframe的索引，使用日期作为索引将在绘制数据时增加优势

# reset the index to date column
dataset.set_index("Date", inplace=True)

我们将首先观察全年连续每一天的黄金价格变化。

# plot price of gold for each increasing day
dataset["EUR/USD"].plot()
plt.title("Change in price of gold through date")
plt.xlabel("date")
plt.ylabel("price")
plt.show()

在这里插入图片描述
通过这张图，我们无法很好地洞察黄金价格的变化。它看起来非常嘈杂，要看到数据中的趋势，我们必须使图表平滑。

使用移动平均绘制黄金价格走势

为了可视化数据中的趋势，我们必须对这条看起来非常嘈杂的线进行平滑处理。在我们的项目中，我们将使用pandas的rolling函数平均获取20个以前的数据点。这也被称为移动平均线。

# apply rolling mean with window size of 3
dataset["price_trend"] = dataset["EUR/USD"]\.rolling(window=20).mean()# reset the index to date column
dataset.reset_index("Date", inplace=True)# since we have used rolling method
# for 20 rows first 2 rows will be NAN
dataset["price_trend"].loc[20:].plot()# set title of the chart
plt.title("Trend in price of gold through date")# set x_label of the plot
plt.xlabel("date")
plt.ylabel("price")
plt.show()

在这里插入图片描述
现在图表看起来不那么嘈杂，在这里我们可以分析黄金价格的变化趋势。

分布情况

要查看数值列的分布，我们将在一个图中绘制每列的直方图，为此，我们将使用Matplotlib子图函数。

fig = plt.figure(figsize=(8, 8))# suptitle of the graph
fig.suptitle('Distribution of data across column')
temp = dataset.drop("Date", axis=1).columns.tolist()
for i, item in enumerate(temp):plt.subplot(2, 3, i+1)sns.histplot(data=dataset, x=item, kde=True)
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=2.0)
plt.show()

在这里插入图片描述
这里我们使用plt.figure初始化一个图形并设置其大小。需要注意的一点是，每当我们使用matplotlib.plot创建一个图时，它都会自动调用这个函数，默认值为figsize。我们使用sns.histplot函数创建直方图，kde等于True。数据分布看起来不错，但是，我们将使用pandas函数检查每列的偏度。

# skewness along the index axis
print(dataset.skew(axis=0, skipna=True))

输出

SPX        0.263937
GLD        0.375042
USO        0.981611
EUR/USD   -0.170009
dtype: float64

USO列的偏度最高，为0.98，因此这里我们将对该列应用平方根变换，以将其偏度降低到0。我们可以使用不同的变换函数来降低偏度，有些是对数变换，逆变换等。

# apply saquare root transformation
# on the skewed dataset
dataset["USO"] = dataset["USO"]\.apply(lambda x: np.sqrt(x))

处理离群值

离群值会对我们的模型产生非常坏的影响，就像在线性回归中，如果一个数据点是离群值，那么它会增加一个非常大的均方误差。去除异常值是EDA中一个很好的过程。一些模型，如决策树和集成方法，如随机森林，并没有那么多的离群值。但是，处理离群值始终是一个很好的做法。

绘制箱形图以显示异常值

箱形图在绘制数据的分布和偏度时非常有用，在绘制个体的离群数据点时也很有用，它们由代表25%至75%分位数范围内的点的方框组成。而方框中间的线代表中位数，方框末端的须线显示低于25%和75%的范围，不包括离群值。

fig = plt.figure(figsize=(8, 8))
temp = dataset.drop("Date", axis=1).columns.tolist()
for i, item in enumerate(temp):plt.subplot(2, 3, i+1)sns.boxplot(data=dataset, x=item, color='violet')
plt.tight_layout(pad=0.4, w_pad=0.5, h_pad=2.0)
plt.show()

在这里插入图片描述

可以清楚地看到，列’USO’中存在离群值，因此我们创建一个函数来规范化列中存在的离群值。

def outlier_removal(column):# Capping the outlier rows with Percentilesupper_limit = column.quantile(.95)# set upper limit to 95percentilelower_limit = column.quantile(.05)# set lower limit to 5 percentilecolumn.loc[(column > upper_limit)] = upper_limitcolumn.loc[(column < lower_limit)] = lower_limitreturn column

在这里，我们将列的上限设置为数据的95%，下限设置为5%。这意味着大于数据的95%百分位数的数据被归一化为数据的95%值，对于低于数据的5%的数据点相同。

# Normalize outliers in columns except Datedataset[['SPX', 'GLD', 'USO', 'EUR/USD']] = \dataset[['SPX', 'GLD', 'USO', 'EUR/USD']].apply(outlier_removal)

这里使用pandas的apply函数我们已经将outlier_removal函数应用于列的每一行。

3. 建立模型

在开始建模之前，我们必须将数据分为训练和测试，这样在训练数据之后，我们就可以看到我们的数据在学习模式和推广新数据点方面有多少。这也是一种方法，可以看出我们的模型没有学习数据中的噪声，或者说它没有过度拟合数据集。

# select the features and target variable
X = dataset.drop(['Date', 'EUR/USD'], axis=1)y = dataset['EUR/USD']
# dividing dataset in to train test
x_train, x_test,\y_train, y_test = train_test_split(X, y,test_size=0.2)

在这里，我们首先删除Date和目标变量，并将其存储在变量X中，这将是我们的自变量，我们也将目标变量存储在Y变量中。这里我们将数据以80：20的比例划分。我们可以根据我们的需要改变它。

标准化

在我们用数据训练模型之前，我们应该对数据进行缩放以进行标准化。缩放数据后，每列的平均值变为零，标准差变为1。它也被称为z分数归一化，因为我们从每个元素中减去列的平均值，然后将其除以列的标准差。它使所有列都具有相同的比例，并且可以直接相互比较。

# Create an instance of the StandardScaler
scaler = StandardScaler()# Fit the StandardScaler on the training dataset
scaler.fit(x_train)# Transform the training dataset
# using the StandardScaler
x_train_scaled = scaler.transform(x_train)
x_test_scaled = scaler.transform(x_test)

从简单的模型开始拟合数据，然后将其移动到复杂的模型，这始终是明智的。这样做的原因之一是简单模型需要更少的时间和存储来训练数据。此外，许多简单的模型比复杂的模型工作得更好，这些模型也比复杂的模型更容易解释。

Lasso回归

在这个模型中，我们使用了L1正则化的线性回归，在make_pipeline对象的帮助下，我们将使用2阶的Lasso回归。我们还将在每个模型中使用GridSearch对象来获得性能最佳的超参数并降低方差。

# Create a PolynomialFeatures object of degree 2
poly = PolynomialFeatures(degree=2)# Create a Lasso object
lasso = Lasso()# Define a dictionary of parameter
#values to search over
param_grid = {'lasso__alpha': [1e-4, 1e-3, 1e-2,1e-1, 1, 5, 10, 20, 30, 40]}# Create a pipeline that first applies 
# polynomial features and then applies Lasso regression
pipeline = make_pipeline(poly, lasso)# Create a GridSearchCV object with 
#the pipeline and parameter grid
lasso_grid_search = GridSearchCV(pipeline,param_grid, scoring='r2', cv=3)# Fit the GridSearchCV object to the training data
lasso_grid_search.fit(x_train_scaled, y_train)# Predict the target variable using
# the fitted model and the test data
y_pred = lasso_grid_search.predict(x_train_scaled)# Compute the R-squared of the fitted model on the train data
r2 = r2_score(y_train, y_pred)# Print the R-squared
print("R-squared: ", r2)# Print the best parameter values and score
print('Best parameter values: ',lasso_grid_search.best_params_)
print('Best score: ',lasso_grid_search.best_score_)

输出

R-squared:  0.8387982103677535
Best parameter values:  {'lasso__alpha': 0.0001}
Best score:  0.8355941438243204

在这里，我们正在拟合度的多元回归，然而，要使用多元回归的Lasso回归，我们必须使用sklearn的管道方法。我们还将使用网格搜索方法进行交叉验证，并为训练数据选择性能最佳的超参数。网格搜索是找到不会过拟合训练数据的模型的最佳方法之一。

我们在整个模型中使用了R平方评估矩阵。我们使用这个矩阵，因为我们想比较我们的模型，并选择哪一个是最好的。

RandomForestRegressor回归

在第二个模型中，我们将使用集成方法来拟合训练数据。与随机森林一样，它使用几个决策树来拟合数据，需要注意的一点是，在随机森林回归器中，m行用于训练，其总是小于n（m<n）。其中n是训练数据集中存在的原始列的总数，对于行点，随机森林也选择这些行的元素。

# Insiate param grid for which to search
param_grid = {'n_estimators': [50, 80, 100],'max_depth': [3, 5, 7]}# create instance of the Randomforest regressor
rf = RandomForestRegressor()# Define Girdsearch with random forest
# object parameter grid scoring and cv
rf_grid_search = GridSearchCV(rf, param_grid, scoring='r2', cv=2)# Fit the GridSearchCV object to the training datarf_grid_search.fit(x_train_scaled, y_train)# Print the best parameter values and score
print('Best parameter values: ', rf_grid_search.best_params_)
print('Best score: ', rf_grid_search.best_score_)

输出

Best parameter values:  {'max_depth': 7, 'n_estimators': 50}
Best score:  0.9463515663955014

在这里，我们使用了RandomForest回归器和Gridsearchcv，Gridsearch将有助于从50，80，100中选择最佳决策树数量。我们还指定了树的最大深度作为参数，可以是3，5或7。

最佳参数值表明，该模型在取100棵最大深度为7的决策树的平均结果时给出了最佳结果。

# Compute the R-squared of the
# fitted model on the test data
r2 = r2_score(y_test,rf_grid_search.predict(x_test_scaled))# Print the R-squared
print("R-squared:", r2)

输出

R-squared: 0.9519989288862035

这些模型被称为黑盒模型，因为我们无法可视化模型内部发生的事情，但我们将绘制数据集特征重要性的条形图。

features = dataset.drop("Date", axis=1).columns# store the importance of the feature
importances = rf_grid_search.best_estimator_.\feature_importances_indices = np.argsort(importances)# title of the graph
plt.title('Feature Importance')plt.barh(range(len(indices)),importances[indices],color='red',align='center')# plot bar chart
plt.yticks(range(len(indices)),[features[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()

在这里插入图片描述
这个特征重要性图显示USO柱在决定美元黄金价格方面发挥了重要作用（超过2倍）。

XGBoost回归模型

在Boosting技术中，数据被拟合在几个顺序的弱学习算法模型中，这些模型仅略优于随机猜测。在每个下一个连续模型中，为先前模型错误分类/回归的点提供更多权重。

在我们的模型中，我们将使用XGBoost模型来拟合训练数据集。

# Create an instance of the XGBRegressor model
model_xgb = XGBRegressor()# Fit the model to the training data
model_xgb.fit(x_train_scaled, y_train)# Print the R-squared score on the training data
print("Xgboost Accuracy =", r2_score(y_train, model_xgb.predict(x_train_scaled)))

输出

Xgboost Accuracy = 0.99813247897056

现在让我们使用测试数据来评估这个模型。

# Print the R-squared score on the test data
print("Xgboost Accuracy on test data =",r2_score(y_test,model_xgb.predict(x_test_scaled)))

输出

Xgboost Accuracy on test data = 0.9728930078121006

从图中可以看出，USO列在决定预测值方面起着主要作用。

4. 模型可解释性

在黑盒模型Boosting和Bagging中，我们将无法看到赋予这些列的实际权重，但是当我们预测单个向量时，我们可以使用一些库来计算赋予特定列的权重。我们将使用eli5包来演示模型的可解释性。您可以通过在终端中运行以下命令来安装此软件包。

pip install eli5

eli5是一个流行的Python库，用于调试和解释机器学习模型。我们将使用eli5来查看我们性能最好的模型的权重，这是XGBOOST在训练和测试准确性方面最好的模型。

import eli5 as eli
# weight of variables in xgboost model
# Get the names of the features
feature_names = x_train.columns.tolist()# Explain the weights of the features using ELI5
eli.explain_weights(model_xgb,feature_names=feature_names)

在这里插入图片描述

5. 保存模型

为了部署模型，我们将使用Python语言中的pickle库。我们将部署我们性能最好的模型，即XGBoost。Pickle是一个Python模块，用于序列化和重新序列化模型，即保存和加载模型。它存储Python对象，这些对象可以移动到磁盘（序列化），然后再次从磁盘移动到内存（序列化）。

# dump model using pickle library
import pickle# dump model in file model.pkl
pickle.dump(model_xgb, open('model.pkl', 'wb'))

现在我们可以使用pickle中的load函数将model_xgb加载到内存中来预测一个新的向量数据集。

基于机器学习的黄金价格预测案例

1. 问题表述

2. 数据预处理

3. 建立模型

4. 模型可解释性

5. 保存模型

相关资讯

热文排行

最新新闻

推荐新闻

热搜词