葡萄酒,作为一种历史悠久、文化底蕴深厚的饮品,不仅在全球范围内拥有广泛的爱好者,其酿造工艺、品种分类及品鉴技巧也一直是研究的热点。随着现代科技的进步,尤其是人工智能和机器学习技术的发展,为葡萄酒行业带来了新的研究视角和应用可能。其中,利用前馈神经网络(Feedforward Neural Network, FNN)对葡萄酒类型进行分类,便是一个富有创新性和探索性的研究课题。



1. 数据集介绍

您可以从免费提供的UCI机器学习存储库中找到葡萄酒质量数据集。数据集由数据中包含的 12 个变量组成。其中少数如下——

  • 固定酸度: 总酸度分为两组:挥发性酸和非挥发性或固定酸。此变量的值在数据集中以 gm/dm3 表示。
  • 挥发性酸度: 挥发性酸度是葡萄酒变成醋的过程。在该数据集中,挥发性酸度以 gm/dm3 表示。
  • 柠檬酸: 柠檬酸是葡萄酒中的固定酸之一。它在数据集中以 g/dm3 表示。
  • 残糖: 残糖是发酵停止或停止后剩余的糖。它在数据集中以 g/dm3 表示。
  • 氯化物: 它可能是葡萄酒咸味的重要因素。此变量的值在数据集中以 gm/dm3 表示。
  • 游离二氧化硫: 它是添加到葡萄酒中的二氧化硫的一部分。此变量的值在数据集中以 gm/dm3 表示。
  • 总二氧化硫: 它是结合二氧化硫和游离二氧化硫的总和。此变量的值在数据集中以 gm/dm3 表示。

1.1 获取数据

import numpy as np
import pandas as pd
import matplotlib.pyplot as pltfrom sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, aucimport torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader, Dataset
from torchinfo import summary
# Read in white wine data
white = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv", sep =';')# Read in red wine data
red = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv", sep =';')


data = pd.read_csv('wines.csv')
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   alcohol  quality  type  
0      9.4        5     1  
1      9.8        5     1  
2      9.8        5     1  
3      9.8        6     1  
4      9.4        5     1  

1.2 检查数据信息

.info()方法打印有关DataFrame 的信息,包括索引 dtype 和列、非 null 值以及内存使用情况。

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):#   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  0   fixed acidity         6497 non-null   float641   volatile acidity      6497 non-null   float642   citric acid           6497 non-null   float643   residual sugar        6497 non-null   float644   chlorides             6497 non-null   float645   free sulfur dioxide   6497 non-null   float646   total sulfur dioxide  6497 non-null   float647   density               6497 non-null   float648   pH                    6497 non-null   float649   sulphates             6497 non-null   float6410  alcohol               6497 non-null   float6411  quality               6497 non-null   int64  12  type                  6497 non-null   int64  
dtypes: float64(11), int64(2)
memory usage: 660.0 KB


2. 特征工程

2.1 数据缩放(归一化)

StandardScaler() 函数将数据的特征值转换为符合正态分布的形式,它将数据缩放到均值为0,‌标准差为1的区间‌。

# 创建一个StandardScaler实例
scaler = StandardScaler()features = data.iloc[:, :-1]
features_scaled = scaler.fit_transform(features)target = data.iloc[:, -1].values.reshape(-1, 1)

2.2 数据划分(训练测试)

train_test_split 将数组或矩阵拆分为随机的训练子集和测试子集。

# Splitting the data set for training and validating 
X_train, X_test, \y_train, y_test = train_test_split(features_scaled, target, test_size = 0.2, random_state = 45)

2.3 数据集张量

NumPy 数组转换为 tensor 张量

# 将 NumPy数组转换为 tensor张量
X_train_tensor = torch.from_numpy(X_train).type(torch.Tensor)
X_test_tensor = torch.from_numpy(X_test).type(torch.Tensor)
y_train_tensor = torch.from_numpy(y_train).type(torch.Tensor).view(-1, 1)
y_test_tensor = torch.from_numpy(y_test).type(torch.Tensor).view(-1, 1)print(X_train_tensor.shape, X_test_tensor.shape, y_train_tensor.shape, y_test_tensor.shape)
torch.Size([5197, 12]) torch.Size([1300, 12]) torch.Size([5197, 1]) torch.Size([1300, 1])
train_dataset = TensorDataset(X_train_tensor, y_train_tensor)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_dataset = TensorDataset(X_test_tensor, y_test_tensor)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)

3. 构建神经网络

3.1 构建前馈神经网络(FNN)

class FeedForwardNN(nn.Module):def __init__(self, input_dim, hidden_dim, output_dim):# input_dim 是时间回溯窗口,# output_dim 是输出维度# hidden_dim 是隐藏层神经单元维度或称为隐藏状态的大小,super(FeedForwardNN, self).__init__()# 通过调用 super(FeedForwardNN, self).__init__() 初始化父类 nn.Moduleself.fc1 = nn.Linear(input_dim, hidden_dim)# 定义全连接层fc1,将输入维度为input_dim的向量映射到维度为hidden_dim的向量self.fc2 = nn.Linear(hidden_dim, output_dim)# 定义全连接层fc2,将维度为 hidden_dim的向量映射到维度为 output_dim 的向量self.activation = nn.ReLU() # 定义 ReLU激活函数def forward(self, x):# 定义了前馈神经网络的前向传播方法,输入 x(通常是一个batch的数据)out = self.activation(self.fc1(x))out = self.fc2(out)return out

3.2 定义模型、损失函数和优化器

model = FeedForwardNN(input_dim= X_train_tensor.shape[1],hidden_dim= 8,output_dim= 1) 
criterion = torch.nn.BCEWithLogitsLoss() # 定义二进制交叉熵损失函数
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001) # 定义优化器

3.3 模型概要

summary(model, (32, X_train_tensor.shape[1])) # batch_size, input_size
Layer (type:depth-idx)                   Output Shape              Param #
FeedForwardNN                            [32, 1]                   --
├─Linear: 1-1                            [32, 8]                   104
├─ReLU: 1-2                              [32, 8]                   --
├─Linear: 1-3                            [32, 1]                   9
Total params: 113
Trainable params: 113
Non-trainable params: 0
Total mult-adds (Units.MEGABYTES): 0.00
Input size (MB): 0.00
Forward/backward pass size (MB): 0.00
Params size (MB): 0.00
Estimated Total Size (MB): 0.00

4. 模型训练与可视化

4.1 定义训练与评估函数


def binary_accuracy(outputs, labels):# 通过 sigmoid 函数将输出值映射到 [0, 1] 区间outputs = torch.sigmoid(outputs)# 将输出值与 0.5 比较,得到预测的类别(0 或 1)predicted = (outputs > 0.5).float()# 计算预测正确的数量correct = (predicted == labels).float().sum()# 计算总样本数量total = labels.size(0)# 计算准确率accuracy = correct / totalreturn accuracy

上述代码,定义了一个名为 binary_accuracy 的函数,用于计算二分类任务中的准确率。它接收模型的输出结果 outputs和真实标签 labels 作为参数,并返回计算得到的准确率值。

def train(model, iterator, optimizer, criterion):epoch_loss = 0epoch_acc = 0model.train()  # 确保模型处于训练模式for batch in iterator:optimizer.zero_grad()  # 清空梯度inputs, labels = batch  # 获取输入和标签outputs = model(inputs)  # 前向传播# 计算损失和准确率loss = criterion(outputs, labels)acc = binary_accuracy(outputs, labels)loss.backward()optimizer.step()# 累积损失和准确率epoch_loss += loss.item()epoch_acc += acc# 计算平均损失和准确率average_loss = epoch_loss / len(iterator)average_acc = epoch_acc / len(iterator)return average_loss, average_acc

上述代码定义了一个名为 train 的函数,用于训练给定的模型。它接收模型、数据迭代器、优化器和损失函数作为参数,并返回训练过程中的平均损失和平均准确率。

def evaluate(model, iterator, criterion):epoch_loss = 0epoch_acc = 0model.eval()  # 将模型设置为评估模式,例如关闭 Dropout 等with torch.no_grad():  # 不需要计算梯度for batch in iterator:inputs, labels = batchoutputs = model(inputs)  # 前向传播# 计算损失和准确率loss = criterion(outputs, labels)acc = binary_accuracy(outputs, labels)# 累计损失和准确率epoch_loss += loss.item()epoch_acc += accreturn epoch_loss / len(iterator), epoch_acc / len(iterator)

上述代码定义了一个名为 evaluate 的函数,用于评估给定模型在给定数据迭代器上的性能。它接收模型、数据迭代器和损失函数作为参数,并返回评估过程中的平均损失和平均准确率。这个函数通常在模型训练的过程中定期被调用,以监控模型在验证集或测试集上的性能。通过评估模型的性能,可以了解模型的泛化能力和训练的进展情况。

best_acc = 0
epoch = 100
train_losses = []
valid_losses = []
train_accs = []
valid_accs = []for epoch in range(epoch):train_loss, train_acc = train(model, train_loader, optimizer, criterion)valid_loss, valid_acc = evaluate(model, test_loader, criterion)train_losses.append(train_loss)valid_losses.append(valid_loss)train_accs.append(train_acc)valid_accs.append(valid_acc)print(f'Epoch: {epoch+1:02}, Train Loss: {train_loss:.3f}, Train Acc: {train_acc * 100:.2f}%, Val. Loss: {valid_loss:.3f}, Val. Acc: {valid_acc * 100:.2f}%')if best_acc <= valid_acc:best_acc = valid_accpth = model.state_dict()


Epoch: 01, Train Loss: 0.619, Train Acc: 75.69%, Val. Loss: 0.614, Val. Acc: 73.90%
Epoch: 02, Train Loss: 0.598, Train Acc: 75.72%, Val. Loss: 0.594, Val. Acc: 73.90%
Epoch: 03, Train Loss: 0.577, Train Acc: 75.75%, Val. Loss: 0.572, Val. Acc: 73.90%
Epoch: 04, Train Loss: 0.553, Train Acc: 75.81%, Val. Loss: 0.548, Val. Acc: 73.90%
Epoch: 05, Train Loss: 0.526, Train Acc: 75.81%, Val. Loss: 0.520, Val. Acc: 73.90%
Epoch: 96, Train Loss: 0.024, Train Acc: 99.56%, Val. Loss: 0.046, Val. Acc: 99.16%
Epoch: 97, Train Loss: 0.024, Train Acc: 99.56%, Val. Loss: 0.046, Val. Acc: 99.16%
Epoch: 98, Train Loss: 0.024, Train Acc: 99.56%, Val. Loss: 0.046, Val. Acc: 99.16%
Epoch: 99, Train Loss: 0.024, Train Acc: 99.56%, Val. Loss: 0.046, Val. Acc: 99.16%
Epoch: 100, Train Loss: 0.025, Train Acc: 99.53%, Val. Loss: 0.046, Val. Acc: 99.16%

4.2 绘制损失与准确率曲线

# 绘制损失图
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(train_losses, label='Train Loss')
plt.plot(valid_losses, label='Validation Loss')
plt.title('Train and Validation Loss')
plt.grid(True)# 绘制准确率图
plt.subplot(1, 2, 2)
plt.plot(train_accs, label='Train Accuracy')
plt.plot(valid_accs, label='Validation Accuracy')
plt.title('Train and Validation Accuracy')


5. 模型评估与可视化

5.1 构建预测函数

定义预测函数prediction 方便调用

# 定义 prediction函数
def prediction(model, test_loader): all_labels = []all_predictions = []all_predictions_prob = []model.eval()with torch.no_grad():for inputs, labels in test_loader:outputs = model(inputs)predictions_prob = torch.sigmoid(outputs)predicted = (predictions_prob > 0.5).float()all_labels.extend(labels.numpy())all_predictions.extend(predicted.numpy())all_predictions_prob.extend(predictions_prob.numpy())return all_labels, all_predictions, all_predictions_prob

上述代码定义了一个名为 prediction 的函数,用于对给定的模型在验证数据加载器(valid_loader)上进行预测,并返回真实标签、预测的类别以及预测的概率。这个函数通常在模型训练完成后,用于对新的数据进行预测。通过收集所有的预测结果,可以进一步分析模型的性能,例如计算准确率、绘制混淆矩阵等。它也可以用于实际应用中,对未知数据进行预测并做出决策。

# 预测结果
labels, predictions, predictions_prob = prediction(model, valid_loader)

7.2 混淆矩阵

def plot_confusion_matrix(labels, predictions, classes):cm = confusion_matrix(labels, predictions)plt.figure(figsize=(8, 6))plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Blues)plt.title("Confusion Matrix")plt.colorbar()tick_marks = np.arange(len(classes))plt.xticks(tick_marks, classes, rotation=45)plt.yticks(tick_marks, classes)thresh = cm.max() / 2.for i in range(cm.shape[0]):for j in range(cm.shape[1]):plt.text(j, i, format(cm[i, j], 'd'),horizontalalignment="center",color="white" if cm[i, j] > thresh else "black")plt.ylabel('True label')plt.xlabel('Predicted label')plt.show()

上述代码定义一个名为 plot_confusion_matrix 的函数,用于绘制给定真实标签和预测结果的混淆矩阵。混淆矩阵是一种用于评估分类模型性能的可视化工具,它展示了模型在不同类别上的预测准确性。

classes = ['Class 0', 'Class 1']


plot_confusion_matrix(labels, predictions, classes)


7.3 ROC_AUC曲线

def plot_roc_curve(labels, predictions_prob):fpr, tpr, _ = roc_curve(labels, predictions_prob)roc_auc = auc(fpr, tpr)plt.figure()plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve (area = %0.2f)' % roc_auc)plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')plt.xlim([0.0, 1.0])plt.ylim([0.0, 1.05])plt.xlabel('False Positive Rate')plt.ylabel('True Positive Rate')plt.title('Receiver operating characteristic')plt.legend(loc="lower right")plt.show()
# 绘制 ROC曲线
plot_roc_curve(labels, predictions_prob)


7.4 分类报告

from sklearn.metrics import classification_report
print(classification_report(labels, predictions))
              precision    recall  f1-score   support0.0       0.99      0.99      0.99       9601.0       0.98      0.98      0.98       340accuracy                           0.99      1300macro avg       0.98      0.99      0.99      1300
weighted avg       0.99      0.99      0.99      1300


