机器分类的基石：逻辑回归Logistic Regression

逻辑回归核心思想总结

1. 核心原理与改进

问题驱动：
从线性回归的不足出发（输出无界、对极端值敏感），逻辑回归通过 Sigmoid函数（非线性映射）将线性结果压缩到 (0,1) 区间，输出为事件发生概率，解决分类问题。
$\frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n)}}$
平滑化极端值：
Sigmoid函数两端梯度趋近于0，极端值对模型参数更新的影响被抑制（如学生单科满分但其他科目极低时，概率输出仍合理）。

2. 与线性回归的关键对比

维度	线性回归	逻辑回归
任务类型	回归（预测连续值，如房价）	分类（预测概率，如是否评为三好学生）
输出范围	$(-\infty, +\infty)$	((0, 1))
损失函数	均方误差（MSE）	交叉熵（Cross-Entropy）
优化目标	最小化预测误差	最大化数据似然概率
极端值敏感性	高敏感（无约束）	低敏感（Sigmoid平滑抑制）

3. 训练评估：交叉熵函数

定义：
$\text{Loss} = -\frac{1}{N} \sum_{i=1}^N \left[ y_i \log(p_i) + (1-y_i)\log(1-p_i) \right]$
意义：
- 衡量预测概率分布与真实标签的差异，越接近真实标签，损失越小。
- 对“高置信度错误预测”（如预测概率0.9但实际为0）施加严厉惩罚。

4. 优缺点分析

优点	缺点
✅ 输出为概率，支持风险量化决策（如阈值调整）	❌ 仅适用于线性可分或近似线性关系的数据
✅ 计算高效，适合大规模数据	❌ 需手动设计非线性特征（如多项式组合）
✅ 模型可解释性强（参数可分析重要性）	❌ 多重共线性会导致参数估计不稳定
✅ 支持正则化防止过拟合（L1/L2正则化）	❌ 样本不均衡时需调整阈值或采样策略

5. 应用场景举例

教育领域：
学生综合评分→三好学生概率（特征：成绩、德育、实践）
金融领域：
股票特征（市盈率、成交量）→上涨概率；信用评分→违约概率
生物识别：
指纹/面部特征匹配度→身份验证成功概率
工业领域：
传感器数据（温度、振动）→设备故障概率

关键结论

逻辑回归本质是线性模型：
通过Sigmoid函数将线性边界转化为概率，决策边界仍为 (\beta_0 + \beta_1 x_1 + \cdots + \beta_n x_n = 0)。
适用条件：
数据需近似线性可分，或通过特征工程构建线性关系；对非线性问题需结合其他方法（如决策树特征生成）。
实践意义：
提供概率输出而非硬分类结果，支持灵活的业务决策（如金融风控中调整阈值平衡风险与收益）。

案例

通过 scikit-learn 练习逻辑回归是一个高效的学习方式！以下是分步骤的实践指南，涵盖二分类、多分类、特征工程、正则化及实战技巧：

1. 环境准备

确保安装以下库：

pip install numpy matplotlib scikit-learn

pip install pandas

pip install -U imbalanced-learn

2. 基础二分类示例（乳腺癌数据集）

加载数据并训练模型

from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report# 加载数据
data = load_breast_cancer()
X = data.data  # 特征（如细胞半径、纹理等）
y = data.target  # 标签（0=恶性，1=良性）# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 训练模型（默认使用L2正则化）
model = LogisticRegression(max_iter=1000)  # 增加迭代次数确保收敛
model.fit(X_train, y_train)# 预测并评估
y_pred = model.predict(X_test)
print("准确率:", accuracy_score(y_test, y_pred))
print("混淆矩阵:\n", confusion_matrix(y_test, y_pred))
print("分类报告:\n", classification_report(y_test, y_pred))

输出概率与阈值调整

# 获取预测概率（0类和1类的概率）
y_proba = model.predict_proba(X_test)
print("预测概率样例:\n", y_proba[:3])  # 显示前三行# 调整阈值（默认阈值为0.5）
custom_threshold = 0.6  # 更严格，减少假阴性（将更多样本判为恶性）
y_pred_custom = (y_proba[:, 1] > custom_threshold).astype(int)# 对比调整前后的效果
print("默认阈值准确率:", accuracy_score(y_test, y_pred))
print("自定义阈值准确率:", accuracy_score(y_test, y_pred_custom))

分析特征重要性

# 乳腺癌数据集特征重要性（系数绝对值）
feature_importance = pd.DataFrame({'Feature': data.feature_names,'Coefficient': model.coef_[0]
}).sort_values(by='Coefficient', key=abs, ascending=False)print(feature_importance.head(10))  # 显示前10个重要特征

输出结果

准确率: 0.956140350877193
混淆矩阵:[[39  4][ 1 70]]
分类报告:precision    recall  f1-score   support0       0.97      0.91      0.94        431       0.95      0.99      0.97        71accuracy                           0.96       114macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114预测概率样例:[[1.49528194e-01 8.50471806e-01][9.99999991e-01 8.60075153e-09][9.97169798e-01 2.83020220e-03]]
默认阈值准确率: 0.956140350877193
自定义阈值准确率: 0.9824561403508771Feature  Coefficient
0            mean radius     1.783323
11         texture error     1.627921
26       worst concavity    -1.532397
28        worst symmetry    -0.953797
27  worst concave points    -0.783198
25     worst compactness    -0.739000
6         mean concavity    -0.691336
20          worst radius     0.525571
21         worst texture    -0.522758
7    mean concave points    -0.469830

3. 多分类问题（鸢尾花数据集）

逻辑回归通过 OvR（One-vs-Rest） 或 Softmax 支持多分类：

# 多分类问题（鸢尾花数据集）from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import  classification_reportfrom sklearn.datasets import load_iris# 加载数据
data = load_iris()
X = data.data  # 4个特征（萼片长宽、花瓣长宽）
y = data.target  # 3类鸢尾花# 划分数据集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)# 训练模型（multi_class='multinomial'启用Softmax）
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=1000)
model.fit(X_train, y_train)# 评估
y_pred = model.predict(X_test)
print("分类报告:\n", classification_report(y_test, y_pred))

输出结果

分类报告:precision    recall  f1-score   support0       1.00      1.00      1.00        101       1.00      1.00      1.00         92       1.00      1.00      1.00        11accuracy                           1.00        30macro avg       1.00      1.00      1.00        30
weighted avg       1.00      1.00      1.00        30

4. 特征工程：处理非线性关系

通过多项式特征增强模型能力：

# 特征工程：处理非线性关系import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegressionfrom sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline# 生成非线性可分数据（螺旋数据集）
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,n_redundant=0, n_clusters_per_class=1, flip_y=0.1, random_state=42)# 创建管道：多项式扩展 + 逻辑回归
model = make_pipeline(PolynomialFeatures(degree=3),  # 增加3次多项式特征LogisticRegression(C=10, max_iter=1000)  # 调整正则化强度
)
model.fit(X, y)# 可视化决策边界
def plot_decision_boundary(model, X, y):x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),np.arange(y_min, y_max, 0.02))Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)plt.contourf(xx, yy, Z, alpha=0.4)plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k')plt.show()plot_decision_boundary(model, X, y)

输出结果

在这里插入图片描述

5. 正则化与参数调优

通过正则化防止过拟合，对比 L1 和 L2 的效果：

# 正则化与参数调优,通过正则化防止过拟合，对比 L1 和 L2 的效果import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_irisplt.rcParams['axes.unicode_minus'] = False  # 用来正常显示负号# 加载数据集（以鸢尾花数据集为例）
data = load_iris()
X = data.data
y = data.target# 将数据集拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)# 标准化数据（正则化对特征尺度敏感）
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)# 对比不同正则化方式
model_l1 = LogisticRegression(penalty='l1', solver='liblinear', C=0.1)  # L1正则化
model_l2 = LogisticRegression(penalty='l2', C=0.1)  # L2正则化model_l1.fit(X_train_scaled, y_train)
model_l2.fit(X_train_scaled, y_train)print("L1正则化系数稀疏性:", np.sum(model_l1.coef_ != 0), "非零系数")
print("L2正则化系数稀疏性:", np.sum(model_l2.coef_ != 0), "非零系数")

输出结果

L1正则化系数稀疏性: 4 非零系数
L2正则化系数稀疏性: 12 非零系数

6. 处理样本不均衡

调整类别权重

# 模拟不均衡数据（95%负类，5%正类）
X, y = make_classification(n_samples=1000, weights=[0.95, 0.05], random_state=42)model = LogisticRegression(class_weight='balanced')  # 自动调整权重
model.fit(X, y)# 对比默认权重和平衡权重
print("默认权重 F1:", classification_report(y, model.predict(X), zero_division=0))

过采样（SMOTE）

from imblearn.over_sampling import SMOTE# 生成均衡数据
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)model = LogisticRegression()
model.fit(X_resampled, y_resampled)

输出结果

默认权重 F1:               precision    recall  f1-score   support0       0.99      0.86      0.92       9471       0.26      0.85      0.39        53accuracy                           0.86      1000macro avg       0.62      0.86      0.66      1000
weighted avg       0.95      0.86      0.89      1000

参考文献视频：点击跳转