欢迎来到尧图网

客户服务 关于我们

您的位置:首页 > 新闻 > 国际 > 机器学习 [白板推导](三)[线性分类]

机器学习 [白板推导](三)[线性分类]

2025/3/17 6:06:59 来源:https://blog.csdn.net/weixin_53196770/article/details/146300291  浏览:    关键词:机器学习 [白板推导](三)[线性分类]

4. 线性分类

4.1. 线性分类的典型模型

  • 硬分类:输出结果只有0或1这种离散结果;
    • 感知机
    • 线性判别分析 Fisher
  • 软分类:会输出0-1之间的值作为各个类别的概率;
    • 概率生成模型:高斯判别分析GDA、朴素贝叶斯,主要建模的是 p ( x ⃗ , y ) p(\vec{x},y) p(x ,y)
    • 概率判别模型:逻辑回归,主要建模的是 p ( y ∣ x ⃗ ) p(y|\vec{x}) p(yx )

4.2. 感知机

4.2.1. 基本模型

  模型:
f ( x ⃗ ^ ) = sign ( x ⃗ ^ T W ) , sign ( a ) = { 1 , a > 0 0 , a = 0 − 1 , a < 0 . (4.1) \begin{aligned} f(\hat{\vec{x}})=\text{sign}(\hat{\vec{x}}^TW),\\ \text{sign}(a)=\left\{\begin{matrix}1,&a>0\\0,&a=0\\-1,&a<0\end{matrix}.\right.\end{aligned}\tag{4.1} f(x ^)=sign(x ^TW),sign(a)= 1,0,1,a>0a=0a<0.(4.1)

  思想:错误驱动,先随机初始化一个 W W W,研究错误分类的样本来调整。
  策略:使用错分类样本计算损失函数:
L ( W ) = ∑ x ⃗ ^ i ∈ D − y i ⋅ x ⃗ ^ i T W D = { x ⃗ ^ i } f ( x ⃗ ^ i ) ≠ y i . (4.2) \begin{aligned} \mathcal{L}(W)=\sum_{\hat{\vec{x}}_i\in D}-y_i\cdot \hat{\vec{x}}_i^TW\\ D=\{\hat{\vec{x}}_i\}_{f(\hat{\vec{x}}_i)\neq y_i}. \end{aligned}\tag{4.2} L(W)=x ^iDyix ^iTWD={x ^i}f(x ^i)=yi.(4.2)

4.3. 线性判别分析 Fisher

4.3.1. 问题定义

  对于一个二分类问题,将样本分为 X c 1 = { x ⃗ ^ i ∣ y i = + 1 } X_{c_1}=\left \{ \hat{\vec{x}}_i|y_i=+1 \right \} Xc1={x ^iyi=+1} X c 2 = { x ⃗ ^ i ∣ y i = − 1 } X_{c_2}=\left \{ \hat{\vec{x}}_i|y_i=-1 \right \} Xc2={x ^iyi=1},这两组的样本数分别为 N 1 N_1 N1 N 2 N_2 N2 N 1 + N 2 = N N_1+N_2=N N1+N2=N.
  寻找一个投影超平面 W W W,使所有样本点在这个平面的投影可以做到类内间距小,类间间距大

4.3.2. 过程推导

  样本 x ⃗ ^ i \hat{\vec{x}}_i x ^i 在超平面 W W W 上的投影可以表示为 z = x ⃗ ^ T ⋅ W z=\hat{\vec{x}}^T\cdot W z=x ^TW,则对其求均值和方差:
z ˉ = 1 N ∑ i = 1 N z i = 1 N ∑ i = 1 N x ⃗ ^ i T ⋅ W S z = 1 N ∑ i = 1 N ( z i − z ˉ ) 2 = 1 N ∑ i = 1 N ( x ⃗ ^ i T ⋅ W − z ˉ ) 2 . (4.3) \begin{aligned} \bar{z}&=\frac{1}{N}\sum_{i=1}^Nz_i=\frac{1}{N}\sum_{i=1}^N\hat{\vec{x}}_i^T\cdot W\\ S_z&=\frac{1}{N}\sum_{i=1}^N(z_i-\bar{z})^2=\frac{1}{N}\sum_{i=1}^N(\hat{\vec{x}}_i^T\cdot W-\bar{z})^2. \end{aligned}\tag{4.3} zˉSz=N1i=1Nzi=N1i=1Nx ^iTW=N1i=1N(zizˉ)2=N1i=1N(x ^iTWzˉ)2.(4.3)

  基于上式分别对两类样本计算均值 z ˉ 1 \bar{z}_1 zˉ1 z ˉ 2 \bar{z}_2 zˉ2,以及方差 S z 1 S_{z_1} Sz1 S z 2 S_{z_2} Sz2. 为了尽可能类内间距小,类间间距大,将目标函数定义为
J ( W ) = ( z ˉ 1 − z ˉ 2 ) 2 S z 1 + S z 2 , (4.4) \mathcal{J}(W)=\frac{(\bar{z}_1-\bar{z}_2)^2}{S_{z_1}+S_{z_2}},\tag{4.4} J(W)=Sz1+Sz2(zˉ1zˉ2)2,(4.4)

则模型转为优化问题:
W = arg max ⁡ W J ( W ) , (4.5) W=\argmax_W\mathcal{J}(W),\tag{4.5} W=WargmaxJ(W),(4.5)

  对目标函数进行化简:
( z ˉ 1 − z ˉ 2 ) 2 = ( 1 N 1 ∑ i = 1 N 1 x ⃗ ^ i T ⋅ W − 1 N 2 ∑ i = 1 N 2 x ⃗ ^ i T ⋅ W ) 2 = [ ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W ] 2 = W T ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W (4.6) \begin{aligned} (\bar{z}_1-\bar{z}_2)^2&=(\frac{1}{N_1}\sum_{i=1}^{N_1}\hat{\vec{x}}_i^T\cdot W-\frac{1}{N_2}\sum_{i=1}^{N_2}\hat{\vec{x}}_i^T\cdot W)^2\\ &=[(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW]^2\\ &=W^T(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW \end{aligned}\tag{4.6} (zˉ1zˉ2)2=(N11i=1N1x ^iTWN21i=1N2x ^iTW)2=[(x ˉC1x ˉC2)TW]2=WT(x ˉC1x ˉC2)(x ˉC1x ˉC2)TW(4.6)

S z 1 = 1 N 1 ∑ i = 1 N 1 ( x ⃗ ^ i T ⋅ W − z ˉ 1 ) 2 = 1 N 1 ∑ i = 1 N 1 W T ( x ⃗ i − x ⃗ ˉ C 1 ) ( x ⃗ i − x ⃗ ˉ C 1 ) T W = W T [ 1 N 1 ∑ i = 1 N 1 ( x ⃗ i − x ⃗ ˉ C 1 ) ( x ⃗ i − x ⃗ ˉ C 1 ) T ] W = W T S C 1 W , (4.7) \begin{aligned}S_{z_1}&=\frac{1}{N_1}\sum_{i=1}^{N_1}(\hat{\vec{x}}_i^T\cdot W-\bar{z}_1)^2\\&=\frac{1}{N_1}\sum_{i=1}^{N_1}W^T(\vec{x}_i-\bar{\vec{x}}_{C_1})(\vec{x}_i-\bar{\vec{x}}_{C_1})^TW\\ & =W^T\left [\frac{1}{N_1}\sum_{i=1}^{N_1}(\vec{x}_i-\bar{\vec{x}}_{C_1})(\vec{x}_i-\bar{\vec{x}}_{C_1})^T \right ]W\\ &=W^TS_{C_1}W,\end{aligned}\tag{4.7} Sz1=N11i=1N1(x ^iTWzˉ1)2=N11i=1N1WT(x ix ˉC1)(x ix ˉC1)TW=WT[N11i=1N1(x ix ˉC1)(x ix ˉC1)T]W=WTSC1W,(4.7)

同理可得 S z 2 = W T S C 2 W S_{z_2}=W^TS_{C_2}W Sz2=WTSC2W. 所以目标函数化为
J ( W ) = ( z ˉ 1 − z ˉ 2 ) 2 S z 1 + S z 2 = W T ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W W T ( S C 1 + S C 2 ) W (4.8) \begin{aligned} \mathcal{J}(W)&=\frac{(\bar{z}_1-\bar{z}_2)^2}{S_{z_1}+S_{z_2}}\\&=\frac{W^T(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW}{W^T(S_{C_1}+S_{C_2})W} \end{aligned}\tag{4.8} J(W)=Sz1+Sz2(zˉ1zˉ2)2=WT(SC1+SC2)WWT(x ˉC1x ˉC2)(x ˉC1x ˉC2)TW(4.8)

  再定义总类内方差 S w S_w Sw总类间方差 S b S_b Sb
S w = S C 1 + S C 2 S b = ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T (4.9) \begin{aligned} S_w&=S_{C_1}+S_{C_2}\\ S_b&=(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^T \end{aligned}\tag{4.9} SwSb=SC1+SC2=(x ˉC1x ˉC2)(x ˉC1x ˉC2)T(4.9)

因此目标函数被表示为:
J ( W ) = W T S b W W T S w W , (4.10) \mathcal{J}(W)=\frac{W^TS_bW}{W^TS_wW},\tag{4.10} J(W)=WTSwWWTSbW,(4.10)

对目标函数求导得:
∂ J ( W ) ∂ W = 2 S b W ( W T S w W ) − 1 + W T S b W ⋅ ( − 1 ) ⋅ ( W T S w W ) − 2 ⋅ 2 S w W , (4.11) \begin{aligned}\frac{\partial \mathcal{J}(W)}{\partial W}&=2S_bW(W^TS_wW)^{-1}+\\&W^TS_bW\cdot(-1)\cdot (W^TS_wW)^{-2}\cdot2S_wW,\end{aligned}\tag{4.11} WJ(W)=2SbW(WTSwW)1+WTSbW(1)(WTSwW)22SwW,(4.11)

令其为0可得
W = W T S w W W T S b W S w − 1 S b W = W T S w W W T S b W S w − 1 ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W = W T S w W ⋅ ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W W T S b W S w − 1 ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) , (4.12) \begin{aligned}W&=\frac{W^TS_wW}{W^TS_bW}S_w^{-1}S_bW\\&=\frac{W^TS_wW}{W^TS_bW}S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW\\&=\frac{W^TS_wW\cdot (\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW}{W^TS_bW}S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2}) ,\end{aligned}\tag{4.12} W=WTSbWWTSwWSw1SbW=WTSbWWTSwWSw1(x ˉC1x ˉC2)(x ˉC1x ˉC2)TW=WTSbWWTSwW(x ˉC1x ˉC2)TWSw1(x ˉC1x ˉC2),(4.12)

因为 W W W 是一个单位向量,所以我们只关心其方向而不关心其长度,所以最终得到:
W ∝ S w − 1 ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) . (4.13) \begin{aligned}W\propto S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2}). \end{aligned}\tag{4.13} WSw1(x ˉC1x ˉC2).(4.13)

4.4. 逻辑回归

4.4.1. 基本思想

  在线性回归中引入非线性激活函数,使其可以将回归结果映射为概率值,作为类别的概率。因此将这个模型看做一个条件概率分布的建模,输入 x ⃗ ^ \hat{\vec{x}} x ^,通过建模 p ( y ∣ x ⃗ ^ ) p(y|\hat{\vec{x}}) p(yx ^),输出 y y y 的离散取值;

4.4.2. Sigmoid 激活函数

  基本公式:
σ ( z ) = 1 1 + e − z (4.14) \sigma (z)=\frac{1}{1+e^{-z}}\tag{4.14} σ(z)=1+ez1(4.14)

特殊取值:

  • z → − ∞ z\rightarrow -\infty z 时, lim ⁡ σ ( z ) = 0 \lim \sigma (z)=0 limσ(z)=0
  • z = 0 z=0 z=0 时, σ ( z ) = 1 2 \sigma(z)=\frac{1}{2} σ(z)=21
  • z → ∞ z\rightarrow \infty z 时, lim ⁡ σ ( z ) = 1 \lim \sigma (z)=1 limσ(z)=1.

函数图像:
在这里插入图片描述

4.4.3. 模型推导

  根据条件概率建模的思想:
p 1 = p ( y = 1 ∣ x ⃗ ^ ) = σ ( x ⃗ ^ T W ) = 1 1 + e − x ⃗ ^ T W p 0 = p ( y = 0 ∣ x ⃗ ^ ) = 1 − σ ( x ⃗ ^ T W ) = e − x ⃗ ^ T W 1 + e − x ⃗ ^ T W , (4.15) \begin{aligned} p_1&=p(y=1|\hat{\vec{x}})=\sigma(\hat{\vec{x}}^TW)= \frac{1}{1+e^{-\hat{\vec{x}}^TW}}\\ p_0&=p(y=0|\hat{\vec{x}})=1-\sigma(\hat{\vec{x}}^TW)= \frac{e^{-\hat{\vec{x}}^TW}}{1+e^{-\hat{\vec{x}}^TW}}, \end{aligned}\tag{4.15} p1p0=p(y=1∣x ^)=σ(x ^TW)=1+ex ^TW1=p(y=0∣x ^)=1σ(x ^TW)=1+ex ^TWex ^TW,(4.15)

因此将整个模型写作
p ( y ∣ x ) = p 1 y ⋅ p 0 1 − y , (4.16) p(y|x)=p_1^y\cdot p_0^{1-y},\tag{4.16} p(yx)=p1yp01y,(4.16)

y = 0 y=0 y=0 时, p = p 0 p=p_0 p=p0,当 y = 1 y=1 y=1 时, p = p 1 p=p_1 p=p1.

  用极大似然估计法求解模型:
W ^ = arg max ⁡ W log ⁡ p ( Y ∣ X ) = arg max ⁡ W ∑ i = 1 N [ y i ⋅ log ⁡ σ ( x ⃗ ^ T W ) + ( 1 − y i ) ⋅ log ⁡ ( 1 − σ ( x ⃗ ^ T W ) ) ] , (4.17) \begin{aligned} \hat{W}&=\argmax_W\log p(Y|X)\\ &=\argmax_W\sum_{i=1}^N\left [ y_i\cdot \log\sigma(\hat{\vec{x}}^TW)+(1-y_i)\cdot \log(1-\sigma(\hat{\vec{x}}^TW)) \right ], \end{aligned}\tag{4.17} W^=Wargmaxlogp(YX)=Wargmaxi=1N[yilogσ(x ^TW)+(1yi)log(1σ(x ^TW))],(4.17)

对其求梯度得
▽ grad W = ∑ i = 1 N [ y i ⋅ ( 1 − σ ( x ⃗ ^ T W ) ) ⋅ x ⃗ ^ − ( 1 − y i ) ⋅ σ ( x ⃗ ^ T W ) ⋅ x ⃗ ^ ] = ∑ i = 1 N [ y i − σ ( x ⃗ ^ T W ) ] ⋅ x ⃗ ^ , (4.18) \begin{aligned} \bigtriangledown \text{grad}_W &=\sum_{i=1}^N \left [y_i\cdot (1-\sigma(\hat{\vec{x}}^TW))\cdot\hat{\vec{x}} - (1-y_i)\cdot \sigma(\hat{\vec{x}}^TW)\cdot\hat{\vec{x}} \right ]\\ &=\sum_{i=1}^N \left [y_i-\sigma(\hat{\vec{x}}^TW) \right ]\cdot\hat{\vec{x}}, \end{aligned}\tag{4.18} gradW=i=1N[yi(1σ(x ^TW))x ^(1yi)σ(x ^TW)x ^]=i=1N[yiσ(x ^TW)]x ^,(4.18)

即可对模型进行迭代更新。

4.5. 高斯判别分析

4.5.1. 概率判别式模型与概率生成式模型的区别

  概率判别式模型主要计算条件概率密度 p ( y ∣ x ⃗ ) p(y|\vec{x}) p(yx ),取令该概率最大的 y y y 为分类结果;
  概率生成式模型并不需要计算具体的 p ( y ∣ x ⃗ ) p(y|\vec{x}) p(yx ) 值,而是直接思考 p ( y = 1 ∣ x ⃗ ) p(y=1|\vec{x}) p(y=1∣x ) p ( y = 0 ∣ x ⃗ ) p(y=0|\vec{x}) p(y=0∣x ) 的结果谁更大,根据贝叶斯公式 p ( y ∣ x ⃗ ) = p ( x ⃗ ∣ y ) ⋅ p ( y ) p ( x ⃗ ) p(y|\vec{x}) = \frac{p(\vec{x}|y)\cdot p(y)}{p(\vec{x})} p(yx )=p(x )p(x y)p(y),将目标函数变为:
y ^ = arg max ⁡ y p ( y ∣ x ⃗ ) = arg max ⁡ y p ( x ⃗ ∣ y ) ⋅ p ( y ) , (4.19) \begin{aligned} \hat{y}&=\argmax_y p(y|\vec{x})\\&=\argmax_y p(\vec{x}|y)\cdot p(y), \end{aligned}\tag{4.19} y^=yargmaxp(yx )=yargmaxp(x y)p(y),(4.19)

其中若 p ( y = 1 ) = ϕ p(y=1)=\phi p(y=1)=ϕ,则 p ( y = 0 ) = 1 − ϕ p(y=0)=1-\phi p(y=0)=1ϕ,可以将 p ( y ) p(y) p(y) 合并为 p ( y ) = ϕ y ⋅ ( 1 − ϕ ) 1 − y p(y)=\phi ^y\cdot (1-\phi)^{1-y} p(y)=ϕy(1ϕ)1y.

4.5.2. 高斯概率假设

  在高斯判别模型中,假设条件分布是遵从高斯概率分布的,即:
x ⃗ ∣ y = 0 ∼ N ( μ 1 , Σ ) x ⃗ ∣ y = 1 ∼ N ( μ 2 , Σ ) , (4.20) \begin{aligned} \vec{x}|y&=0\sim N(\mu_1, \Sigma)\\ \vec{x}|y&=1\sim N(\mu_2, \Sigma), \end{aligned}\tag{4.20} x yx y=0N(μ1,Σ)=1N(μ2,Σ),(4.20)

使用对数似然求解目标函数,可得
L ( θ ) = log ⁡ ∏ i = 1 N p ( x ⃗ i , y i ) = ∑ i = 1 N [ log ⁡ p ( x ⃗ i ∣ y i ) + log ⁡ p ( y i ) ] = ∑ i = 1 N { y i ⋅ [ − 1 2 ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 ( x ⃗ i − μ ⃗ 1 ) − p 2 log ⁡ 2 π − − 1 2 log ⁡ ∣ Σ ∣ ] + ( 1 − y i ) ⋅ [ − 1 2 ( x ⃗ i − μ ⃗ 2 ) T Σ − 1 ( x ⃗ i − μ ⃗ 2 ) − p 2 log ⁡ 2 π − − 1 2 log ⁡ ∣ Σ ∣ ] + y i ⋅ log ⁡ ϕ + ( 1 − y i ) ⋅ log ⁡ ( 1 − ϕ ) } . (4.21) \begin{aligned} \mathcal{L}(\theta)&=\log\prod_{i=1}^N p(\vec{x}_i,y_i)\\ &=\sum _{i=1}^N\left [\log p(\vec{x}_i|y_i)+\log p(y_i) \right ] \\ &=\sum _{i=1}^N\left \{ y_i\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right )-\frac{p}{2}\log2\pi--\frac{1}{2}\log|\Sigma| \right ]+\right.\\ &\left (1-y_i \right )\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_2 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_2 \right )-\frac{p}{2}\log2\pi--\frac{1}{2}\log|\Sigma| \right ] +\\&\left.y_i \cdot\log\phi + (1-y_i)\cdot\log(1-\phi)\right \} . \end{aligned}\tag{4.21} L(θ)=logi=1Np(x i,yi)=i=1N[logp(x iyi)+logp(yi)]=i=1N{yi[21(x iμ 1)TΣ1(x iμ 1)2plog2π21log∣Σ∣]+(1yi)[21(x iμ 2)TΣ1(x iμ 2)2plog2π21log∣Σ∣]+yilogϕ+(1yi)log(1ϕ)}.(4.21)

4.5.3. 求解模型

  首先求解参数 ϕ \phi ϕ,对目标函数求偏导:
∂ L ( θ ) ∂ ϕ = ∑ i = 1 N ( y i ϕ − 1 − y i 1 − ϕ ) , (4.22) \begin{aligned} \frac{\partial L(\theta)}{\partial \phi}=\sum_{i=1}^N\left (\frac{y_i}{\phi}-\frac{1-y_i}{1-\phi} \right ) , \end{aligned}\tag{4.22} ϕL(θ)=i=1N(ϕyi1ϕ1yi),(4.22)

令其为0,求得
ϕ ^ = 1 N ∑ i = 1 N y i , (4.23) \hat{\phi}=\frac{1}{N}\sum_{i=1}^{N}y_i,\tag{4.23} ϕ^=N1i=1Nyi,(4.23)

也就是样本中各个标签出现的频率即为最优概率值。

  再求解参数 μ ⃗ 1 \vec{\mu}_1 μ 1,由于其他和 μ ⃗ 1 \vec{\mu}_1 μ 1 无关的部分求偏导后都得0,所以从目标函数中单独取出和相关的部分,即
μ ⃗ ^ 1 = arg max ⁡ μ ⃗ 1 ∑ i = 1 N y i ⋅ [ − 1 2 ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 ( x ⃗ i − μ ⃗ 1 ) ] = arg max ⁡ μ ⃗ 1 ∑ x i ⃗ ∈ C 1 [ − 1 2 ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 ( x ⃗ i − μ ⃗ 1 ) ] , (4.22) \begin{aligned} \hat{\vec{\mu}}_1&=\argmax_{\vec{\mu}_1}\sum_{i=1}^Ny_i\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ]\\ &=\argmax_{\vec{\mu}_1}\sum_{\vec{x_i}\in C_1}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ], \end{aligned}\tag{4.22} μ ^1=μ 1argmaxi=1Nyi[21(x iμ 1)TΣ1(x iμ 1)]=μ 1argmaxxi C1[21(x iμ 1)TΣ1(x iμ 1)],(4.22)

其中 C 1 = { x ⃗ i } y i = 1 C_1=\left \{ \vec{x}_i\right \}_{y_i=1} C1={x i}yi=1,因此求解方法等同于高维高斯分布的极大似然估计(见另一篇笔记 机器学习[白板推导](一) 第2.2.2.节),结果为
μ ⃗ ^ 1 = ∑ i = 1 N y i ⋅ x ⃗ i N 1 = ∑ x ⃗ i ∈ C 1 x ⃗ i N 1 μ ⃗ ^ 2 = ∑ i = 1 N ( 1 − y i ) ⋅ x ⃗ i N 2 = ∑ x ⃗ i ∈ C 2 x ⃗ i N 2 , (4.23) \begin{aligned} \hat{\vec{\mu}}_1&=\frac{\sum_{i=1}^Ny_i\cdot \vec{x}_i}{N_1}=\frac{\sum_{\vec{x}_i\in C_1}\vec{x}_i}{N_1}\\ \hat{\vec{\mu}}_2&=\frac{\sum_{i=1}^N\left (1-y_i \right )\cdot \vec{x}_i}{N_2}=\frac{\sum_{\vec{x}_i\in C_2}\vec{x}_i}{N_2}, \end{aligned}\tag{4.23} μ ^1μ ^2=N1i=1Nyix i=N1x iC1x i=N2i=1N(1yi)x i=N2x iC2x i,(4.23)

  最后求 Σ \Sigma Σ,同样取出目标函数中和其相关的部分,得:
Σ ^ = arg max ⁡ Σ ∑ x ⃗ i ∈ C 1 [ − 1 2 ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 ( x ⃗ i − μ ⃗ 1 ) ] + ∑ x ⃗ i ∈ C 2 [ − 1 2 ( x ⃗ i − μ ⃗ 2 ) T Σ − 1 ( x ⃗ i − μ ⃗ 2 ) ] − N 2 log ⁡ ∣ Σ ∣ , (4.24) \begin{aligned} \hat{\Sigma}=&\argmax_{\Sigma}\sum_{\vec{x}_i\in C_1}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ]\\ &+\sum_{\vec{x}_i\in C_2}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_2 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_2 \right ) \right ] - \frac{N}{2}\log |\Sigma| , \end{aligned}\tag{4.24} Σ^=Σargmaxx iC1[21(x iμ 1)TΣ1(x iμ 1)]+x iC2[21(x iμ 2)TΣ1(x iμ 2)]2Nlog∣Σ∣,(4.24)

对其求偏导得
d L ( θ ) d Σ = − N 2 Σ − 1 + 1 2 ∑ x ⃗ i ∈ C 1 Σ − 1 ( x ⃗ i − μ ⃗ 1 ) ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 + 1 2 ∑ x ⃗ i ∈ C 2 Σ − 1 ( x ⃗ i − μ ⃗ 2 ) ( x ⃗ i − μ ⃗ 2 ) T Σ − 1 , (4.25) \begin{aligned} \frac{d\mathcal{L}(\theta)}{d\Sigma} &=-\frac{N}{2}\Sigma ^{-1}+\frac{1}{2}\sum_{\vec{x}_i\in C_1}\Sigma^{-1}(\vec{x}_i-\vec{\mu}_1)(\vec{x}_i-\vec{\mu}_1)^T\Sigma^{-1}\\ &+\frac{1}{2}\sum_{\vec{x}_i\in C_2}\Sigma^{-1}(\vec{x}_i-\vec{\mu}_2)(\vec{x}_i-\vec{\mu}_2)^T\Sigma^{-1} , \end{aligned}\tag{4.25} dΣdL(θ)=2NΣ1+21x iC1Σ1(x iμ 1)(x iμ 1)TΣ1+21x iC2Σ1(x iμ 2)(x iμ 2)TΣ1,(4.25)

令其为0得
Σ ^ = 1 N [ ∑ x ⃗ i ∈ C 1 ( x ⃗ i − μ ⃗ 1 ) ( x ⃗ i − μ ⃗ 1 ) T + ∑ x ⃗ i ∈ C 2 ( x ⃗ i − μ ⃗ 2 ) ( x ⃗ i − μ ⃗ 2 ) T ] = N 1 ⋅ S 1 + N 2 ⋅ S 2 N , (4.26) \begin{aligned} \hat{\Sigma} &=\frac{1}{N}\left [\sum_{\vec{x}_i\in C_1}(\vec{x}_i-\vec{\mu}_1)(\vec{x}_i-\vec{\mu}_1)^T+\sum_{\vec{x}_i\in C_2}(\vec{x}_i-\vec{\mu}_2)(\vec{x}_i-\vec{\mu}_2)^T \right ]\\ &=\frac{N_1\cdot S_1+N_2\cdot S_2}{N} , \end{aligned}\tag{4.26} Σ^=N1 x iC1(x iμ 1)(x iμ 1)T+x iC2(x iμ 2)(x iμ 2)T =NN1S1+N2S2,(4.26)

其中 S 1 S_1 S1 S 2 S_2 S2 分别为两个类内样本方差。

4.6. 朴素贝叶斯 (Naive Bayes Classifier)

4.6.1. 基本思想

  所有朴素贝叶斯家族的算法都是基于朴素贝叶斯假设,又叫条件随机场假设,即假设各个特征之间相互独立。朴素贝叶斯模型是最简单的概率图模型,模型方法和高斯判别分析较为接近,这里不做重复。

版权声明:

本网仅为发布的内容提供存储空间,不对发表、转载的内容提供任何形式的保证。凡本网注明“来源:XXX网络”的作品,均转载自其它媒体,著作权归作者所有,商业转载请联系作者获得授权,非商业转载请注明出处。

我们尊重并感谢每一位作者,均已注明文章来源和作者。如因作品内容、版权或其它问题,请及时与我们联系,联系邮箱:809451989@qq.com,投稿邮箱:809451989@qq.com

热搜词