4. 线性分类
4.1. 线性分类的典型模型
- 硬分类:输出结果只有0或1这种离散结果;
- 感知机
- 线性判别分析 Fisher
- 软分类:会输出0-1之间的值作为各个类别的概率;
- 概率生成模型:高斯判别分析GDA、朴素贝叶斯,主要建模的是 p ( x ⃗ , y ) p(\vec{x},y) p(x,y)
- 概率判别模型:逻辑回归,主要建模的是 p ( y ∣ x ⃗ ) p(y|\vec{x}) p(y∣x)
4.2. 感知机
4.2.1. 基本模型
模型:
f ( x ⃗ ^ ) = sign ( x ⃗ ^ T W ) , sign ( a ) = { 1 , a > 0 0 , a = 0 − 1 , a < 0 . (4.1) \begin{aligned} f(\hat{\vec{x}})=\text{sign}(\hat{\vec{x}}^TW),\\ \text{sign}(a)=\left\{\begin{matrix}1,&a>0\\0,&a=0\\-1,&a<0\end{matrix}.\right.\end{aligned}\tag{4.1} f(x^)=sign(x^TW),sign(a)=⎩ ⎨ ⎧1,0,−1,a>0a=0a<0.(4.1)
思想:错误驱动,先随机初始化一个 W W W,研究错误分类的样本来调整。
策略:使用错分类样本计算损失函数:
L ( W ) = ∑ x ⃗ ^ i ∈ D − y i ⋅ x ⃗ ^ i T W D = { x ⃗ ^ i } f ( x ⃗ ^ i ) ≠ y i . (4.2) \begin{aligned} \mathcal{L}(W)=\sum_{\hat{\vec{x}}_i\in D}-y_i\cdot \hat{\vec{x}}_i^TW\\ D=\{\hat{\vec{x}}_i\}_{f(\hat{\vec{x}}_i)\neq y_i}. \end{aligned}\tag{4.2} L(W)=x^i∈D∑−yi⋅x^iTWD={x^i}f(x^i)=yi.(4.2)
4.3. 线性判别分析 Fisher
4.3.1. 问题定义
对于一个二分类问题,将样本分为 X c 1 = { x ⃗ ^ i ∣ y i = + 1 } X_{c_1}=\left \{ \hat{\vec{x}}_i|y_i=+1 \right \} Xc1={x^i∣yi=+1} 和 X c 2 = { x ⃗ ^ i ∣ y i = − 1 } X_{c_2}=\left \{ \hat{\vec{x}}_i|y_i=-1 \right \} Xc2={x^i∣yi=−1},这两组的样本数分别为 N 1 N_1 N1 和 N 2 N_2 N2, N 1 + N 2 = N N_1+N_2=N N1+N2=N.
寻找一个投影超平面 W W W,使所有样本点在这个平面的投影可以做到类内间距小,类间间距大。
4.3.2. 过程推导
样本 x ⃗ ^ i \hat{\vec{x}}_i x^i 在超平面 W W W 上的投影可以表示为 z = x ⃗ ^ T ⋅ W z=\hat{\vec{x}}^T\cdot W z=x^T⋅W,则对其求均值和方差:
z ˉ = 1 N ∑ i = 1 N z i = 1 N ∑ i = 1 N x ⃗ ^ i T ⋅ W S z = 1 N ∑ i = 1 N ( z i − z ˉ ) 2 = 1 N ∑ i = 1 N ( x ⃗ ^ i T ⋅ W − z ˉ ) 2 . (4.3) \begin{aligned} \bar{z}&=\frac{1}{N}\sum_{i=1}^Nz_i=\frac{1}{N}\sum_{i=1}^N\hat{\vec{x}}_i^T\cdot W\\ S_z&=\frac{1}{N}\sum_{i=1}^N(z_i-\bar{z})^2=\frac{1}{N}\sum_{i=1}^N(\hat{\vec{x}}_i^T\cdot W-\bar{z})^2. \end{aligned}\tag{4.3} zˉSz=N1i=1∑Nzi=N1i=1∑Nx^iT⋅W=N1i=1∑N(zi−zˉ)2=N1i=1∑N(x^iT⋅W−zˉ)2.(4.3)
基于上式分别对两类样本计算均值 z ˉ 1 \bar{z}_1 zˉ1 和 z ˉ 2 \bar{z}_2 zˉ2,以及方差 S z 1 S_{z_1} Sz1 和 S z 2 S_{z_2} Sz2. 为了尽可能类内间距小,类间间距大,将目标函数定义为
J ( W ) = ( z ˉ 1 − z ˉ 2 ) 2 S z 1 + S z 2 , (4.4) \mathcal{J}(W)=\frac{(\bar{z}_1-\bar{z}_2)^2}{S_{z_1}+S_{z_2}},\tag{4.4} J(W)=Sz1+Sz2(zˉ1−zˉ2)2,(4.4)
则模型转为优化问题:
W = arg max W J ( W ) , (4.5) W=\argmax_W\mathcal{J}(W),\tag{4.5} W=WargmaxJ(W),(4.5)
对目标函数进行化简:
( z ˉ 1 − z ˉ 2 ) 2 = ( 1 N 1 ∑ i = 1 N 1 x ⃗ ^ i T ⋅ W − 1 N 2 ∑ i = 1 N 2 x ⃗ ^ i T ⋅ W ) 2 = [ ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W ] 2 = W T ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W (4.6) \begin{aligned} (\bar{z}_1-\bar{z}_2)^2&=(\frac{1}{N_1}\sum_{i=1}^{N_1}\hat{\vec{x}}_i^T\cdot W-\frac{1}{N_2}\sum_{i=1}^{N_2}\hat{\vec{x}}_i^T\cdot W)^2\\ &=[(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW]^2\\ &=W^T(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW \end{aligned}\tag{4.6} (zˉ1−zˉ2)2=(N11i=1∑N1x^iT⋅W−N21i=1∑N2x^iT⋅W)2=[(xˉC1−xˉC2)TW]2=WT(xˉC1−xˉC2)(xˉC1−xˉC2)TW(4.6)
S z 1 = 1 N 1 ∑ i = 1 N 1 ( x ⃗ ^ i T ⋅ W − z ˉ 1 ) 2 = 1 N 1 ∑ i = 1 N 1 W T ( x ⃗ i − x ⃗ ˉ C 1 ) ( x ⃗ i − x ⃗ ˉ C 1 ) T W = W T [ 1 N 1 ∑ i = 1 N 1 ( x ⃗ i − x ⃗ ˉ C 1 ) ( x ⃗ i − x ⃗ ˉ C 1 ) T ] W = W T S C 1 W , (4.7) \begin{aligned}S_{z_1}&=\frac{1}{N_1}\sum_{i=1}^{N_1}(\hat{\vec{x}}_i^T\cdot W-\bar{z}_1)^2\\&=\frac{1}{N_1}\sum_{i=1}^{N_1}W^T(\vec{x}_i-\bar{\vec{x}}_{C_1})(\vec{x}_i-\bar{\vec{x}}_{C_1})^TW\\ & =W^T\left [\frac{1}{N_1}\sum_{i=1}^{N_1}(\vec{x}_i-\bar{\vec{x}}_{C_1})(\vec{x}_i-\bar{\vec{x}}_{C_1})^T \right ]W\\ &=W^TS_{C_1}W,\end{aligned}\tag{4.7} Sz1=N11i=1∑N1(x^iT⋅W−zˉ1)2=N11i=1∑N1WT(xi−xˉC1)(xi−xˉC1)TW=WT[N11i=1∑N1(xi−xˉC1)(xi−xˉC1)T]W=WTSC1W,(4.7)
同理可得 S z 2 = W T S C 2 W S_{z_2}=W^TS_{C_2}W Sz2=WTSC2W. 所以目标函数化为
J ( W ) = ( z ˉ 1 − z ˉ 2 ) 2 S z 1 + S z 2 = W T ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W W T ( S C 1 + S C 2 ) W (4.8) \begin{aligned} \mathcal{J}(W)&=\frac{(\bar{z}_1-\bar{z}_2)^2}{S_{z_1}+S_{z_2}}\\&=\frac{W^T(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW}{W^T(S_{C_1}+S_{C_2})W} \end{aligned}\tag{4.8} J(W)=Sz1+Sz2(zˉ1−zˉ2)2=WT(SC1+SC2)WWT(xˉC1−xˉC2)(xˉC1−xˉC2)TW(4.8)
再定义总类内方差 S w S_w Sw 和总类间方差 S b S_b Sb:
S w = S C 1 + S C 2 S b = ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T (4.9) \begin{aligned} S_w&=S_{C_1}+S_{C_2}\\ S_b&=(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^T \end{aligned}\tag{4.9} SwSb=SC1+SC2=(xˉC1−xˉC2)(xˉC1−xˉC2)T(4.9)
因此目标函数被表示为:
J ( W ) = W T S b W W T S w W , (4.10) \mathcal{J}(W)=\frac{W^TS_bW}{W^TS_wW},\tag{4.10} J(W)=WTSwWWTSbW,(4.10)
对目标函数求导得:
∂ J ( W ) ∂ W = 2 S b W ( W T S w W ) − 1 + W T S b W ⋅ ( − 1 ) ⋅ ( W T S w W ) − 2 ⋅ 2 S w W , (4.11) \begin{aligned}\frac{\partial \mathcal{J}(W)}{\partial W}&=2S_bW(W^TS_wW)^{-1}+\\&W^TS_bW\cdot(-1)\cdot (W^TS_wW)^{-2}\cdot2S_wW,\end{aligned}\tag{4.11} ∂W∂J(W)=2SbW(WTSwW)−1+WTSbW⋅(−1)⋅(WTSwW)−2⋅2SwW,(4.11)
令其为0可得
W = W T S w W W T S b W S w − 1 S b W = W T S w W W T S b W S w − 1 ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W = W T S w W ⋅ ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) T W W T S b W S w − 1 ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) , (4.12) \begin{aligned}W&=\frac{W^TS_wW}{W^TS_bW}S_w^{-1}S_bW\\&=\frac{W^TS_wW}{W^TS_bW}S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW\\&=\frac{W^TS_wW\cdot (\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2})^TW}{W^TS_bW}S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2}) ,\end{aligned}\tag{4.12} W=WTSbWWTSwWSw−1SbW=WTSbWWTSwWSw−1(xˉC1−xˉC2)(xˉC1−xˉC2)TW=WTSbWWTSwW⋅(xˉC1−xˉC2)TWSw−1(xˉC1−xˉC2),(4.12)
因为 W W W 是一个单位向量,所以我们只关心其方向而不关心其长度,所以最终得到:
W ∝ S w − 1 ( x ⃗ ˉ C 1 − x ⃗ ˉ C 2 ) . (4.13) \begin{aligned}W\propto S_w^{-1}(\bar{\vec{x}}_{C_1}-\bar{\vec{x}}_{C_2}). \end{aligned}\tag{4.13} W∝Sw−1(xˉC1−xˉC2).(4.13)
4.4. 逻辑回归
4.4.1. 基本思想
在线性回归中引入非线性激活函数,使其可以将回归结果映射为概率值,作为类别的概率。因此将这个模型看做一个条件概率分布的建模,输入 x ⃗ ^ \hat{\vec{x}} x^,通过建模 p ( y ∣ x ⃗ ^ ) p(y|\hat{\vec{x}}) p(y∣x^),输出 y y y 的离散取值;
4.4.2. Sigmoid 激活函数
基本公式:
σ ( z ) = 1 1 + e − z (4.14) \sigma (z)=\frac{1}{1+e^{-z}}\tag{4.14} σ(z)=1+e−z1(4.14)
特殊取值:
- z → − ∞ z\rightarrow -\infty z→−∞ 时, lim σ ( z ) = 0 \lim \sigma (z)=0 limσ(z)=0;
- z = 0 z=0 z=0 时, σ ( z ) = 1 2 \sigma(z)=\frac{1}{2} σ(z)=21;
- z → ∞ z\rightarrow \infty z→∞ 时, lim σ ( z ) = 1 \lim \sigma (z)=1 limσ(z)=1.
函数图像:
4.4.3. 模型推导
根据条件概率建模的思想:
p 1 = p ( y = 1 ∣ x ⃗ ^ ) = σ ( x ⃗ ^ T W ) = 1 1 + e − x ⃗ ^ T W p 0 = p ( y = 0 ∣ x ⃗ ^ ) = 1 − σ ( x ⃗ ^ T W ) = e − x ⃗ ^ T W 1 + e − x ⃗ ^ T W , (4.15) \begin{aligned} p_1&=p(y=1|\hat{\vec{x}})=\sigma(\hat{\vec{x}}^TW)= \frac{1}{1+e^{-\hat{\vec{x}}^TW}}\\ p_0&=p(y=0|\hat{\vec{x}})=1-\sigma(\hat{\vec{x}}^TW)= \frac{e^{-\hat{\vec{x}}^TW}}{1+e^{-\hat{\vec{x}}^TW}}, \end{aligned}\tag{4.15} p1p0=p(y=1∣x^)=σ(x^TW)=1+e−x^TW1=p(y=0∣x^)=1−σ(x^TW)=1+e−x^TWe−x^TW,(4.15)
因此将整个模型写作
p ( y ∣ x ) = p 1 y ⋅ p 0 1 − y , (4.16) p(y|x)=p_1^y\cdot p_0^{1-y},\tag{4.16} p(y∣x)=p1y⋅p01−y,(4.16)
当 y = 0 y=0 y=0 时, p = p 0 p=p_0 p=p0,当 y = 1 y=1 y=1 时, p = p 1 p=p_1 p=p1.
用极大似然估计法求解模型:
W ^ = arg max W log p ( Y ∣ X ) = arg max W ∑ i = 1 N [ y i ⋅ log σ ( x ⃗ ^ T W ) + ( 1 − y i ) ⋅ log ( 1 − σ ( x ⃗ ^ T W ) ) ] , (4.17) \begin{aligned} \hat{W}&=\argmax_W\log p(Y|X)\\ &=\argmax_W\sum_{i=1}^N\left [ y_i\cdot \log\sigma(\hat{\vec{x}}^TW)+(1-y_i)\cdot \log(1-\sigma(\hat{\vec{x}}^TW)) \right ], \end{aligned}\tag{4.17} W^=Wargmaxlogp(Y∣X)=Wargmaxi=1∑N[yi⋅logσ(x^TW)+(1−yi)⋅log(1−σ(x^TW))],(4.17)
对其求梯度得
▽ grad W = ∑ i = 1 N [ y i ⋅ ( 1 − σ ( x ⃗ ^ T W ) ) ⋅ x ⃗ ^ − ( 1 − y i ) ⋅ σ ( x ⃗ ^ T W ) ⋅ x ⃗ ^ ] = ∑ i = 1 N [ y i − σ ( x ⃗ ^ T W ) ] ⋅ x ⃗ ^ , (4.18) \begin{aligned} \bigtriangledown \text{grad}_W &=\sum_{i=1}^N \left [y_i\cdot (1-\sigma(\hat{\vec{x}}^TW))\cdot\hat{\vec{x}} - (1-y_i)\cdot \sigma(\hat{\vec{x}}^TW)\cdot\hat{\vec{x}} \right ]\\ &=\sum_{i=1}^N \left [y_i-\sigma(\hat{\vec{x}}^TW) \right ]\cdot\hat{\vec{x}}, \end{aligned}\tag{4.18} ▽gradW=i=1∑N[yi⋅(1−σ(x^TW))⋅x^−(1−yi)⋅σ(x^TW)⋅x^]=i=1∑N[yi−σ(x^TW)]⋅x^,(4.18)
即可对模型进行迭代更新。
4.5. 高斯判别分析
4.5.1. 概率判别式模型与概率生成式模型的区别
概率判别式模型主要计算条件概率密度 p ( y ∣ x ⃗ ) p(y|\vec{x}) p(y∣x),取令该概率最大的 y y y 为分类结果;
概率生成式模型并不需要计算具体的 p ( y ∣ x ⃗ ) p(y|\vec{x}) p(y∣x) 值,而是直接思考 p ( y = 1 ∣ x ⃗ ) p(y=1|\vec{x}) p(y=1∣x) 和 p ( y = 0 ∣ x ⃗ ) p(y=0|\vec{x}) p(y=0∣x) 的结果谁更大,根据贝叶斯公式 p ( y ∣ x ⃗ ) = p ( x ⃗ ∣ y ) ⋅ p ( y ) p ( x ⃗ ) p(y|\vec{x}) = \frac{p(\vec{x}|y)\cdot p(y)}{p(\vec{x})} p(y∣x)=p(x)p(x∣y)⋅p(y),将目标函数变为:
y ^ = arg max y p ( y ∣ x ⃗ ) = arg max y p ( x ⃗ ∣ y ) ⋅ p ( y ) , (4.19) \begin{aligned} \hat{y}&=\argmax_y p(y|\vec{x})\\&=\argmax_y p(\vec{x}|y)\cdot p(y), \end{aligned}\tag{4.19} y^=yargmaxp(y∣x)=yargmaxp(x∣y)⋅p(y),(4.19)
其中若 p ( y = 1 ) = ϕ p(y=1)=\phi p(y=1)=ϕ,则 p ( y = 0 ) = 1 − ϕ p(y=0)=1-\phi p(y=0)=1−ϕ,可以将 p ( y ) p(y) p(y) 合并为 p ( y ) = ϕ y ⋅ ( 1 − ϕ ) 1 − y p(y)=\phi ^y\cdot (1-\phi)^{1-y} p(y)=ϕy⋅(1−ϕ)1−y.
4.5.2. 高斯概率假设
在高斯判别模型中,假设条件分布是遵从高斯概率分布的,即:
x ⃗ ∣ y = 0 ∼ N ( μ 1 , Σ ) x ⃗ ∣ y = 1 ∼ N ( μ 2 , Σ ) , (4.20) \begin{aligned} \vec{x}|y&=0\sim N(\mu_1, \Sigma)\\ \vec{x}|y&=1\sim N(\mu_2, \Sigma), \end{aligned}\tag{4.20} x∣yx∣y=0∼N(μ1,Σ)=1∼N(μ2,Σ),(4.20)
使用对数似然求解目标函数,可得
L ( θ ) = log ∏ i = 1 N p ( x ⃗ i , y i ) = ∑ i = 1 N [ log p ( x ⃗ i ∣ y i ) + log p ( y i ) ] = ∑ i = 1 N { y i ⋅ [ − 1 2 ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 ( x ⃗ i − μ ⃗ 1 ) − p 2 log 2 π − − 1 2 log ∣ Σ ∣ ] + ( 1 − y i ) ⋅ [ − 1 2 ( x ⃗ i − μ ⃗ 2 ) T Σ − 1 ( x ⃗ i − μ ⃗ 2 ) − p 2 log 2 π − − 1 2 log ∣ Σ ∣ ] + y i ⋅ log ϕ + ( 1 − y i ) ⋅ log ( 1 − ϕ ) } . (4.21) \begin{aligned} \mathcal{L}(\theta)&=\log\prod_{i=1}^N p(\vec{x}_i,y_i)\\ &=\sum _{i=1}^N\left [\log p(\vec{x}_i|y_i)+\log p(y_i) \right ] \\ &=\sum _{i=1}^N\left \{ y_i\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right )-\frac{p}{2}\log2\pi--\frac{1}{2}\log|\Sigma| \right ]+\right.\\ &\left (1-y_i \right )\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_2 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_2 \right )-\frac{p}{2}\log2\pi--\frac{1}{2}\log|\Sigma| \right ] +\\&\left.y_i \cdot\log\phi + (1-y_i)\cdot\log(1-\phi)\right \} . \end{aligned}\tag{4.21} L(θ)=logi=1∏Np(xi,yi)=i=1∑N[logp(xi∣yi)+logp(yi)]=i=1∑N{yi⋅[−21(xi−μ1)TΣ−1(xi−μ1)−2plog2π−−21log∣Σ∣]+(1−yi)⋅[−21(xi−μ2)TΣ−1(xi−μ2)−2plog2π−−21log∣Σ∣]+yi⋅logϕ+(1−yi)⋅log(1−ϕ)}.(4.21)
4.5.3. 求解模型
首先求解参数 ϕ \phi ϕ,对目标函数求偏导:
∂ L ( θ ) ∂ ϕ = ∑ i = 1 N ( y i ϕ − 1 − y i 1 − ϕ ) , (4.22) \begin{aligned} \frac{\partial L(\theta)}{\partial \phi}=\sum_{i=1}^N\left (\frac{y_i}{\phi}-\frac{1-y_i}{1-\phi} \right ) , \end{aligned}\tag{4.22} ∂ϕ∂L(θ)=i=1∑N(ϕyi−1−ϕ1−yi),(4.22)
令其为0,求得
ϕ ^ = 1 N ∑ i = 1 N y i , (4.23) \hat{\phi}=\frac{1}{N}\sum_{i=1}^{N}y_i,\tag{4.23} ϕ^=N1i=1∑Nyi,(4.23)
也就是样本中各个标签出现的频率即为最优概率值。
再求解参数 μ ⃗ 1 \vec{\mu}_1 μ1,由于其他和 μ ⃗ 1 \vec{\mu}_1 μ1 无关的部分求偏导后都得0,所以从目标函数中单独取出和相关的部分,即
μ ⃗ ^ 1 = arg max μ ⃗ 1 ∑ i = 1 N y i ⋅ [ − 1 2 ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 ( x ⃗ i − μ ⃗ 1 ) ] = arg max μ ⃗ 1 ∑ x i ⃗ ∈ C 1 [ − 1 2 ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 ( x ⃗ i − μ ⃗ 1 ) ] , (4.22) \begin{aligned} \hat{\vec{\mu}}_1&=\argmax_{\vec{\mu}_1}\sum_{i=1}^Ny_i\cdot \left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ]\\ &=\argmax_{\vec{\mu}_1}\sum_{\vec{x_i}\in C_1}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ], \end{aligned}\tag{4.22} μ^1=μ1argmaxi=1∑Nyi⋅[−21(xi−μ1)TΣ−1(xi−μ1)]=μ1argmaxxi∈C1∑[−21(xi−μ1)TΣ−1(xi−μ1)],(4.22)
其中 C 1 = { x ⃗ i } y i = 1 C_1=\left \{ \vec{x}_i\right \}_{y_i=1} C1={xi}yi=1,因此求解方法等同于高维高斯分布的极大似然估计(见另一篇笔记 机器学习[白板推导](一) 第2.2.2.节),结果为
μ ⃗ ^ 1 = ∑ i = 1 N y i ⋅ x ⃗ i N 1 = ∑ x ⃗ i ∈ C 1 x ⃗ i N 1 μ ⃗ ^ 2 = ∑ i = 1 N ( 1 − y i ) ⋅ x ⃗ i N 2 = ∑ x ⃗ i ∈ C 2 x ⃗ i N 2 , (4.23) \begin{aligned} \hat{\vec{\mu}}_1&=\frac{\sum_{i=1}^Ny_i\cdot \vec{x}_i}{N_1}=\frac{\sum_{\vec{x}_i\in C_1}\vec{x}_i}{N_1}\\ \hat{\vec{\mu}}_2&=\frac{\sum_{i=1}^N\left (1-y_i \right )\cdot \vec{x}_i}{N_2}=\frac{\sum_{\vec{x}_i\in C_2}\vec{x}_i}{N_2}, \end{aligned}\tag{4.23} μ^1μ^2=N1∑i=1Nyi⋅xi=N1∑xi∈C1xi=N2∑i=1N(1−yi)⋅xi=N2∑xi∈C2xi,(4.23)
最后求 Σ \Sigma Σ,同样取出目标函数中和其相关的部分,得:
Σ ^ = arg max Σ ∑ x ⃗ i ∈ C 1 [ − 1 2 ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 ( x ⃗ i − μ ⃗ 1 ) ] + ∑ x ⃗ i ∈ C 2 [ − 1 2 ( x ⃗ i − μ ⃗ 2 ) T Σ − 1 ( x ⃗ i − μ ⃗ 2 ) ] − N 2 log ∣ Σ ∣ , (4.24) \begin{aligned} \hat{\Sigma}=&\argmax_{\Sigma}\sum_{\vec{x}_i\in C_1}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_1 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_1 \right ) \right ]\\ &+\sum_{\vec{x}_i\in C_2}\left [-\frac{1}{2}\left (\vec{x}_i-\vec{\mu}_2 \right )^T\Sigma^{-1} \left(\vec{x}_i-\vec{\mu}_2 \right ) \right ] - \frac{N}{2}\log |\Sigma| , \end{aligned}\tag{4.24} Σ^=Σargmaxxi∈C1∑[−21(xi−μ1)TΣ−1(xi−μ1)]+xi∈C2∑[−21(xi−μ2)TΣ−1(xi−μ2)]−2Nlog∣Σ∣,(4.24)
对其求偏导得
d L ( θ ) d Σ = − N 2 Σ − 1 + 1 2 ∑ x ⃗ i ∈ C 1 Σ − 1 ( x ⃗ i − μ ⃗ 1 ) ( x ⃗ i − μ ⃗ 1 ) T Σ − 1 + 1 2 ∑ x ⃗ i ∈ C 2 Σ − 1 ( x ⃗ i − μ ⃗ 2 ) ( x ⃗ i − μ ⃗ 2 ) T Σ − 1 , (4.25) \begin{aligned} \frac{d\mathcal{L}(\theta)}{d\Sigma} &=-\frac{N}{2}\Sigma ^{-1}+\frac{1}{2}\sum_{\vec{x}_i\in C_1}\Sigma^{-1}(\vec{x}_i-\vec{\mu}_1)(\vec{x}_i-\vec{\mu}_1)^T\Sigma^{-1}\\ &+\frac{1}{2}\sum_{\vec{x}_i\in C_2}\Sigma^{-1}(\vec{x}_i-\vec{\mu}_2)(\vec{x}_i-\vec{\mu}_2)^T\Sigma^{-1} , \end{aligned}\tag{4.25} dΣdL(θ)=−2NΣ−1+21xi∈C1∑Σ−1(xi−μ1)(xi−μ1)TΣ−1+21xi∈C2∑Σ−1(xi−μ2)(xi−μ2)TΣ−1,(4.25)
令其为0得
Σ ^ = 1 N [ ∑ x ⃗ i ∈ C 1 ( x ⃗ i − μ ⃗ 1 ) ( x ⃗ i − μ ⃗ 1 ) T + ∑ x ⃗ i ∈ C 2 ( x ⃗ i − μ ⃗ 2 ) ( x ⃗ i − μ ⃗ 2 ) T ] = N 1 ⋅ S 1 + N 2 ⋅ S 2 N , (4.26) \begin{aligned} \hat{\Sigma} &=\frac{1}{N}\left [\sum_{\vec{x}_i\in C_1}(\vec{x}_i-\vec{\mu}_1)(\vec{x}_i-\vec{\mu}_1)^T+\sum_{\vec{x}_i\in C_2}(\vec{x}_i-\vec{\mu}_2)(\vec{x}_i-\vec{\mu}_2)^T \right ]\\ &=\frac{N_1\cdot S_1+N_2\cdot S_2}{N} , \end{aligned}\tag{4.26} Σ^=N1 xi∈C1∑(xi−μ1)(xi−μ1)T+xi∈C2∑(xi−μ2)(xi−μ2)T =NN1⋅S1+N2⋅S2,(4.26)
其中 S 1 S_1 S1 和 S 2 S_2 S2 分别为两个类内样本方差。
4.6. 朴素贝叶斯 (Naive Bayes Classifier)
4.6.1. 基本思想
所有朴素贝叶斯家族的算法都是基于朴素贝叶斯假设,又叫条件随机场假设,即假设各个特征之间相互独立。朴素贝叶斯模型是最简单的概率图模型,模型方法和高斯判别分析较为接近,这里不做重复。