PSI-群体稳定性指标(连续型)
-
PSI
(Population Stability Index:群体稳定性指标)在风控中,一套模型上线后往往需要很久(通常一年以上),如果模型不稳定会直接影响决策的合理性,所以稳定性压倒一切,
PSI
反应了验证样本在各个分布与建模样本分布的稳定性,常用来筛选特征变量,评估模型稳定性入模变量保证稳定性,变量监控
模型分数保证稳定性,模型监控
其中在建模时通常以
- 训练样本(In the Sample,INS)作为预期分布
- 验证样本作为实际分布,训练样本包括
- 样本外(Out of Sample,OOS)
- 跨时间样本(Out of Time,OOT)
-
PSI计算公式
:
$$
\begin{aligned}
PSI &= sum{(实际占比-预期占比)\ln(\frac{实际占比}{预期占比})} \
&= \sum\limits_{buckets} (actual_pct - expect_pct)\ln(\frac{actual_pct}{expect_pct})\end{aligned}
$$
其中,origin_percent
表示实际数据当前分箱样本数占比、new_percent
表示预期数据当前分箱样本数占比。 -
注意:
np.log
基数默认为e
,信息论中尝尝选择2,因此信息的单位是比特
(bits
),而机器学习中基数长选择为自然常数e,因此单位常被称为奈特
(nats
) -
PSI数值范围
PSI范围 稳定性 建议事项 0 ~ 0.1 好 没有变化或者很少变化 0,1 ~ 0.25 略不稳定 有变化,继续监控后续变化 > 0.25 不稳定 发生大变化,进行特征项分析 -
代码
import pandas as pd import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from tqdm import notebook"""创建数据""" cancer = load_breast_cancer() df = pd.DataFrame(cancer.data,columns=['_'.join(i.split()) for i in cancer.feature_names]) df['y'] = cancer.targetX_train,X_test,y_train,y_test = train_test_split(df.iloc[:,:-1],df['y'],test_size=.2)print(X_train.shape) # (455,30) print(X_test.shape) # (114,30)
psi计算代码
def psi_calculate(origin,new,feature_name,origin_y=None,y_name=None,buckets_type='cut',bins_num=10):"""计算单个连续型变量的psiorigin为实际数据,new为预期数据无监督分箱,当bucket_type 为 cut、qcut 时,无需借助目标变量有监督分箱,当bucket_type 为 tree、chimerge时,此时借助scorecardpy库Parameters----------------------------------------------------------------------:param origin: DataFrame,实际数据 :param new: DataFrame,预期数据 :param feature_name: string,需要计算PSI的字段(类别型) :param origin_y: Series,y值 :param y_name: string,目标变量名称:param buckets_type: string,分箱方式: cut、qcut、tree、chimerge:param bins_num: int,分箱数 Returns----------------------------------------------------------------------:return psi: float,psi值psi_df: DataFrame,psi详细Examples------------------------------------------------------------------------等频、等距分箱(无监督分箱)>>> psi,psi_df = psi_calculate(origin=X_train,new=X_test,feature_name='mean_radius',buckets_type='qcut',bins_num=10 )--决策树,卡方分箱(有监督分箱)>>> psi,psi_df = psi_calculate(origin=X_train,new=X_test,feature_name='mean_radius',origin_y=y_train,y_name='y',buckets_type='chimerge')"""origin = origin[[feature_name]]new = new[[feature_name]]if buckets_type == 'cut': # 等宽分箱origin_min = origin[feature_name].min() # 最小值origin_max = origin[feature_name].max() # 最大值binlen = (origin_max-origin_min) / bins_num #等频率每一箱长度bins = [origin_min + i * binlen for i in range(1, bins_num)]#设定分组bins.insert(0, -float("inf"))bins.append(float("inf"))print(bins)origin_cut = pd.cut(origin[feature_name],bins=bins).value_counts(sort=False).reset_index()new_cut = pd.cut(new[feature_name],bins=bins).value_counts(sort=False).reset_index()origin_cut.columns = ['buckets','origin_cnt'] new_cut.columns = ['buckets','new_cnt']elif buckets_type == 'qcut': # 等频率分箱qcut_data = pd.qcut(origin[feature_name],q=bins_num,duplicates='drop'# ,retbins=True)origin_cut = origin[feature_name].groupby(qcut_data).count().rename('origin_cnt')qcut_bins = origin_cut.index.categories # 等频分箱的bins,如果直接用retbins返回的会有浮点数误差origin_cut = origin_cut.reset_index()new_cut = new[feature_name].groupby(pd.cut(new[feature_name],bins=qcut_bins)).count().rename('new_cnt').reset_index()origin_cut.columns = ['buckets','origin_cnt'] new_cut.columns = ['buckets','new_cnt']elif buckets_type in ['tree','chimerge']: ## 借助scorecardpy 库的分箱,分箱方法为决策树或者卡方origin = origin[[feature_name]]new = new[[feature_name]]## 卡方和决策树分箱都是有监督分箱,需要借助目标变量origin_cut_data = pd.concat([origin[feature_name],origin_y],axis=1)origin_cut = sc.woebin(origin_cut_data,y=y_name,method=buckets_type)[feature_name]break_list = origin_cut.breaks.tolist()break_list = [float(i) for i in break_list]break_list.insert(0,-np.inf)origin_cut = origin_cut[['bin','count']]new_cut = new[feature_name].groupby(pd.cut(new[feature_name],bins=break_list,right=False # [left,right))).count().rename('new_cnt').reset_index()new_cut[feature_name] = new_cut[feature_name].astype('str')new_cut[feature_name] = new_cut[feature_name].apply(lambda x: x.replace(' ',''))origin_cut.columns = ['buckets','origin_cnt'] new_cut.columns = ['buckets','new_cnt']## 解决为了防止浮点数误差导致的后续无法根据分割点merge## [-inf,0.14999999999999997) [-inf,0.15)origin_cut['buckets'] = new_cut['buckets']else:print('bucket_types 只能在【cut、qcut、tree、chimerge】中')raise ValueError# print(origin_cut)origin_cut['feature'] = feature_namenew_cut['feature'] = feature_nameorigin_cut = origin_cut[['feature','buckets','origin_cnt']]new_cut = new_cut[['feature','buckets','new_cnt']]# print(origin_cut)# print(new_cut)psi_df = pd.merge(origin_cut,new_cut,on=['feature','buckets'])# print(psi_df)# 计算占比,分子加1,防止计算PSI时分子为0(这里分母不可能为0)psi_df['origin_percent'] = (psi_df['origin_cnt'] + 1) / psi_df['origin_cnt'].sum()psi_df['new_percent'] = (psi_df['new_cnt'] + 1) / psi_df['new_cnt'].sum()psi_df['minus'] = psi_df.apply(lambda x: x['origin_percent']-x['new_percent'],axis=1)psi_df['log'] = psi_df.apply(lambda x: np.log(x['origin_percent']/x['new_percent']),axis=1)# psi_df['psi'] = psi_df.apply(lambda x: (x['origin_percent']-x['new_percent']) *\# np.log(x['origin_percent']/x['new_percent'])# ,axis=1)psi_df['psi_bucket'] = psi_df.apply(lambda x: x['minus'] * x['log'],axis=1)psi_df['psi'] = psi_df['psi_bucket'].sum() psi = psi_df['psi_bucket'].sum()return psi,psi_df
计算单个变量psi(等频)
# 计算单个指标的psi psi,psi_df = psi_calculate(X_train,X_test,buckets_type='qcut',feature_name='mean_radius') psi"""输出""" 0.11023008607241508
计算单个变量psi(
以卡方分箱为例子
)psi,psi_df = psi_calculate(origin=X_train,new=X_test,feature_name='mean_radius',origin_y=y_train,y_name='y',buckets_type='chimerge') psi"""输出""" 0.02009471667376726
计算所有指标的psi(
以卡方分箱为例子
)# 计算所有指标的psi psi_list = [] psi_df_list = [] for feature in notebook.tqdm(X_train.columns):print(feature)psi,psi_df = psi_calculate(origin=X_train,new=X_test,feature_name=feature,origin_y=y_train,y_name='y',buckets_type='chimerge') psi_list.append((feature,psi))psi_df_list.append(psi_df)psi = pd.DataFrame(psi_list,columns=['feature','psi']) psi_df = pd.concat(psi_df_list,ignore_index=False)