启
比较原始做LASSO包是library(glmnet)
若目标是纯 LASSO 分析,alpha 必须设为 1
标准化数据:LASSO 对特征的尺度敏感,需对数据进行标准化(均值为0,方差为1)。
cv.glmnet获得的lambda.min 或者 lambda.1se 传递给
glmnet::glmnet(lambda = ???)
# 加载数据(以 mtcars 为例)
data(mtcars)
x <- as.matrix(mtcars[, -1]) # 特征矩阵(mpg 是响应变量)
y <- mtcars$mpg# 交叉验证选择最优 lambda(自动 LASSO)
cv_fit <- cv.glmnet(x, y, alpha = 1)
best_lambda <- cv_fit$lambda.min# 用最优 lambda 训练最终模型
final_model <- glmnet(x, y, alpha = 1, lambda = best_lambda)# 查看筛选的变量
selected_vars <- rownames(coef(final_model))[coef(final_model)[, 1] != 0]
print(selected_vars)
手动标准化特征矩阵
x_scaled <- scale(x)
分类变量区别-测试
library(glmnet)data(iris)
str(iris$Species)
df=iris
design_matrix <- model.matrix(~ Species, data = df)
x<-as.matrix(data.frame(Sepal.Width=df$Sepal.Width, Petal.Length=df$Petal.Length,Petal.Width=df$Petal.Width,design_matrix))fit1 <- cv.glmnet(x = x,y = df$Sepal.Length)
fit1
plot(fit1)iris$Species_num <- as.numeric(iris$Species)
x2 <- as.matrix(iris[, c(2, 3, 4, 5)])
fit2 <- cv.glmnet(x = x, y = iris$Sepal.Length)
fit2
plot(fit2)
食管癌的
# -----01-Lasso----
set.seed(123)
train_index <- caret::createDataPartition(1:nrow(df), p = 0.7, list = T)[["Resample1"]]
test_index= setdiff(1:nrow(df), train_index)library(glmnet)
df <- read.csv("tab.csv")
library(glmnet)
# 先进行参数查找
cv.glmnet()#
names(df)
df[,4:15]<-lapply(df[,4:15],as.factor)paste(names(df[,4:15]),collapse = "+")
design_matrix <- model.matrix(~ Smoking_status+Alcohol_consumption+Tea_consumption+Sex+Ethnic.group+Residence+Education+Marital.status+History_of_diabetes+Family_history_of_cancer+Occupation+Physical_Activity, data = df)
df[,16:48] <- scale(df[,16:48])
summary(df$AAvsEPA);sd(df$AAvsEPA)
x <- as.matrix(data.frame(df[,16:48],design_matrix))fit1 <- cv.glmnet(x = x[train_index,],y = df[train_index,]$Group,alpha=1, nfolds = 5,type.measure = "mse",family="binomial")
plot(fit1)
fit1
mean(fit1$cvm)
best_lambda <- fit1$lambda.1se
coeficients <- coef(fit1, s = best_lambda)
selected_vars <- rownames(coeficients)[coeficients[, 1] != 0]
print("Selected variables in test prediction:")
print(selected_vars)lasso_pred <- predict(fit1, s = best_lambda, newx = x[test_index,])mse <- mean((lasso_pred - df[test_index,]$Group)^2)
cat("Test MSE:", mse, "")fit<- glmnet(x, df$Group, family = "cox", maxit = 1000)plot(fit)final_model <- glmnet(x[train_index,], df[train_index,]$Group, # 重新运行 glmnet(使用相同的 lambda 值)lambda = fit1$lambda,alpha = 1)
plot(final_model,label = T)
plot(final_model, xvar = "lambda", label = TRUE)
plot(final_model, xvar = "dev", label = TRUE)
Feature selection
We found 44 potential features, including demographics and clinical and laboratory variables (Table 1). We performed feature selection using the least absolute shrinkage and selection operator (LASSO), which is among the most widely used feature selection techniques. LASSO constructs a penalty function that compresses some of the regression coefcients, i.e., it forces the sum of the absolute values of the coefcients to be less than some fxed value while setting some regression coefcients at zero, thus obtaining a more refned model. LASSO retains the advantage of subset shrinkage as a biased estimator that deals with data with complex covariance. This algorithm uses LassoCV, a fvefold cross-validation approach, to automatically eliminate factors with zero coefcients (Python version: sklearn 0.22.1)
2.2.2. Feature Selection.
Feature selection was performed by using least absolute shrinkage and selection operator
(LASSO) regression. The LASSO regression model improves the prediction performance by adjusting the hyperparameter λ to compress the regression coefficients to zero and selecting the feature set that performs best in DN prediction. To determine the best λ value, λ was selected by minimum mean error using 10-fold cross-validation.
Detailed steps were as follows: (1) Screening characteristic factors: First, R software (glmnet4.1.2) was used to conduct the least absolute shrinkage and selection operator (LASSO) regression analysis and adjusting the variable screening and complexity. Then, LASSO regression analysis results were used to conduct multifactor
logistic regression analysis with SPSS, and finally, we obtained the characteristic factors of p < 0.05. (2) Data division: Pyskthon (0.22.1) random number method was used to randomly divide the gout patients into training set and test set according to the ratio of 7:3, of which 491 were in the training set and 211 were in the testing set. (3) Classified multi-model comprehensive analysis: eXtreme Gradient Boosting (XGBoost)