Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms

Feature selection is a process of extracting valuable features that have significant influence ondependent variable. This is still an active field of research and machine wandering. In this post I compare few feature selection algorithms: traditional GLM with regularization, computationally demanding Borutaand entropy based filter from FSelectorRcpp (free of Java/Weka) package. Check out the comparison onVenn Diagram carried out on data from the RTCGA factory of R data packages.

I would like to thank Magda Sobiczewska and pbiecek for inspiration for this comparison. I have a chance to use Boruta nad FSelectorRcpp in action. GLMnet is here only to improve Venn Diagram.

RTCGA data

Data used for this comparison come from RTCGA (http://rtcga.github.io/RTCGA/) and present genes’ expressions (RNASeq) from human sequenced genome. Datasets with RNASeq are available viaRTCGA.rnaseq data package and originally were provided by The Cancer Genome Atlas. It’s a great set of over 20 thousand of features (1 gene expression = 1 continuous feature) that might have influence on various aspects of human survival. Let’s use data for Breast Cancer (Breast invasive carcinoma / BRCA) where we will try to find valuable genes that have impact on dependent variable denoting whether a sample of the collected readings came from tumor or normal, healthy tissue.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA.rnaseq")
library(RTCGA.rnaseq)
BRCA.rnaseq$bcr_patient_barcode <-
   substr(BRCA.rnaseq$bcr_patient_barcode, 14, 14)

The dependent variable, bcr_patient_barcode, is the TCGA barcode from which we receive information whether a sample of the collected readings came from tumor or normal, healthy tissue (14th character in the code).

Check another RTCGA use case: TCGA and The Curse of BigData.

GLMnet

Logistic Regression, a model from generalized linear models (GLM) family, a first attempt model for class prediction, can be extended with regularization net to provide prediction and variables selection at the same time. We can assume that not valuable features will appear with equal to zero coefficient in the final model with best regularization parameter. Broader explanation can be found in the vignette of the glmnet package. Below is the code I use to extract valuable features with the extra help of cross-validation and parallel computing.

library(doMC)
registerDoMC(cores=6)
library(glmnet)
# fit the model
cv.glmnet(x = as.matrix(BRCA.rnaseq[, -1]),
          y = factor(BRCA.rnaseq[, 1]),
          family = "binomial",
          type.measure = "class",
          parallel = TRUE) -> cvfit
# extract feature names that have
# non zero coefficiant
names(which(
   coef(cvfit, s = "lambda.min")[, 1] != 0)
   )[-1] -> glmnet.features
# first name is intercept

Function coef extracts coefficients for fitted model. Argument s specifies for which regularization parameter we would like to extract them - lamba.min is the parameter for which miss-classification error is minimal. You may also try to use lambda.1se.

plot(cvfit)

Discussion about standardization for LASSO can be found here. I normally don’t do this, since I work with streaming data, for which checking assumptions, model diagnostics and standardization is problematic and is still a rapid field of research.

转自:http://r-addict.com/2016/06/19/Venn-Diagram-RTCGA-Feature-Selection.html

时间: 2024-08-25 13:40:12

Venn Diagram Comparison of Boruta, FSelectorRcpp and GLMnet Algorithms的相关文章

[R] venn.diagram保存pdf格式文件?

vennDiagram包中的主函数绘图时,好像不直接支持PDF格式文件: dat = list(a = group_out[[1]][,1],b = group_out[[2]][,1]) names(dat) <- group_names[1:2] venn.plot <- venn.diagram( dat, filename = "proteinGroup_venn.tiff", #pdf error imagetype = "tiff", #pd

VennDiagram 画文氏图/维恩图/Venn

install.packages("VennDiagram")library(VennDiagram) A = 1:150B = c(121:170,300:320)C = c(20:40,141:200)Length_A<-length(A)Length_B<-length(B)Length_C<-length(C)Length_AB<-length(intersect(A,B))Length_BC<-length(intersect(B,C))Leng

R绘制韦恩图 | Venn图

解决方案有好几种: 网页版,无脑绘图,就是麻烦,没有写代码方便 极简版,gplots::venn 文艺版,venneuler,不好安装rJava,参见Y叔 酷炫版,VennDiagram 1. 网页版的就不说了,非常简单,直接输入数据就行: 2. 极简版 options(repr.plot.width=4, repr.plot.height=5) vp <- gplots::venn(list(Nup=names(moduleListN_DEG[["up"]]), Ndown=n

data mining,machine learning,AI,data science,data science,business analytics

数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)的区别是什么? 数据科学(data science)和商业分析(business analytics)之间有什么关系? 本来我以为不需要解释这个问题的,到底数据挖掘(data mining),机器学习(machine learning),和人工智能(AI)有什么区别,但是前几天因为有个学弟问我,我想了想发现我竟然也回答不出来,我在知乎和博客上查了查这个问题,发现还没有人写过比较详细和有说服力的对比

MySQL INNER JOIN

Summary: in this tutorial, you will learn how to use MySQL INNER JOIN clause to select data from multiple tables based on join conditions. Introducing MySQL INNER JOIN clause The MySQL INNER JOIN clause matches rows in one table with rows in other ta

Difference between INNER and OUTER joins?

Assuming you're joining on columns with no duplicates, which is a very common case: An inner join of A and B gives the result of A intersect B, i.e. the inner part of a venn diagram intersection. An outer join of A and B gives the results of A union

Unsupervised Learning and Text Mining of Emotion Terms Using R

Unsupervised learning refers to data science approaches that involve learning without a prior knowledge about the classification of sample data. In Wikipedia, unsupervised learning has been described as "the task of inferring a function to describe h

Transparency Tutorial with C# - Part 1

Download demo project - 4 Kb Download source - 6 Kb Download demo project - 5 Kb Download source - 6 Kb Download demo project - 4 Kb Download source - 8 Kb Download demo project - 5 Kb Download source - 9 Kb Download demo project - 5 Kb Download sour

Javac内部类本地类

One way declared types in Java differ from one another is whether the type is a class (which includes enums) or an interface (which includes annotation types). An independent property of a type is its relation to the surrounding lexical context. A to