GeneSrF is web-based tool for gene selection in classification problems that uses random forests. Two approaches for gene selection are used: one is targeted towards identifying small, non-redundant sets of genes that have good predictive performance. The second is a more heuristic graphical approach that can be used to identify large sets of genes (including redundant genes) related to the outcome of interest. The first approach is described in detail in this technical report. The R code is available as an R package from CRAN or from this link.
Briefly, random forest is an algorithm for classification developed by Leo Breiman Breiman, 2001, 2003 that uses an ensemble of classification trees. Each of the classification trees is built using a bootstrap sample of the data, and at each split the candidate set of variables is a random subset of the variables. Thus, random forest uses both bagging (bootstrap aggregation) and random variable selection for tree building. Each tree is unpruned (grown fully), so as to obtain low-bias trees; at the same time, bagging and random variable selection result in low correlation of the individual trees. The algorithm yields an ensemble that can achieve both low bias and low variance (from averaging over a large ensemble of low-bias, high-variance but low correlation trees). Random forests return a prediction as the unweighted majority of predictions from a this ensemble of classification trees.
Random forest has several characteristics that make it ideal for gene expression data:
This approach is described in more detail in the technical report. Essentially, we progressively eliminate the least important (where importance is based on the measures of variable importance returned by random forest itslef) genes until we can obtain no further improvements in the "out-of-bag" error rate. We show in the above report that this approach leads to very small sets of genes but good predictive accuracy.
We use the set of parameters that were found to work well in Díaz-Uriarte and Alvarez de Andres (2005): 2000 trees, mtryFactor = 1, se = 1.
There are other approaches with similar objectives, such as those implemented in our tool Tnasas: with Tnasas we try to minimize prediction error but, in most cases, the criteria used for ranking genes are univariate criteria. In addition, the methods implemented in Tnasas are not targeted towards identifying potentially large sets of redundant and relevant genes (see next).
The main objective here is to identify relevant genes for subsequent research; this involves obtaining a (probably large) set of genes that are related to the outcome of interest, and this set should include genes even if they perform similar functions and are highly correlated. Of course, there are many other alternatives; for instance, many gene-by-gene approaches, such as those implemented in Pomelo. However, the emphasis of the current tool is gene selection in the context of classification problems, and the use of variable importance from random forest allows to consider interactions and additive relationships between genes (contrary to gene-by-gene testing such as in a, say, t-test).
Our main approach here is to plot ordered variable importances, yielding plots that resemble the ``scree plots'' or ``scree graphs'' common in principal component analysis. This idea is similar to the the ``importance spectrum'' plots in Friedman and Meulman. We want to compare the variable importance plot from the original data with variable importance plots that are generated when the class labels and the predictors are independent. Therefore we permute only the class labels, leaving intact the correlation structure of the predictors (and of course using the same parameters for random forest). In this application, the number of random permutations is 50.
Unpublished research with simulated data indicate that these types of plots can be used to identify even large and highly correlated sets of genes (i.e., that the procedure can recover "important genes" even in situations of high collinearity).
In addition to returning lists of relevant genes, our tool also returns:
The file with the expression data (the covariates); generally the gene expression data. In this file, rows represent variables (generally genes), and columns represent subjects, or samples, or arrays.
The file for the covariates should be formated as:
#Name ge1 ge2 ge1 ge1 ge2 gene1 23.4 45.6 44 76 85.6 genW@ 3 34 23 56 13 geneX# 23 25.6 29.4 13.2 1.98
These are the class labels (e.g., healthy or affected, or different types of cancer) that group the samples. Our predictor will try to predict these class labels.
Please note that we do not allow any data set with 3 or fewer cases in any class. Why? Because, on the one hand, any results from such data would be hard to believe; on the other hand, that would result in some cross-validation samples having some training samples with 0 elements from one of the classes.
Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here. Class labels can be anything you wish; they can be integers, they can be words, whatever.
This is a simple example of class labels file
CL1 CL2 CL1 CL4 CL2 CL2 CL1 CL4
If you use any of the currently standard identifiers for your gene IDs for either human, mouse, or rat genomes, you can obtain additional information by clicking on the gene names in the output. This information is returned from IDClight based on that provided by our IDConverter tool.
The Out of bag error rate vs. number of genes (number of variables) used by random forest. With a thich red line the line for the original data, and with dotted black lines the lines from the bootstrap samples. Please note that this OOB errors are biased down and should not be used as honest estimates of the error rate.
For each sample, its (average) out-of-bag class prediction. Thus, this is the Out-of-bag prediction for each sample when that sample was not used at all for the training process. We provide one such plot for each of the class. Thus, for each sample we plot the (average) posterior probability of that sample belonging to each class. Obviously, when there are only two classes, one plot is like the 1 minus the other plot. You would have excellent results if each sample that belongs to a class has an out-of-bag (average) posterior probability very close to 1 for its true class and very close to zero for all other classes.
Plots of the variable importance of the genes from the original data compared to variable importances from data sets with the same gene expression data but randomly permuted class labels. For greater detail, we show the plot both for all genes and for only the first 200 and 30 genes.
We plot, for each of the top ranked genes from the original sample, the probability that it is included among the top ranked k genes (where in these figures k = 20, 100) from the bootstrap samples. These plots can be a measure of our confidence in choosing the g gene among the top k genes
The genes selected (variables used) we running the gene selection procedure for finding small sets of genes on the complete, original data.
The results from the bootstrap run. In particular:
You can download a single compressed file with all figures, in both the png format showed in the web results as well as in pdf format ---which gives better quality for printing---, and results (a single text file). The format is tar.gz, understood by GNU's tar, standard in GNU/Linux distributions, and also understood by the common uncompressors/unzipers/etc in other operating systems.
It is now possible to send the results to PaLS. PaLS "analyzes sets of lists of genes or single lists of genes. It filters those genes/clones/proteins that are referenced by a given percentage of PubMed references, Gene Ontology terms, KEGG pathways or Reactome pathways." (from PalS's help). By sending your results to PaLS, it might be easier to make biological sense of your results, because you are "annotating" your results with additional biological information.
Scroll to the bottom of the main outpu, where you will find the PaLS icon and two lists. When you click on any of the links, the corresponding list of genes will be sent to PaLS. There, you can configure the options as you want (please, consult PalS's help for details) and then submit the list. In PaLS, you can always go back, and keep playing with the very same gene list, modifying the options.
Probably it rarely, if ever, makes sense to send to PaLS the list of the genes selected in the main run. However, sending the second list can provide you with valuable information about the biological characterists of the genes that tend to be selected in most of the bootstrap runs.
For a data set such as the leukemia data set of Golub et al (1999, Science, 286: 531-537), with 38 subjects and 3051 genes it takes less than 10 minutes when the servers are lightly loaded, and it takes about 1 hour and 15 minutes with the Prostate data set of Singh et al. (2002, Cancer Cell, 1: 203--209), with 102 subjects and 6033 genes. Of course, your mileage will vary.
Examples of several runs, one with fully commented results, are available here.
This program was developped by Ramón Díaz-Uriarte, from the Bioinformatics Unit at CNIO. This tool uses Python for the CGI and R for the computations. The basic squeleton of the R code is the package randomForest an R port by Andy Liaw and Matthew Wiener of the Fortran original by Leo Breiman and Adele Cutler. The functions for gene selection are part of the varSelRF package. The R code also uses the following R packages: CGIwithR by David Firth, snow, by Luke Tierney, A. J. Rossini, Na Li and H. Sevcikova, Rmpi, by Hao Yu, and rsprng, by Na (Michael) Li.
This application is running on a cluster of machines using Debian GNU/Linux as operating system, Apache as web server, Linux Virtual Server for web server load-balancing, and LAM/MPI for parallelization.
We want to thank the authors and contributors of these great (and open source) tools that they have made available for all to use. If you find this useful, and since R and Bioconductor are developed by a team of volunteers, we suggest you consider making a donation to the R foundation for statistical computing.
Funding partially provided by Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science. This application is running on a cluster of machines purchased with funds from the RTICCC.
Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.
In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.
This software is experimental in nature and is supplied "AS IS", without
obligation by the authors or the CNIO the to provide accompanying services or
support. The entire risk as to the quality and performance of the software is
with you. The authors expressly disclaim any and all warranties regarding the
software, whether express or implied, including but not limited to warranties
pertaining to merchantability or fitness for a particular purpose.
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99: 6562--6566.
Breiman L (2001) Random forests. Machine Learning 45: 5--32 (Tech. report).
Breiman L (2003) Manual--Setting Up, Using, And Understanding Random Forests V4.0.
Diaz-Uriarte, R (2007) GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 2007, 8:328.
Díaz-Uriarte R, Alvarez de Andrés, S (2005) Gene selection and classification of microarray data using random forest. In review. (tech. report.)
Efron B, Tibshirani RJ (1997). Improvements on cross-validation: the .632+ bootstrap method. J. American Statistical Association, 92: 548-560.
Friedman, JH, Meulman, JJ. (2004). Clustering objects on subsets of attributes (with discussion). J. Royal Statistical Society, Sr. B, 66: 815--850.
Liaw A, Wiener M (2002) Classification and regression by randomForest. R News, 2: 18--22.
Simon R, Radmacher MD, Dobbin K, McShane LM (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute 95: 14--18.