This document is licensed under a Creative Commons License.

GeneSrF

Introduction and purpose
Usage
Author and Acknowledgements
Terms of use
Privacy and Security
Disclaimer
References

Introduction and purpose

GeneSrF is web-based tool for gene selection in classification problems that uses random forests. Two approaches for gene selection are used: one is targeted towards identifying small, non-redundant sets of genes that have good predictive performance. The second is a more heuristic graphical approach that can be used to identify large sets of genes (including redundant genes) related to the outcome of interest. The first approach is described in detail in this technical report. The R code is available as an R package from CRAN or from this link.

Briefly, random forest is an algorithm for classification developed by Leo Breiman Breiman, 2001, 2003 that uses an ensemble of classification trees. Each of the classification trees is built using a bootstrap sample of the data, and at each split the candidate set of variables is a random subset of the variables. Thus, random forest uses both bagging (bootstrap aggregation) and random variable selection for tree building. Each tree is unpruned (grown fully), so as to obtain low-bias trees; at the same time, bagging and random variable selection result in low correlation of the individual trees. The algorithm yields an ensemble that can achieve both low bias and low variance (from averaging over a large ensemble of low-bias, high-variance but low correlation trees). Random forests return a prediction as the unweighted majority of predictions from a this ensemble of classification trees.

Random forest has several characteristics that make it ideal for gene expression data:

Can be used when there are many more variables than observations.
Can be used both for two-class and multi-class problems of more than two classes.
Has good predictive performance even when most predictive variables are noise, and therefore it does not require a pre-selection of genes.
Does not overfit.
Can handle a mixture of categorical and continuous predictors.
Incorporates interactions among predictor variables.
The output is invariant to monotone transformations of the predictors.
There are high quality and free implementations: the original Fortran code from L. Breiman and A. Cutler, and an R package from A. Liaw and M. Wiener.
Returns measures of variable (gene) importance.
There is little need to fine-tune parameters to achieve excellent performance.

Approach 1: Finding small sets of genes

This approach is described in more detail in the technical report. Essentially, we progressively eliminate the least important (where importance is based on the measures of variable importance returned by random forest itslef) genes until we can obtain no further improvements in the "out-of-bag" error rate. We show in the above report that this approach leads to very small sets of genes but good predictive accuracy.

We use the set of parameters that were found to work well in Díaz-Uriarte and Alvarez de Andres (2005): 2000 trees, mtryFactor = 1, se = 1.

There are other approaches with similar objectives, such as those implemented in our tool Tnasas: with Tnasas we try to minimize prediction error but, in most cases, the criteria used for ranking genes are univariate criteria. In addition, the methods implemented in Tnasas are not targeted towards identifying potentially large sets of redundant and relevant genes (see next).

Approach 2: Finding potentially large sets of genes

The main objective here is to identify relevant genes for subsequent research; this involves obtaining a (probably large) set of genes that are related to the outcome of interest, and this set should include genes even if they perform similar functions and are highly correlated. Of course, there are many other alternatives; for instance, many gene-by-gene approaches, such as those implemented in Pomelo. However, the emphasis of the current tool is gene selection in the context of classification problems, and the use of variable importance from random forest allows to consider interactions and additive relationships between genes (contrary to gene-by-gene testing such as in a, say, t-test).

Our main approach here is to plot ordered variable importances, yielding plots that resemble the ``scree plots'' or ``scree graphs'' common in principal component analysis. This idea is similar to the the ``importance spectrum'' plots in Friedman and Meulman. We want to compare the variable importance plot from the original data with variable importance plots that are generated when the class labels and the predictors are independent. Therefore we permute only the class labels, leaving intact the correlation structure of the predictors (and of course using the same parameters for random forest). In this application, the number of random permutations is 50.

Unpublished research with simulated data indicate that these types of plots can be used to identify even large and highly correlated sets of genes (i.e., that the procedure can recover "important genes" even in situations of high collinearity).

Estimates of error rate and stability

In addition to returning lists of relevant genes, our tool also returns:

Assessment of error rate of the gene selection method targeted towards selecting very small sets of genes. We use the .632+ bootstrap method Efron and Tibshirani, 1997 of the complete gene selection procedure, and thus we avoid "selection bias" and related problems (Ambroise and McLachlan, 2002; Simon et al., 2003). We do not perform such a process with the second approach (targeted towards identifying potentially large sets of redundant genes) because this later approach is not really trying to build a prediction rule, but rather trying to uncover "itneresting genes"; moreover, exactly which and how many genes to select is something left to the judgment of the user (based on the plots and numbers of genes one wants to select). The number of bootstrap replicates we use is 200.
Stability assessments. We provide two such assessemtns.
- How often the selected genes (from Approach 1) are also selected in bootstrap runs (with Approach 1).
- Selection probability plots, where we plot, for each of the top ranked genes from the original sample, the probability that it is included among the top ranked k genes from the (200) bootstrap samples. These plots can be a measure of our confidence in choosing the g gene among the top k genes.

Usage

Input

Expression data file

The file with the expression data (the covariates); generally the gene expression data. In this file, rows represent variables (generally genes), and columns represent subjects, or samples, or arrays.

The file for the covariates should be formated as:

Data should conform to the "genes in rows, patients (or arrays) in columns". In other words, each row of the data file is supposed to represent a different gene or variable.
Use tab (\t) as the field separator within rows.
Use newline or carriage return (\n) between rows. It is also convenient to finish a file with one carriage return (\n).
Array names: if you want to name your arrays (useful for the output of the analyses) do as follows:
1. Place a line that starts with "#";
2. After the "#" put "Name" or "NAME" or "name" (don't say we don't give you choices);
3. Write the array names (separated by tabs).
The first column is assumed to contain the ID information for genes, marker, or whatever. This will be used to label the output (but it also means that whatever is in the first column is not used in the analyses).
You can have an arbitrary number of rows with comments. These rows must always start with an "#".
Gene names and array names MUST be unique. If they are not, the program will let you know. If you do not want to provide array names, that is OK, and we will name them with sequential integers starting at 1. If you do not want to provide gene names, then put some arbitrary labels on the first column (e.g., fill it with a sequence of integers). Why do we need gene and array names to be unique? Because in many steps, we need to provide either where we classify a given array (and what should we do if two or more arrays are named "A"?), or the genes used in the classifier (and what should we do if two or more genes are named "gene B"?).
Missing values are NOT allowed. You can use the preprocessor and do several things with your data before sending it to Tnasas. We would probably recommend you do imputation after eliminating genes with too many (more than, say, 20%?) missing. Anyway, how best to deal with missing values is not a trivial issue and is outside the scope of this help file.

This is a small covariate data file:

 
#Name     ge1      ge2      ge1      ge1      ge2 
gene1   23.4    45.6    44      76      85.6 
genW@   3      34      23      56      13 
geneX#  23      25.6    29.4    13.2    1.98

Class file

These are the class labels (e.g., healthy or affected, or different types of cancer) that group the samples. Our predictor will try to predict these class labels.

Please note that we do not allow any data set with 3 or fewer cases in any class. Why? Because, on the one hand, any results from such data would be hard to believe; on the other hand, that would result in some cross-validation samples having some training samples with 0 elements from one of the classes.

Separate values by tab (\t), and finish the file with a carriage return or newline. No missing values are allowed here. Class labels can be anything you wish; they can be integers, they can be words, whatever.

This is a simple example of class labels file

  
CL1     CL2     CL1     CL4     CL2     CL2     CL1     CL4

Type of gene identifier and species

If you use any of the currently standard identifiers for your gene IDs for either human, mouse, or rat genomes, you can obtain additional information by clicking on the gene names in the output. This information is returned from IDClight based on that provided by our IDConverter tool.

Output

Plots

OOB error vs. num of genes

The Out of bag error rate vs. number of genes (number of variables) used by random forest. With a thich red line the line for the original data, and with dotted black lines the lines from the bootstrap samples. Please note that this OOB errors are biased down and should not be used as honest estimates of the error rate.

OOB predictions

For each sample, its (average) out-of-bag class prediction. Thus, this is the Out-of-bag prediction for each sample when that sample was not used at all for the training process. We provide one such plot for each of the class. Thus, for each sample we plot the (average) posterior probability of that sample belonging to each class. Obviously, when there are only two classes, one plot is like the 1 minus the other plot. You would have excellent results if each sample that belongs to a class has an out-of-bag (average) posterior probability very close to 1 for its true class and very close to zero for all other classes.

Importance spectrum plots

Plots of the variable importance of the genes from the original data compared to variable importances from data sets with the same gene expression data but randomly permuted class labels. For greater detail, we show the plot both for all genes and for only the first 200 and 30 genes.

Selection probability plot

We plot, for each of the top ranked genes from the original sample, the probability that it is included among the top ranked k genes (where in these figures k = 20, 100) from the bootstrap samples. These plots can be a measure of our confidence in choosing the g gene among the top k genes

Results: text

Variable selection using all data

The genes selected (variables used) we running the gene selection procedure for finding small sets of genes on the complete, original data.

Bootstrap results

The results from the bootstrap run. In particular:

Bootstrap estimate of prediction error. This is an honest estimate of the error rate.
Error rate at random. This is the (smallest) error you can achieve by betting always on the most common class. In other words, this is the smallest error from using no information from the gene expression.
Number of vars in bootstrapped forests. The number of genes selected in the bootstrap runs. We show a summary over the 200 runs, including mean, median, 1st and 3rd quantile, maximum and minimum.
Variable freqs. in bootstrapped models. For each of the genes (variables) selected in at least one bootstrap run, the frequency with which it was selected.
Variable freqs. of variables in forest from all data, and summary.For all the genes selected from the run on the complete, original data, the frequecy with which those genes appear in the bootstrap runs, as well as some summary statistics.
Solutions frequencies in bootstrapped models."Solution" is how we call the actual set of genes selected in a run. This might allow to spot certain sets of genes that tend to get selected together, or sets of genes that are selected as alternatives to each other.
Mean class membership probabilities from out of bag samples. These are the average out-of-bag posterior probabilities ploted above.
Variable (gene) importances from original data.Gene importances for all genes from the original data, before any gene selection. Ordered from most to least important.

Download figures and results

You can download a single compressed file with all figures, in both the png format showed in the web results as well as in pdf format ---which gives better quality for printing---, and results (a single text file). The format is tar.gz, understood by GNU's tar, standard in GNU/Linux distributions, and also understood by the common uncompressors/unzipers/etc in other operating systems.

Sending results to PaLS (New!!)

It is now possible to send the results to PaLS. PaLS "analyzes sets of lists of genes or single lists of genes. It filters those genes/clones/proteins that are referenced by a given percentage of PubMed references, Gene Ontology terms, KEGG pathways or Reactome pathways." (from PalS's help). By sending your results to PaLS, it might be easier to make biological sense of your results, because you are "annotating" your results with additional biological information.

Scroll to the bottom of the main outpu, where you will find the PaLS icon and two lists. When you click on any of the links, the corresponding list of genes will be sent to PaLS. There, you can configure the options as you want (please, consult PalS's help for details) and then submit the list. In PaLS, you can always go back, and keep playing with the very same gene list, modifying the options.

Probably it rarely, if ever, makes sense to send to PaLS the list of the genes selected in the main run. However, sending the second list can provide you with valuable information about the biological characterists of the genes that tend to be selected in most of the bootstrap runs.

How long does it take?

For a data set such as the leukemia data set of Golub et al (1999, Science, 286: 531-537), with 38 subjects and 3051 genes it takes less than 10 minutes when the servers are lightly loaded, and it takes about 1 hour and 15 minutes with the Prostate data set of Singh et al. (2002, Cancer Cell, 1: 203--209), with 102 subjects and 6033 genes. Of course, your mileage will vary.

Examples

Examples of several runs, one with fully commented results, are available here.

Author and Acknowledgements

This program was developped by Ramón Díaz-Uriarte, from the Bioinformatics Unit at CNIO. This tool uses Python for the CGI and R for the computations. The basic squeleton of the R code is the package randomForest an R port by Andy Liaw and Matthew Wiener of the Fortran original by Leo Breiman and Adele Cutler. The functions for gene selection are part of the varSelRF package. The R code also uses the following R packages: CGIwithR by David Firth, snow, by Luke Tierney, A. J. Rossini, Na Li and H. Sevcikova, Rmpi, by Hao Yu, and rsprng, by Na (Michael) Li.

This application is running on a cluster of machines using Debian GNU/Linux as operating system, Apache as web server, Linux Virtual Server for web server load-balancing, and LAM/MPI for parallelization.

We want to thank the authors and contributors of these great (and open source) tools that they have made available for all to use. If you find this useful, and since R and Bioconductor are developed by a team of volunteers, we suggest you consider making a donation to the R foundation for statistical computing.

Funding partially provided by Project TIC2003-09331-C02-02 of the Spanish Ministry of Education and Science. This application is running on a cluster of machines purchased with funds from the RTICCC.

Terms of use

You acknowledge that this Software is experimental in nature and is supplied "AS IS", without obligation by the authors, the CNIO's Bioinformatics Unit or the CNIO to provide accompanying services or support. The entire risk as to the quality and performance of the Software is with you. The CNIO and the authors expressly disclaim any and all warranties regarding the software, whether express or implied, including but not limited to warranties pertaining to merchantability or fitness for a particular purpose.
If you use GeneSrF for any publication, we would appreciate if you could let us know and if you cite our program (you know, "credit where credit is due"). Please, provide the reference to the publication (Diaz-Uriarte, R. 2007. GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 2007, 8:328) and the URL of the application (http://genesrf.iib.uam.es).
We appreciate if you give us feedback concerning bugs, errors or misconfigurations. Complaints or suggestions are welcome.

Privacy and Security

Uploaded data set are saved in temporary directories in the server and are accessible through the web until they are erased after some time. Anybody can access those directories, nevertheless the name of the directories are not trivial, thus it is not easy for a third person to access your data.

In any case, you should keep in mind that communications between the client (your computer) and the server are not encripted at all, thus it is also possible for somebody else to look at your data while you are uploading or dowloading them.

Disclaimer

This software is experimental in nature and is supplied "AS IS", without obligation by the authors or the CNIO the to provide accompanying services or support. The entire risk as to the quality and performance of the software is with you. The authors expressly disclaim any and all warranties regarding the software, whether express or implied, including but not limited to warranties pertaining to merchantability or fitness for a particular purpose.

References

Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99: 6562--6566.

Breiman L (2001) Random forests. Machine Learning 45: 5--32 (Tech. report).

Breiman L (2003) Manual--Setting Up, Using, And Understanding Random Forests V4.0.

Diaz-Uriarte, R (2007) GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest. BMC Bioinformatics 2007, 8:328.

Díaz-Uriarte R, Alvarez de Andrés, S (2005) Gene selection and classification of microarray data using random forest. In review. (tech. report.)

Efron B, Tibshirani RJ (1997). Improvements on cross-validation: the .632+ bootstrap method. J. American Statistical Association, 92: 548-560.

Friedman, JH, Meulman, JJ. (2004). Clustering objects on subsets of attributes (with discussion). J. Royal Statistical Society, Sr. B, 66: 815--850.

Liaw A, Wiener M (2002) Classification and regression by randomForest. R News, 2: 18--22.

Simon R, Radmacher MD, Dobbin K, McShane LM (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. Journal of the National Cancer Institute 95: 14--18.

Copyright

Last modified: 2006-12-21