GSAA - gene set association analysis

GSAASeqGP User Guide

Introduction


Gene Set Association Analysis for RNA-Seq (GSAASeqGP) is a toolset for gene set association analysis of RNA-Seq count data. GSAASeqGP identify pathways/gene sets significantly associated with a disease or a phenotype by analyzing genome-wide patterns of gene expression variation measured by RNA-Seq technology.

The software GSAASeqGP is a Java/R based desktop application which implements methods described in
Qing Xiong, Sayan Mukherjee, Terrence S. Furey. GSAASeqSP: A toolset for gene set association analysis of RNA-Seq data. Scientific Reports. 2014 Sep; 4:6347


Downloading and installing software


Software GSAASeqGP is released as a functionally independent module in our GSAA platform that is available for free download at http://gsaa.unc.edu

GSAASeqGP can run on any computer (Windows, Mac OS X, Linux etc.) with the following software installed.

1) Download and install Java7+ or higher. Java7+ is available at http://java.sun.com/javase/downloads/index.jsp

2) Download and install R 2.15.1 or higher. R is available at http://www.r-project.org

3) Download and install R packages edgeR, DESeq, NOISeq, DEGseq, baySeq to [R_HOME]/library/

edgeR is available at http://www.bioconductor.org/packages/release/bioc/html/edgeR.html.

DESeq is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq.html.

NOISeq is available at http://www.bioconductor.org/packages/release/bioc/html/NOISeq.html.

DEGseq is available at http://www.bioconductor.org/packages/release/bioc/html/DEGseq.html.

baySeq is available at http://www.bioconductor.org/packages/release/bioc/html/baySeq.html.

4) Download and install rJava to [R_HOME]/library/

rJava is available at http://www.rforge.net/rJava

5) Setting up environment variables

Briefly, R library (R.dll on Windows, libR.so on Linux, or libR.dylib on Mac OS X) and JRI native library (jri.dll on Windows, libjri.so on Linux, or libjri.jnilib on Mac OS X) should be set within java.library.path (1. specify -Djava.library.path or 2. edit the environment variable "Path" on Windows, "LD_LIBRARY_PATH" on Linux, or "DYLD_LIBRARY_PATH" on Mac OS X), JRI.jar should be in the class path of java (edit the environment variable "CLASSPATH").

Windows:

Add R_HOME variable (right click desktop icon Computer-> Properties-> Advanced system settings -> Environment variables). R_HOME should point to the location where R is installed

Edit the Path variable and append the following two directories
(i) [R_HOME]\library\rJava\jri (if your OS is 32-bit) or [R_HOME]\library\rJava\jri\i386 (if your OS is 64-bit) and
(ii) [R_HOME]\bin (if your OS is 32-bit) or [R_HOME]\bin\i386 (if your OS is 64-bit)
Note: in the first directory you should see a file "jri.dll", in the second directory you should see a file "R.dll", otherwise appending the directories containing these files.

Edit the CLASSPATH variable and append the following
(i) [R_HOME]\library\rJava\jri
Note: in this directory you should see a jar file "JRI.jar"(or JRIEngine.jar, JRI.jar and REngine.jar for the old JRI version), otherwise appending the directory containing this file.

Linux:

Publish R installation path. R_HOME should point to the location where R is installed
export R_HOME=your installation path

Add the path of executable programs to PATH, by default executables are located in ${R_HOME}/bin
export PATH=${R_HOME}/bin:${PATH}

Add the path of JRI native library to PATH, by default this library is located in ${R_HOME}/library/rJava/jri
export PATH=${R_HOME}/library/rJava/jri:${PATH}

Add the path of JRI native library to LD_LIBRARY_PATH, by default this library is located in ${R_HOME}/library/rJava/jri
export LD_LIBRARY_PATH=${R_HOME}/library/rJava/jri:${LD_LIBRARY_PATH}

Add the path of R library to LD_LIBRARY_PATH, by default this library is located in ${R_HOME}/lib
export LD_LIBRARY_PATH=${R_HOME}/lib:${LD_LIBRARY_PATH}

Add the path of archive JRI.jar (or JRIEngine.jar, JRI.jar and REngine.jar for the old JRI version) to CLASSPATH, by default JRI.jar is located in ${R_HOME}/library/rJava/jri
export CLASSPATH=${R_HOME}/library/rJava/jri:${CLASSPATH}

For example: open .bashrc or .bash_profile and append the following
export R_HOME=/opt/R-3.0.0/lib64/R
export PATH=${R_HOME}/bin:${R_HOME}/library/rJava/jri:${PATH}
export LD_LIBRARY_PATH=${R_HOME}/library/rJava/jri:${R_HOME}/lib:${LD_LIBRARY_PATH}
export CLASSPATH=${R_HOME}/library/rJava/jri:${CLASSPATH}

Mac OS X:

Publish R installation path. R_HOME should point to the location where R is installed
export R_HOME=your installation path

Add the path of executable programs to PATH, by default executables are located in ${R_HOME}/bin
export PATH=${R_HOME}/bin:${PATH}

Add the path of JRI native library to PATH, by default this library is located in ${R_HOME}/library/rJava/jri
export PATH=${R_HOME}/library/rJava/jri:${PATH}

Add the path of JRI native library to DYLD_LIBRARY_PATH, by default this library is located in ${R_HOME}/library/rJava/jri
export DYLD_LIBRARY_PATH=${R_HOME}/library/rJava/jri:${DYLD_LIBRARY_PATH}

Add the path of R library to DYLD_LIBRARY_PATH, by default this library is located in ${R_HOME}/lib
export DYLD_LIBRARY_PATH=${R_HOME}/lib:${DYLD_LIBRARY_PATH}

Add the path of archive JRI.jar (or JRIEngine.jar, JRI.jar and REngine.jar for the old JRI version) to CLASSPATH, by default JRI.jar is located in ${R_HOME}/library/rJava/jri
export CLASSPATH=${R_HOME}/library/rJava/jri:${CLASSPATH}

For example: open .bashrc, .bash_profile or launchd.conf and append the following
export R_HOME=/Library/Frameworks/R.framework/Resources
export PATH=${R_HOME}/bin:${R_HOME}/library/rJava/jri:${PATH}
export DYLD_LIBRARY_PATH=${R_HOME}/library/rJava/jri:${R_HOME}/lib:${DYLD_LIBRARY_PATH}
export CLASSPATH=${R_HOME}/library/rJava/jri:${CLASSPATH}

if you cannot find .bashrc, .bash_profile or launchd.conf, create it.

Note: If you still cannot get it work, then try a command like below on a Linux machine
java -Djava.library.path=${R_HOME}/library/rJava/jri -cp ${CLASSPATH}:/home/tcga/program/GSAA.jar -Xmx5000m xtools.gsea.GsaaSeqGP -gmx /home/tcga/data/c2.cp.v4.0.symbols.gmt -nperm 2000 -exp_file /home/tcga/data/gbm_tcga_rnaseq.gct -exp_template_file /home/tcga/data/gbm_tcga_rnaseq.cls -gsametric KS -demetric DESeq -permute gene_set -rnd_type no_balance -scoring_scheme weighted -norm MeanDiv -rpt_label gsaaseqgp_gbm_tcga_c2.cp.v4.0 -make_sets true -plot_top_x 20 -rnd_seed timestamp -save_rnd_lists false -set_max 500 -set_min 15 -zip_report false -out /home/tcga/result/gsaaseqgp -gui false


Getting Started


Starting GSAASeqGP Desktop Application


Unzip or untar the downloaded program file into a directory. Remember, lib and GSAA.jar must be in the same directory.

Windows user:
To launch GSAA, double click the icon of GSAA.jar file or use command
Java –Xmx1000m –jar full-path/GSAA.jar

Linux and Mac user:
Java –Xmx1000m –jar full-path/GSAA.jar

Parameter –Xmx specifies the amount of memory available to Java. If you get error message “out of memory”, try to increase 1000m to 2000m or more. GSAASeqGP has been successfully used with 20000m on a Linux server.
full_path is the complete path of the GSAA.jar file

Example: Java –Xmx1000m –jar C:/programs/gsaa/GSAA.jar


When GSAA starts, the main window appears. The main components of the user interface are as follows:




1. The navigation bar on the left, which provides quick access to common GSAA operations.

2. The Processes panel in the bottom left corner, which provides information about the status of your analyses.

3. The main panel on the right, which is used to display dialogs and results. When you start GSAA, the main panel displays the Home page. To open GSAASeqGP page, click the icon "Run GSAASeqGP", GSAASeqGP tab will appear next to the Home tab. To close the page, click the close (X) icon on the tab.

Exiting GSAASeqGP
To exit from GSAASeqGP:

1. Click the close (x) button on the top-right corner of the GSAASeqGP window.

2. Select File>Exit.

Getting Help


The GSAA web site is your primary source of help for GSAASeqGP. It includes the following resources:

1. Documentation. The GSAASeqGP documentation includes this User Guide.

2. Publications. The web site provides a link to the paper describing the algorithms.

If you cannot find the answers to your questions on our web site, contact us at qxiong@email.unc.edu.


Preparing Data Files for GSAASeqGP


When you use GSAASeqGP, you supply three data files: an expression dataset file, a phenotype labels file, a gene sets file. The following table lists each type of data file and its valid file formats. All files are tab-delimited ASCII text files; they can be created and edited using any text editor.


Data File Content Format Source
Expression dataset Contains gene names, samples, and a count for each gene in each sample. The count values must be raw counts of sequencing reads. Expression data can come from any source. gct You create the file.
Phenotype labels Contains phenotype labels and associates each sample with a phenotype. Only categorical labels are allowed in GSAASeqGP. cls You create the file.
Gene sets Contains one or more gene sets. For each gene set, gives the gene set name and list of features (genes or probes) in that gene set. gmx or gmt You use the files on the Broad ftp site, export gene sets from the Molecular Signature Database (MSigDb) or create your own gene sets file.

You can create and edit GSAASeqGP files using Excel or any text editor. If you use Excel to create a tab-delimited text file: select File>Save As, enter the file name in quotes to preserve the the file extension (for example, "lung.gct"), and select "Text(Tab delimited)(*.txt)" as the file type. Excel displays a message warning you that your file may contain features that are not compatible with this format and asks if you want to keep the workbook in this format. Click Yes to keep this format. Your file has now been saved. Exit from Excel. When Excel asks if you want to save your changes to this file, select No (you have already saved the file).

In addition, do not use hypens (-) in the file names.

For descriptions and examples of GSEA-related file formats gct, gmx, gmt, see GSEA User Guide and GSEA file formats. For GSAASeqGP file formats, see below

RNA-Seq Data Format (*.gct)


The GCT format is a tab delimited file format that describes a RNA-Seq dataset. It is organized as follows:




The first line contains comments describing the dataset. The first line must start with #.
Line format: # anything
Example: # kidney liver rnaseq dataset

The second line contains the number of genes and the number of samples.
Line format: (number of genes) (tab) (number of samples)
Example: 15584 14

The remainder of the data file contains count data for each of the genes. There is one row for each gene and one column for each of the samples. The number of rows and columns should agree with the number of rows and columns specified on line 2. Each row contains a gene name, a description (a na means no description), a count value for each sample.
Line format: (gene name) (tab) (description) (tab) (col 1 data) (tab) (col 2 data) (tab) ... (col N data)
Example: GDA na 87 52 79 90 60 93

Phenotype Data Format (*.cls)


The CLS file format defines categorical phenotype (class or template) labels and associates each sample in the expression data with a label. Only two phenotypic classes, for example, tumor vs normal, are allowed. We recommend that you use the class of interest, for example tumor, as the first class in the CLS file.

The CLS file format uses spaces or tabs to separate the fields. It is organized as follows:




The first line of a CLS file contains numbers indicating the number of samples and number of classes (2). The number of samples should correspond to the number of samples in the associated GCT file.
Line format: (number of samples) (space) 2 (space) 1
Example: 30 2 1

The second line in a CLS file contains a user-visible name for each class. These are the class names that appear in analysis reports. The line should begin with a pound sign (#).
Line format: # (class 1 name) (space) (class 2 name)
Example: #Tumor Normal

The third line contains a class label for each sample. The label for the first class must be 1, and the label for the second class must be 2. The first label used is assigned to the first class named on the second line; the second unique label is assigned to the second class named. (Note: The order of the labels determines the association of class names and class labels, even if the class labels are the same as the class names.) The number of class labels specified on this line should be the same as the number of samples specified in the first line. The number of unique class labels specified on this line should be the same as the number of classes specified in the first line.
Line format: (sample 1 class) (space) (sample 2 class) (space) ... (sample N class)
Example: 1 1 1 ... 2 2


Loading Data


Click the icon “Load data” to open the Load data page.




There are several ways to load data:

  • Clicking the Browse for files button will allow you to select files from your file system and load it into GSAASeqGP. To select multiple files, use SHIFT-click or CTRL-click.
  • Clicking the Load last dataset used button will load the data used in the most recent analysis.
  • Drag-and-drop the files from a file browser window into the drag-and-drop pane. When the files that you want to load are listed in that pane, click the Load these files button. To remove files from the drag-and-drop pane, click the Clear button.
  • The Recently Used Files pane contains files that you have used previously. Double-click a file to load it.


Specifying Parameters


Click the icon “Run GSAASeqGP” to open the GSAASeqGP page. There are three categories of parameters in GSAASeqGP
  • Required: Essential parameters which you must specify before the analysis can be run.
  • Basic: Additional parameters with standard defaults. Typically, accepting the defaults is ok. Click Show to see these parameters.
  • Advanced: Parameters that allow control of several more details of the GSAASeqGP algorithm and the java implementation. Typically, these do not need to be changed by most users. Click Show to see these parameters.
Place your cursor on a parameter name to see a brief description of the parameter.

Required Fields




Required fields lists parameters that are essential for the analysis. Enter values for these parameters before starting the analysis.

  • Gene sets database. Click the ellipse (…) button and select one or more gene sets:
    • GeneMatrix (from website) lists the MSigDB gene sets available on the Broad ftp site. These gene set files may contain hundreds of gene sets. Use the Browse MSigDB Page to browse the gene sets and to create gene set files (gmx/gmt) containing only gene sets of interest.
    • GeneSets(grp) lists gene sets that GSAASeqGP has created in memory; for example, gene sets created using the Text Entry tab described below.
    • GeneMatrix (local gmx/gmt) lists the gene set files that you have loaded (see Loading Data).
    • Subsets lists each gene set in each gmx/gmt file that you have loaded.
    • Text Entry allows you to create a gene set by entering the genes for that gene set; enter one gene per line. The gene set is created in memory and deleted when you exit.
  • Number of permutations. Specify the number of permutations to perform in assessing the statistical significance of the association score. It is best to start with a small number, such as 10. After the analysis completes successfully, run it again with a full set of permutations.
  • Gene Expression dataset. Click the ellipse (…) button to select an expression dataset file from a file browser window.
  • Expression phenotype labels. Click the ellipse (…) button to select a phenotype labels file for expression dataset from a file browser window.
  • Permutation type. Select the type of permutation to perform in assessing the statistical significance of the association score:
    • Gene_set. Random gene sets, size matched to the actual gene set, are created and their association scores calculated. These association scores are used to create a null distribution from which the significance of the actual association score (for the actual gene set) is calculated. This method is useful when you have too few samples to do phenotype permutations (that is, when you have fewer than seven (7) samples in any phenotype).

Basic Fields




Basic fields lists additional parameters with standard defaults. Typically, you use the default values for these parameters. Click Show/Hide to display and hide these parameters.

  • Analysis name. A short descriptive label for the analysis. The name cannot include spaces. This label is used as a prefix when naming the output report generated by the analysis (for example, my_analysis.GsaaSeq.1130510139575.rpt).
  • Metric for differential expression analysis. GSAASeqGP ranks genes based on their association scores and then analyzes that ranked list of genes. Use this parameter to select the metric used to score the genes. We have incorporated five methods for the RNA-seq differential expression analysis in GSAASeqGP: edgeR (Robinson, et al., 2010), DESeq (Anders and Huber, 2010), NOISeq (Tarazona, et al., 2011), DEGseq (Wang, et al., 2010), baySeq (Hardcastle, et al., 2010).
  • Metric for gene set analysis. Use this parameter to select the metric for gene set asssociation analysis.
  • Association statistic. This option controls the value of p used in the association score calculation: Larger p gives higher weights to genes with extreme statistic values
    • classic: p=0
    • weighted (default): p=1
    • weighted_p2: p=2
    • weighted_p1.5: p=1.5
  • Max size. After filtering from the gene sets any gene not in the expression dataset, gene sets larger than this are excluded from the analysis.
  • Min size. After filtering from the gene sets any gene not in the expression dataset, gene sets smaller than this are excluded from the analysis.
  • Save results in this folder. Path of the directory in which to place the analysis results. Existing results in this folder are not overwritten. By default, analysis results are saved in the GSAA output folder. To view this folder, select Help>Show GSAA output folder.

Advanced Fields




Advanced fields lists parameters that control details of the GSAASeqGP algorithm and its Java implementation. Do not change the default values of these parameters unless you are conversant with the algorithm and its Java implementation. Click Show/Hide to display and hide these parameters.

  • Randomization mode. Method used to randomly assign phenotype labels to samples for phenotype permutations. Not used for gene set permutations.
    • no_balance (default). Permutes labels without regard to number of samples per phenotype. For example, if your dataset has 12 samples in class_a and 10 samples in class_b, any permutation of class_a has 12 samples randomly chosen from the dataset.
    • equalize_and_balance. Permutes labels by equalizing the number of samples per phenotype and then balancing the number of samples contributed by each phenotype. For example, if your dataset has 12 samples in class_a and 10 samples in class_b, any permutation of class_a has 10 samples: 5 randomly chosen from class_a and 5 randomly chosen from class_b.

      We recommend using no balance (default), unless the number of samples per phenotype is highly unbalanced.
  • Normalization mode. Method used to normalize the association scores (AS) across analyzed gene sets:
    • MeanDiv (default): GSAASeqGP normalizes the association scores by dividing a given AS by the mean of its null distribution generated from a permutation procedure.
    • None: GSAASeqGP does not normalize the association scores.
  • Make detailed gene set report. Set to True (default) to create a detailed gene set report for each associated gene set.
  • Plot graphs for the top sets of each phenotype. Generates summary plots and detailed analysis results for the top x genes in each phenotype, where x is 20 by default. GSAASeqGP ranks gene sets by their FDR q-values so the top genes are those with the smallest FDR.
  • Seed for permutation. Seed used to generate a random number for phenotype and gene set  permutations: timestamp (default) or 149. The specific seed value (149) generates consistent results, which is useful when testing software.
  • Save random ranked lists. Set to True (default=false) to save the random ranked lists of genes created by phenotype permutations. When you save random ranked lists, for each permutation, GSAASeqGP saves the rank metric score for each gene (the score used to position the gene in the ranked list). Saving random ranked lists is memory intensive; therefore, this parameter is set to false by default.
  • Make a zipped file with all reports. Set to True (default=false) to create a zip file of the analysis results. The zip file is saved to the output folder with all of the other files generated by the analysis. This is useful for sharing analysis results.

Buttons at the bottom of the page:

  • Reset. Restores the default values for all parameters.
  • Last. Loads the data used the last time you ran this analysis.
  • Command. Displays the command line used to run the analysis, as described in Running GSAASeqGP from the Command Line.
  • Low/Normal (cpu usage). Determines the amount of CPU dedicated to this analysis. To use your computer for other tasks while running GSAASeqGP in the background, choose Low. To complete your analysis more quickly, choose Normal.
  • Run. Starts the analysis.


Running Gene Set Association Analysis




Click Run to start the analysis




Use the Processes panel at the lower left corner to view the status of analyses run in this session, including the currently running analysis:

1. The blue Running label indicates the currently running analysis. You can click on this label to pause or resume an analysis.

2. If a red Error appears, click on it for a description of the error.

3. When the analysis completes, click the green Success label to display the results in a web browser.


Interpreting GSAASeqGP Results


GSAASeqGP Statistics

Gene Set Association Score (AS)

The primary result of the gene set association analysis is the gene set association score (AS), which reflects the degree to which a gene set is overrepresented at the top of a ranked list of genes. GSAASeqGP calculates the AS by walking down the ranked list of genes, increasing a running-sum statistic when a gene is in the gene set and decreasing it when it is not. The magnitude of the increment depends on the association of the gene with the phenotype. The AS is the maximum deviation from zero encountered in walking the list. A positive AS indicates gene set enrichment at the top of the ranked list; a negative AS indicates gene set enrichment at the bottom of the ranked list.

In the analysis results, the association plot provides a graphical view of the association score for a gene set:



  • The top portion of the plot shows the running AS for the gene set as the analysis walks down the ranked list. The score at the peak of the plot (the score furthest from 0.0) is the AS for the gene set. Since GSAASeqGP employs a non-directional differential expression score so gene sets with a distinct peak at the beginning (such as the one shown here) are generally the most interesting.
  • The middle portion of the plot shows where the members of the gene set appear in the ranked list of genes.
    The leading edge subset of a gene set is the subset of members that contribute most to the AS. For a positive AS (such as the one shown here), the leading edge subset is the set of members that appear in the ranked list prior to the peak score. For a negative AS, it is the set of members that appear subsequent to the peak score.
  • The bottom portion of the plot shows the value of the ranking metric as you move down the list of ranked genes. The ranking metric measures a gene’s association with a phenotype. The value of the ranking metric goes from positive to zero as you move down the ranked list.

Normalized Association Score (NAS)

By normalizing the gene set association score, GSAASeqGP accounts for differences in gene set size and in correlations between gene sets and the datasets; therefore, the normalized association scores (NAS) can be used to compare analysis results across gene sets.

Nominal P Value

GSAASeqGP uses a permutation test to evaluate the statistical significance of the AS assigned to a gene set. The statistical significance of the AS is estimated using a nominal P-value that is calculated relative to a null AS distribution generated by permutations.

False Discovery Rate (FDR) and Family-Wise Error Rate (FWER)

GSAASeqGP uses FDR and FWER to correct for multiple hypothesis testing and control the proportion of false positives below a certain threshold.

Given m gene sets {S1,S2,...,Sm} and label permutations π=1,…,Π, the FDR for each gene set Si with NAS(Si)>=0 is calculated as


If NAS(Si)<0, the FDR is computed as


Where NAS(Sj, π) is the normalized association score for gene set j with label permutation π. NAS(Sj, π)+ and NAS(Sj, π)- denote positive and negative NAS(Sj, π), respectively. NAS(Sj) is the normalized association score for gene set j. NAS(Sj)+, NAS(Sj)- denote positive and negative NAS(Sj), respectively.
The FWER for a gene set Si with NAS(Si)>=0 is computed as


If NAS(Si)<0, the FDR is computed as


GSAASeqGP Report



This section discusses the content of the report generated by the gene set association analysis:

  • Association with phenotype
  • Gene Set Details
  • Gene Markers
  • Other
  • Detailed Association Results
  • Gene Set Details Report

Association with Phenotype




The Association with Phenotype section shows results for gene sets that have a positive association score (gene sets that show enrichment at the top of the ranked list). In GSAASeqGP, a positive association score indicates association with eithor the first phenotype or second phenotype, and a negative association score indicates no association with the phenotype.

For each phenotype, the report shows:

  • Number of associated gene sets that are significant, as indicated by a false discovery rate (FDR) of less than 25%. Typically, these are the gene sets most likely to generate interesting hypotheses and drive further research.
  • Number of associated gene sets with a nominal p value of less than 1% and of less than 5%. The nominal p value is not adjusted for gene set size or multiple hypothesis testing; therefore, it is of limited value for comparing gene sets.
  • Snapshot of top results. Displays association plots for the gene sets with the smallest FDR. By default, GSAASeqGP displays plots for the top 20 gene sets. To display a different number of plots, use the Plot graphs for the top sets of each phenotype parameter on the Run GSAASeqGP Page. For a description of the association plot, see Association Score (AS).
  • Detailed association results provide a summary report of gene sets associated with this phenotype (html and excel formats).
  • Guide to interpret results displays this section of the documentation.

Gene Set Details




The Gene Set Details section of the analysis report provides information about the gene sets:

  • Number of gene sets filtered out of the analysis due to size, and the minimum and maximum gene set sizes used for the filter.
  • Number of gene sets used in the analysis.
  • List of analyzed gene sets. For each gene set, the report shows the original number of genes in the gene set, the number of genes in the gene set after filtering out those genes not in the expression dataset, and the status of the gene set. Status is either blank (the gene set was included in the analysis) or “Rejected” (the gene set was filtered out of the analysis).
Note: If all gene sets are filtered out, the analysis fails. Typically, this occurs for one of the following reasons:
  • The feature identifiers used for the expression dataset do not match those used in the gene sets.
  • After filtering out those genes not in the expression dataset, all of the gene sets are either larger than the maximum or smaller than the minimum gene set size allowed. You can use the Max Size and Min Size parameters on the Run GSAASeqGP Page to change the maximum and minimum gene set size.


Gene Markers




The Gene Markers section of the analysis report provides information about the ranked list of genes used for the analysis:

  • Number of features (genes) in the expression dataset.
  • Rank ordered list of genes in the dataset (Excel format), which includes the following information for each gene: name, gene symbol, gene title, and score.


Other




The final section of the report, Other, lists the analysis parameters. Knowing the parameters is critical for reproducing analysis results.


Detailed Association Results

From the Association in Phenotype section of the analysis report, you can click a link to display the detailed association results report, which lists all gene sets associated with this phenotype ordered by the false discovery rate (FDR):


  • GS. Gene set name. Click the gene set name for a detailed description of the gene set. For MSigDB gene sets, the description is the gene set page on the GSEA web site. For other gene sets, the description is provided by the author of the gene set.
  • GS DETAILS. For the top 20 gene sets, click the Details link to display the Gene Set Details Report. To generate the Details link for a different number of gene sets, use the Plot graphs for the top sets of each phenotype parameter on the Run GSAASeqGP Page.
  • SIZE. Number of genes in the gene set after filtering out those genes not in the expression dataset.
  • AS. Association score. for the gene set; that is, the degree to which this gene set is overrepresented at the top or bottom of the ranked list of genes in the expression dataset.
  • NAS. Normalized association score; that is, the association score for the gene set after it has been normalized across analyzed gene sets.
  • NOM p-value. Nominal p value; that is, the statistical significance of the association score. The nominal p value is not adjusted for gene set size or multiple hypothesis testing; therefore, it is of limited use in comparing gene sets.
  • FDR q-value. False discovery rate; that is, the estimated probability that the normalized association score represents a false positive finding.
  • FWER p-value. Familywise-error rate; that is, a more conservatively estimated probability that the normalized association score represents a false positive finding.
  • RANK AT MAX. The position in the ranked list at which the maximum association score occurred. The more interesting gene sets achieve the maximum association score near the top or bottom of the ranked list; that is, the rank at max is either very small or very large.
  • LEADING EDGE. Displays the three statistics used to define the leading edge subset:
    • Tags. The percentage of gene hits before (for positive AS) or after (for negative AS) the peak in the running association score. This gives an indication of the percentage of genes contributing to the association score.
    • List. The percentage of genes in the ranked gene list before (for positive AS) or after (for negative AS) the peak in the running association score. This gives an indication of where in the list the association score is attained.
    • Signal. The association signal strength that combines the two previous statistics:



      where N is the number of genes in the list and Nh is the number of genes in the gene set. If the gene set is entirely within the first Nh positions in the list, then the signal strength is maximal or 100%. If the gene set is spread throughout the list, then the signal strength decreases towards 0%.

    These statistics describe the leading-edge subset of a single gene set. Use the Leading Edge analysis to analyze the overlap between multiple leading-edge subsets.


Gene Set Details Report

From the Detailed Association Results table, click the Details link for a gene set to display a Gene Set Details report that contains the following:

  • A table showing the GSAASeqGP results for this gene set. The fields in this table are similar to those in the Detailed Association Results.
  • An association plot for this gene set, as described in Association Score (AS).
  • A table of genes in the gene set ordered by their position in the ranked list of genes. The analysis includes only those genes in the gene set that are also in the expression dataset. To display the table in Excel, click the plain text format link in the table header.


    • GENE. Gene name.
    • RANK IN GENE LIST. Position of the gene in the ranked list of genes.
    • RANK METRIC SCORE. Score used to position the gene in the ranked list.
    • RUNNING AS. Running association score; that is, the association score at this point in the ranked list of genes.
    • CORE ASSOCIATION . Genes with a Yes value in this column contribute to the leading-edge subset within the gene set. This is the subset of genes that contributes most to the association result. Use the Leading Edge analysis to analyze the overlap between multiple leading-edge subsets.
  • A histogram of the association scores for all permutations.


Running GSAASeqGP from the Command Line


Syntax


To run GSAASeqGP from the command line, use a java command of the form:

java -cp full-path/GSAA.jar –Xmx5000m gsaa-tool  parameters
  • -cp Points the CLASSPATH variable to the complete path of the GSAA.jar file. You do not need to set any other CLASSPATH variables.
  • -Xmx1000m Specifies the amount of memory available to Java.
  • gsaa-tool Specifies the analysis to use. For GSAASeqGP, use xtools.gsea.GsaaSeqGP.
  • parameters Specifies the analysis parameters. To find the parameters for an analysis, open the GSAASeqGP application, display the page that runs the analysis, enter the parameters that you want to use, and click the Command button at the bottom of the page. GSAASeqGP displays the command line used to run the analysis. If you omit a parameter, GSAASeqGP uses the default value as displayed in the GSAASeqGP application.
    • Paths to file names must be fully specified or relative to the execution directory. When creating batch files, you generally want to use full path names for all files.
    • File names are platform-specific and may require editing. For example, on Windows, a file name that contains spaces must be enclosed in quotation marks.
    • Files cannot be directly accessed from the GSEA ftp site. Download the desired gene set or array annotations files from the GSEA web site (http://www.broad.mit.edu/gsea/downloads.jsp) and reference the downloaded files in the command line.
    • Parameter values cannot include hyphens (-); therefore, file names cannot include hyphens. If necessary, change hyphens to underscores. For example, you cannot use -res my-dataset.gct, but must use -res my_dataset.gct instead.
    Optionally, use the –param_file parameter to specify a parameter file, which can contain any parameter except –param_file. If you specify the same parameter on the command line and in the parameter file, the value on the command line takes precedence. A parameter file is a text file that defines one parameter per line. Each line contains a parameter name (without the initial hyphen), a tab (not spaces), and the parameter value.
 

Parameters


The table below lists the command line options and their corresponding names in the graphical user interface (GUI).

Command line option GUI name
-gmx Gene sets database
-nperm Number of permutations
-exp_file Gene expression dataset
-exp_template_file Expression phenotype labels
-permute Permutation type
-rpt_label Analysis name
-demetric Metric for differential expression analysis
-gsametric Metric for gene set association analysis
-scoring_scheme Association statistic
-set_max exclude larger sets
-set_min exclude smaller sets
-out save results in this folder
-rnd_type Randomization mode
-norm Normalization mode
-make_sets Make detailed gene set report
-plot_top_x Plot graphs for the top sets of each phenotype
-rnd_seed Seed for permutation
-save_rnd_lists Save random ranked lists
-zip_report Make a zipped file with all reports
-gui graphical user interface


Examples


1, Following is a command line that assumes that you use the in-house gene set collection in GSAASeqGP, click HERE to see a list of available gene sets databases.

java -cp /home/tcga/program/GSAA.jar -Xmx10000m xtools.gsea.GsaaSeqGP -gmx Genesets:MSigDB.c2.cp.v5.0.symbols.gmt -nperm 2000 -exp_file /home/tcga/data/gbm_tcga_rnaseq.gct -exp_template_file /home/tcga/data/gbm_tcga_rnaseq.cls -gsametric KS -demetric edgeR -permute gene_set -rnd_type no_balance -scoring_scheme weighted -norm MeanDiv -rpt_label gsaaseqgp_gbm_tcga_c2.cp.v5.0 -make_sets true -plot_top_x 20 -rnd_seed timestamp -save_rnd_lists false -set_max 500 -set_min 15 -zip_report false -out /home/tcga/result/gsaaseqgp -gui false

2, Following is a command line that assumes that you supply the gene sets database file.

java -cp /home/tcga/program/GSAA.jar -Xmx10000m xtools.gsea.GsaaSeqGP -gmx /home/tcga/data/c2.cp.v5.0.symbols.gmt -nperm 2000 -exp_file /home/tcga/data/gbm_tcga_rnaseq.gct -exp_template_file /home/tcga/data/gbm_tcga_rnaseq.cls -gsametric KS -demetric edgeR -permute gene_set -rnd_type no_balance -scoring_scheme weighted -norm MeanDiv -rpt_label gsaaseqgp_gbm_tcga_c2.cp.v5.0 -make_sets true -plot_top_x 20 -rnd_seed timestamp -save_rnd_lists false -set_max 500 -set_min 15 -zip_report false -out /home/tcga/result/gsaaseqgp -gui false

Furey Lab | Mukherjee Lab | Department of Genetics | The University of North Carolina at Chapel Hill
Last updated: September 12, 2015
Copyright © 2012 UNC-CH