Sequence Data: RTA, SLIMS and Data Storage
Data analysis on the HiSeq2500 has undergone vast improvements over the last couple of years but the basic procedure has remained constant. The tiff images acquired by the instrument’s camera are analyzed to identify the clusters and to assign fluorescent intensities to the clusters at each cycle. Following the computationally intensive image analysis, the fluorescent intensities are converted to nucleotide calls during the basecalling step. Quality scores to the basecalls are assigned at this point. Finally the sequence files are collated, aligned to reference genomes, and exported in various formats. Since early 2009 the image analysis and basecalling takes place automatically on the computer that runs the sequencer in real time, or Real Time Analysis (RTA).
The image analysis and basecalling steps are automatic, although it still is important for the software to “tune” the analysis to a spectrally neutral sample. This is one reason we spike-in (add a low percentage of) PhiX to every lane of the flow cell. In contrast, the alignment function has a number of parameters that can be controlled by the user to provide more in-depth information about the generated sequence data. This occurs through modules within the CASAVA software package. We recommend that the interested user download the CASAVA 1.8 user guide (3.6 MB download) and CASAVA 1.8 changes to read about the different options. You can also download this condensed “cheat sheet” that summarizes some of the information. A particularly relevant part of the manual for all users is a description of the output files generated. These files are the raw material for your subsequent analyses, and will be the first set of files you see on SLIMS (see below) so it’s critical to understand what’s in them.
Other modules within CASAVA can be used to recognize indels, splice junctions, and other features while providing enhanced data display functionalities. We don’t run these other analyses in the Core, but the CASAVA software is freely available (you will need LINUX help or knowledge), and we can provide you with access to GenomeStudio, so these are low cost options to try to do some sequence analysis on your own.
For the standard Core service option, we run the most basic alignment algorithm against the PhiX to ensure overall sequencing quality. If CASAVA demultiplexing, please email firstname.lastname@example.org to provide sample name and barcode sequences electronically, in an Excel file format.
Data Download: SLIMS
Following analysis of each run, users have access to parsed output through the Solexa LIMS (SLIMS) created in conjunction with the Bioinformatics Core (to get some idea of the look of this interface, login here and use ‘email@example.com’ as the email and ‘slimsdemo’ as the password). A SLIMS account will be created for you on your first run, with information about how to set up and access your account distributed via email. The main SLIMS page can be reached here.
The expected range for Miseq using v2 chemistry is 12-17 million reads passing filter and 20-25 million reads with v3 chemistry. For the HiSeq2500, the expected range for rapid runs (2 lane flowcells) is 120-180 million reads passing filter per lane and 150-200 million reads in high output mode (8 lane flow cells). The total base ouput for the lane depends on the number of cycles for that run, and whether it’s a single or paired end read. For example, a PE300 run with 25 million good reads gives 15,000 Mb/lane, and a PE100 run with 150 million reads gives 30 Gb/lane, enough data for more than one sample per lane for the most of cases. Basecalling and demultiplexing are provided as part of the normal service; all other analyses, either run through us or the Bioinformatics Core, are subject to additional charges. Users should be prepared to store their own sequence data. We will store processed data free of charge for one month – after this time data will be deleted unless you indicate otherwise on the sample submission form or via communication with us.
For the UNIX-savvy user it is possible to use rsync to acquire all your non-image data directly. Please go to here to read how and additional informtion about Bioshare. The Bioinformatics Core is continually gaining expertise manipulating and analyzing these large sequence data sets, so for any assistance downstream of the initial basecalling these are the people to talk to.
The Genome Center Bioinformatics Core
Our neighbors from the Bioinformatics Core carry out sequence data analysis, statistical evaluations, consulting, and training to help you get the most out of your data.
Please contact us for complete analysis packages including sequencing and bioinformatics (e.g. differential gene expression, pathway analyses, variant calling) as well as for joint consultations with us and the Bioinformatics Core staff.
Please consider a consultation especially during the planning stage of a new project (free of charge).
Other Analysis Options / Bioinformatics Tools
For the do-it-yourselfer, many open-source tool options exist to carry out analyses on Illumina sequence data. The vast majority of these tools require the use of the command line and of UNIX/Linux operation systems. One exception is the Unipro UGENE suite of tools (http://ugene.net/) available for windows, mac, and linux, which is operated through a graphical user interface. GALAXY is a web-based platform with a simple user interface that can help with many routine workflows. For limited data sets, the USEGALAXY service allows you to analyze data on their servers. Please note that the Bioinformatics Core offers GALAXY workshops and that a UC Davis GALAXY server is in beta-testing.
Worth mentioning is the BBtools suite of command line programs which is developing into a “swiss-army-knife” of bioinformatics. All the tools in there are fast and provide comprehensive reports and statistics on the resulting data.
The Small-RNA Workbench tool suite is very helpful for miRNA data analysis.
The section below is not intended to provide a comprehensive list of sites offering such software, but instead presents some of the tools developed and used by researchers here at the Genome Center :
- The UC Davis Bioinformatics Core has written a suite of open-source bioinformatics software that is freely available here which enables adapter trimming, QC of sequencing data &demultiplexing.
- The Michelmore lab has developed a series of tools for processing and visualizing Illumina data. The site to visit is:
- The Comai lab Barcoded-Data-Preparation-Tools are very helpful for all kinds of de-multiplexing (both in-line barcodes and index-read barcodes) and read-filtering and trimming tasks.