Where can I find the UMIs in the Tag-Seq data? When and how should I trim my Tag-Seq data? What is the low complexity stretch in the Tag-Seq data?

By default, we will generate Tag-Seq and Batch-Tag-Seq gene expression profiling data that incorporate Unique Molecular Identifiers (UMIs) in the sequence reads.

(This FAQ provides information on the usage of UMIs: https://dnatech.genomecenter.ucdavis.edu/faqs/should-i-remove-pcr-duplicates-from-my-rna-seq-data/ ).
Please note that the UMIs provide optional additional data analysis options; for many applications, the UMI information can be safely ignored. UMIs are especially beneficial for low RNA input situations as well as ultra-deep sequencing.

The UMI data are located in the first 6 bases of the read, followed by a constant 4bp spacer, followed by the 12 bp random priming sequence of the QuantSeq kit, which is finally followed by the insert sequence.

If you are using a local aligner (like STAR and most other RNA-seq aligners as well as BWA-MEM), you do not need to trim these sequences.

If using a global aligner, the first 22 bases should be trimmed from the read, after transferring the 6 bp UMI information into the read header. As with any other RNA-seq data, you should also trim any Poly-A stretches (expected at the 3′ end) for global aligners. Poly-A stretches can be followed by the Truseq adapter sequence “AGATCGGAAGAGCACACGTCTGAACTCCAGTCA”.

It is recommended to transfer the UMI sequence information to the read header with UMI-TOOLS, FASTP or custom scripts.

Tag-seq data are strand-specific and have a “sense-strand” orientation. You can find more information on the sequencing kit (QuantSeq FWD) and protocol here.

Please note that a FASTQC report of the Tag-Seq data will clearly show a low-complexity stretch of the data for bases 7 to 10 (sequence TATA). This is the spacer sequence between UMI and random priming sequences and is expected.

Category: 06 Sequencing Data

← FAQ

Posted in