Please note that when opening an Illumina sequence fastq file it is expected that the first few thousand reads are of comparatively low quality and frequently contain “N”s. An “N” means that the Illumina software was not able to make a basecall for this base. The reads at the beginning and end of the sequence data files originate from the edges of the flowcells, where imaging is more difficult, thus these reads show below average quality. In many cases it is best to ignore the first and last 100,000 reads of an Illumina dataset, since they are not at all representative.
To verify the data quality of the entire dataset we recommend running a program like FASTQC (http://www.
bioinformatics.babraham.ac.uk/ projects/fastqc/). FASTQC will summarize data quality and run multiple other analyses on the data. Please note that these additional analyses presuppose that the data were generated by whole genome shotgun sequencing – thus they will usually result in warnings that are not applicable to other data types.