Please note that when opening an Illumina sequence fastq file it is expected that the first thousands of reads are of comparatively low quality and frequently might contain “N”s. An “N” means that the Illumina software was not able to make a basecall for this base. The reads at the beginning and the end of the sequence data files do originate from the edges of the flowcells, where the imaging is more difficult. These reads always show below average quality because of this. Thus, it is in many cases best to ignore the first and last 100,000 reads of an Illumina dataset since they are not at all representative.
To verify the data quality of the entire dataset it is recommended to run a program like FASTQC (http://www.
bioinformatics.babraham.ac.uk/ projects/fastqc/). FASTQC will summarize the data quality for you and run multiple other analyses on your data. Please note that these additional analyses expect that the data were generated by whole genome shotgun sequencing – thus they will usually result in many warnings that are not applicable to other data types.