Modern Illumina Files

Each sample's set of files is under a directory with name in the structure: Sample_[SAMPLENAME].

Many of the files with start with the file label: [SAMPLENAME]_[BARCODE]_L00[#]

Overview of notable files in sample directories

[FILELABEL]_R[READNUM]_[FILENUMBER].fastq.gz
These are gzipped FASTQ files. The read number can be 1 or 2 and denotes which read the tags come from; paired end flowcells will have both read numbers. The file number is three digits and starts counting from 001. All paired end flowcells with have the same number of fastq files for read 1 and read 2.
filtered_[FILELABEL]_R[READNUM].fastqc.gz
These are FASTQC files taken from the above files, collated, and stripped of reads no passing filter. Paired end flowcells will have two files, one for each read.
[FILELABEL].sorted.bam
This is a sorted BAM file containing all of the reads, aligned using ELAND.
[FILELABEL].uniques.sorted.bam
This is a sorted bam file containing only uniquely mapping reads passing filters, that contain no Ns in the read and no more than 2 mismatches. It is filtered from [FILELABEL].sorted.bam
[FILELABEL].uniques.sorted.bam.bai
This is the index to the uniques bamfile, allowing random access.
[FILELABEL]_spot.txt
This file contains a SPOT score--the percentage of uniquely mapping tags in hotspots.
[FILELABEL]_spotdups.txt
Contains the duplication metrics calculated by Picard on the same randomly selected set of tags used by the SPOT score.
[FILELABEL]_uniques.bed.starch
This is a compressed BED file of the tags in the uniques BAM file; it might not be present if you have not specifically requested BED input. It can be uncompressed using the freely available unstarch program in the bedops toolset.
[FILELABEL]_75_20.[GENOME].bw
This is a density file in the bigwig format, suitable for use in a UCSC browser. This is the currently generated density file format, as it takes up less space than .wig files.
[FILELABEL]_75_20.[GENOME].wig
This is a density file in the wig format, suitable for use in a UCSC browser; there will also be a matching .wib file. Older data may have this format.
[FILELABEL].tagcounts.txt
This is a file listing out different tag counts for.
[FILELABEL]_R[READNUM]_fastqc
This is a directory produced by FastQC, a raw sequence quality control checker. You can read their help manual to get an idea of how to interpret the data. They have examples of a bad sequence report and a good sequence report.

Illumina Documentation

Because all samples coming from an Illumina sequencer are processed with some Illumina software, even if it's only getting the FASTQ data, here is the documentation for these packages:

Off-Line Basecaller v1.9.4
While we do not use the OLB for most flowcells, as basecalling is generally done on the machine, the manual for the OLB gives guidelines as to what happens for calibration and filtering.
CASAVA 1.8.2 Documentation
If you have results aligned with Illumina's aligner ELAND, this document could help with understanding the background of how the aligner functions.

Overview of file formats

FASTQ format

Each FASTQ entry for Illumina's FASTQ is in four lines:

  1. The sequence identifier, starting with @
  2. The sequence
  3. A + indicating the uality score identifier line
  4. The quality score -- Sanger format (Phred+33)

The sequence identifier is in the following format:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x- pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>
ElementRequirementsDescription
<instrument>Characters allowed: a-z, A-Z, 0-9, underscoreInstrument ID
<run number>NumericalRun number on instrument
<flowcell ID>Characters allowed: a-z, A-Z, 0-9The flowcell label--useful when writing into us
<lane>NumericalLane number
<tile>NumericalTile number
<x_pos>NumericalX coordinate of cluster on the tile
<y_pos>NumericalY coordinate of cluster on the tile
<read>NumericalRead number, either 1 or 2 (for paired end flowcells)
<is filtered>Y or NY if the read is filtered, N otherwise. If the read is filtered, it should not be used.
<control number>Numerical0 when none of the control bits are on, otherwise it is an even number
<index sequence>ACTGThe barcode sequence; empty if this was the only sample on a lane.

Here is Illumina's example read from the manual:

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@

Here is an example bash script to extract all the passing filter tags into one single file:

# USAGE: bash filterfastq.bash D14_WT_IgG_K4me3_IP_TGACCA_L004_R1

for LANE in "$@"; do

echo FILTERING $LANE

for fastq in $LANE*.fastq.gz ;
do

echo FILTERING $fastq

zcat $fastq | \
   awk '{if (substr($2, 3, 1) == "N") {f=0;print $1} else if (substr($2, 3, 1) == "Y") {f=1} else if ( f == 0) {print $1 } }' \
   > filtered_`basename $fastq .gz`;

done

echo COLLATING $LANE FASTQ

cat filtered_$LANE*.fastq | gzip -c > filtered_$LANE.fastq.gz
rm filtered_$LANE*.fastq

done

Paired End FASTQC

Remember that paired end sequencing produces matching FASTQC files each containing the same number of sequences: [FILELABEL]_R1_[FILENUMBER].fastq.gz and [FILELABEL]_R2_[FILENUMBER].fastq.gz

BED format

Our bed files are currently created using bedToBam on our BAM files. Older bed files might have fewer columns, but the first three required columns will always be present.

  1. chrom - The name of the chromosome (e.g. chr3, chrY, chr2_random) or scaffold (e.g. scaffold10671).
  2. chromStart - The starting position of the feature in the chromosome or scaffold. The first base in a chromosome is numbered 0.
  3. chromEnd - The ending position of the feature in the chromosome or scaffold. The chromEnd base is not included in the display of the feature. For example, the first 100 bases of a chromosome are defined as chromStart=0, chromEnd=100, and span the bases numbered 0-99.
  4. name - Defines the name of the BED line. This label is displayed to the left of the BED line in the Genome Browser window when the track is open to full display mode or directly to the left of the item in pack mode.
  5. mapping quality - taken from the BAM file, it represents −10 log10 Pr{mapping position is wrong}, rounded to the nearest integer. NOTE: This column is different than what UCSC's browser uses this column for
  6. strand - Defines the strand - either '+' or '-'.

Here is an example of BED lines:

chr1    10245   10280   HWI-ST700693_247:4:2208:18413:38502/1   31      +
chr1    10316   10351   HWI-ST700693_247:4:2313:1426:80330/1    12      -
chr1    13070   13105   HWI-ST700693_247:4:1311:20921:99196/1   32      +

To save space, we have compressed our BED files using the starch program. They can be uncompressed using the freely available unstarch program in the bedops toolset. Our BED files have also been sorted using the bed-sort program also in the bedops toolset

BAM format

BAM is the compact, binary form of the SAM format. You can translate BAM files into SAM using samtools.

BAM files can be converted to BED files using bedToBam in the bedtools suite.

Unless otherwise specified or requested, BAM aligments are performed with ELAND.

Tag count categories

The tag count file is formatted with a count label and then a number on each line. Definitions for the labels are below:

LabelDescription
uuniquely matching
u-pfuniquely matching and passing filter
u-pf-nuniquely matching, passing filter, no Ns
u-pf-n-mm2same as u-pf-n, but allows no more mismatches than 2
u-pf-n-mm2-mito same as u-pf-n-mm2, but also does not count matches to the mitochondrial chromosome
qcno matching done, QC failure
nmno match found
mmmultiple matches
pfpasses filter
totaltotal number of tags gathered

There are also counts for tags aligned to the individual chromosomes.

Illumina Export Files

These are deprecated output files for current CASAVA pipelines and no longer available for new analysis, but older flowcells might have them as output.

The fields are as follows:

  1. Machine (Parsed from Run Folder name)
  2. Run Number (Parsed from Run Folder name)
  3. Lane
  4. Tile
  5. X Coordinate of cluster. As of RTA v1.6, OLB v1.6, and CASAVA v1.6, the X and Y coordinates for each clusters are calculated in a way that makes sure the combination will be unique. The new coordinates are the old coordinates times 10, +1000, and then rounded.
  6. Y Coordinate of cluster. As of RTA v1.6, OLB v1.6, and CASAVA v1.6, the X and Y coordinates for each clusters are calculated in a way that makes sure the combination will be unique. The new coordinates are the old coordinates times 10, +1000, and then rounded.
  7. Index sequence or 0. For no indexing, or for a file that has not been demultiplexed yet, this field should have a value of 0.
  8. Read number (1 for single reads; 1 or 2 for paired ends or multiplexed single reads; 1, 2, or 3 for multiplexed paired ends)
  9. Called sequence of read
  10. Quality string--In symbolic ASCII format (ASCII character code = quality value + 64)
  11. Match chromosome--Name of chromosome match OR code indicating why no match resulted (RM = repeat masked, for example match against abundant sequences, NM = not matched)
  12. Match Contig--Gives the contig name if there is a match and the match chromosome is split into contigs (Blank if no match found)
  13. Match Position--Always with respect to forward strand, numbering starts at 1 (Blank if no match found)
  14. Match Strand--"F" for forward, "R" for reverse (Blank if no match found)
  15. Match Descriptor--Concise description of alignment (Blank if no match found)
  16. Single-Read Alignment Score--Alignment score of a single-read match, or for a paired read, alignment score of a read if it were treated as a single read. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat. -1 for shadow reads.
  17. Paired-Read Alignment Score--Alignment score of a paired read and its partner, taken as a pair. Blank if no match found; any scores less than 4 should be considered as aligned to a repeat. Note that in single-ended analyses it is always blank.
  18. Partner Chromosome--Name of the chromosome if the read is paired and its partner aligns to another chromosome
  19. Partner Contig
  20. Partner Offset
  21. If a partner of a paired read aligns to the same chromosome and contig, this number, added to the Match Position, gives the alignment position of the partner.
  22. If partner is a shadow read, this value is 0.
  23. If partner aligns to a different chromosome and/or contig, the number represents the absolute position of the partner.
  24. Blank for single-read analysis unless the record belongs to a part of a spliced RNA read.
  25. Partner Strand--To which strand did the partner of the paired read align? "F" for forward, "R" for reverse ("N" if no match found, blank for single- read analysis)
  26. Filtering--Did the read pass filtering? 0 - No, 1 - Yes.

Wig and Bigwig files

Density files can be in the wig or bigwig formats. We create our density files with a window of +/-75 basepairs once every 20 positions.

You can read more about the wig format here.

You can read more about the bigwig format here. That page also includes information on how to make bigwig file from wig files, and on extracting information from bigwig files.

Old Illumina System Files

Old flowcells (over a year or more) may show an alternative files and file structure than those documented above.

Most files will start with the format: s_[LANE]

Old multiplex flowcells will also have their files separated into bin folders of three digits, such as: 001 or 004.

s_[LANE]_sequence.txt.gz
A gzipped FASTQ file--similar to the FASTQ format described here, but the identification tag is different, the quality score indicator line contains the identification tag, and the encoding for the quality score will depend on which version of the software was used for alignment.
s_[LANE]_export.txt.gz
These are alignment files in Illumina's deprecated export format.
uniques.lane1.[GENOME].bed.gz
These are gzipped BED files--similar to the BED format described here but instead of a name column, the tag is included.