Are there any statistics about how much sequence data is in IGSR?

We do not provide summary statistics that span the collections in IGSR. However, our index files, provided for each data collection on the FTP site, include information for each collection. The following describes the information available using the 1000 Genomes Project files as an example, however, similar files are available for the other data collections.

For raw data, a sequence.index file contains base and read counts for each of the active FASTQ files.

For the aligned data all BAM and CRAM files have BAS files associated with them. These contain read group level statistics for the alignment. We also provide this in a collected form in alignment index files. The alignment indices for the alignments of the 1000 Genomes Project data to GRCh38 are available on the FTP site. There is also an historic alignment indices directory, which contains a .hsmetrics file with the results of the Picard tool CalculateHsMetrics for all the exome alignments and summary files, which compare statistics between old and new alignment releases during the 1000 Genomes Project.

Related questions: