About VCF variant files

Answer:

Variants are released in VCF format. As these have been released at different times, they are on different versions of the format - this will be indicated in the file heading. Our VCFs are multi-individual, with genotypes listed for each sample; we do not have individual or population specific VCFs.

Are all the genotype calls in the 1000 Genomes Project VCF files bi-allelic?

No. While bi-allelic calling was used in earlier phases of the 1000 Genomes Project, multi-allelic SNPs, indels, and a diverse set of structural variants (SVs) were called in the final phase 3 call set. More information can be found in the main phase 3 publication from the 1000 Genomes Project and the structural variation publication. The supplementary information for both papers provides further detail.

In earlier phases of the 1000 Genomes Project, the programs used for genotyping were unable to genotype sites with more than two alleles. In most cases, the highest frequency alternative allele was chosen and genotyped. Depth of coverage, base quality and mapping quality were also used when making this decision. This was the approach used in phase 1 of the 1000 Genomes Project. As methods were developed during the 1000 Genomes Project, it is recommended to use the final phase 3 data in preference to earlier call sets.

Related questions:

Are the IGSR variants available in dbSNP?

Answer:

When studies are published, their variant call sets are submitted to the archives (dbSNP,DGVa, EVA, etc.).

The 1000 Genomes Project SNPs and short indels were all submitted to dbSNP and longer structural variants to the DGVa.

The accessions for data sets in the archives can be found in the accompanying publications (listed alongside the data collections).

Related questions:

Are the IGSR variants available in genome browsers?

Answer:

1000 Genomes Project data is available at both Ensembl and the UCSC Genome Browser.

Ensembl provides consequence information for the variants. The variants that are loaded into the Ensembl database and have consequence types assigned are displayed on the Variation view. Ensembl can also offer consequence predictions using their Variant Effect Predictor (VEP).

You can see individual genotype information in the Ensembl browser by looking at the Individual Genotypes section of the page from the menu on the left hand side.

Related questions:

Are the variant calls in IGSR phased?

Answer:

You can tell when a VCF file contains a phased genotype as the delimiter used in the GT field is a pipe symbol | e.g

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096
10   60523  rs148087467    T     G       100     PASS    AC=0;AF=0.01;AFR_AF=0.06;AMR_AF=0.0028;AN=2; GT:GL 0|0:-0.19,-0.46,-2.28

The VCF files produced by the final phase of the 1000 Genomes Project (phase 3) are phased. They can be found in the final release directory from the project and in the directory supporting the final publications.

Some other studies have also produced phased versions of their calls. These include the analysis of high-coverage data across 3,202 samples on GRCh38 completed by NYGC. Multiple sets of VCFs are available, including phased VCFs, linked to from the page for that collection.

Related questions:

Can I get phased genotypes and haplotypes for the individual genomes?

Answer:

Phased variant call sets are described in “Are the variant calls in IGSR phased?”.

You can obtain individual phased genotypes through either the Ensembl Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.

The Data Slicer has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.

To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz

Please also note that some studies, such as the second phase of the Human Genome Structural Variation Consortium (HGSVC), are now producing haplotype resolved asssemblies.

Related questions:

Can I map your variant coordinates between different genome assemblies?

Answer:

We have data presented on GRCh38, GRCh37 and NCBI36, please check the data portal to see what assembly the data is on. If you need variant calls to be in a particular assembly it is best to go to dbSNP, Ensembl or an equivalent archive using their rs numbers as this will provide a definitive mapping.

If an rs number or equivalent is not available there are tools available to map between NCBI36, GRCh37 and GRCh38 from both Ensembl and the NCBI

Related questions:

Can I use the IGSR data for imputation?

Answer:

The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.

Please look at the software’s website to find those files.

Related questions:

Do you have assembled FASTA sequences for samples?

Answer:

Recent projects, such as the second phase of the Human Genome Structural Variation Consortium (HGSVC) have produced assemblies. These are linked to from the page for that data collection. These are haplotype resolved assemblies. More details can be found in the accompanying publication.

The 1000 Genomes Project did not create any assemblies from the genome sequence data it generated.

The Gerstein Lab at Yale University created a diploid version of the NA12878 sequence, which is available from the Gerstein website under NA12878_diploid. When used, groups should cite AlleleSeq: analysis of allele-specific expression and binding in a network framework, Rozowsky et al., Molecular Systems Biology 7:522.

You can create a FASTA file incorporating the variants from an individual with a VCFtools Perl script called vcf-consensus.

An example set of command lines would be:

#Extract the region and individual of interest from the VCF file you want to produce the consensus from
tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr17.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c > HG00098.vcf.gz

#Index the new VCF file so it can be used by vcf-consensus
tabix -p vcf HG00098.vcf.gz

#Run vcf-consensus and use --sample to apply sample-specific variants. If not given, all the variants are applied
cat ref.fa | vcf-consensus HG00098.vcf.gz --sample HG00098 > HG00098.fa

You can get more support for VCFtools on their help mailing list.

Related questions:

Do you have structural variation data?

Answer:

The 1000 Genomes Project considered structural variation (longer than 50bp in length) based on short read Illumina data in the publication by Sudmant et al. in 2015.

Structural variants are also considred in analysis of high-coverage short read data in work done by NYGC.

However, short read data has limitations for assessing structural variation. The Human Genome Structural Variation Consortium (HGSVC) applied a variety of technologies to explore their abilty to detect structural variation. This work has subsequently been expanded and other projects are using a variety of technologies to produce haplotype resolved genome assemblies.

Related questions:

Does the 1000 Genomes Project use HapMap data?

Answer:

The 1000 Genomes Project shares some samples with the HapMap project; any sample which starts with NA was likely part of the HapMap project. In the pilot stages of the project HapMap genotypes were also used to help quality control the data and identify sample swaps and contamination. Since phase 1 the HapMap data has not been used by the 1000 Genomes Project, and all genotypes were independantly identified by 1000 Genomes.

Related questions:

How do I find out information about a single variant?

Answer:

Our VCF files contain global and super population alternative allele frequencies. You can see this in our most recent release. For multi allelic variants, each alternative allele frequency is presented in a comma separated list.

An example info column which contains this information looks like

1 15211 rs78601809 T G 100 PASS AC=3050;AF=0.609026;AN=5008;NS=2504;DP=32245;EAS_AF=0.504;AMR_AF=0.6772;AFR_AF=0.5371;EUR_AF=0.7316;SAS_AF=0.6401;AA=t|||;VT=SNP

If you want population specific allele frequencies you have three options: * For a single variant you can look at the population genetics page for a variant in the Ensembl browser. This gives you piecharts and a table for a single site. * For a genomic region you can use our allele frequency calculator tool which gives a set of allele frequencies for selected populations * If you would like sub population allele frequences for a whole file, you are best to use the vcftools command line tool.

This is done using a combination of two vcftools commands called vcf-subset and fill-an-ac

An example command set using files from our phase 1 release would look like

grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list

vcf-subset -c CEU.samples.list ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | fill-an-ac |
    bgzip -c > CEU.chr13.phase1.vcf.gz
    </pre>

Once you have this file you can calculate your frequency by dividing AC (allele count) by AN (allele number).

Please note that some early VCF files from the main project used LD information and other variables to help estimate the allele frequency. This means in these files the AF does not always equal AC/AN. In the phase 1 and phase 3 releases, AC/AN should always match the allele frequency quoted.

Lists of identifiers

You can get information about a list of variant identifiers using Ensembl’s Biomart.

This YouTube video gives a tutorial on how to do it.

The basic steps are:

  1. Select the Ensembl Variation Database
  2. Select the Homo sapiens Short Variants (SNPs and indels excluding flagged variants) dataset
  3. Select the Filters menu from the left hand side
  4. Expand the General Variant Filters section
  5. Check the Filter by Variant Name (e.g. rs123, CM000001) [Max 500 advised] box
  6. Add your list of rs numbers to the box or browse for a file which contains this list
  7. Click on the Results Button in the headline section
  8. This should provide you with a table of results which you can also download in Excel or CSV format

If you would like the coordinates on GRCh38, you should use the main Ensembl site, however if you would like the coordinates on GRCh37, you should use the dedicated GRCh37 site.

Related questions:

How do I find specific variant files?

Answer:

The easiest way to find the variant files you’re looking for is with the Data Portal. You can search for individuals, populations and data collections, and filter the files by data type and technologies. This will give you locations of the files, which you can use to download directly, or to export a list to use with a download manager.

The VCFs are by chromosome, and contain genotypes for all the individuals in the dataset. Chromosome X, Y and MT variants are available for the phase 3 variant set. The chrX calls were made in the same manner as the autosome variant calls and as such are part of the integrated call set and include SNPs, indels and large deletions, note the males in this call set are show as haploid for any chrX SNP not in the Pseudo Autosomal Region (PAR). The chrY and MT calls were made independently. Both call sets are presented in an integrated file in the phase 3 FTP directory, chrY and chrMT. ChrY has snps, indels and large deletions. ChrMT only has snps and indels. For more details about how these call sets were made please see the phase 3 paper.

Related questions:

How do I get a genomic region sub-section of your files?

Answer:

You can get a subsection of the VCF or BAM files using the Ensembl Data Slicer tool. This tool gives you a web interface requesting the URL of any VCF file and the genomic location you wish to get a sub-slice for. This tool also works for BAM files. This tool also allows you to filter the file for particular individuals or populations if you also provide a panel file.

You can also subset VCFs using tabix on the command line, e.g.

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768

Specifications for the VCF format, and a C++ and Perl tool set for VCF files can be found at vcftools on sourceforge

Please note that all our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a vcf file using a chromosome name in the style chrN as shown below it will not work.

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768

You can subset alignment files with samtools on the command line, e.g.

samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 17:7512445-7513455

Samtools supports streaming files and piping commands together both using local and remote files. You can get more help with samtools from the samtools help mailing list

Related questions:

How do you calculate ancestral alleles?

Answer:

Information relating to ancestral alleles is available for phase three of the 1000 Genomes Project. The work on annotating ancestral alleles is described in Section 8.3 of the supplementary material of the publication accompanying that work.

Related questions:

How was imputation used in 1000 Genomes to fill in gaps in sequencing?

Answer:

In the original phase 1 and 3 sequencing of the 1000 Genomes individuals, many genomes were only sequenced in full at low coverage, so some individuals some genotypes will be based on imputation.

This means that if an individual has no coverage at a particular location but overall we have been able to determine there is variation at that location then we can statistically infer the genotype for that variant in that individual using haplotype information. This means we are able to provide complete haplotypes for all the variation we discover.

The process used to create our genotypes first gave our merged sites and genotype likelihoods sets to Beagle to generate initial haplotypes (using 50 interations across all samples) and these were refined using a modified version of Thunder (it used 300 states chosen by longest matching haplotype at each iteration in addition to 100 randomly chosen states).

This process means we are unable to precisely identify which sites used imputation to generate their genotype. Without this process the approximate error rate for our heterozygous sites would be 20% so you can estimate that 20% of our heterozgous sites will have been changed on the basis of imputation. The sites covered by our exome sequencing represent our highest accuracy sites and these are the least likely to have been changed by this process. The converse is also true any site without any sequence alignment will have been imputed. You can find the depth of coverage at any site using our bam files. Other sites may have been given greater evidence on the basis of the imputation and refinement process.

You can find out more about this in our Phase 1 paper.

Related questions:

Where can I get consequence annotations for the IGSR variants?

Answer:

The final 1000 Genomes phase 3 analysis calculated consequences based on GENCODE annotation and this can be found in the directory: release/20130502/supporting/functional_annotation/

Ensembl can also provides consequence information for the variants. The variants that are loaded into the Ensembl database and have consequence types assigned and displayed on the Variation view. Ensembl can also offer consequence predictions using their Variant Effect Predictor (VEP).

Related questions:

Why are the coordinates of some variants different to what is displayed in other databases?

Answer:

Data from the 1000 Genomes Project has now been analysed multiple times and on different reference assemblies. In addition, further data sets have been generated and analysed. These studies, therefore, use different data, different reference assemblies and different methodologies, which can lead to different variant calls being made.

The studies that IGSR makes data available for are listed with their accompanying publications in our data portal. The publications can provide further details.

Related questions:

Why are there differences between the different analyses of the 1000 Genomes samples?

Answer:

The phase 1 variants list released in 2012 and the phase 3 variants list released in 2014 overlap but phase 3 is not a complete superset of phase 1. The variant positions between phase 3 and phase 1 releases were compared using their positions. This shows that 2.3M phase 1 sites are not present in phase 3. Of the 2.3M sites, 1.92M are SNPs, the rest are either indels or structural variations (SVs).

The difference between the two lists can be explained by a number of different reasons.

  1. Some phase 1 samples were not used in phase 3 for various reasons. If a sample was not part of phase 3, variants private to this sample are not be part of the phase 3 set.

  2. Our input sequence data is different. In phase 1 we had a mixture of both read lengths 36bp to >100bp and a mixture of sequencing platforms, Illumina, ABI SOLiD and 454. In phase 3 we only used data from the Illumina sequencing platform and we only used read lengths of 70bp+. We believe that these calls are higher quality, and that variants excluded this way were probably not real.

  3. The first two reasons listed explain 548k missing SNPs, leaving 1.37M SNPs still to be explained.

    The phase 1 and phase 3 variant calling pipelines are different. Phase 3 had an expanded set of variant callers, used haplotype aware variant callers and variant callers that used de novo assembly. It considered low coverage and exome sequence together rather than independently. Our genotype calling was also different using ShapeIt2 and MVNcall, allowing integration of multi allelic variants and complex events that weren’t possible in phase 1.

    891k of the 1.37M sites missing from phase 1 were not identified by any phase 3 variant caller. These 891k SNPs have relatively high Ts/Tv ratio (1.84), which means these were likely missed in phase 3 because they are very rare, not because they are wrong; the increase in sample number in phase 3 made it harder to detect very rare events especially if the extra 1400 samples in phase 3 did not carry the alternative allele.

    481k of these SNPs were initially called in phase 3. 340k of them failed our initial SVM filter so were not included in our final merged variant set. 57k overlapped with larger variant events so were not accurately called. 84k sites did not make it into our final set of genotypes due to losses in our pipeline. Some of these sites will be false positives but we have no strong evidence as to which of these sites are wrong and which were lost for other reasons.

  4. The reference genomes used for our alignments are different. Phase 1 alignments were aligned to the standard GRCh37 primary reference including unplaced contigs. In phase 3 we added EBV and a decoy set to the reference to reduce mismapping. This will have reduced our false positive variant calling as it will have reduced mismapping leading to false SNP calls. We cannot quantify this effect.

We have made no attempt to eludcidate why our SV and indel numbers changed. Since the release of phase 1 data, the algorithms to detect and validate indels and SVs have improved dramatically. By and large, we assume the indels and SVs in phase 1 that are missing from phase 3 are false positive in phase 1.

You can get more details about our comparison from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/phase1_sites_missing_in_phase3/

Related questions:

Why are there duplicate calls in the phase 3 call set?

Answer:

The phase 3 VCF files released in June 2014 contain overlapping and duplicate sites.

This is due to an error in the processing pipeline used when sets of variant calls were combined. Originally, all multi-allelic sites were seperated into individual lines in the VCF file during the pipeline but the recombination process did not always succeed, leaving us with a small number of sites with overlapping or duplicate call records. This is most commonly seen in chromosome X.

The simplest solution to this is to ignore duplicate sites in any analysis. If you wish to use one or both of a pair of duplicate sites in your own analysis, you should use the GRCh37 alignment files to recall the genotypes of interest in the individuals you are interested in to resolve the conflict.

Related questions:

Why can't I find one of your variants in another database?

Answer:

As is described in “Why are the coordinates of some variants different to what is displayed in other databases?”, different analyses may produce different results.

The publications accompanying the data collections list the accessions for the variant calls in the variation archives.

Related questions:

Why do some of your vcf genotype files have genotypes of ./. in them?

Answer:

Our August 2010 call set represents a merge of various different independent call sets. Not all the call sets in the merge had genotypes associated with them, as this merge was carried out using a predefined rules which has led to individuals or whole variant sites having no genotype and this is described as ./. in vcf 4.0. In our November 2010 call set and all subsequent call sets all sites have genotypes for all individuals for chr1-22 and X.

Related questions:

Why do some variants in the phase1 release have a zero Allele Frequency?

Answer:

There are a small number of variants which have an Allele Count of 0 and an Allele Frequency of 0.

This is because the original sample list for phase1 had 1094 samples on it. After our integrated genotyping processes 2 samples where discovered to have very discordant genotypes.

The decision was made to leave in any variant which only been discovered in one or both of these individuals. The Analysis group is still confident in their sites but not in their genotypes. In doing this we are left with some variant sites where no sample holds the non reference allele.

Related questions:

Why does a tabix fetch fail?

Answer:

There are two main reasons a tabix fetch might fail.

All our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a VCF file using a chromosome name in the style chrN as shown below it will not work.

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804 ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768

Also tabix does not fail when streaming remote files but instead just stops streaming. This can lead to incomplete lines with final rows with unexpected numbers of columns when trying to stream large sections of the file. The only way to avoid this is to download the file and work with it locally.

Related questions:

Why is the allele frequency different from allele count/allele number?

Answer:

In some early main project releases the allele frequency (AF) was estimated using additional information like LD, mapping quality and Haplotype information. This means in these releases the AF was not always the same as allele count/allele number (AC/AN). In the phase 1 release AF should always match AC/AN rounded to two decimal places.

Related questions: