About VCF variant files

Answer:

Variants are released in VCF format. As these have been released at different times, they are on different versions of the format - this will be indicated in the file heading. Our VCFs are multi-individual, with genotypes listed for each sample; we do not have individual or population specific VCFs.

Are all the genotype calls in the 1000 Genomes Project VCF files bi-allelic?

No. While bi-allelic calling was used in earlier phases of the 1000 Genomes Project, multi-allelic SNPs, indels, and a diverse set of structural variants (SVs) were called in the final phase 3 call set. More information can be found in the main phase 3 publication from the 1000 Genomes Project and the structural variation publication. The supplementary information for both papers provides further detail.

In earlier phases of the 1000 Genomes Project, the programs used for genotyping were unable to genotype sites with more than two alleles. In most cases, the highest frequency alternative allele was chosen and genotyped. Depth of coverage, base quality and mapping quality were also used when making this decision. This was the approach used in phase 1 of the 1000 Genomes Project. As methods were developed during the 1000 Genomes Project, it is recommended to use the final phase 3 data in preference to earlier call sets.

Related questions:

Can I BLAST against the sequences in IGSR?

Answer:

The 1000 Genomes raw sequence data represents more then 30,000x coverage of the human genome and there are no tools currently available to search against the complete data set. You can, however, use the Ensembl or NCBI BLAST services and then use these results to find 1000 Genomes Project variants in dbSNP.

Related questions:

Can I convert VCF files to PLINK/PED format?

Answer:

We provide a VCF to PED tool to convert from VCF to PLINK PED format. This tool has documentation for both the web interface and the Perl script.

An example Perl command to run the script would be:

perl vcf_to_ped_converter.pl -vcf ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/ALL.chr13.phase1_integrated_calls.20101123.snps_indels_svs.genotypes.vcf.gz
    -sample_panel_file ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/phase1_integrated_calls.20101123.ALL.sample_panel
    -region 13:32889611-32973805 -population GBR -population FIN

Related questions:

Can I get phased genotypes and haplotypes for the individual genomes?

Answer:

Phased variant call sets are described in “Are the variant calls in IGSR phased?”.

You can obtain individual phased genotypes through either the Ensembl Data Slicer or using a combination of tabix and VCFtools allows you to sub sample VCF files for a particular individual or list of individuals.

The Data Slicer has both filter by individual and population options. The individual filter takes the individual names in the VCF header and presents them as a list before giving you the final file. If you wish to filter by population, you also must provide a panel file which pairs individuals with populations, again you are presented with a list to select from before being given the final file, both lists can have multiple elements selected.

To use tabix you must also use a VCFtools Perl script called vcf-subset. The command line would look like:

tabix -h ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 17:1471000-1472000 | perl vcf-subset -c HG00098 | bgzip -c /tmp/HG00098.20100804.genotypes.vcf.gz

Please also note that some studies, such as the second phase of the Human Genome Structural Variation Consortium (HGSVC), are now producing haplotype resolved asssemblies.

Related questions:

Can I map your variant coordinates between different genome assemblies?

Answer:

We have data presented on GRCh38, GRCh37 and NCBI36, please check the data portal to see what assembly the data is on. If you need variant calls to be in a particular assembly it is best to go to dbSNP, Ensembl or an equivalent archive using their rs numbers as this will provide a definitive mapping.

If an rs number or equivalent is not available there are tools available to map between NCBI36, GRCh37 and GRCh38 from both Ensembl and the NCBI

Related questions:

Can I query IGSR programmatically?

Answer:

Our data is in standard formats like SAM and VCF, which have tools associated with them. To manipulate SAM/BAM files look at SAMtools for a C based toolkit and links to APIs in other languages. To interact with VCF files look at VCFtools which is a set of Perl and C++ code.

Related questions:

Can I use the IGSR data for imputation?

Answer:

The developers of Beagle, Mach and Impute2 have all created data sets based on the 1000 Genomes data to use for imputation.

Please look at the software’s website to find those files.

Related questions:

Have you calculated linkage disequilibrium from these variants?

Answer:

The 1000 Genomes Project’s phase three publication in 2015, which analysed data from the original 1000 Genomes Project on GRCh37, used linkage disequilibrium calculations to assess GWAS Type I error rates. This work is described in that publication and the accompanying supplementary material.

To look at Linkage Disequilibrium between variants that may be of interest to you, we recommend using Ensembl’s Linkage Disequilibrium Calculator, which can be used with data from the 1000 Genomes Project or other studies.

Related questions:

How do I find out information about a single variant?

Answer:

Our VCF files contain global and super population alternative allele frequencies. You can see this in our most recent release. For multi allelic variants, each alternative allele frequency is presented in a comma separated list.

An example info column which contains this information looks like

1 15211 rs78601809 T G 100 PASS AC=3050;AF=0.609026;AN=5008;NS=2504;DP=32245;EAS_AF=0.504;AMR_AF=0.6772;AFR_AF=0.5371;EUR_AF=0.7316;SAS_AF=0.6401;AA=t|||;VT=SNP

If you want population specific allele frequencies you have three options: * For a single variant you can look at the population genetics page for a variant in the Ensembl browser. This gives you piecharts and a table for a single site. * For a genomic region you can use our allele frequency calculator tool which gives a set of allele frequencies for selected populations * If you would like sub population allele frequences for a whole file, you are best to use the vcftools command line tool.

This is done using a combination of two vcftools commands called vcf-subset and fill-an-ac

An example command set using files from our phase 1 release would look like

grep CEU integrated_call_samples.20101123.ALL.panel | cut -f1 > CEU.samples.list

vcf-subset -c CEU.samples.list ALL.chr13.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz | fill-an-ac |
    bgzip -c > CEU.chr13.phase1.vcf.gz
    </pre>

Once you have this file you can calculate your frequency by dividing AC (allele count) by AN (allele number).

Please note that some early VCF files from the main project used LD information and other variables to help estimate the allele frequency. This means in these files the AF does not always equal AC/AN. In the phase 1 and phase 3 releases, AC/AN should always match the allele frequency quoted.

Lists of identifiers

You can get information about a list of variant identifiers using Ensembl’s Biomart.

This YouTube video gives a tutorial on how to do it.

The basic steps are:

  1. Select the Ensembl Variation Database
  2. Select the Homo sapiens Short Variants (SNPs and indels excluding flagged variants) dataset
  3. Select the Filters menu from the left hand side
  4. Expand the General Variant Filters section
  5. Check the Filter by Variant Name (e.g. rs123, CM000001) [Max 500 advised] box
  6. Add your list of rs numbers to the box or browse for a file which contains this list
  7. Click on the Results Button in the headline section
  8. This should provide you with a table of results which you can also download in Excel or CSV format

If you would like the coordinates on GRCh38, you should use the main Ensembl site, however if you would like the coordinates on GRCh37, you should use the dedicated GRCh37 site.

Related questions:

How do I get a genomic region sub-section of your files?

Answer:

You can get a subsection of the VCF or BAM files using the Ensembl Data Slicer tool. This tool gives you a web interface requesting the URL of any VCF file and the genomic location you wish to get a sub-slice for. This tool also works for BAM files. This tool also allows you to filter the file for particular individuals or populations if you also provide a panel file.

You can also subset VCFs using tabix on the command line, e.g.

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz 2:39967768-39967768

Specifications for the VCF format, and a C++ and Perl tool set for VCF files can be found at vcftools on sourceforge

Please note that all our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a vcf file using a chromosome name in the style chrN as shown below it will not work.

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768

You can subset alignment files with samtools on the command line, e.g.

samtools view -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/data/HG00154/alignment/HG00154.mapped.ILLUMINA.bwa.GBR.low_coverage.20101123.bam 17:7512445-7513455

Samtools supports streaming files and piping commands together both using local and remote files. You can get more help with samtools from the samtools help mailing list

Related questions:

What methods were used for generating alignments?

Answer:

Details of alignment methodology differ between data sets and the types of sequence data being aligned.

Full details of methodology can be found in the publications accompanying the data collections and, for unpublished alignments, in the README files placed with the data collections on our FTP site.

Related questions:

What tools can I use to download IGSR data?

Answer:

The 1000 Genomes data is available via ftp, http, Aspera and Globus. Any standard tool like wget or ftp should be able to download from our ftp or http mounted sites. There are no official torrents of the 1000 Genomes Project data sets.

How to download files using Aspera

Download Aspera

Aspera provides a fast method of downloading data. To use the Aspera service you need to download the Aspera connect software. This provides a bulk download client called ascp.

Command line

For the command line tool ascp, for versions 3.3.3 and newer, you need to use a command line like:

     ascp -i bin/aspera/etc/asperaweb_id_dsa.openssh -Tr -Q -l 100M -P33001 -L- fasp-g1k@fasp.1000genomes.ebi.ac.uk:vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ./

For versions 3.3.2 and older, you need to use a command line like:

     ascp -i bin/aspera/etc/asperaweb_id_dsa.putty -Tr -Q -l 100M -P33001 -L- fasp-g1k@fasp.1000genomes.ebi.ac.uk:vol1/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz ./

Note, the only change between these commands is that for newer versions of ascp asperaweb_id_dsa.openssh replaces asperaweb_id_dsa.putty. You can check the version of ascp you have using:

   ascp --version

The argument to -i may also be different depending on the location of the default key file. The command should not ask you for a password. All the IGSR data is accessible without a password but you do need to give ascp the ssh key to complete the command.

Files on the ENA FTP

Some of the data we provide URLs for is hosted on the ENA FTP site. ENA provide information on using Aspera with their FTP site.

As an example of downloading a file from ENA, you could use a command line like:

ascp -i bin/aspera/etc/asperaweb_id_dsa.openssh -Tr -Q -l 100M -P33001 -L- 
era-fasp@fasp.sra.ebi.ac.uk:/vol1/fastq/ERR008/ERR008901/ERR008901_1.fastq.gz ./

Key files

If you are unsure of the location of asperaweb_id_dsa.openssh or asperaweb_id_dsa.putty, Aspera provide some documentation on where these will be found on different systems.

Ports

For the above commands to work with your network’s firewall you need to open ports 22/tcp (outgoing) and 33001/udp (both incoming and outgoing) to the following EBI IPs:

If the firewall has UDP flood protection, it must be turned off for port 33001.

Browser

Our aspera browser interace no longer works. If you wish to download files using a web interface we recommend using the Globus interface we present. If you are previously relied on the aspera web interface and wish to discuss the matter please email us at info@1000genomes.org to discuss your options.

How to download 1000 Genomes data with Globus Online?

The 1000 Genomes FTP site is available as an end point in the Globus Online system. In order to access the data you need to sign up for an account with Globus via their signup page. You must also install the Globus Connect Personal software and setup a personal endpoint to download the data too.

The 1000 Genomes data is hosted as EMBL-EBI collection called “EMBL-EBI Public Data”. Data from our FTP site can then be found under the 1000g directory within the EMBL-EBI Public Data collection. The collection can be found here

screen_shot

When you have setup your personal end point you should be able to start a transfer using their web interface.

The Globus website has support for setting up accounts, and installing the globus personal connect software.

Can I get 1000 Genomes data on the Amazon Cloud?

At the end of the 1000 Genomes Project, a large volume of the 1000 Genomes data (the majority of the FTP site) was available on the Amazon AWS cloud as a public data set.

At the end of the 1000 Genomes Project, the IGSR was established and the FTP site has been further developed since the conclusion of the 1000 Genomes Project, adding additional data sets. The Amazon AWS cloud reflects the data as it was at the end of the 1000 Genomes Project and does not include any updates or new data.

You can find more information about how to use the data in the Amazon AWS cloud on our AWS help page.

Related questions:

Where can I find a list of the sequencing and analysis done for each individual?

Answer:

Our data portal has a page for each sample. At the bottom of the page, the various data collections that the sample is present in are listed in tabs. Each tab then lists the available files for that sample, including seqeunce data, genotype arrays, alignments and VCFs.

An example is the page for NA12878. Sample IDs can be entered in the search box to locate a given sample.

To understand the data available for larger groups of samples, the samples and population tabs of the portal can be used to explore available data.

Related questions:

Why does a tabix fetch fail?

Answer:

There are two main reasons a tabix fetch might fail.

All our VCF files using straight intergers and X/Y for their chromosome names in the Ensembl style rather than using chr1 in the UCSC style. If you request a subsection of a VCF file using a chromosome name in the style chrN as shown below it will not work.

tabix -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804 ALL.2of4intersection.20100804.genotypes.vcf.gz chr2:39967768-39967768

Also tabix does not fail when streaming remote files but instead just stops streaming. This can lead to incomplete lines with final rows with unexpected numbers of columns when trying to stream large sections of the file. The only way to avoid this is to download the file and work with it locally.

Related questions:

Why is the allele frequency different from allele count/allele number?

Answer:

In some early main project releases the allele frequency (AF) was estimated using additional information like LD, mapping quality and Haplotype information. This means in these releases the AF was not always the same as allele count/allele number (AC/AN). In the phase 1 release AF should always match AC/AN rounded to two decimal places.

Related questions: