GENOME-WIDE ASSOCIATION STUDIES (GWAS) Data Repository

The National Institute of Health (NIH) Genomic Data Sharing Policy expects that genomic research data from NIH-supported studies involving human specimens as well as non-human and model organisms will be submitted to appropriate data repositories.The list below provides examples of relevant databases.

NIH Data Repositories, NIH-Funded Databases, and NIH Database Collaborations

Array Express: an NIH-funded database at the European Molecular Biology Laboratory -European Bioinformatics Institute that collects and disseminates microarray-based gene-expression data. Read more about Array Express.

DNA Data Bank of Japan (DDBJ): a data bank organized by the National Institute of Genetics in Japan that collects sequence data. As a member of the International Nucleotide Sequence Database Collaboration, DDBJ exchanges data with GenBank at the NIH National Center for Biotechnology Information and the European Nucleotide Archive European Molecular Biology Laboratory -European Bioinformatics Institute. Read more about DDBJ.

Database of Genotypes and Phenotypes (dbGaP): an NIH database at the National Center for Biotechnology Information originally designed to archive and distribute coded genotype, phenotype, exposure, and pedigree data from genome-wide association studies. dbGaP now accepts additional types of data such as copy number variants and large-scale sequencing. Read more about dbGaP.

Database of Short Genetic Variations (dbSNP): an NIH database at the National Center for Biotechnology Information that includes single nucleotide variations, microsatellites, and small-scale insertions and deletions. dbSNP provides population-specific frequency and genotype data, experimental conditions, molecular context, and mapping information for both neutral variations and clinical mutations. Read more about dbSNP.

Database of Genomic Structural Variation (dbVar): an NIH database at the National Center for Biotechnology Information for large-scale structural genomic variations--such as insertions, deletions, translocations, and inversions--and associated phenotype information. dbVar accepts germline and somatic human structural variant data as well as data from a diverse array of organisms, including agriculturally important plants and livestock. Read more about dbVar.

European Nucleotide Archive (ENA): a database at the European Molecular Biology Laboratory -European Bioinformatics Institute (EMBL-EBI) that collects, maintains, and presents comprehensive sequencing information--including raw sequencing data, sequence assembly information, and functional annotation--as part of the permanent public scientific record. As a member of the International Nucleotide Sequence Database Collaboration, EMBL-EBI exchanges data with GenBank at the NIH National Center for Biotechnology Information and the Data Bank of Japan. Read more about ENA.

FlyBase: an NIH-funded database for genetic and genomic information on the fruit fly Drosophila melanogaster and related fly species. It includes referenced sequence genomes, phenotypic and gene expression data, chromosome maps, and additional resources. Read more about FlyBase.

GenBank: an NIH genetic sequence database at the National Center for Biotechnology Information (NCBI) that provides an annotated collection of publicly available DNA sequences. As a member of the International Nucleotide Sequence Database Collaboration, NCBI exchanges GenBank data with the European Nucleotide Archive at the European Molecular Biology Laboratory -European Bioinformatics Institute and the Data Bank of Japan. Read more about GenBank.

Gene Expression Omnibus (GEO): an NIH data repository that archives and distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomic data. Read more about GEO.

Influenza Research Database (IRD): an NIH-funded database that provides genomic and proteomic data for influenza viruses as well as surveillance data and phenotypic characteristics of viruses isolated from extracts. Read more about IRD.

Mouse Genome Informatics (MGI): an NIH-funded international database for the laboratory mouse Mus musculus that provides data on gene characterization, allelic variants, gene expression, mouse tumor biology, strain-specific phenotypes and genotypes, and mammalian orthology. Read more about MGI.

Rat Genome Database (RGD): an NIH-funded database that serve as a repository of genetic and genomic data from the laboratory rat Rattus norvegicus and also provides curation of mapped positions for quantitative trait loci, known mutations, and other phenotypic data. Read more about RGD.

Sequence Read Archive (SRA): NIH's primary archive of high-throughput sequencing data at the National Center for Biotechnology Information (NCBI). SRA stores raw sequencing data as well as alignment information in the form of read placements on a reference sequence. As a member of the International Nucleotide Sequence Database Collaboration, NCBI exchanges SRA data with the European Nucleotide Archive European Molecular Biology Laboratory -European Bioinformatics Institute and the Data Bank of Japan. Read more about SRA.

WormBase: an NIH-funded international consortium that provides accurate, current, accessible information concerning the genetics, genomics, and biology of Caenorhabditis elegans and related nematodes. Read more about WormBase.

Xenbase: an NIH-funded database that serves as a biology and genomics resource for research on the African frog species Xenopus laevis and Xenopus tropicalis. Read more about Xenbase.

Zebrafish Information Network (ZFIN): an NIH-funded database that collects, curates, and disseminates genetic, genomic, phenotypic, and developmental data about the zebrafishDanio rerio. Data represented in ZFIN are derived from three primary sources: curation of zebrafish publications, individual research laboratories, and collaborations with bioinformatics organizations. Read more about ZFIN.

Data Repositories Established as NIH Trusted Partners

For genomic data derived from human specimens, NIH may employ trusted third parties, or trusted partners, to meet infrastructure needs for data storage and/or to provide tools that are useful for genomic data analyses. A trusted partner is defined as a public or private, national or international organization that is able to meet core NIH standards for establishing the data quality and data management service protocols.

NIH Established Trusted Partners

Cancer Genomics Hub (CGHub):CGHub stores, catalogs, and facilitates research using cancer genome sequences, alignments, and mutation information from the Cancer Genome Atlas (TCGA) consortium and related projects.