NCI Computer Services

Twitter

You are here

NCI Cancer Genomics Cloud Pilots

Bringing data and computation together to create knowledge that accelerates cancer research and enables precision medicine

Cancer Genomics Cloud Pilots Extended

NCI Cancer Genomics Cloud (CGC) Pilots have been extended for one year! The scope of the activities for the upcoming year includes:

  • Incorporation of additional datasets maintained in the GDC, including Therapeutically Applicable Research to Generate Effective Treatments (TARGET) program, and the Cancer Genome Characterization Initiative (CGCI)
  • Addition of two new data types: medical imaging and proteomic data.
  • Updates to improve usability or add new tools for analysis and visualization

All three CGC Pilots are now available!

Read the blog post by Dr. Tony Kerlavage about how the CGC Pilots and the GDC fit into the National Cancer Knowledge System.

For upcoming events and conferences or mentions of the NCI Cloud in the news, visit the News and Events page.

Overview of the Program

The traditional model for analyzing genomic data involves individual researchers downloading data stored at a variety of locations, adding their own data, attempting to harmonize the data, and then computing over these data on local hardware. While this model has been successful for many years, it has become unsustainable given the enormous growth of biomedical data due to the prevalent use of next-generation sequencing technology in large scientific programs. The size of the data makes access and analysis difficult for anyone but the best-resourced institutions, in terms of both storage and computing capability.

In response to these challenges, CBIIT, in collaboration with the Center for Cancer Genomics, awarded three contracts to develop CGC Pilots to help meet the research community’s needs to access and analyze high quality, large-scale cancer genomic data and associated clinical information.

Key design principles for the CGC Pilots include: APIs for secure tool and data access, usability for biologists and clinicians as well as bioinformaticists and application developers, scalability, sustainability, extensibility to new data types without major refactoring, and open source, non-viral software licenses.

All three CGC Pilots have chosen to implement their systems through commercial cloud providers and are collaborating on adopting common standards. Beyond these commonalities, the three project teams have distinct system designs, data presentation, and analysis resources to serve the cancer research community.

The Cancer Genome Atlas Data

All three Cloud Pilots will host a core data set from The Cancer Genome Atlas (TCGA). TCGA is a comprehensive effort launched by NCI and the National Human Genome Research Institute in 2006 to accelerate the understanding of the molecular basis of cancer through the application of genome analysis technologies to matched tumor-normal pairs. TCGA has collected tissue samples from and is characterizing 33 different types of cancer, including 10 rare cancers. TCGA has successfully demonstrated that a national, shared infrastructure for the generation of cancer genomic data, where individual labs pool their efforts and contribute their data, enables researchers to make and validate important discoveries and achieve economies of scale.

All three Cloud Pilots will host these core TCGA data:

Data Type

Description

Clinical

Available clinical information for each participant (may include demographic information, treatment information, survival data, etc)

Biospecimen

Information on how samples from each participant were processed by the Biospecimen Core Resource Center (BCR)

DNA-Seq

Whole exome sequence for both tumor and normal sample for each participant; whole genome sequence for select participants

RNA-Seq

mRNA sequence for each participant's tumor sample

SNP array

Probe signals for each participant's tumor sample

Mutations and Variations

Somatic and germline mutation calls for each participant

By its projected completion in 2016, it is expected that TCGA will generate approximately 2.5 Petabytes (PB) of data. Maintaining local copies of all of the data is not feasible, and downloads can take weeks or months to complete. For precision medicine to move forward, data access and computing resources must be made available to the broadest set of researchers possible.

Meeting the Big Data Challenge: NCI Genomic Data Commons

The NCI Center for Cancer Genomics (CCG) was established to lead the NCI effort in generating critical datasets for cataloging the alteration seen in human tumors, coordinating data unification and sharing efforts, and supporting the development of analytical tools and computation approaches aimed at improving the understanding of large-scale, multidimensional data.  The CCG supports several large-scale cancer genome research programs including The Cancer Genome Atlas (TCGA) and the Office of Cancer Genomics (OCG). OCG includes two initiatives supporting the molecular characterization of cancer including the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) initiative and the Cancer Genome Characterization Initiative (CGCI).

The GDC is working with the Cloud Pilots to implement a comprehensive ecosystem for cancer genomics data which will serve as a cohesive model for large-scale genomic data management and analysis. The GDC will provide the Cloud Pilots with an authoritative NCI reference data set that can be accessed by the Cloud Pilots for high performance computing. 

Read more about the GDC

Cloud Pilots and the Genomic Data Commons: A Comprehensive Infrastructure for Cancer Genomic Data

The Cloud Pilots and the GDC have complementary goals, and the implementation teams are working together to promote interoperability among these systems.

Together, the systems create a cohesive model for large-scale genomic data management and analysis:

  • Data are generated through the TCGA and other NCI-funded genomics programs.
  • Data are validated, aggregated, harmonized, stored, and made available for query and download through the GDC as the authoritative NCI cancer genomics dataset.
  • The Cloud Pilots provide the computational capacity to effectively analyze these data and allow researchers to bring their own data and tools to the cloud.

Benefits of the Programs

Together, the Cloud Pilots and the GDC provide the research community with many significant benefits:

  • Democratize access to high-quality standardized clinical, biospecimen, and molecular data
  • Enable researchers across the cancer community to access tools and to compute on large volumes of data, regardless of local resource constraints
  • Provide capabilities to search, visualize, and analyze researcher's own data in combination with TCGA data
  • Provide consistent, programmatic access to the data and the ability for researchers to bring their own tools to the data
  • Harmonize data and analysis pipelines for consistency and reproducibility
  • Ensure security and appropriate access to controlled data
Useful Links