ENCODE Pilot Project at UCSC

Genome Browser

Regions (hg18)

Regions (hg17)

Released Data

Downloads

Contributors

About the UCSC ENCODE pilot phase repository

The University of California Santa Cruz (UCSC) manages the official repository of sequence-related data submitted during the pilot phase (2003-2007) of the ENCODE project at NHGRI. Data analysis during the pilot phase was coordinated by the Ensembl group, a joint project of EBI and the Wellcome Trust Sanger Institute.

The data collected during the pilot phase can be accessed from the links on this page. To view the ENCODE data on the NCBI Build 36 human assembly (March 2006), click the Regions (hg18) link. To view data on the previous ENCODE reference assembly (May 2004, hg17, Build 35) that was used as the basis for the June 2007 publications in Nature and Genome Research, click the Regions (hg17) link.

ENCODE data submissions released by UCSC during the pilot phase are listed on the released data page. Primary data for ENCODE is available from the NCBI GEO and EBI ArrayExpress public array data repositories. Ensembl provides an ENCODE resource page, and NHGRI provides the ENCODEdb portal.

We thank NHGRI and those who have contributed annotations and analyses to this project. This portal is maintained by the UCSC Genome Bioinformatics Group, a cross-departmental team within the Center for Biomolecular Science and Engineering (CBSE) at UCSC. Click the Contributors link for a complete list of ENCODE pilot phase data providers and the UCSC staff who develop and maintain this website.

The ENCODE project has now expanded to cover the full human genome. UCSC is the official Data Coordination Center for this endeavor. The new data are accessible on the most recent human assemblies in the Genome Browser, along with the sequence and annotation data for a large collection of genome assemblies.

About the ENCODE project pilot phase

Following the release of the completed human genome sequence in April 2003, the scientific community intensified its efforts to mine the data for clues about how the body works in health and in disease. A basic requirement for this understanding of human biology is the ability to identify and characterize sequence-based functional elements through experimentation and computational analysis. In September 2003, the NHGRI introduced the ENCODE project to facilitate the identification and analysis of the complete set of functional elements in the human genome sequence. During the initial pilot and technology development phases of the project, 44 regions—approximately 1% of the human genome—were targeted for analysis using a variety of experimental and computational methods with the aim of assembling a comprehensive encyclopedia of the functional elements in these regions, showing their identity and precise location. The pilot project established protocols for scaling up to full-genome coverage and produced a wealth of data, elucidating elements such as protein-coding genes, transcription units, protein binding sites, conserved DNA elements, features of chromatin assembly and modification, and single nucleotide polymorphisms.

During the pilot phase, UCSC collected, processed, and released more than 500 ENCODE data sets representing a broad range of experimental methods and diverse tissues and cell lines. In addition to the two designated ENCODE cell lines, HeLa cervical carcinoma and GM06990 lymphoblastoid, more than 40 cell types are represented. A substantial proportion of the data is the product of chromatin immunoprecipitation (ChIP-CHIP) experiments used to determine binding sites for transcription factors—eight groups have produced ChIP/CHIP data from four microarray platforms, investigating more than two dozen transcription factors and histone modifications. Several experimental groups have provided time course data and varied cell treatments. Other notable experimental data include localization of RNA transcription starts, identification of regions of DNaseI hypersensitivity, and temporal profiling of DNA replication.

Accompanying the ENCODE experimental data, UCSC also hosts the ENCODE high-quality gene set, provided by the Gencode project, and a variety of computationally derived annotations, including gene predictions from the ENCODE Gene Annotation Assessment Project (EGASP), pseudogene annotations from four projects, and RNA secondary structure predictions from two contributors. The comparative genomics tracks include multiple alignments of 28 vertebrate species in the ENCODE regions, produced with three sequence alignment methods and four different conservation algorithms. The Genome Browser provides a full set of genome-wide comparative genomics tracks that complement the ENCODE tracks, including a genome-wide multiple alignment covering nearly 30 vertebrate species.

You can find more information about the ENCODE pilot phase at UCSC in the news archives.

Conditions of Use

The sequence and annotation data displayed in the Genome Browser are freely available for academic, nonprofit, and personal use with the following conditions:

The general Conditions of Use for the UCSC Genome Browser apply.
The ENCODE Consortium Data Release Policy (2003-2007) applies to all pilot phase ENCODE data.