Re-processing of the data generated by the FANTOM5 project (hg38 v5) === All the data produced by the FANTOM5 project was originally processed on hg19 and mm9 for human and mouse respectively. With the recent update of genome assembly and related information, we reprocessed the FANTOM5 data here. - target genome: hg38 (the raw set of hg38 sequences, except for alt, random, Un.) - inquiries: fantom-help@gsc.riken.jp - original data: http://fantom.gsc.riken.jp/5/datafiles/phase2.0 Special note on "analysis set" --- Note that we are separately working on the hg38 "analysis set" (ftp://ftp.ncbi.nlm.nih.gov/genbank/genomes/Eukaryotes/vertebrates_mammals/Homo_sapiens/GRCh38/seqs_for_alignment_pipelines/) in which several duplicated regions are hard masked and decoy sequences are included. For special analysis requiring such data, we would advise to wait for the "analysis set" based reprocessing (for a few months) Updates --- * Sep 18, 2015 initial release * Jul 22, 2016 version 2 release * Add expression tables (count and TPM) based on remapped CAGE tags * fixed several CAGE peak regions due to issues in lift-over * Apr 14, 2017 version 3 release * Add newly identified CAGE peaks by the DPI clustering with the latest genome * Aug 28, 2017 version 4 release * Revise sample metadata * Revise annotation/expression tables * Add DPI clustering result files * Oct 27, 2017 * Restructured CAGE_peaks directory * Sep 18, 2018 version 5 release * Add reprocessed enhancer data Data types --- - CAGE read alignment: the raw HeliScope reads are aligned by delve (http://fantom.gsc.riken.jp/software/). The resulting alignment formatted in ( *.bam ) are indexed ( *.bai ) - CTSS (CAGE tag starting site): 5'-end of the CAGE read alignments with mapping quality above 20 and percent identity 85% are counted at 1bp resolution. Genomic coordinates are formatted as BED and the counts are described in its score column - experimental meta data: *sdrf.txt is a tab delimited flat file describing the experimental details for each sample. Directory and file names --- Data files are located under the directory names as .. - Technology is either hCAGE (CAGE sequencing on Heliscope single molecule sequencer) or LQhCAGE (Low Quantity hCAGE). For details on the protocols used, please see [http://fantom.gsc.riken.jp/5/sstar/Protocols]. - The biological category is one of primary_cell, cell line, timecourse, fractionation or tissue. - A part of file name represent the sample name. The sample name is encoded by percent encoding, and concatenated with , , , , and data types described wbove. Reference --- - FANTOM5 main papers * Forrest ARR, et al. A promoter-level mammalian expression atlas. Nature 507: 462–470 (2014) * Andersson R, et al. An atlas of active enhancers across human cell types and tissues. Nature 507: 455–461 (2014) * Arner E, et al. 2015. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science (80- ) 347: 1010–1014. http://www.sciencemag.org/cgi/doi/10.1126/science.1259418. - Data descriptor * Abugessaisa I, et al. FANTOM5 CAGE profiles of human and mouse reprocessed for GRCh38 and GRCm38 genome assemblies. Sci Data 4: 170107 (2017) * Noguchi S, et al. FANTOM5 CAGE profiles of human and mouse samples. Sci Data 4: 170112 (2017) - FANTOM5 databases / data resource: * Lizio M, et al. Gateways to the FANTOM5 promoter level mammalian expression atlas. Genome Biol 16: 22 (2015) - HeliScopeCAGE: * Kanamori-Katayama M, et al. Unamplified cap analysis of gene expression on a single-molecule sequencer. Genome Res 21: 1150–1159 (2011) * Itoh M, Automated workflow for preparation of cDNA for cap analysis of gene expression on a single molecule sequencer. PLoS One 7: e30809 (2012) - BAM: https://samtools.github.io/hts-specs/SAMv1.pdf - BED: https://genome.ucsc.edu/FAQ/FAQformat.html#format1 - SDRF: http://isatab.sourceforge.net/format.html