Practical guidelines for the comprehensive analysis of ChIP-seq data #
Find similar titles

Structured data

About: ChIP-seq
Author: Timothy Bailey
Date Published: 2013-11-14
Publisher: PLOS Computational Biology
URL: http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003326

Summary
Suggested Pages

Summary #

Introduction to ChIP-seq Technology #

2009년 처음 소개됨. NGS로 특정 단백질(DNA-binding enzymes, histones, chaperones or nucleosome)과 바인딩한 지놈 영역을 시퀀싱함.

The Analysis of ChIP-seq Data #

Sequencing Depth #

Human의 경우 20 Mbps reads 적절.

Preseq package가 얼마나 더 시퀀싱 해야하는지 알려줄 수 있음. ENCODE software tools 가 PCR bottleneck coefficient (PBC)를 계산.

Read Mapping and Quality Metrics #

Quality cutoff 한 후, Bowtie, BWA, SOAP, MAQ와 같은 도구로 매핑함. uniquely mapped reads 70% 이상이어야 함.

Repeated DNA 영역의 경우 multi-mapping 될 수 있는데, paired-end sequencing으로 극복할 수 있다. 보통 peak-calling 알고리즘에서 multi-mapping reads는 제외된다.

매핑후, Signl-to-noise ratio (SNR)를 확인해야 함. CHANCE와 같은 도구가 사용됨. strand cross-correlation 혹은 IP enrichment estimation을 확인하는데 이들 값은 다음의 ChIP-seq 실패를 탐지하게 한다.

Insufficient enrichment by IP step
Poor fragment-size selection
Insufficient sequencing depth

Strand cross-correlation 분석은 SPP와 MACS 같은 peak caller 프로그램에서도 수행할 수 있다.

Peak Calling #

Sensitivity와 specificity 사이의 조절은 ChIP되는 단백질의 종류에 따라 적절한 peak calling 알고리즘과 normalization 방법을 선택하느냐에 달려있음.

Protein-source factors such as most TFs
Broadly enriched factors such as histone marks
Both characteristics such as RNA Pol II

non-specific or background binding 레벨을 알기위해 control sample을 사용하는 것을 추천함.

SIPeS 같은 프로그램은 Paired-end sequencing에 좀 더 최적화되어 있음

p-value or FDR 같은 파라메터는 사용된 통계모델, 시퀀싱 depth, 실제 바인딩사이트 수에 크게 영향 받는다. 그러므로, 같은 파라메터 값을 쓴다고 라이브러리간 비교를 보장할 순 없다.

나은 방법은 IDR (Irreproducible discovery rate)로 제한(threshold)하는 것이다. 이는 motif analysis와 함께 최적의 Peak calling 알고리즘을 선택하거나 파라메터를 설정하는데 도움이 된다.

Box 6. Normalization #

There are linear and non-linear normalization methods available to make the two samples "comparable".

Sequencing depth normalization (using total reads)
PeakSeq program use scale factor which is estimated in a region (~10Kb) using linear regression
RPKM (Reads per Kilobase of sequence range per Million mapped reads)
non-linear normalization LOESS (Locally weighted regression) --> modified version is implemented in MAnorm

R package POLYPHEMUS has two normalization methods

non-linear method described above.
Quantile normalization that makes the distriution in different samples the same

Normalization issues are, at present, not fully exploited although they might have a substantial impact on the results.

Assessment of Reproducibility #

재현성 확인을 위한 반복실험 필요. 각 지놈영역 리드수의 PCC (Pearson correlation coefficient)로 계산. PCC is typically from 0.3-0.4 (unrelated samples) to >0.9 (replicat samples)

high enriched 영역에 의해 지배되어 less enriched 영역 재현성이 낮을 수 있다. 그래서 centromeres, telomeres, satellite repeats, ENCODE and 1000 Genomes blacklisted region의 high signal은 PCC 계산전 제거해야 한다.

반복실험간 Peak의 IDR analysis로 rank consistency를 알 수 있다.

Differential Binding Analysis #

bedtools로 단순한 binary overlap of sets of peaks는 피크를 비교하는 적절한 방법이 아니다.

두가지 대안

qualitative - implements hypothesis testing on multiple overlapping sets of peaks
quantiative - differential binding analysis between conditions based on the total counts of reads in peak regions or on the read densities.

control 없이 두 샘플을 직접 비교하는 것은 비추천. artefacts or different chromatin structure 때문임.

Differentially bound region을 탐지하는 sensitivity를 높이려면 threshold를 유연하게 한다.

Peak 위치만으로 비교하는 정성분석은 안된다. 리드수를 사용하는 DBChIP, 리드 density를 사용하는 MAnorm 같은 프로그램으로 정확한 통계(p-value, q-vale) 확인이 필요하다.

DIME 프로그램은 비교시 유의한 피크 비율이 일정하다고 가정하며, MAnorm은 두 조건에서 일정한 피크는 유의하게 바뀌지 않는다고 가정한다.

ChIPed protein에 따라 좀 더 좋은 성능을 내는 프로그램도 있다. ChIPDiff는 Histone marks에, POLYPHEMUS는 RNA Pol II에 좀 더 잘 동작한다.

Peak Annotation #

The aim of the annotation is to associate the ChIP-seq peaks with functionally relevant genomic regions, such as gene promoters, transcription start sites, inter- genic regions, etc.

BED 혹은 GFF 형식으로 피크를 저장함. WIG 혹은 bedGraph 형식으로 normalized read coverage를 저장함.

bedtools로 각 피크와 TSS 같은 근처 landmark를 계산하거나 주어진 거리내의 유전자를 확인할 수 있다. 이러한 "location analysis" 결과는 CEAS 혹은 ChIPpeakAnno (Biodonductor package)로 확인할 수 있다. 이들은 더 나아가 expression data와의 correlation을 비교하거나 바로 Gene Ontology 분석할 수 있다.

DAVID, GREAT, GSEA와 같은 프로그램으로 특정 biological process 관여 여부를 알 수 있다.

Motif Analysis #

이는 단지 DNA binding motif를 확인하는 것 외에도 매우 유용하다. 이것이 이미 알려져있다면, 실험의 성공을 확인하는데도 쓸 수 있다.

Motif analysis is also useful with Histone modification ChIP-seq because it can discover unanticipated sequence signals associated with such marks.

binding site 서열을 FASTA로 만들어 alignment하고 Motif 분석 - MEME-ChIP, peak-motifs. 이후 다른 motif들과 비교함.