Methods of integrating data to uncover genotype–phenotype interactions #
Find similar titles

Structured data

About: Genotype and phenotype
Author: Marylyn D. Ritchie; Dokyoon Kim
Date Published: 2015-01-13
Publisher: Nature Review Genetics
URL: http://www.ncbi.nlm.nih.gov/pubmed/25582081

주요 내용 #

Introduction #

단일 데이터 타입 기반 연구는 복합형질(complex trait)을 연구하는데 한계가 있다. 그래서,

meta-dimensional analysis
multi-staged analysis (systems genomics approach)

Why integrate data? #

생물학적 복잡성을 해석하기 위해서는 통합적인 확장 도구가 꼭 필요함. Breast cancer를 예로 보아도, CNV, common and rare SNV, DNA methylation, Gene expression 변화가 함께 작용함.

Challenges with individual data sets #

개인데이터셋으로 통합분석 이전에 data quality, data scale or dimensionality, potential confounding에 대한 고려가 필요하다.

Quality assurance and quality control #

Genome-wide SNP array established QC pipelines

eMERGE (Electric Medical Records and Genomics)
GENEVA (Gene Environment Association Studies)

Data reduction #

필터링과 데이터 마이닝을 통한 데이터 축소는 다음 접근방법이 있다.

extrinsic: use prior knowledge like as Biofilter
- exe) Immune system 관련 유전자들만 뽑아 Gene expression 분석
intrinsic: ReliefF, Chi-square test, PCA, Factor analysis, Genetic algorithm
- exe) LD 패턴에 따라 SNP 수 줄이기

복합형질을 설명하는 두가지 분자적 가변 가설(molecular variability hypothesis)

Hypothesis A: DNA leads to variation in RNA and so on...
Hypothesis B: Combination of variation across all possible omic levels

예를들어, CNV, DNA methylation, microRNA 데이터는 조합될 수 있고, ReliefF로 축소될 수 있다.

Confounding #

두가지 이상의 요인효과가 섞여 각 효과를 분리할 수 없는 것이 교락(Confounding)이다. genetic, environmental, demographic, other technical factor들이 섞일 수 있다.

An overview of data integration #

Multi-staged analysis

Approach	Methods	Software, tools
Genomic variation analysis	eQTL, mQTL and causal analysis	Matrix eQTL and QTDT
Genomic variation analysis	Allele specific expression	AlleleSeq and ChIP-SNP
Domain knowledge guided analysis	Correlation and mapping variation to pathway	ANNOVAR, HaploReg, RegulomeDB

Meta-dimensional analysis

Approach	Methods	Software, tools
Concatenation-based	Grammatical evolution neural network	ATHENA
Concatenation-based	Bayesian network	WinBUGS
Concatenation-based	Multivarate Cox LASSO model	Glmpath
Transformation-based	Kernel-based	SKMsmo
Transformation-based	Graph-based semi-supervised learning	Graph-based semi-supervised learning
Model-based	Majority voting	ipred
Model-based	Ensemble classifier	Weka 3

Data integration: multi-stage analysis #

먼저 데이터 타입사이의 연관을 찾고, 이어 데이터 타입과 관심있는 표현형과의 연관을 찾는다.

Genomic variation analysis approaches #

Three-stage or triangle method

SNPs association with phenotype based on genome-wise significance threshold
SNPs association with another level of omic data (eQTL, mQTL, pQTL)
2번의 결과를 관심 표현형과 비교

Allele-specific expression approaches #

diploid의 한 allele이 더 잘 발현하는 경우가 있다. 후성유전학 관련됨

Domain knowledge-guided approaches #

ENCODE, KEGG 같은 정보와 통합하여 해석함. Current knowledge에 bias가 생김.

Data integration: meta-dimensional analysis #

환자별 SNP matrix, Gene expression matrix, miRNA expression matrix 정보가 있다고 할 때,

Concatenation-based integration #

합쳐서 큰 행렬을 다시 만듬. 다음과 같은 사례가 있음

Fridley et al.: joint relationship of mRNA gene expression and SNP genotype -> model by Bayesian integrative model -> predict a quantitive phenotpye (ex, drug cytotoxicity)
Mankoo et al.: Ovarian cancer CNV, DNA methylation, miRNA, Gene expression -> Multivariate Cox LASSO -> predict recurrence, survival

어떻게 데이터를 통합해야하나의 문제가 있음. 예를 들어 데이터 값이 다음과 같은 스케일인데.

SNP: 0, 1, 2
CNV: -2, -1, 0, 1, 2
Methylation: 0, 1

Transformation-based integration #

중간 데이터 형태를 만듬. Graph나, kernal matrix 일 수 있음. 다음과 같은 사례가 있음.

Lanckriet et al. (kernal-based integration): Amino acid sequence, hydropathy profile, Gene expression, Protein-protein interaction -> Protein function prediction
Tsuda et al. (graph-based semi-supervised learning): Protein function prediction
Kim et al. (graph-based integration): CNV, methylation, miRNA, Gene expression -> Cancer clinical outcomes prediction

Model-based integration #

각각의 데이터 타입들을 학습자료(training set)로 다중 모델을 만들고, 이로 부터 최종 모델을 만든다. 다음의 사례가 있음

ATHENA (Analysis Tools for Heritable and Environmental Network Associations): CNV, Gene expression, methylation, miRNA -> Each Neural network models -> final integrative model -> Ovarian cancer survival prediction
Majority voting approach: drug resistance prediction
Bayesian network: Gene expression, matabolomic data, SNP genotype -> probabilistic causal network

overfitting의 문제가 있음. concatenation-based, transformation-based 불가능시에만 적절함

Caveats and limitations #

Replication #

대용량 데이터 분석에서 중요한 것은 False positive (type 1 error)를 구분할 수 있는 것.

유전체 연구에서는 replication을 같은 유전좌위의 변이가 같은 표현형을 내는지로 확인해 볼 수 있음. 하지만 이 역시 문제가 있음. GWAS SNP들은 tag SNP 임.

Validation #

추가적인 complementary or orthogonal 실험으로 확인할 수 있음. Text mining으로 문헌조사하는 것도 또하나의 방법임. GRAIL (Gene relationships among implicated loci) 도구도 자주 사용됨.

Crooke at al. 연구(2006)는 Oestrogen metabolism에 관한 theoretical pathway, gene-gene interaction 통계모델, kinetic exeperiments와 미분방정식 모델로 Breast cancer 위험도를 예측했다.

Correlated variables #

다른 데이터 타입간의 correlation은 데이터 해석의 중요한 소재임. 그러나...

Overfitting #

Overfitting을 줄이는 방법

cross-validation
receiver-operating curves and area under the curve
Pareto optimization: find fittest model

이들 방법으로 sensitivity와 specificity의 균형을 유지하는 모델 개발이 가능함

Future directions #

Novel questions will be asked about the complex interplay of different types of omic data using new statistical and machine-learning approaches as more researchers think ‘outside the box’.

Conclusions #

복잡한 형질의 구조를 이해하기 위한 유전체학적 요인 분석은 통계학적, 전산학적으로 계속해서 새로운 방법들이 나오고 있다. 한가지 데이터 타입 연구로는 아니되고, 시스템적 유전체학 접근이 중요하지만, 아직 gold standard는 나오지 않고 있다. 다양한 방법들이 소개되고 있고, 한가지 방법만으론 한계가 있다. 따라서 데이터 타입에 맞게 적절히 선택하는 것이 중요하다.

Suggested Pages #

0.075 Cancer genome data analysis and knowledgebase
0.025 CREB
0.025 Optimization
0.025 Bioinformatics problems
0.025 WGS
0.025 Long non-coding RNA
0.025 Dacogen
0.025 June 16
0.025 Gene
0.025 Whole genome bisulfite sequencing
More suggestions...