Author: Gene Kim,MyungHo Kim Advisor: Dr.Hsu Graduate: Ching-Wen Hong

Application of Support Vector Machine to detect an association between a disease or trait and multiple SNP variations Author: Gene Kim,MyungHo Kim Advisor: Dr.Hsu Graduate: Ching-Wen Hong

Outline • 1.Motivation • 2.Objective • 3. What’s SNP(single nucleotide polymorphism) • 4. How to find SNP variations • 5. A review of Support Vector Machine • 6. A representation of multiple SNP variations as a vector • 7.The marks • 8. Inseparable Case • 9.Test results with clinical data • 10. Personal opinion

Motivation • 研究每個人的「單一核甘酸多型性」(SNP)的差異,可以幫助了解致病基因,甚至預測藥物對個人是否具有療效,進一步設計量身訂做藥物,對新藥的開發有極大的影響。SNP的研究是後基因時代生技產業發展的主要趨勢。

Objective • We can present a method of detecting whether there is an association between multiple SNP variations and a trait or disease. • The method exploits the Support Vector Machine(SVM) which has been attracting lots of attentions recently.

What’s SNP • 何謂SNP(單一核甘酸多型性) • 雖然同種生物其染色體差異極小,但平均1000個鹼基對(base pair)就有一個發生突變,這些變異稱為SNP,是造成每個人對藥物的敏感性不同、血型不同、身高等等的原因。此外,SNP也和癌症、心血管疾病、自體免疫等等疾病有關。目前國內賽亞基因和台大醫院合作,正從事C型肝炎SNP研究,試圖找出病患的SNP,以預測藥物是否對病人有效。

What’s SNP • A genetic marker is M1,M2,…in the DNA • The different variants of DNA that different people have at the marker are alleles , denoted by 1,2,3..,The number of alleles per marker is small :typically less than ten(for called microsatellite marker)or exactly two (for called SNPs).

How to find SNP variations • The problem of determining whether a set of SNP variation cause a specific disease or trait could be formulated as follows. For a given disease or trait, • 1.For each set of SNP variations, find its representation as a vector in a Euclidean space. (haplotype data,clinical data,….we will discuss this in the page9) • 2. Get a systematic way of distinguishing SNP genotype of normal people from ones of people with the disease or trait. • We will use the Support Vector Machine (SVM) to separate SNP vectors into two groups (normal,sick) .

A review of Support Vector Machine • What is a SVM ? • a family of learning algorithm for classification of objects into two classes . • Input : a training set {(x1,y1),…,(xl ,yl)} of object xi E Ŕ(n-dim vector space) and their known classes yi E {-1,+1}. • Output : a classifier f :Ŕ→ {-1,+1}.which predicts the class f(x) for any (new) object x E Ŕ

A review of Support Vector Machine • (1).Linear SVM for separable training sets: • a training set S= {(x1,y1),…,(xl ,yl)} , xiE Ŕ, yi E {-1,+1}.

A review of Support Vector Machine • The optimal hyperplane is defined by the pair (w,b). • Solve the linear program problem • Min ½║w║² • st. yi(xi·w+b)-1≥0 ,i=1,…,l • This is a class quadratic(convex) program

A review of Support Vector Machine • (2).Linear SVM for non-separable training sets • Solve the linear program problem • Min ½║w║²+C(∑εi) , c is a extreme large value • S.t. yi(xi·w+b)-1+εi≥0 , εi≥0, 0≤αi≤c ,i =1,…,l

A representation of multiple SNP variations as a vector • Scheme • Given each disease or trait, and a collection of SNP data which depending on genotype in a consistent way. ( haplotype, clinical data ):7 step • 1.Assume that there is no environmental factor. • 2.SNP locations are assumed to be know for the disease or trait. • 3.Assume there is a reference SNP data.(good health records) • 4.By giving scores based on difference from the reference data ,assign a vector to each SNP data.

A representation of multiple SNP variations as a vector • The dimension of vector is the number of SNPs to the related disease or trait. • 5.A training set is chosen for the disease or trait, in other words,SNP genotype data of normal and sick population. • 6.By using Step 4,compute the SNP vectors of the training data set﹛(xi,yi)﹜, xi is a SNP data, yi=1(sick)or -1(normal), • 7.Use the SVM to get a hyperplane dividing into two groups (sick, normal)

The remarks • 1. The reference data can be built by collecting SNP genotypes from the healthy normal population. • 2.The hyperplane obatined can be considered as acriterion, and,given a new data set,it can be used for testing whether the person of the data is susceptible to the disease or trait. • 3.Representation of an object as a vector might be critical for making use the SVM.How to make domain knowledge contained in vector representations is one of the major issues. • 4.The idea of difference scoring could be applied to other data sets(visual data such as X-ray or MRI image,…),in particular,to haplotype data and to find out a linkage among SNP to the disease or trait. • 5.Once a group of SNP patterns are identified, it can compute contribution score of each of those SNP to the disease or trait.

Inseparable Case • For the inseparable case ,the iterated use of SVM enables us to divide a collection of labelled of vectors into several clustering groups. • 1.Set a threshold value. Say ,80%. • 2.Use SVM to separate a collection of labelled of vectors into two groups A,B. • 3.Check if the groups contain more than 80% of either 1 or -1 labeled vectors.Suppose A is not such one. Then use SVM to A again to two subgroups. • 4.Repeat this procedure until each subgroup has a majority of more than 80%. • 5. For each subgroup, figure out a range.

Test results with clinical data • The clinical data is a cardio-patient records data set : Height,age,sex,weight,etnic background,medical history,birth place,blood pressure(systolic and diastolic),Liqid measurements etc are numericalized and +1:a patient with heart attack,stroke or heart failure,otherwise -1 • We used Thorsten Joachims’ implementation of SVM.

Personal opinion • Application of SVM is effective ,But it is difficult to solve nonlinear problem. • How to make domain knowledge contained in vector representations is one of the major issues.

Author: Gene Kim,MyungHo Kim Advisor: Dr.Hsu Graduate: Ching-Wen Hong

Author: Gene Kim,MyungHo Kim Advisor: Dr.Hsu Graduate: Ching-Wen Hong

Presentation Transcript

Environmental Studies The Hong Kong Polytechnic University

ORIENTATION FOR GRADUATE PROGRAM DIRECTORS

Gene Expression Profiling

Gene Finding Approaches

SS4115 Integrated Social Work Practice

Gene Network Modeling

Gene Concept

Cancer Gene therapy

Travelling Around Hong Kong

Poster size: A0 format (841mm x 1189mm) in portrait

Regulation of Gene Expression in Multicellular Organisms

Welcome to Hong Kong

Advanced Gene Technology

Regulation of Gene Expression Chapter 18

Regulation of Gene Expression

Regulation of Gene Expression

Regulation of Gene Expression

Hong Cheng Jiawei Han

Gene Network Modeling

Gene flow

REGULATION OF GENE EXPRESSION PROKARYOTES