Efficient Gene Selection Methods for Tumor Marker Identification

COT6930 Course Project

Outline • Gene Selection • Sequence Alignment

Why Gene Selection • Identify marker genes that characterize different tumor status. • Many genes are redundant and will introduce noise that lower performance. • Can eventually lead to a diagnosis chip. (“breast cancer chip”, “liver cancer chip”)

Why Gene Selection

Gene Selection • Methods fall into three categories: • Filter methods • Wrapper methods • Embedded methods Filter methods are simplest and most frequently used in the literature Wrapper methods are likely the most accurate ones

Filter Method • Features (genes) are scored according to the evidence of predictive power and then are ranked. • Top s genes with high score are selected and used by the classifier. • Scores: t-statistics, F-statistics, signal-noise ratio, … • The # of features selected, s, is then determined by cross validation. • Advantage: Fast and easy to interpret.

Good versus bad features

Filter Method: Problem • Genes are considered independently. • Redundant genes may be included. • Some genes jointly with strong discriminant power but individually are weak will be ignored. • Good single features do not necessarily form a good feature set • The filtering procedure is independent to the classifying method • Features selected can be applied to all types of classifying methods

Wrapper Method • Iterative search: many “feature subsets” are scored base on classification performance and the best is used. • Select a good subset of features • Subset selection: Forward selection, backward selection, their combinations. • Exhaustive searching is impossible. • Greedy algorithm are used instead.

Wrapper Method: Problem • Computationally expensive • For each feature subset considered, the classifier is built and evaluated. • Exhaustive searching is impossible • Greedy search only. • Easy to overfit.

Embedded Method • Attempt to jointly or simultaneously train both a classifier and a feature subset. • Often optimize an objective function that jointly rewards accuracy of classification and penalizes use of more features. • Intuitively appealing

Relief-F • Relief-F a filter approach for feature selection • Relief

Relief-F • Original Relief is only able to handle binary classification problem. Extension was made to handle multiple-class problem

Relief-F • Categorical attributes • Numerical attributes

Relief-F Problem • Time Complexity • m×(m×a+c×m×a+a)=O(cm2a) • Assume m=100, c=3, a=10,000 • Time complexity 300×106 • Only considers one single attribute, cannot select a subset of “good” genes

Solution: Parallel Relief-F • Version 1: • Clusters runs ReliefF in parallel, and updated weighted weight values are collected at the master. • Theoretical time complexity O(cm2a/p) • P is the # of clusters

Parallel Relief-F • Version 2: • Clusters runs ReliefF in parallel, and each cluster directly update the global weight values. • Each cluster also considers the current weight values to select nearest neighbour instances • Theoretical time complexity O(cm2a/p) • p is the # of clusters

Parallel Relief-F • Version 3 • Consider selecting a subset of important features • Comparing the difference between including/excluding a specific feature, and understand the importance of a gene with respect to an existing subset of features • Discussion in private!

Outline • Gene Selection • Sequence Alignment • Given a dataset D with N=1000 sequences (e.g., 1000 each) • Given an input x, • Do pair-wise global sequence alignment between x and all sequences D • Dispatch jobs to clusters • And aggregate the results

Efficient Gene Selection Methods for Tumor Marker Identification

Efficient Gene Selection Methods for Tumor Marker Identification

Presentation Transcript

Course: Graduation project

Course Project

Course Project

Course project presentation

Course: Project Management

Course project

Course project presentation

CAS 721 Course Project

ACCT-212 Course Project

BIS155 Course Project: Excel Project

PROJ420 Project Risk Management - Course Project

BIS155 Course Project: Excel Project

BIS155 Course Project: Excel Project

BIS155 Course Project: Excel Project

PROJ420 Project Risk Management - Course Project

BIS155 Course Project: Excel Project

Project Management Course | Project Management Certification

Project management course

Course Project