190 likes | 213 Views
Matlab Bioinformatics Toolkit Evaluation . Kanishka Bhutani. What I expected ??. Local/Global sequence alignments. Multiple sequence alignments. Choice of different scoring matrices (BLOSUM, PAM) for evaluation. Build Hidden Markov Models.
E N D
Matlab Bioinformatics Toolkit Evaluation Kanishka Bhutani
What I expected ?? • Local/Global sequence alignments. • Multiple sequence alignments. • Choice of different scoring matrices (BLOSUM, PAM) for evaluation. • Build Hidden Markov Models. • Easily import sequences from databases (PFAM,PDB, Swissprot)
What I found ?? • Most of the features. • “Bonus” = Microarray normalization tools. Microarray Visualization tools including box plots, heat maps.
Any surprises ? • No “Multiple sequence alignments” • Avg./Std Dev. of hydrophobicity, solvent accessibility : Command ? • “Proteinplot”- GUI for protein structure analysis. • Import your file to view, select parameters and display stats.
What all I tried? • Local alignment, Global alignment. • For short sequences: swalign(‘seq1’,’seq2’) nwalign(‘seq1’,’seq2’) seq1,seq2: AA or NT sequences. • For ‘imported’ long sequences: Convert seq into a vector of integer values Commands: nt2int, aa2int
Pairwise Sequence alignment • S = getgenbank(‘NM_00001’) • M= getgenbank(‘NM_00002’) • Output : Header and a sequence. • K=nt2int(S.Sequence) B=nt2int(M.Sequence) [sc,align] = nwalign [K,B] Alignment Score Aligned seq.
Getting sequences : V Easy ! • ‘getgenbank’: Retrieve sequence information from Genbank database. • ‘getembl’: Retrieve seq. information from EMBL database. • ‘getpept’: Retrieve seq information from Genpept database. • ‘gethmmprof’: Get HMM from the PFAM database.
Experiment • hmmodel = gethmmprof(‘PF00001’)
Visualization of model Showhmmprof (hmmodel,’scale’,’logodds’)
Get GPCR seq’s • S = getgenbank (‘NM_024531’) • disp (S.Sequence)
Alignment of the seq’s • var = gethmmalignment (‘PF00001,’type’,’seed’) • disp [char(var.Header) char (var.Sequence)]
For GPCR Family C • Similarly for diff families. • Multiple aligned sequences retrieved.
GUI proteinplot • User friendly. • Avg./ Std. dev values for: Hydrophobicity. Secondary structure propensity (Alpha helices or beta strands) Accessibility (accessible and buried residues)
Test a seq. with HMM • Retrieve mglur1 from Genbank mgr = getgenbank (‘NM_012407’) glusequence = mgr.sequence • Test it with the HMM model class A [a.sglu] = hmmprofalign (model A, glusequence,’showscore’,true) • Score = -203.53 • Seq =
Difficulties & questions • No multiple sequence alignment. • Demos: Not very helpful. • Difficult to view the sequences as no “disp” command found. • Bugs: Storing huge sequences (GPCR A) in a file, parsing error. HMMprofdemo command abruptly stops and gives errors. • Proteinplot (GUI) hangs the machine often. • Verify the sequences using the HMM models ?? • Regular expression matches and highlighting those positions??
Suggestions of experiment • Given an unknown sample dataset of proteins, known dataset of proteins (known structural information). • Utilize the BLMT to extract ‘over expressed’ 4 Grams in a protein sequence or a group of protein sequences from the known set. • Use “search for regular expression” function in the Matlab toolkit to look for those ‘4 Grams’ in unknown proteins and hence predict their structure.