Biomedical informatics for proteomics

Biomedical informatics for proteomics Boguski, M. S. and M. W. McIntosh (2003). Nature 422(6928): 233-237. 指導老師 : 趙坤茂 Kun-Mao Chao 組員 : 施光偉蕭雅茵計佩岑葉衍陞葉欣綺鍾宇彥蘇鈺惠陳雲濤

Outline • Introduction • Study design and sample quality • Protein databases • Protein identification by database searching • Pattern matching without protein identification • Conclusions and future challenges

Introduction reporter:施光偉

Introduction • The subtitle:“Genes WereEasy”. • We have transitioned rapidly from a large but finite and completehuman genome to a seemingly infinite biological universe. • Proteomics is often referred to as a ‘post-genome’ science, but its antecedents actually predate the Human Genome Project by two to three decades. • Although medical informatics has until recently been largely detached from bioinformatics, the emergence of clinical genomics and proteomics increasingly requires the integrated analysis of genetic, cellular, molecular and clinical information and the expertise of pathologists, epidemiologists and biostatisticians.

Introduction • Proteomics is the latest functional genomics technology to capture our imagination and it is instructive to review some lessons learned during the earlier adoption of another functional genomics technology, namely gene expression analysis using microarrays and similar technologies. • There are many implications of biomedical informatics for proteomics, including multiple platform technologies, laboratory information-management systems, medical records systems, and documentation of clinical trial results for regulatory agencies. • In the present work, we confine our discussions to mass spectrometry-based proteomics, and to study design and data resources, tools and analysis in a research setting.

Introduction • Proteomics depends upon careful study design and high-quality biological samples, advanced information technologies. • Proteome analysis is at a much earlier stage ofdevelopment than genomics and gene expression (microarray) studies. • Fundamental issues involvingbiological variability, pre-analytic factors and analytical reproducibility remain to be resolved.

Study design and sample quality reporter:蕭雅茵

Glossary • Case-control and cohort study Observational studies: Case → O/X of the phenotype(case/control) Cohort → Participants based on O/X of risk factor of interest and over time for development of an outcome • Confounder/Confounding Distort an apparent relationship between an exposure and a phenotype of interest • Plasma: fluid, non-cellular Serum: protein solution remaining after blood coagulated • Pre-analytical variables Variables that present before laboratory test and data analysis • Randomized clinical trial Treatments are randomly assigned in order to prevent confounding

Study design and sample quality • Potter describes 4 study design • However, the distinction between observational and experimental design isn’t made as well as proteomics studies.

Study design and sample quality • Observational studies of gene expression and proteomic analysis involving human→①bias & confounding factor • Human plasma and serum proteomics are susceptible to observational biases→confused with a specific characteristic of the disease process→mislead • Each may induce a change in total protein concentrations by ± 10%. • Highlighting human serum proteome→nature but confounding variables may complicate finding

Study design and sample quality • No adjust for confounding even, only to have careful design and specimen ascertainment • ②quality ③number • Margolin has admonished that ”Scientists...need to avoid the tendency, often driven by the high price of some of the newer techniques, of running under-controlled experiments or experiments with fewer repeated conditions than would have been accepted with standard techniques.” • Proteomics discovery has no priori enumeration of targets and lacks described procedural structure.

Protein database reporter:計佩岑

Proteome Genome DNA mRNA Proteome Proteins

Protein databases • Collections protein sequences date back to the1960s. • Utilitarian goal of protein databases (1990s~today) • Minimal redundancy • Maximal annotation • Integration with other databases

Protein databases • Current molecular sequence databases are classified according to their evolutionary history inferred from sequence homology. • excellent tools for gene discovery, comparative genomics and molecular evolution • much work to be done to even minimally serve the needs of proteomics and integrative biological science

Protein databases • Today's principal protein databases emphasize • molecular • cellular features • annotation • are not well suited to represent physiology. • A more ideal database for plasma proteome studies would classify proteins from a functional, rather than an evolutionary, viewpoint

Data standards • Multiple or specialized file formats has hindered accessibility, information exchange and integration • eXtensibleMarkup Language (XML) • an Internet standard for describing structured and semistructured data • most of the main databasesmake their data available in XML and make it easy to publish and exchange XML data

Protein databases • PDB(Protein Data Bank ) • GenBank • SWISS-PROT • EMBL • HPRD(Human Protein Reference Database）

Protein identification by database searching reporter: 葉衍陞葉欣綺鍾宇彥

Purpose of Protein identification by database searching • NOT the species or remoteness of the relationship • infer similarity of function from similarity of sequence • study the evolution of protein families or domains • Different aims and therefore require different strategies and tools

Analysis of human serum • interested in identifying proteins they are not normally present • match between subsequences • weak similarities

Statistical significance • statistical significance is important, but not in the sense of the probability that two sequences are related by chance • deviates significantly from a normal range of values. • If it is met, one is then interested in attempting to demonstrate a significant correlation

影響database原因之一 DNA 1 mRNA 2 protein 1.Transcription Post translational modification 2.translation(proteolytic processing glycosylation, methylation, phosphorylation, Met切除, 雙硫鍵形成, acetylation, hydroxylation )

Post translational modification proteolytic processing • 移除訊號序列胺基酸殘基 • 移往特定細胞 • 特殊胜肽水解酶移除 glycosylation • Asn和 Ser或Thr • 主要場地內質網 • 有潤滑作用的含有寡糖類之鏈

Post translational modification methylation • 特定Lys殘基進行 • 某些肌肉蛋白、組蛋白、與色素細胞c phosphorylation • 多接在-OH 基的胺基酸 • 調控蛋白質酵素活性

Post translational modification Met切除 • N端的Met往往在多胜肽鏈合成前被切除(AUG) 雙硫鍵形成 • mRNAhas no coding acetylation • 組蛋白調控轉錄作用 hydroxylation • 膠原蛋白等

Peptide analysis • Error Tolerance • Scoring methods

Peptide analysis- Experimental process • Cut to mixture of short peptides • Specific: restriction enzyme • Mass Spectrometry • Detect the m/z of the compounds • Tandem mass spectrometry (MS/MS) • Fragments of specific m/z • Chromatography • Separation before MS

Tandem mass spectrometry http://en.wikipedia.org/wiki/Tandem_mass_spectrometry

Chromatography Dionex

Peptide analysis- Mass Spectrometry • Several Approach • Analytic peptide-mass fingerprint • used as profile • Compare with the predicted spectrum • match to database • De novo sequence interpretation • Manual interpretation by expert • Time consumption high

Consideration of Error Tolerance • Restriction enzyme non-specificity • Precursor charge errors • Get more than one charge in ionization • Isotope • Mass measurement errors • Related to accuracy of instrument • Unsuspected modifications • Ex: post-translational modification • Primary sequence variations • deletions, insertions, substitutions [2002] Error tolerant searching of uninterpreted tandem mass spectrometry data

Scoring methods description • In general, each scoring algorithm designates a quantity related to the probability that the candidate peptide could have produced the observed spectrum by chance • Ranking is required for high-throughput automated analysis

Example of peptide identification PB cannot be identified due to high variation Solutions: reduce the number of target peptides

Another challenge • Another automating proteomics challenge : the best match of a scoring algorithm is simply not good enough. • Establishing a criteria for acceptance overall therefore becomes the main focus of automated proteomics.

Scoring Threshld ,P value • It is generally assumed that higher-scoring assignments are more likely to be correct than lower-scoring assignments. • Threshold:i : Sensitivity ii : Specificity iii:Mixture , sequence data base • P values : If p values＜0.05 , 5% of all false tests will be misidentified as true.

Scoring Threshld ,P value Probability http://rating.com.vn/home/_/Y-nghia-cua-tri-so-P-tuc-P-value.26.1080

P value-like quantities • Keller et al. estimate the reference distributions of the correct and incorrect assignments within any experiment. • Keller et al. describe an approach that may allow a scoring algorithm to be converted into P value-like quantities that can then be used to control error rates.

*Pattern matching without protein identification *Conclusions and future challenges reporter:蘇鈺惠陳雲濤

Time-of-flight mass spectrometry(TOF)

Time-of-flight mass spectrometry • mass spectrometry • ions are accelerated by an electric field • velocity of the ion depends on the mass-to-charge ratio • Time is measured • Compared with known experimental parameter, we can get the ion of mass-to-charge ratio.

Time-of-flight mass spectrometry • Time-of-flight mass spectrometry (TOFMS) is a method of mass spectrometry in which ions are accelerated by an electric field of known strength. This acceleration results in an ion having the same kinetic energy as any other ion that has the same charge. The velocity of the ion depends on the mass-to-charge ratio. The time that it subsequently takes for the particle to reach a detector at a known distance is measured. This time will depend on the mass-to-charge ratio of the particle (heavier particles reach lower speeds). From this time and the known experimental parameters one can find the mass-to-charge ratio of the ion. The elapsed time from the instant a particle leaves a source to the instant it reaches a detector.from wikipedia

Principle & method • Ep=q*U • Ek=1/2*m*v^2 • Ek=Ep q*U=1/2*m*v^2 (v=d/t)t=k*sqrt(m/q) ; k=d/sqrt(2*U) • The velocity is determined by time-of-flight tube length(d) and time of the flight of the ion (t) v=d/t

application • Matrix-assisted laser desorption ionization time of flight spectrometry(MALDI-TOF) is a pulsed ionization technique that is readily compatible with TOF MS. • 1. ionize molecule via laser pulse • 2. separate molecule according to mass to charge ratio • 3. mainly used for detection of large biomolecule.

Component of MALDI-TOF

http://www.youtube.com/watch?v=gTRsaAnkRVU

Drawback of TOF • Each m/z value of the spectrum reflects the abundance of possibly many peptides having a similar mass. Thus, with complex mixtures, these TOF methods are not able to identify individual peptides.

Using TOF • When used with complex mixtures, analysis methods are intended to identify peaks, or features, of the spectrum that can segregate identifiable groups • When evaluating expression array, using Clustering methods, Pattern matching for alignment and peak identification.

expression array (圖) ig 2.

Cluster Algorithm • 一種分類的方法: 由一個基準點，描述其在有限範圍(Eps)內包含不少於MinPt個點的群集 • 範圍以歐幾里得距離或曼哈頓距離算之 • 用途廣泛: 諸如商業市場分析、生物分類研究、生醫資訊領域、Data mining、Machine Learning、圖像分析 • 種類:Partitioning MethodsHierarchical Methodsdensity-based methodsgrid-based methodsModel-Based Methods

Biomedical informatics for proteomics

Biomedical informatics for proteomics

Presentation Transcript

Personalized Biomedical Informatics

Proteomics Informatics David Fenyő

Proteomics Informatics David Fenyő

Biomedical Informatics

Computational BioMedical Informatics

Proteomics Informatics –

Proteomics Informatics –

Proteomics Informatics –

Biomedical Informatics Program

Personalized Biomedical Informatics

BIOMEDICAL INFORMATICS RESEARCH

Biomedical Informatics Hub

Biomedical informatics for proteomics

Biomedical Informatics

Biomedical Informatics Core

Using Proteomics for Biomedical Research

Biomedical Informatics

Selected Videos for Biomedical Informatics

Biomedical Engineering and Biomedical Informatics Program

Data Mining for Biomedical Informatics