320 likes | 603 Views
milkER – a milk informatics resource. Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005. Overview. Aims of milkER milkER database Text-mining Potential targets. milkER aims.
E N D
milkER – a milk informatics resource Stephen Edwards BSc. University of Edinburgh BioNLP meeting 6th June 2005
Overview • Aims of milkER • milkER database • Text-mining • Potential targets
milkER aims • To amalgamate disperse milk information into one resource, allowing more focused analysis of milk proteins in relation to dairy issues, health and disease.
A milk database • Knowledge on milk affects many industries • UniProt, GenBank excellent resources • Marsupial genomics database (New Zealand) • Glasgow genomics data • Chinese database • Polish bioactive peptide database • Food property database (commercial)
Milk components • Fat, carbohydrates, proteins, minerals • Growth factors, enzymes, enzyme inhibitors, immunoglobulins, allergens, disease factors, anti-bacterial proteins, opioids 1. Deliberate 2. Leakage from blood 3. Result of disease conditions 4. Engineered 5. Bacterial origin
milkER database • Database using BioSQL which allows incorporation of UniProt, EMBL, GenBank entries
LOCUS NM_173929 790 bp mRNA linear MAM 27-OCT-2004 DEFINITION Bos taurus lactoglobulin, beta (LGB), mRNA. ACCESSION NM_173929 VERSION NM_173929.2 GI:31343239 KEYWORDS . SOURCE Bos taurus (cow) ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos. REFERENCE 1 (bases 1 to 790) AUTHORS Jayat,D., Gaudin,J.C., Chobert,J.M., Burova,T.V., Holt,C., McNae,I., Sawyer,L. and Haertle,T. TITLE A recombinant C121S mutant of bovine beta-lactoglobulin is more susceptible to peptic digestion and to denaturation by reducing agents and heating JOURNAL Biochemistry 43 (20), 6312-6321 (2004) PUBMED 15147215 REMARK GeneRIF: Results suggest that the stability of beta-lactoglobulin arising from the hydrophobic effect is reduced by the C121S mutation so that unfolded or partially unfolded states are more favored. ORIGIN 1 actccactcc ctgcagagct cagaagcgtg atcccggctg cagccatgaa gtgcctcctg 61 cttgccctgg ccctcacctg tggcgcccag gccctcatcg tcacccagac catgaagggc …..
Information retrieval Other Databases EMBL UniProt Information extraction milkER population Other Sources (e.g. published tables) milkER Web Query
milkER database • Database using BioSQL which allows incorporation of UniProt, EMBL, GenBank entries • Library of literature on milk • User interface (www.milker.org.uk)
Text-mining • Machine ‘reading’ of text • Many techniques involved: • Tokenisation • Stemming (Activation Activat) • POS tagging (Protein noun) • Abbreviation expansion (CN Casein) • Entity identification (Casein protein) • Dictionary
”Increased levels of IgA antibodies to B-LG were found and were shown to be an independent risk marker for type 1 diabetes.” Increased [past participle] levels [plural noun] of [preposition] Tokeniser / POS tagger IgA [antibody] B-LG [protein] Diabetes [disease] Entity identification Parser [IgA antibodies to B-LG] ‘MARKER’ [type 1 diabetes]
Information extraction • Rule based • ‘interact’ ‘bind’ ‘activate’ • [protein] (0-5 words) [verbs] (0-5 words) [protein] (Blaschke and Valencia, 2002) • Machine-learning • Statistical methods, Hidden Markov Models • Learn interfillers, text lying between tagged entities (Bunescu et al, 2004)
Difficulties • Synonyms • Proteins and genes with same name • Funny names e.g. ERK-1/2, ‘and’ gene! • Variability of natural language • Compounded names • Co-ordination, negatives, speeling errors
Evaluation • Precision (P) - how correct is output • Recall (R) - how often does it pick • F-measure - combines P and R • IE systems can achieve high results, but not enough to populate databases automatically
Text-mining uses • Aim to extract interactions and diseases • Swanson (Fish oil) • Srinivasan (Turmeric)
General model for discovering implicit links between topics Starting topic: Turmeric (inhibits) Intermediate topic: Nuclear factor-kappa B (involved in) Terminal topic: Crohn’s disease Diagram taken from Srinivasan et al, 2004
Targets for text mining • Many milk relationships still require further investigation • Positive reasons - nutritional benefits - neonatal growth - antimicrobial activity - bioactive peptides
Targets for text mining (cont.) • Negative reasons - recent link with Alzheimer's - diabetes link - asthma - human reactions to cow hormones (e.g. Acne, Danby 2005) - drug transfer to milk and effects - allergic reactions/intolerance - toxic contaminants
milkER process • 897 proteins, 772 dna, 1232 rna • Analyze references (1465 MEDLINE refs) • MeSH terms, GO terms etc • POS tag • UMLS standardisation • Gene/protein dictionary • Extract relations
milkER interactions • Table of interacting proteins • Store as queryable XML strings? • Discover links between proteins and disease • Create hypotheses • Confirm experimentally
Diabetes • Pancreas secretes hormones • Glycagon, increases conversion glycagon glucose • Insulin, increases conversion glucose glycagon. Allows glucose into cells. • “Condition where the amount of glucose in the blood is abnormally high as the body cannot use it adequately as fuel”
Diabetes • Affects 3-5% of industrialised populations • Type 1 (~10%) • Genetic and environmental factors (e.g. diet) • Decreased insulin production • Mostly develops < age 20 • Type II (~90%) • Resistance of body to insulin • Normally develops > age 40 • Often associates with high B.P, cholsterol and arterial disease
Selected quotes • “More research is needed on all aspects of lactation in women with diabetes.” • Reader D. et al, Curr Diab Rep. 2004 • “The effect of high protein intakes from different sources on glucose-insulin metabolism needs further study” • Hoppe et al, European Journal of Clinical Nutrition 2005 • “American children also tend to be heavier than those from European countries, skewing the [growth] charts further.” • The Scotsman Sat 5 Feb 2005 • The government currently recommends that babies should be fed breast milk alone for the first six months - the WHO recommends two years.
Conclusions • Knowledge of milk vital in many areas • milkER aims to bring disparate milk data together • Text-mining can wade through large amounts of data to retrieve and discover vital information
Future work • Relation extraction of milk literature • Extend content of milkER to include interaction data • Create hypotheses for experimental work
Acknowledgements • Prof. Lindsay Sawyer • Dr. Carl Holt (Hannah Research Institute, Ayr) • Prof. Bonnie Webber (Informatics) • Dr. Alistair Kerr and Dr. Douglas Armstrong for technical support
References • Acne/milk • Acne and milk, the diet myth, and beyond (Danby, 2005) • Diabetes/milk • Milk and diabetes (Schrezenmeir et al, 2000) REVIEW • The role of -casein variants in the induction of insulin-dependent diabetes (Elliott et al, 1997) • Text-mining • Natural language processing and systems biology (Cohen et al, 2004) REVIEW • Mining MEDLINE for implicit links between dietary substances and diseases (Srinivasan et al, 2004) • Learning to extract proteins and their interactions from MEDLINE abstracts (Bunescu et al, 2003)