1 / 14

Biological information extraction from natural language text

Biological information extraction from natural language text. Chitta Baral Arizona State University. Goal. Extract `simple’ information from text. This is somewhat simpler than complete natural language understanding Examples of `simple’ information (structure is anticipated)

etana
Download Presentation

Biological information extraction from natural language text

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological information extraction from natural language text Chitta Baral Arizona State University

  2. Goal • Extract `simple’ information from text. • This is somewhat simpler than complete natural language understanding • Examples of `simple’ information (structure is anticipated) • John was in Phoenix in March at( John, Phoenix, March) • Protein-x in presence of enzyme y breaks down to components z and w. breaks_in_presence_of( x, y, [z , w] ) • Not so `simple’ information (meta-informations, unanticipated or untargeted structure) • John only visits cities where he has a friend

  3. Main approach • Use extraction rules that can extract the targeted information • Extract P(X,Y,Z) from a sentence if in that sentence X is a proper noun, Y is a verb that immediately follows the noun and Z is a noun phrase that immediately follows Y. • Coming up with extraction rules • Manually • Learning extraction rules • Develop your own learning program • Cast your problem appropriately so as to use existing learning programs (such as Progol, FOIL, etc.) • Take an existing information extraction system and make appropriate changes to it so as to make it applicable for our case

  4. Learning extraction rules • Mark the text of what is to be extracted • Parse the text (with markings) and do part of speech tagging • Extract pattern • Use the pattern on other text, and add conditions or modify pattern to avoid false positives. • Repeat the above steps until an acceptable performance is achieved.

  5. An example • HMBA could inhibit the MEC-1 cell proliferation by down-regulation of PCNA expression, it could also induce apoptosis effectively that might be through the way of up-regulation of bax and bcl-2 gene expression. • Interaction(HMBA, inhibit, MEC-1 cell proliferation) • Interaction(HMBA, down-regulation, PCNA expression)

  6. [ word([tag= 'NNP' ,arg(1)],'HMBA'), vg([word([tag= 'MD'],'could'), word([tag = 'VB' ,arg(2)],'inhibit')]), ng([arg(3)], [word([tag= 'DT'],'the'), word([tag= 'NNP'],'MEC-1'), word([tag= 'NN'],'cell'), word([tag= 'NN'],'proliferation') ] ), word([tag= 'IN'],'by'), word([tag= 'NN'],'down-regulation'), word([tag= 'IN'],'of'), ng([word([tag= 'NNP'],'PCNA'), word([tag= 'NN'],'expression') ]), word([tag= ','],','), word([tag= 'PRP'],'it'), vg([word([tag= 'MD'],'could'), word([tag= 'RB'],'also'), word([tag= 'VB'],'induce') ]), word([tag= 'NN'],'apoptosis'), word([tag= 'RB'],'effectively'), word([tag= 'WDT'],'that'), vg([word([tag= 'MD'],'might'), word([tag= 'VB'],'be')]), word([tag= 'IN'],'through'), ng([word([tag= 'DT'],'the'), word([tag= 'NN'],'way') ]), word([tag= 'IN'],'of'), word([tag= 'NN'],'up-regulation'), word([tag= 'IN'],'of'), word([tag= 'NN'],'bax'), word([tag= 'CC'],'and'), ng([word([tag= 'JJ'], 'bcl-2'), word([tag= 'NN'],'gene'), word([tag= 'NN'],'expression') ]) ] Parsing and POS tagging

  7. An alternate way to code • sentence(s). • first(s, p1). • next(p1,p2). next(p2,p3). next(p3,p4). next(p4,p5). • next(p5,p6). next(p6,p7). next(p7,p8). next(p8,p9). • next(p9,p10). next(p10,p11). next(p11,p12). next(p12,p13). • next(p13,p14). next(p14,p15). next(p15,p16). next(p16,p17). • next(p17,p18). next(p18,p19). next(p19,p20). next(p20,empty). • type(p1, word). tag(p1, nnp). content(p1, hmba). marked(p1,arg1). • type(p2, vg). …

  8. POS tags • NNP – proper noun • MD -- modal • VB – verb base form • DT -- determiner • NN – common noun • IN -- preposition • PRP • RB -- adverb • WDT -- • CC – coordinating conjunction • JJ -- adjective

  9. Extracted interaction rule • extract( [ word([tag = NNP],_h18724), word([tag = VB],_h18725), ng(_h18726) ], interact(_h18724,_h18725,_h18726), true).

  10. Tagged text • Interact (HMBA, [word ([tag = MD], could), word ([tag = VB], inhibit)], [word ([tag = DT], the), word ([tag = NNP],MEC-1), word ([tag = NN], cell), word ([tag = NN], proliferation)]). • Interact (HMBA, down-regulation, [word ([tag = NNP],PCNA), word ([tag = NN], expression)]).

  11. Prolog code for learning extraction rules • :-import append/3 from basics. • learn( S):- find_interact( S,I,P), nl, write( I), nl, write( P), write_file( P,I). • P : extraction pattern • I : interaction fact • S: tagged text • find_interact([word([T,arg(1)],_) | R], interact (A,B,C), P ) :- A=X, pattern ([ word ([T],A)|PR],P), find_interact (SR, interact (A,B,C),PR). • More rules for find_interact. • pattern( W,P):- P=W. • write_file( P,I):- E=extract (P, I, true), open( 'extract.P', append, F), write( F, E), write( F,'.'), nl( F), close( F).

  12. A set of extraction patterns • extract( [ word ([tag = 'NNP'],_h13664),word([tag = 'VB'],_h13665), word ([tag = 'NNP'],_h13666)],interact(_h13664,_h13665,_h13666),true). • extract( [word ([tag ='NNP'],_h62915),vg(_h62916),ng(_h62917)], interact(_h62915,_h62916,_h62917),true). • extract( [word ([tag = 'NNP'],_h112469), word ([tag = 'NN'],_h112470), ng(_h112471)], interact(_h112469,_h112470,_h112471),true). • extract( [word ([tag = 'NNP'],_h161953),word([tag = 'NN'],_h161954), word ([tag = 'NNP'],_h161955)], interact(_h161953,_h161954,_h161955),true). • extract( [word ([tag = 'VB'],_h17857),vg(_h17858),ng(_h17859)], interact(_h17857,_h17858,_h17859),true). • extract( [word ([tag = 'NNP'],_h42739),word([tag = 'NN'],_h42740),ng(_h42741)], interact(_h42739,_h42740,_h42741),true). • extract( [word ([tag = 'NNP'],_h44071),word([tag = 'NN'],_h44072),ng(_h44073)], interact(_h44071,_h44072,_h44073),true). • extract( [word ([tag = 'NNP'],_h16431),word([tag = 'NN'],_h16432),ng(_h16433)], interact(_h16431,_h16432,_h16433),true).

  13. Code that extracts patterns • :- load_dyn( 'extract.P'). • matcher(_,[],_). • matcher( [SH|ST],[SH|PT],_) :- matcher(ST,PT,_). • matcher( [SH|ST],[PH|PT],_) :- SH \== PH, matcher( ST,[PH|PT],_). • run( S):- process( S). • process(S) :- extract( P,F,_), matcher( S,P,_), write_file(F), fail. • process(_). • write_file(I):- open( 'interact.P', append,File), write(File,I), write(File,'.'),nl(File), close(File).

  14. Applications of interest • Finding interaction between genes and proteins • Given a set of genes, say obtained using micro array experiments, using such extracted information get a rough idea about the various genes and proteins that interact with these genes. • Now build a pathway.

More Related