140 likes | 303 Views
Biological information extraction from natural language text. Chitta Baral Arizona State University. Goal. Extract `simple’ information from text. This is somewhat simpler than complete natural language understanding Examples of `simple’ information (structure is anticipated)
E N D
Biological information extraction from natural language text Chitta Baral Arizona State University
Goal • Extract `simple’ information from text. • This is somewhat simpler than complete natural language understanding • Examples of `simple’ information (structure is anticipated) • John was in Phoenix in March at( John, Phoenix, March) • Protein-x in presence of enzyme y breaks down to components z and w. breaks_in_presence_of( x, y, [z , w] ) • Not so `simple’ information (meta-informations, unanticipated or untargeted structure) • John only visits cities where he has a friend
Main approach • Use extraction rules that can extract the targeted information • Extract P(X,Y,Z) from a sentence if in that sentence X is a proper noun, Y is a verb that immediately follows the noun and Z is a noun phrase that immediately follows Y. • Coming up with extraction rules • Manually • Learning extraction rules • Develop your own learning program • Cast your problem appropriately so as to use existing learning programs (such as Progol, FOIL, etc.) • Take an existing information extraction system and make appropriate changes to it so as to make it applicable for our case
Learning extraction rules • Mark the text of what is to be extracted • Parse the text (with markings) and do part of speech tagging • Extract pattern • Use the pattern on other text, and add conditions or modify pattern to avoid false positives. • Repeat the above steps until an acceptable performance is achieved.
An example • HMBA could inhibit the MEC-1 cell proliferation by down-regulation of PCNA expression, it could also induce apoptosis effectively that might be through the way of up-regulation of bax and bcl-2 gene expression. • Interaction(HMBA, inhibit, MEC-1 cell proliferation) • Interaction(HMBA, down-regulation, PCNA expression)
[ word([tag= 'NNP' ,arg(1)],'HMBA'), vg([word([tag= 'MD'],'could'), word([tag = 'VB' ,arg(2)],'inhibit')]), ng([arg(3)], [word([tag= 'DT'],'the'), word([tag= 'NNP'],'MEC-1'), word([tag= 'NN'],'cell'), word([tag= 'NN'],'proliferation') ] ), word([tag= 'IN'],'by'), word([tag= 'NN'],'down-regulation'), word([tag= 'IN'],'of'), ng([word([tag= 'NNP'],'PCNA'), word([tag= 'NN'],'expression') ]), word([tag= ','],','), word([tag= 'PRP'],'it'), vg([word([tag= 'MD'],'could'), word([tag= 'RB'],'also'), word([tag= 'VB'],'induce') ]), word([tag= 'NN'],'apoptosis'), word([tag= 'RB'],'effectively'), word([tag= 'WDT'],'that'), vg([word([tag= 'MD'],'might'), word([tag= 'VB'],'be')]), word([tag= 'IN'],'through'), ng([word([tag= 'DT'],'the'), word([tag= 'NN'],'way') ]), word([tag= 'IN'],'of'), word([tag= 'NN'],'up-regulation'), word([tag= 'IN'],'of'), word([tag= 'NN'],'bax'), word([tag= 'CC'],'and'), ng([word([tag= 'JJ'], 'bcl-2'), word([tag= 'NN'],'gene'), word([tag= 'NN'],'expression') ]) ] Parsing and POS tagging
An alternate way to code • sentence(s). • first(s, p1). • next(p1,p2). next(p2,p3). next(p3,p4). next(p4,p5). • next(p5,p6). next(p6,p7). next(p7,p8). next(p8,p9). • next(p9,p10). next(p10,p11). next(p11,p12). next(p12,p13). • next(p13,p14). next(p14,p15). next(p15,p16). next(p16,p17). • next(p17,p18). next(p18,p19). next(p19,p20). next(p20,empty). • type(p1, word). tag(p1, nnp). content(p1, hmba). marked(p1,arg1). • type(p2, vg). …
POS tags • NNP – proper noun • MD -- modal • VB – verb base form • DT -- determiner • NN – common noun • IN -- preposition • PRP • RB -- adverb • WDT -- • CC – coordinating conjunction • JJ -- adjective
Extracted interaction rule • extract( [ word([tag = NNP],_h18724), word([tag = VB],_h18725), ng(_h18726) ], interact(_h18724,_h18725,_h18726), true).
Tagged text • Interact (HMBA, [word ([tag = MD], could), word ([tag = VB], inhibit)], [word ([tag = DT], the), word ([tag = NNP],MEC-1), word ([tag = NN], cell), word ([tag = NN], proliferation)]). • Interact (HMBA, down-regulation, [word ([tag = NNP],PCNA), word ([tag = NN], expression)]).
Prolog code for learning extraction rules • :-import append/3 from basics. • learn( S):- find_interact( S,I,P), nl, write( I), nl, write( P), write_file( P,I). • P : extraction pattern • I : interaction fact • S: tagged text • find_interact([word([T,arg(1)],_) | R], interact (A,B,C), P ) :- A=X, pattern ([ word ([T],A)|PR],P), find_interact (SR, interact (A,B,C),PR). • More rules for find_interact. • pattern( W,P):- P=W. • write_file( P,I):- E=extract (P, I, true), open( 'extract.P', append, F), write( F, E), write( F,'.'), nl( F), close( F).
A set of extraction patterns • extract( [ word ([tag = 'NNP'],_h13664),word([tag = 'VB'],_h13665), word ([tag = 'NNP'],_h13666)],interact(_h13664,_h13665,_h13666),true). • extract( [word ([tag ='NNP'],_h62915),vg(_h62916),ng(_h62917)], interact(_h62915,_h62916,_h62917),true). • extract( [word ([tag = 'NNP'],_h112469), word ([tag = 'NN'],_h112470), ng(_h112471)], interact(_h112469,_h112470,_h112471),true). • extract( [word ([tag = 'NNP'],_h161953),word([tag = 'NN'],_h161954), word ([tag = 'NNP'],_h161955)], interact(_h161953,_h161954,_h161955),true). • extract( [word ([tag = 'VB'],_h17857),vg(_h17858),ng(_h17859)], interact(_h17857,_h17858,_h17859),true). • extract( [word ([tag = 'NNP'],_h42739),word([tag = 'NN'],_h42740),ng(_h42741)], interact(_h42739,_h42740,_h42741),true). • extract( [word ([tag = 'NNP'],_h44071),word([tag = 'NN'],_h44072),ng(_h44073)], interact(_h44071,_h44072,_h44073),true). • extract( [word ([tag = 'NNP'],_h16431),word([tag = 'NN'],_h16432),ng(_h16433)], interact(_h16431,_h16432,_h16433),true).
Code that extracts patterns • :- load_dyn( 'extract.P'). • matcher(_,[],_). • matcher( [SH|ST],[SH|PT],_) :- matcher(ST,PT,_). • matcher( [SH|ST],[PH|PT],_) :- SH \== PH, matcher( ST,[PH|PT],_). • run( S):- process( S). • process(S) :- extract( P,F,_), matcher( S,P,_), write_file(F), fail. • process(_). • write_file(I):- open( 'interact.P', append,File), write(File,I), write(File,'.'),nl(File), close(File).
Applications of interest • Finding interaction between genes and proteins • Given a set of genes, say obtained using micro array experiments, using such extracted information get a rough idea about the various genes and proteins that interact with these genes. • Now build a pathway.