1 / 15

LATE: Lisp Architecture for Text Engineering*

LATE: Lisp Architecture for Text Engineering*. Peter Szolovits MIT. *In homage to GATE, the General Architecture for Text Engineering. Desiderata for a Natural Language Processing Framework. Flexibility preferred over performance, aimed at furthering research

ayasha
Download Presentation

LATE: Lisp Architecture for Text Engineering*

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LATE: Lisp Architecture for Text Engineering* • Peter Szolovits • MIT *In homage to GATE, the General Architecture for Text Engineering

  2. Desiderata for a Natural Language Processing Framework • Flexibility preferred over performance, aimed at furthering research • Ability to deal with large corpora of documents • Persistence of analysis results for subsequent re-use • Stand-off style of annotations • Wrappers for existing language processing components, independent of their implementation language • Incorporation of tools for machine learning • General purpose programming language for developing workflows, pipelines, scaling and distribution • Dynamic programming language to make experimentation easy • “Convention over Configuration”

  3. LATE • Core programming environment is Common Lisp • Object-oriented, multiple inheritance • Garbage collection • Strong support for meta-programming (manifest types, meta-object protocol) • Interfaces to C, C++, and Java external programs • SOAP, HTTP, … interfaces to web services • Persistence is implemented in a relational database (now MySQL) • Unlimited storage for corpora, documents, annotations, classification and regression models, … • Cumulative development of annotations • Provides a conventional interface to external programs that can make use of LATE-developed interpretations, or add to them • Efficient support for indexed search and retrieval

  4. LATE, continued • Core entities • Corpus—a collection of documents, (possibly) a structure grammar of how to interpret section/subsection structure of the documents • Document—a single document, including its content as a UTF-8 string, plus meta-data about source, …; documents can belong to many corpora, for example for cross-validation studies • Annotation—a stand-off characterization of a substring (perhaps all) of a document’s content; annotations are of various types, organized as a taxonomy of subtypes • persistent annotation—annotations never to be deleted; e.g., result of human annotation • PHI mark, and type of PHI; e.g., name, address, MRN, … • “gold standard” diagnosis, procedure, medication, … • volatile annotation—machine-generated annotation • while in use, organized in efficient red-black interval tree • Model—a stored version of a learned model, from a machine learning procedure

  5. LATE: Volatile Annotations • Sections—represent a structural hierarchy of the structures and substructures in a document, as specified by a document structure grammar • list item annotations recognize enumerated lists within sections, a common form in which lists of drugs, procedures, etc. are reported • Sentences—we break text (within section boundaries) into sentences because many processing algorithms work on sentence-level text • Tokens—tokens are identified within text • different incorporated processing algorithms have slightly different definition of token: e.g., “12/4/2008” vs. {“12”, “/”, “4”, “/”, “2008”} • some compound tokens have internal structure that we parse; e.g., T-99.4, spO2:92 • parts of speech, determined by the UMLS Specialist lexicon, Link Grammar parser, and Brill tagger • UMLS-annotations—we map tokens and sequences of tokens to UMLS concept unique identifiers (CUI), type identifiers (TUI), MeSH terms, and an aggregate semantic type (problem, test, drug, ...)

  6. Admission Date: 2011-11-13 Discharge Date: 2012-02-10Date of Birth: 2011-11-13 Sex: MService: NeonatologyHISTORY OF PRESENT ILLNESS: Baby Dean Leslie Baugher was born at26 and 6/7 weeks gestation by cesarean section to a 39 yearold, Gravida II, Para I now II woman.PRENATAL SCREENS: A positive, antibody negative, Rubellaimmune; RPR nonreactive; hepatitis B surface antigennegative; group B strep unknown.This pregnancy was remarkable for cervical incompetence,leading to cerclage placement at 12 weeks. Mother wasadmitted on 2011-10-27 for preterm labor and treated withNifedipine, tocolysis and bed rest and received a course ofBetamethasone at that time. She had refractory pretermlabor, thus leading to delivery.This infant emerged with good tone and cry and deliveredspontaneously. Apgars were seven at one minute and eight atfive minutes. Birth weight was 1,070 grams (50 to 75thpercentile). His birth length was 37 cm (50 to 75thpercentile) and his head circumference was 26.5 cm (50 to75th percentile). Discharge weight was 3,375 grams (50thpercentile); length 49.5 (greater than 50th percentile); headcircumference 36.5 (greater than 90th percentile).PHYSICAL EXAMINATION: On admission, examination revealed anextremely pre term infant; anterior fontanel was soft, flat.Non dysmorphic, intact palate. Chest with moderateretractions with spontaneous breaths, fair breath sounds ... Example • A discharge summary, de-identified, with synthesized names, dates, etc. • Dates are uniformly offset, to retain time relations

  7. Structure cl-user(22): (show-annotations tt :type 'section-annotation)0 12217 " Admission Date: 2011-11-13 ...d of Report) " DOC: discharge_summary;1 17 " Admission Date:" SH: admission_date;17 36 " 2011-11-13 " SA: admission_date;36 51 "Discharge Date:" SH: discharge_date;51 64 " 2012-02-10 " SA: discharge_date;64 79 " Date of Birth:" SH: date_of_birth;79 99 " 2011-11-13 " SA: date_of_birth;99 103 "Sex:" SH: sex;103 107 " M " SA: sex;107 116 " Service:" SH: service;116 130 " Neonatology " SA: service;130 158 " HISTORY OF PRESENT ILLNESS:" SH: history_of_present_illness;158 1215 " Baby Dean Leslie Baugher was bor...th percentile). " SA: history_of_present_illness;1215 1237 " PHYSICAL EXAMINATION:" SH: physical_examination;1237 1815 " On admission, examination reveal... and clavicles. " SA: physical_examination;1815 1832 " HOSPITAL COURSE:" SH: hospital_course;1832 9546 " 1.) Respiratory: Scranton was i...h his progress. " SA: hospital_course;9546 9570 " CONDITION AT DISCHARGE:" SH: discharge_condition;9570 9580 " Stable. " SA: discharge_condition;9580 9603 " DISCHARGE DISPOSITION:" SH: discharge_disposition;9603 10303 " Home with family. PRIMARY PEDIA... breast ad lib. " SA: discharge_disposition;10303 10316 " MEDICATIONS:" SH: medications;10316 11716 " Fer-in-Joshua and Poly-Vi-James a... Lane Hospital. " SA: medications;11716 11737 " DISCHARGE DIAGNOSES:" SH: discharge_diagnoses;11737 12217 " Former 26 and 08-13 premature mal...d of Report) " SA: discharge_diagnoses; • Headings and subheadings are found, hierarchic structure is recognized • Only common headings are used in this structure “grammar”

  8. cl-user(24): (sentencize tt)nilcl-user(25): (show-annotations tt :type 'sentence-annotation)19 29 "2011-11-13" sent: nil;53 63 "2012-02-10" sent: nil;82 92 "2011-11-13" sent: nil;105 106 "M" sent: nil;118 129 "Neonatology" sent: nil;160 294 "Baby Dean Leslie Baugher was born ... I now II woman." sent: nil;296 439 "PRENATAL SCREENS: A positive, ant...B strep unknown." sent: nil;441 540 "This pregnancy was remarkable for ...ent at 12 weeks." sent: nil;541 697 "Mother was admitted on 2011-10-27 ...ne at that time." sent: nil;699 758 "She had refractory preterm labor, ...ing to delivery." sent: nil;760 831 "This infant emerged with good tone...d spontaneously." sent: nil;833 891 "Apgars were seven at one minute an...at five minutes." sent: nil;892 945 "Birth weight was 1,070 grams (50 t...5th percentile)." sent: nil;947 1061 "His birth length was 37 cm (50 to ...5th percentile)." sent: nil;1063 1214 "Discharge weight was 3,375 grams (...0th percentile)." sent: nil;1239 1337 "On admission, examination revealed... was soft, flat." sent: nil;1338 1368 "Non dysmorphic, intact palate." sent: nil;1370 1494 "Chest with moderate retractions wi...coarse crackles." sent: nil;1495 1540 "Heart was regular rate and rhythm, no murmur." sent: nil;1541 1564 "Pink and well perfused." sent: nil;1565 1625 "Abdomen soft and distended with th... umbilical cord." sent: nil;1626 1638 "Patent anus." sent: nil;1640 1706 "Normal preterm male genitalia with...ded bilaterally." sent: nil;1708 1742 "Age appropriate tone and reflexes." sent: nil;1743 1772 "Bruising of arms bilaterally." sent: nil;1774 1814 "Normal spines, limbs, hip and clavicles." sent: nil;1834 1910 "1.) Respiratory: Scranton was int...s of Surfactant." sent: nil;1911 1976 "He remained on SIMV 8until day of ... self-extubated." sent: nil;1978 2167 "He was then placed on continuous p...ned to room air." sent: nil;2169 2442 "He had a trial of diuretic therapy...ping the Diuril." sent: nil;2444 2552 "Baby was loaded with caffeine citr... day of life 48." sent: nil; Sentence Breaks • Within each section, sentence breaks are determined by a MAXENT algorithm from OPENNLP • The model was trained on a newspaper corpus, hence perhaps not appropriate for clinical text • but, it seems to work reasonably well • 139 “sentences” in example

  9. Link Grammar Parser cl-user(26): (link-parse tt)(135 4 0 0)cl-user(28): (length (annotations tt :type 'lp-token))2171cl-user(31): (show-annotations tt :type 'lp-token)19 29 "2011-11-13" lptok: 1;53 63 "2012-02-10" lptok: 1;82 92 "2011-11-13" lptok: 1;105 106 "M" lptok: 1;118 129 "Neonatology" lptok: 1;160 164 "Baby" lptok: 1;165 169 "Dean" lptok: 2;170 176 "Leslie" lptok: 3;177 184 "Baugher" lptok: 4;185 188 "was" lptok: 5;189 193 "born" lptok: 6;194 196 "at" lptok: 7;197 199 "26" lptok: 8;200 203 "and" lptok: 9;204 207 "6/7" lptok: 10;208 213 "weeks" lptok: 11;214 223 "gestation" lptok: 12;224 226 "by" lptok: 13;227 235 "cesarean" lptok: 14;236 243 "section" lptok: 15; • Constraint-based lexicalized parser • Tokenizes • computes all possible links among word pairs • chooses linkages in which links do not cross • Example has 139 sentences, of which 135 parsed • Combinatorial explosion in 4 • Multiple parses possible in many • Links are stored with the 2171 tokens 244 246 "to" lptok: 16;247 248 "a" lptok: 17;249 251 "39" lptok: 18;252 256 "year" lptok: 19;257 260 "old" lptok: 20;260 261 "," lptok: 21;262 269 "Gravida" lptok: 22;270 272 "II" lptok: 23;272 273 "," lptok: 24;274 278 "Para" lptok: 25;279 280 "I" lptok: 26;281 284 "now" lptok: 27;285 287 "II" lptok: 28;288 293 "woman" lptok: 29;293 294 "." lptok: 30;296 304 "PRENATAL" lptok: 1;305 312 "SCREENS" lptok: 2;312 313 ":" lptok: 3;315 316 "A" lptok: 4;317 325 "positive" lptok: 5;325 326 "," lptok: 6;327 335 "antibody" lptok: 7;336 344 "negative" lptok: 8;...

  10. Parsing examples • She had refractory preterm labor, thus leading to delivery. +---------------Os---------------+------MXsp-----+ | +-----------A----------+ +----Xd----+---------Xc--------+ +-Ss-+ | +----A----+ | +---E--+--MVp-+--Jp--+ | | | | | | | | | | | |she had.v refractory.a preterm[?].a labor.n , thus leading.g to delivery.n . cl-user(36): (setq s9 (elt (annotations tt :type 'sentence-annotation) 9))#<sent 235584073 (699-758): "She had refractory preterm labor, ...ing to delivery.">cl-user(37): (setq st9 (annotations s9 :type 'lp-token))(#<lptok 235586149 (699-702):1: "She"> #<lptok 235586148 (703-706):2: "had"> #<lptok 235586147 (707-717):3: "refractory"> #<lptok 235586146 (718-725):4: "preterm"> #<lptok 235586145 (726-731):5: "labor"> #<lptok 235586144 (731-732):6: ","> #<lptok 235586143 (733-737):7: "thus"> #<lptok 235586142 (738-745):8: "leading"> #<lptok 235586141 (746-748):9: "to"> #<lptok 235586140 (749-757):10: "delivery"> ...)cl-user(38): (print-table (left-links (elt st9 1)))1 she Ss Ss S had.v-d 2 1 1 63 cl-user(39): (print-table (right-links (elt st9 1)))2 had.v-d I*j I*j I labor.v 5 3 2 15 2 had.v-d O Os Os preterm[?].n 4 2 2 15 2 had.v-d MV MVg MVg leading.g 8 6 1 23 2 had.v-d O Ou Ou labor.n-u 5 3 1 47

  11. Parsing examples • Baby Dean Leslie Baugher was born at 26 and 6/7 weeks gestation by cesarean section to a 39 year old, Gravida II, Para I now II woman. +----------------------------------MVp---------------------------------+ +--------------------MVp--------------------+ | | +--------------Jp--------------+ | | +----------GN----------+ | | +--------A--------+ | +--- | +---G---+---G---+--Ss--+--Pv--+-MVp+ | +----AN---+ +---Jp--+ | +- | | | | | | | | | | | | | | baby.n Dean.b Leslie.b Baugher was.v born.v at 26 and 6/7[?].a weeks.n gestation.n by cesarean[?].n [section] to a +-------------------MXs------------------+ +---------MXs--------+ | --Js----+ +----Xd----+ +--------Xd-------+ ---Ds---+ | +--G--+X+ +-G-+----G----+----Xc---+ | | | | | | | | |[39] year.n [old] , Gravida II , Para I.n [now] II [woman] .

  12. UMLS Lookup examples (only the good) 208 223 "weeks gestation" TUI: T033; SEM: _finding; CUI: C1135241;227 243 "cesarean section" TUI: T061,T033; SEM: _finding,_procedure; CUI: C0007876,C0029535,C1384674,C2053588,C2114431; MeSH: E04.520.252.500;262 272 "Gravida II" TUI: T033; SEM: _finding; CUI: C0232997;274 278 "Para" TUI: T033; SP-POS: noun; SEM: _finding; CUI: C0030563; MeSH: G08.686.677,G08.686.785.760.769.472,N06.850.490.812.600;327 344 "antibody negative" TUI: T033; SEM: _finding; CUI: C0855852;346 353 "Rubella" TUI: T047,T116,T121,T129; SP-POS: noun; SEM: _medication,_disease; CUI: C0035920,C0035923; MeSH: C02.782.930.700.700,D20.215.894.899.779;354 360 "immune" TUI: T169; SP-POS: adj,noun; SEM: _modifier; CUI: C0439662;362 377 "RPR nonreactive" TUI: T034; SEM: _bodyparam; CUI: C0748443;379 406 "hepatitis B surface antigen" TUI: T121,T129,T059; SEM: _procparam,_medication; CUI: C0019168,C0201477,C2229745; MeSH: D23.050.327.495.500.475;407 415 "negative" TUI: T080,T033; SP-POS: noun,verb,adj; SEM: _finding,_modifier; CUI: C0205160,C1513916;417 430 "group B strep" TUI: T007; SEM: _modifier; CUI: C0579233; MeSH: B03.510.400.800.872.100;431 438 "unknown" TUI: T078,T169,T170,T056,T080,T121,T129,T032,T033,T098; SP-POS: adj,noun; SEM: _finding,_bodyparam,_medication,_modifier; CUI: C0439673,C1521803,C1546837,C1546841,C1547283,C1547294,C1547306,C1547312,...

  13. Database storage mysql> select * from annotations where document_id=31039 limit 30;+-----------+----------------+-------------+-------+-----+----------+-------+------+| id | type | document_id | start | end | data | other | up |+-----------+----------------+-------------+-------+-----+----------+-------+------+| 235586254 | cui-annotation | 31039 | 105 | 106 | C0024554 | NULL | NULL || 235586255 | cui-annotation | 31039 | 105 | 106 | C0221134 | NULL | NULL || 235586256 | cui-annotation | 31039 | 105 | 106 | C0227102 | NULL | NULL || 235586257 | cui-annotation | 31039 | 105 | 106 | C0369637 | NULL | NULL || 235586258 | cui-annotation | 31039 | 105 | 106 | C0439113 | NULL | NULL || 235586259 | cui-annotation | 31039 | 105 | 106 | C0439232 | NULL | NULL || 235586260 | cui-annotation | 31039 | 105 | 106 | C0441923 | NULL | NULL || 235586261 | cui-annotation | 31039 | 105 | 106 | C0456533 | NULL | NULL || 235586262 | cui-annotation | 31039 | 105 | 106 | C0456644 | NULL | NULL || 235586263 | cui-annotation | 31039 | 105 | 106 | C0475209 | NULL | NULL || 235586264 | cui-annotation | 31039 | 105 | 106 | C1553028 | NULL | NULL || 235586265 | cui-annotation | 31039 | 105 | 106 | C1553034 | NULL | NULL || 235586266 | cui-annotation | 31039 | 105 | 106 | C1706456 | NULL | NULL || 235586267 | cui-annotation | 31039 | 105 | 106 | C1706457 | NULL | NULL || 235586268 | cui-annotation | 31039 | 105 | 106 | C1883310 | NULL | NULL || 235586281 | cui-annotation | 31039 | 118 | 129 | C0027621 | NULL | NULL || 235586285 | cui-annotation | 31039 | 160 | 164 | C0021270 | NULL | NULL || 235586286 | cui-annotation | 31039 | 160 | 164 | C1550504 | NULL | NULL || 235586297 | cui-annotation | 31039 | 189 | 193 | C0004897 | NULL | NULL || 235586298 | cui-annotation | 31039 | 189 | 193 | C1301886 | NULL | NULL || 235586299 | cui-annotation | 31039 | 189 | 193 | C1704689 | NULL | NULL || 235586311 | cui-annotation | 31039 | 197 | 199 | C0227067 | NULL | NULL || 235586312 | cui-annotation | 31039 | 197 | 199 | C0450349 | NULL | NULL || 235586318 | cui-annotation | 31039 | 208 | 213 | C0439230 | NULL | NULL || 235586319 | cui-annotation | 31039 | 208 | 213 | C0439506 | NULL | NULL || 235586320 | cui-annotation | 31039 | 208 | 213 | C1561540 | NULL | NULL || 235586325 | cui-annotation | 31039 | 208 | 223 | C1135241 | NULL | NULL || 235586328 | cui-annotation | 31039 | 214 | 223 | C0032961 | NULL | NULL || 235586338 | cui-annotation | 31039 | 227 | 243 | C0007876 | NULL | NULL || 235586339 | cui-annotation | 31039 | 227 | 243 | C0029535 | NULL | NULL |+-----------+----------------+-------------+-------+-----+----------+-------+------+

  14. Modeling Document Content • Statistical Natural Language Processing • Generate large numbers of features • Token-level features • words themselves, parts of speech, mapping to dictionary meanings, UMLS concepts (includes ICD-9, SNOMED, MeSH, ...) • n-tuples of features based on adjacent sets of token-level features • Syntactic features • Noun-phrase chunks, mapped as for tokens • Full parse (e.g., using link-parser grammar), yields n-tuples of syntactically linked tokens and phrases • Position in document, section, subsection • Generate, test, and then apply machine learning models that identify • names, locations, institutions, identifiers, addresses, phone numbers, ... • signs, symptoms, diagnoses, allergies, ... • tests, results, treatments, outcomes, medications, dosages, ... • Currently support wrappers for LIBSVM, WEKA learners • Goal: formalized representation of the meaningful content of the entire note

  15. Steps for SHARPN • cTAKES performs many of the same tasks • Adopt UIMA/cTakes framework • Learning curve • Reproduce some of current unique efforts • importers for specific data sets, annotators for complex tokens, use of features from link parser, ... • Suggest/develop incorporation of database-backed persistence • Alternative • Build data-level translation/interoperability; i.e., • map UIMA type system to LATE type system • build import/export functions between XML representation of UIMA and database representation of LATE • incorporate LATE environment in UIMA environment • Will do small experiments to determine whether feasible

More Related