Terminology mining at OCLC

Terminology mining at OCLC Carol Jean Godby Research Scientist OCLC Online Computer Library Center, Inc. Ohio Academic Library Association May 19, 2006

Outline of this talk • The need for terminology • Sources of terminology • Why the terminology problem is hard • Three outcomes from terminology mining projects • Conclusions and recommendations

have havei havel haven havens havera haverty havey havice havill havilland health care health care coverage health insurance housing housing policy ……. world trade world trade accord world trade agreement world trade center world trade center bombing What terminology looks like

Two sources of terminology • From a human-managed resource • Controlled vocabulary • Subject classification scheme • Dictionary, gazeteer, encyclopedia, subject terminology list • Algorithmically extracted from text

Human-managed terminology • Strengths • It represents important and persistent concepts. • It is derived from linguistic or subject-matter expertise. • The form has been standardized. • It may provide a link to an ontology. • It promises interoperability between online resources and traditionally published materials. • Weaknesses • Literary warrant is based on traditionally published materials. • Human effort is required to keep it current. • The sources are not usually freely available and must be modified for use in automated systems.

Automatically extracted terminology • Strengths • It is timely. • The style is closer to an ordinary user’s vocabulary. • Coverage is not restricted to traditionally published material. • Weaknesses • The data is noisy and difficult to organize. • It is ephemeral. • The problem is ill-defined.

Mining terminology in four steps • Select a text • Assign part-of-speech tags • Apply the part-of-speech filter • Apply the post-extraction filter

Andrew Newell Wyeth (born July 12, 1917) is an American realist painter, one of the best-known of the 20th century. He is sometimes referred to as the "Painter of the People" due to his popularity with the American public. Wyeth's favorite subject is the land and inhabitants around his hometown of Chadds Ford, Pennsylvania, and those near his summer home in Cushing, Maine

Step 2: Assign Part-of-Speech Tags Andrew/noun Newell/noun Wyeth/noun (born/verb July/noun 12/noun, 1917/noun) is/verb an/article American/adjective realist/adjective painter/noun, one/pronoun of/preposition the/article best-known/adjective of/preposition the/article 20th/adjective century/noun. …..

Step 3: Apply part-of-speech filters • Andrew Newell Wyeth (noun-noun-noun) • July 12 1917 (noun noun noun) • (an) American realist painter (adjective-noun-noun) • (one of the) best-known of the 20th century (adjective-adjective-preposition-article-adjective-noun) • Painter of the People (noun-preposition-article-noun) • Popularity with the American public (noun-preposition-article-adjective-noun) • Wyeth’s favorite subject (adjective-adjective-noun) • Land (noun) • Inhabitants around his hometown (noun-preposition-pronoun-noun) • Chadds Ford Pennsylvania (noun-noun-noun) • (his) Summer home in Cushing Maine (noun-noun-preposition noun-noun)

Issues in the part-of-speech filter • The implementations are mature, accessible in the open source community, and reasonably accurate. • But: • The filter must be designed, usually to select noun phrases. • The noun phrases must be normalized (e.g. trim leading articles and pronouns; eliminate punctuation) • The noun phrases may be long or short. Which do we choose? Land and inhabitants around his hometown -- or: Land, inhabitants, hometown Best-known of the 20th century – or: 20th century

So… • By extracting noun phrases from text, the designer is already implementing a simple theory of terminology. • But: • The result is an overwhelming number of phrases that occur only once. • The short-phrase vs. long-phrase problem shows that terminology has no obvious formal boundaries. • In other words, we can’t identify terminology by part of speech alone.

Step 4: A post-extraction filter • The goal is to select terminology of interest, dramatically reducing the output from the part-of-speech filter. • Some criteria for a “good” term: • It is accurately identified and represented. • It is easily obtained. • It represents a persistent concept. • It reveals major or minor themes in the source document. • Possible outcomes: • Named entities • Lexicalized noun phrases • Statistically improbable phrases

Named entities • Goal • Identify and categorize the proper names in a text. • Results • Andrew Newell Wyeth (person), Chadds Ford (place), Pennsylvania (place), Cushing (place), Maine (place), July 12, 1917 (date). • “Painter of the People” – would be recognized, but special handling would be required to categorize it.

Named entities: scorecard • Accurate? • 95% accuracy is reported on systems that recognize personal, corporate, and geographic names. • Easily obtained? -- high • The named entity problem is conceptually simple and well-defined. • Most texts contain named entities. • Textual clues for names are machine-processable. • Software is mature and in the public domain. • Represent persistent concepts? -- high • Something is assigned a name because it is persistent. • Reveal major or minor document themes? --medium to low • Documents that contain named entities may be about something else.

Lexicalized noun phrases • The goal is to identify common noun phrases that ‘name’ a persistent concept. • Language can be used either to describe or to name. • Descriptions • Are constructed from the rules of syntax for immediate use. • The forms are variable. • The meaning of a phrase is the sum of its words. • Names • Are stored in a mental dictionary and then retrieved as needed. • The forms are frozen. • The meaning of a phrase may not be easily inferred. So: • A lexicalized noun phrase has acquired word-like characteristics. • It can be precisely defined. • It can acquire other lexical meaning – connotations, “branding”, etc. • It is a candidate for inclusion in a dictionary, thesaurus, or term list.

Textual cues for lexicalized noun phrases • Weak positive contexts: • lists bread, milk, laundry detergent, kool-aid, tp, pasta sauce, olive oil • compound noun modifiers American realist painter, stock market quote • Strong positive contexts:study of,informationabout, professor of, department of, journal of, so-called, biblography on metadata applications, data processing, automatic classification, internet resources, digital watermarking, font readability, digital image processing • Strong negative contexts:very, -ly adverbs (extremely) different things, few messages, good point, interesting example, appealing idea, small extension, terse document, simple kind

A lexicalized noun phrase: Recurrent erosion

Not a lexicalized noun phrase:Recurrent problem

Lexicalized noun phrases: scorecard • Accurate? – medium • 70-80% agreement with human judges • But there is a natural upper limit. • A text is an imperfect reflection of linguistic knowledge. • Terminology is in a constant state of flux. • Easily obtained? – medium-to-low • The easiest cues are the least accurate; others are dependent on certain subject domains and styles of discourse. • Software is not in the public domain. • Represent persistent concepts? – medium to high • High agreement with dictionaries and subject schemes • Reveal major or minor document themes? -- low • Documents that contain lexicalized noun phrases may be about something else.

“Statistically improbable phrases” • A list of noun phrases automatically extracted from the full text of a book on Amazon.com. • The phrases are common in the book but uncommon in Amazon’s book corpus. • The phrases represent a “lexical signature” of the book.

Statistically improbable phrases: scorecard • Accurate? • Hard data unavailable, but output appears to be accurately parsed. • Easily obtained? – high • The post-extraction filter is based on common information-retrieval metrics available in the public domain. • Represent persistent concepts? – medium to low • Likely candidates: networking form, informational societies • Unlikely candidates: new spatial logic, instant wars, new technological paradigm • Reveal major or minor document themes? -- medium • Lorcan Dempsey on the SIPs for the Rise of the Network Society: This is interesting. For example, clicking on informational economy gives a list of books that may of interest that would have been difficult to find otherwise. Of course, informational is a distinctive usage of Castell's. What did not show up was space of flows and space of places, phrases that are central to some of Castell's arguments in this book. The Books on related topics is a good list (this is a list based on number of shared SIPs), but it does not show the other two books in the trilogy of which this is the first part.

Some observations and recommendations • The tests for good terminology show that the terminology mining problem is complex and needs to be decomposed. • The hierarchical structure of the problem suggests a development path. • The common early stages of processing (text preparation, part-of-speech tagging, noun phrase parsing) • Algorithms are well-understood and available in the public domain. • The post-extraction filters • Named entities • The problem is conceptually simple. • Software is mature and available in the public domain. • Lexicalized noun phrases • A comprehensive solution is expensive and error-prone, but there is low-hanging fruit. • Focus on domain-specific terminology. • In subjects with stable terminology, concentrate on automatic assignment of controlled vocabulary. • Document “aboutness” • Still an immature subject • Model with information retrieval metrics.

For more information “ANNIE - Open Source Information Extraction from The University of Sheffield.” http://www.aktors.org/technologies/annie/ Dempsey, Lorcan, 2005. “Amazon: making data work” http://orweblog.oclc.org/archives/000658.html “Natual Language Toolkit.” http://nltk.sourceforge.net/ Godby, Carol Jean, 2002. A computational study of lexicalized noun phrases in English. PhD. dissertation. Ohio State University Department of Linguistics http://www.ohiolink.edu/etd/view.cgi?osu1017343683

Terminology mining at OCLC