National Center for Text Mining Launch Event

National Center for Text MiningLaunch Event Ray R. Larson University of California, Berkeley School of Information Management and Systems Thanks to Dr. Eric Yen, Prof. Michael Buckland Dr. Rob Sanderson and Prof. Marti Hearst for parts of this presentation

Overview • The Grid, Text Mining and Digital Libraries • Cheshire3: Bringing Search and Text Mining to Grid-Based Digital Libraries • Other Related Berkeley work: The BioText Project

Grid Architecture -- (Dr. Eric Yen, Academia Sinica, Taiwan.) .…. High energy physics Chemical Engineering Climate Astrophysics Cosmology Combustion Applications Application Toolkits Grid Services Grid Fabric ..… Remote Computing Remote Visualization Collaboratories Remote sensors Data Grid Portals Grid middleware Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services

Grid Architecture (ECAI/AS Grid Digital Library Workshop) Digital Libraries High energy physics Bio-Medical Humanities computing Astrophysics Chemical Engineering Climate Cosmology Combustion … Applications Application Toolkits Grid Services Grid Fabric … Text Mining Remote Computing Remote Visualization Search & Retrieval Metadata management Collaboratories Remote sensors Data Grid Portals Grid middleware Protocols, authentication, policy, instrumentation, Resource management, discovery, events, etc. Storage, networks, computers, display devices, etc. and their associated local services

Grid-Based Digital Libraries • Large-scale distributed storage requirements and technologies, • Distributed Information Retrieval issues and algorithms, • Organizing distributed digital collections, • Shared Metadata – standards and requirements • Managing distributed digital collections, • Security and access control, • Collection Replication and backup.

Cheshire3 Overview • XML Information Retrieval Engine • 3rd Generation of the UC Berkeley Cheshire system, as co-developed at the University of Liverpool. • Uses Python for flexibility and extensibility, but imports C/C++ based libraries for processing speed • Standards based: XML, XSLT, CQL, SRW/U, Z39.50, OAI to name a few. • Grid capable. Uses distributed configuration files, workflow definitions and PVM (currently) to scale from one machine to thousands of parallel nodes. • Free and Open Source Software. (GPL Licence) • http://www.cheshire3.org/ (under development!)

Cheshire3 Server Overview Cheshire III SERVER N E T W O R K C O N F I G & C O N T R O L SERVER CONTROL USER INFO API A P A C H E I N T E R F A C E STAFF UI Native calls Normalization I N D E X I N G C L U S T E R I N G S E A R C H S C A N A U T H E N T I C A T I O N T R R X E A S C N L O S T R F D O R M S P H R A O N T D O L C E O R L CONFIG Z39.50 SOAP OAI SRW Client Fetch ID User/ Clients Put ID OpenURL UDDI WSRP REMOTE SYSTEMS (any protocol) OGIS JDBC DB API LOCAL DB XML INDEXES CONFIG & Metadata INFO RESULT SETS ACCESS INFO

Cheshire Additions for NaCTeM

Cheshire3 Object Model Document Document Document Document ProtocolHandler Server UserStore Database Transformer Record Record Record Record Index Parser Extracter Document Document Document Document Normaliser PreParser DocumentGroup IndexStore RecordStore

Cheshire3 Data Objects • DocumentGroup: • A collection of Document objects (e.g. from a file, directory, or external search) • Document: • A single item, in any format (e.g. PDF file, raw XML string, relational table) • Record: • A single item, represented as parsed XML • Query: • A search query, in the form of CQL (an abstract query language for Information Retrieval) • ResultSet: • An ordered list of pointers to records • Index: • An ordered list of terms extracted from Records

Cheshire3 Process Objects • PreParser: • Given a Document, transform it into another Document (e.g. PDF to Text, Text to XML) • Parser: • Given a Document as a raw XML string, return a parsed Record for the item. • Transformer: • Given a Record, transform it into a Document (e.g. via XSLT, from XML to PDF, or XML to relational table) • Extracter: • Extract terms of a given type from an XML sub-tree (e.g. extract Dates, Keywords, Exact string value) • Normaliser: • Given the results of an extracter, transform the terms, maintaining the data structure (e.g. CaseNormaliser)

Cheshire3 Abstract Objects • Server: • A logical collection of databases • Database: • A logical collection of Documents, their Record representations and Indexes of extracted terms. • Workflow: • A 'meta-process' object that takes a workflow definition in XML and converts it into executable code.

Cheshire3 Grid Tests • Running on an 30 processor cluster in Liverpool using PVM (parallel virtual machine) • Using 17 processors with one “master” and 16 “slave” processes we were able to parse and index MARC data at about 10000 records per second • On a similar setup 610 Mb of TEI data can be parsed and indexed in seconds

Other Work at Berkeley • BioText Project • Directed by Prof. Marti Hearst of SIMS • Currently working on a number of areas in NLP analysis and use of Bio-Medical texts • Developing new and efficient methods for • Abbreviation recognition Slide Credit: Prof Marti Hearst

BioText: Main Goals Sophisticated Text Analysis Annotations in Database Improved Search Interface Slide Credit: Prof Marti Hearst

BioText: A Two-Sided Approach Blast Medline Mesh SwissProt Word Net GO Journal Full Text Empirical Computational Linguistics Algorithms Sophisticated Database Design & Algorithms Slide Credit: Prof Marti Hearst

Computational Language Goals • Recognizing and annotating entities within textual documents • Identifying semantic relations among entities • To (eventually) be used in tandem with semi-automated reasoning systems. Slide Credit: Prof Marti Hearst

Computational Linguistics Goals • Mark up text with semantic relations • <protein1><inhibits><protein2> • <protein><binds-with><receptor> • <chemical><increases><level-of><chemical> Slide Credit: Prof Marti Hearst

Database Research Issues • Efficient querying and updating • Semi-structured information • Fuzzy synonyms • Collection subsets • Efficiently and effectively combining • Relational databases • Text databases • Layers of processing • Hierarchical Ontologies Slide Credit: Prof Marti Hearst

Abbreviation Recognition • Example of BioText work (Schwartz and Hearst 03)… • Fast, simple algorithm for recognizing abbreviation definitions. • Simpler and faster than the rest • Higher precision and recall • Idea: Work backwards from the end • Examples: • In eukaryotes, the key to transcriptional regulation of the Heat Shock Response is the Heat Shock Transcription Factor (HSF). • Gcn5-related N-acetyltransferase (GNAT) • Idea: use redundancy across abstracts to figure out abbreviation meaning even when definition is not present. Slide Credit: Prof Marti Hearst

National Center for Text Mining Launch Event

National Center for Text Mining Launch Event

Presentation Transcript

Text Mining

System Center 2012 SP1 Launch Event

NLP for Text Mining

Launch event

L.A.D. Launch Event

Text mining- text analytics- data mining

Text Mining

Text Mining

Text Mining

The National Centre for Text Mining

Text Mining

National Launch

Launch event

Launch event

Text Mining