230 likes | 538 Views
An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems. Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri presented by Thiago Pardo. USP NLP Group and UFSCar Database Group, São Carlos, BR.
E N D
An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems • Pablo Matos, Leonardo Lombardi, Thiago Pardo, Cristina Ciferri, Marina Vieira, and Ricardo Ciferri • presented by Thiago Pardo • USP NLP Group and UFSCar Database Group, São Carlos, BR
Context and Motivation An Environment for Data Analysis - IEA-AIE2010 • A lot of electronic documents that report experiments • treatment adopted • patients with some kind of disease • number of patients enrolled in the treatment • symptoms and risk factors • positive and negative effects • There are several transactions and journals • e.g., American Journal of Hematology, Blood, and Haematologica
Context and Motivation An Environment for Data Analysis - IEA-AIE2010 Nowadays, researchers and doctors are not able to process this huge number of documents
Context and Motivation An Environment for Data Analysis - IEA-AIE2010 These documents are in unstructured format, i.e., in plain textual form, specially in PDF There is necessary to transform these data from unstructured to structured format in order to submit it to an automatic knowledge discovery process
Goal An Environment for Data Analysis - IEA-AIE2010 • Development of an environment called IEDSS-Bio for analyzing data of biomedical domain, i.e., Sickle Cell Anemia • Support the expert in making decisions: • Extracting relevant information from biomedical documents • Storing the information in a data warehouse (DW) • Mining interesting knowledge from the DW
Contributions An Environment for Data Analysis - IEA-AIE2010 • Theoretical: • Domain Knowledge • Methodology of Information Extraction • Practical: • Resources: collection of documents, dictionary and rules • Tools: Converter, Information Extraction, Data Warehouse, Data Mining systems
The Environment for Data Analysis • How many patients had clinical improvement and were treated with the hydroxyurea drug? A significant amount of patients under treatment with the hydroxyurea drug tend to have marrow depression. An Environment for Data Analysis - IEA-AIE2010
Converter Module An Environment for Data Analysis - IEA-AIE2010
Converter Module An Environment for Data Analysis - IEA-AIE2010
Information Extraction Module • Processed Sections: • Abstract, Results and Discussion (class of positive and negative effects) • All Sections (class of patient) An Environment for Data Analysis - IEA-AIE2010
Training Sentence Classification Test New Text TXT • Negative Effect Several files about complication sentences Positive Effect Several files about benefit sentences ML Techniques Classes Output Others • Set of • sentences • classified • into classes Several files about other sentences An Environment for Data Analysis - IEA-AIE2010
Identification of Relevant Information Dictionary Biomedical Database An Environment for Data Analysis - IEA-AIE2010
Identification of Relevant Information Rules Identification of Information Pipeline Example of Sentences Relevant Information An Environment for Data Analysis - IEA-AIE2010
Experiments: Sentence Classification An Environment for Data Analysis - IEA-AIE2010 How do human beings manually perform the sentence classification? Is it feasible to automate the sentence classification task? What kind of classification algorithm performs better in this task?
Manual Classification by humans? 1 Fleiss (1971) An Environment for Data Analysis - IEA-AIE2010 Annotation Agreement in 50 sentences
It is feasible to automate this task? 2 Landis e Koch (1977) An Environment for Data Analysis - IEA-AIE2010
What kind of classification algorithm performs better in this task? 3 Distribution of classes for each sample An Environment for Data Analysis - IEA-AIE2010
Sentence Classification Process:training and testing phase 3 • Bag-of-words model • AVM configuration: • Minimum Frequency = 2 • Attributes: 1 to 3-grams • 1, for the case the n-gram occurs in the sentence (present); • 0 otherwise (absent). • Not considered: stopwords removal and stemming An Environment for Data Analysis - IEA-AIE2010
Evaluation 3 • Partitioning method: 10-fold cross-validation An Environment for Data Analysis - IEA-AIE2010
Conclusions An Environment for Data Analysis - IEA-AIE2010 • The environment proposed – Information Extraction and Decision Support System in Biomedical domain – aims at being • a general environment for mining relevant information in the biomedical domain • First experiments on sentence classification • a step of the whole process • very good results (95.9% accuracy) for papers about Sickle Cell Anemia (SCA) • Task of sentence classification in the SCA domain is well defined and possible to be automated
Future Work An Environment for Data Analysis - IEA-AIE2010 • Investigate the identification of treatment and symptoms information in scientific papers • Extract of the relevant sentence pieces for populating our databases • using IE approaches, e.g., rule-based and dictionary-based • Investigate the use of parallel processing to optimize the more time-consuming tasks, • e.g., the application of data mining algorithms and the analytical query processing • Other biomedical areas may also benefit from our text mining approach
An Environment for Data Analysis in Biomedical Domain: Information Extraction for Decision Support Systems Questions ? • USP NLP Group and UFSCar Database Group, São Carlos, BR
References An Environment for Data Analysis - IEA-AIE2010 ANTHONY, L.; LASHKIA, G. V. Mover: a machine learning tool to assist in the reading and writing of technical papers. IEEE Transactions on Professional Communication, v. 46, n. 3, p. 185-193, 2003. FLEISS, J. L. Measuring nominal scale agreement among many raters. Psychological Bulletin, v. 76, n. 5, p. 378-382, 1971. LANDIS, J. R.; KOCH, G. G. The measurement of observer agreement for categorical data. Biometrics, v. 33, n. 1, p. 159-174, 1977. PINTO, A. C. S. et al. Technical Report "Sickle Cell Anemia". São Carlos: Department of Computer Science, Federal University of São Carlos, 2009. p. 16. Available at: <http://sca.dc.ufscar.br/download/files/report.sca.pdf>.