Machine learning techniques for detecting topics in research papers

Machine learning techniques for detecting topics in research papers Amy Dai

The Goal Build a web application that allows users to easily browse and search papers

Project Overview • Part I – Data Processing • Convert PDF to text • Extract information from documents • Part II – Discovering topics • Index documents • Group documents by similarity • Learn underlying topics

Part I - Data Processing How do we extract information from PDF documents?

Pdf to Text • Research papers are in PDF • PDFs are images • Computer sees colored lines and dots • Conversion process loses some of the formatting

Getting what we need • Construct heuristic rules to extract info • First line • Between title and abstract • Preceded by “Abstract” • Preceded by “Keywords”

Finding Names

Can we predict names? • Named Entity Tagger • by the Cognitive Computation Group at Uni. Illinois Urbana-Champaign. Spam, Damn Spam, and StatisticsUsing statistical analysis to locate spam web pagesDennis Fetterly Mark Manasse Marc NajorkMicrosoft Research Microsoft Research Microsoft Research 1065 La Avenida 1065 La Avenida 1065 La Avenida Mountain View, CA 94043, USA Mountain View, CA 94043, USA Mountain View, CA 94043, USAfetterly@microsoft.commanasse@microsoft.comnajork@microsoft.com

Accuracy • To determine how well my script to extract info worked • (# right + # needing minor changes)/ Total # of documents • Example • 30 were correctly extracted • 10 needed minor changes • 60 total documents • (30+10)/60 = 66.7%

Accuracy and Error

Part II – Learning Topics Can we use machine learning to discover underlying topics?

Indexing Documents • Index documents • Remove common words leaving better descriptors for clustering • Compare to corpus • Brown Corpus: A Standard Corpus of Present-Day Edited American English • From the Natural Language Toolkit • Reduce from 19,100 to 12,400 words • Documents contain between 100 – 1,700 words after common word removal

Effect on Index Size • Changes in document index size for “Deﬁning quality in web search results”

Keeping What’s Important • Words in abstract of “Defining quality in web search results”

Documents as Vectors • Represent documents as numerical vectors by transforming words to numbers using tf-idf • Length is normalized • Vector length is the length of index for corpus • Mostly sparse

Clustering using Machine Learning • Use machine learning algorithms to cluster by: • K-means • Group Average Agglomerative (GAA) • Unsupervised learning • Cosine similarity

Clustering Results Documents K-Means A: SpamRank – Fully Automatic Link Spam Detection B: An Approach to Confidence Based Page Ranking for User Oriented Web Search C: Spam, Damn Spam, and Statistics D:Web Spam, Propaganda and Trust E: Detecting Spam Web Pages through Content Analysis F: A Survey of Trust and Reputation Systems for Online Service Provision Group 1 A Group 2 B,C,D,E Group 3 F GAA Group 1 B Group 2 A,C,D,E Group 3 F

Challenges • K-Means • Finding K • Group Average Agglomerative • The depth to cut the dendogram

Labeling Clusters • Compare term frequency in a cluster with the collection • A frequent word within the cluster and in the collection isn’t a good discriminative label • A good label is one that is infrequent in the collection

Summary • Part I – Data Processing • PDF to text conversion isn’t perfect and imperfections make it difficult to extract text • Documents don’t follow one formatting standard, need heuristic rules to extract info • Part II – Discovering topics • Indexes are large, to keep the important we need a good corpus to compare it to. • There are many clustering algorithms and each has limitations • How do I choose the best label?

Ongoing work • Use Bigrams • Keywords: Web search, adversarial information retrieval, web spam • Limit the number of topic labels by ranking • Use algorithm that clusters based on probabilistic distributions • Logistic normal distribution

Useful Tools • Pdftotext – Unix command for converting PDF to text • Python libraries • Unicode • Re –regular expressions • NLTK – Natural language processing tool • Software and datasets for natural language processing • Used for clustering algorithms and reference corpus

Machine learning techniques for detecting topics in research papers

Machine learning techniques for detecting topics in research papers

Presentation Transcript

Topics in Machine Learning

Terms for Research Papers

Machine Learning Techniques For Autonomous Aerobatic Helicopter Flight

Introduction to Machine Learning Techniques for HEP

Plagiarism for Research Papers

Ideas for Research Papers

Frog classiﬁcation using machine learning techniques

Errors in Research Papers

Character Recognition Using Machine Learning Techniques

Learning to Love … Research PaperS

MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING

Machine Learning techniques for biomarker discovery in proteomic pattern data

How to Leverage Machine-Learning Techniques in ECMs?

Text Mining with Machine Learning Techniques

CSI 5388:Topics in Machine Learning

Topics for Learning

Research Papers for Sale

4 Machine Learning Techniques with Python

5 Trending Hot Research Topics in Machine Learning

Machine Learning Techniques for Data Mining

MACHINE LEARNING TECHNIQUES IN BIO-INFORMATICS

Detecting Phishing Using Machine Learning