220 likes | 335 Views
Machine learning techniques for detecting topics in research papers. Amy Dai. The Goal. Build a web application that allows users to easily browse and search papers. Project Overview. Part I – Data Processing Convert PDF to text Extract information from documents
E N D
Machine learning techniques for detecting topics in research papers Amy Dai
The Goal Build a web application that allows users to easily browse and search papers
Project Overview • Part I – Data Processing • Convert PDF to text • Extract information from documents • Part II – Discovering topics • Index documents • Group documents by similarity • Learn underlying topics
Part I - Data Processing How do we extract information from PDF documents?
Pdf to Text • Research papers are in PDF • PDFs are images • Computer sees colored lines and dots • Conversion process loses some of the formatting
Getting what we need • Construct heuristic rules to extract info • First line • Between title and abstract • Preceded by “Abstract” • Preceded by “Keywords”
Can we predict names? • Named Entity Tagger • by the Cognitive Computation Group at Uni. Illinois Urbana-Champaign. Spam, Damn Spam, and StatisticsUsing statistical analysis to locate spam web pagesDennis Fetterly Mark Manasse Marc NajorkMicrosoft Research Microsoft Research Microsoft Research 1065 La Avenida 1065 La Avenida 1065 La Avenida Mountain View, CA 94043, USA Mountain View, CA 94043, USA Mountain View, CA 94043, USAfetterly@microsoft.commanasse@microsoft.comnajork@microsoft.com
Accuracy • To determine how well my script to extract info worked • (# right + # needing minor changes)/ Total # of documents • Example • 30 were correctly extracted • 10 needed minor changes • 60 total documents • (30+10)/60 = 66.7%
Part II – Learning Topics Can we use machine learning to discover underlying topics?
Indexing Documents • Index documents • Remove common words leaving better descriptors for clustering • Compare to corpus • Brown Corpus: A Standard Corpus of Present-Day Edited American English • From the Natural Language Toolkit • Reduce from 19,100 to 12,400 words • Documents contain between 100 – 1,700 words after common word removal
Effect on Index Size • Changes in document index size for “Defining quality in web search results”
Keeping What’s Important • Words in abstract of “Defining quality in web search results”
Documents as Vectors • Represent documents as numerical vectors by transforming words to numbers using tf-idf • Length is normalized • Vector length is the length of index for corpus • Mostly sparse
Clustering using Machine Learning • Use machine learning algorithms to cluster by: • K-means • Group Average Agglomerative (GAA) • Unsupervised learning • Cosine similarity
Clustering Results Documents K-Means A: SpamRank – Fully Automatic Link Spam Detection B: An Approach to Confidence Based Page Ranking for User Oriented Web Search C: Spam, Damn Spam, and Statistics D:Web Spam, Propaganda and Trust E: Detecting Spam Web Pages through Content Analysis F: A Survey of Trust and Reputation Systems for Online Service Provision Group 1 A Group 2 B,C,D,E Group 3 F GAA Group 1 B Group 2 A,C,D,E Group 3 F
Challenges • K-Means • Finding K • Group Average Agglomerative • The depth to cut the dendogram
Labeling Clusters • Compare term frequency in a cluster with the collection • A frequent word within the cluster and in the collection isn’t a good discriminative label • A good label is one that is infrequent in the collection
Summary • Part I – Data Processing • PDF to text conversion isn’t perfect and imperfections make it difficult to extract text • Documents don’t follow one formatting standard, need heuristic rules to extract info • Part II – Discovering topics • Indexes are large, to keep the important we need a good corpus to compare it to. • There are many clustering algorithms and each has limitations • How do I choose the best label?
Ongoing work • Use Bigrams • Keywords: Web search, adversarial information retrieval, web spam • Limit the number of topic labels by ranking • Use algorithm that clusters based on probabilistic distributions • Logistic normal distribution
Useful Tools • Pdftotext – Unix command for converting PDF to text • Python libraries • Unicode • Re –regular expressions • NLTK – Natural language processing tool • Software and datasets for natural language processing • Used for clustering algorithms and reference corpus