260 likes | 410 Views
Information Retrieval – and projects we have done. Group Members: Aditya Tiwari (08005036) Harshit Mittal (08005032) Rohit Kumar Saraf (08005040) Vinay Surana (08005031). Guided by Prof. Pushpak Bhattacharyya. Motivation.
E N D
Information Retrieval –and projects we have done. Group Members: AdityaTiwari (08005036) HarshitMittal (08005032) Rohit Kumar Saraf (08005040) VinaySurana (08005031) Guided by Prof. Pushpak Bhattacharyya
Motivation • Web, documents and encyclopedia all have tremendous amount of data and information in them. The information thus available serves only the intent of the creator or collector of data. • However, there can be other uses of that data/information as well. The need is to mine the right information from the data and use it appropriately.
Applications • Web search – Google, Yahoo • Querying/QA system like Watson (developed by IBM). • Spam filtering • Automatic Summarization • Cross-lingual retrieval en.wikipedia.org/wiki/Information_retrieval_applications
Information Retrieval • IR is the study of concerned with searching for documents, and for metadata about documents, as well as that of searching relational databases and the WWW. • The data objects that are collected can be images, documents, videos, mind maps, music en.wikipedia.org/wiki/Information_Retrieval
Wiki Mind Mapping Harshit Mittal (IIT-B) h.mittal83@gmail.com AdityaTiwari (IIT-B) adi.tiwari27@gmail.com AkhilBhiwal (VIT University) bhiwalakhil@gmail.com
Project Idea • Represent the textual information in graphical form which is easier to understand and more intuitive to read. The visual representation should be able to summarize the text.
Research Goal • Use of phrases to represent semantic information. • Hierarchical representation of information of a given text
Mind maps • A mind map is a diagram used to represent words, ideas, tasks, or other items linked to and arranged around a central key word or idea. • Example Mind map in the next slide. http://en.wikipedia.org/wiki/Mind_maps
Mind map http://www.spicynodes.org/blog/2010/05/21/stuff-we-like-climate-change-mind-maps/
What’s the difficult part? • We can’t represent information from any article in mind-map as it is. That would make it incoherent and clumsy. • Phrase extraction • General rules of grammar don’t apply here.
Possible Solution • Develop new linguistic rules for representation of text in visual form. • Use existing summarization tools to generate summary and try to represent that in mind-map.
How we did it. • Pulling out the article section wise from the Wikipedia page. • Parsing each section sentence wise using the Stanford parser. • Extracting “relevant” phrases using Tregex (another Stanford tool). • Putting these phrases into a mind map, section wise. http://nlp.stanford.edu/software/tregex.shtml
Extraction of relevant information • Identifying subtrees from the parse tree of a sentence that are important. • This was done using a few heuristics like: • Presence of a superlative adjective in a noun phrase http://nlp.stanford.edu/software/tregex.shtml
Extraction of relevant information • Presence of a cardinal number in a noun phrase http://nlp.stanford.edu/software/tregex.shtml
Extraction of relevant information • Matching of a particular verb to the bag of verbs that were considered relevant for a particular article. For example : for the history section, verbs like find , discover, settle, decline were considered “more useful”, as compared to words like derive, deduce etc. which were considered useful for some other section.
Extraction of relevant information Ex : The name India is derived from Indus. http://nlp.stanford.edu/software/tregex.shtml
Evaluation http://en.wikipedia.org/wiki/Precision_and_recall
Evaluation • Survey based: • Asking a person to generate 10 questions from given article. • Asking another person to answer those question with the help of mind-map. • Repeating the same exercise in reverse manner for another article.
Observations • Pros: • Extraction of right information with high accuracy. • Concept of phrase extraction works well. • High precision value were obtained (between 0.5-0.75).
Observations • Cons • Information presented in mindmap of low depth is clumsy. • Low recall value (0.2 – 0.4) • Linking of node phrases with their apt description. • Heuristics defining “important phrases” need to be refined.
Limitations • Bag of words and Tregex expressions is hand-coded instead of machine learned. • Garbage phrases are being generated. • Level of hierarchy is limited to 3.
Future work • Using machine learning to determine the important keywords for a given sentence. • We want to explore the possibility of finding patterns in subtree expressions using machine learned approach. • Refinement of generated phrases.
References • http://en.wikipedia.org/wiki/Mind_maps • http://en.wikipedia.org/wiki/Precision_and_recall • Tool : Stanford Parser and Stanford Tregex Matchhttp://nlp.stanford.edu/software/tregex.shtml
Vision Based Attribute Segmentation from lists in Web Pages -by Rohit Kumar Saraf