Information Retrieval

Information Retrieval –and projects we have done. Group Members: AdityaTiwari (08005036) HarshitMittal (08005032) Rohit Kumar Saraf (08005040) VinaySurana (08005031) Guided by Prof. Pushpak Bhattacharyya

Motivation • Web, documents and encyclopedia all have tremendous amount of data and information in them. The information thus available serves only the intent of the creator or collector of data. • However, there can be other uses of that data/information as well. The need is to mine the right information from the data and use it appropriately.

Information Retrieval

Applications • Web search – Google, Yahoo • Querying/QA system like Watson (developed by IBM). • Spam filtering • Automatic Summarization • Cross-lingual retrieval en.wikipedia.org/wiki/Information_retrieval_applications

Information Retrieval • IR is the study of concerned with searching for documents, and for metadata about documents, as well as that of searching relational databases and the WWW. • The data objects that are collected can be images, documents, videos, mind maps, music en.wikipedia.org/wiki/Information_Retrieval

Wiki Mind Mapping Harshit Mittal (IIT-B) h.mittal83@gmail.com AdityaTiwari (IIT-B) adi.tiwari27@gmail.com AkhilBhiwal (VIT University) bhiwalakhil@gmail.com

Project Idea • Represent the textual information in graphical form which is easier to understand and more intuitive to read. The visual representation should be able to summarize the text.

Research Goal • Use of phrases to represent semantic information. • Hierarchical representation of information of a given text

Mind maps • A mind map is a diagram used to represent words, ideas, tasks, or other items linked to and arranged around a central key word or idea. • Example Mind map in the next slide. http://en.wikipedia.org/wiki/Mind_maps

Mind map http://www.spicynodes.org/blog/2010/05/21/stuff-we-like-climate-change-mind-maps/

What’s the difficult part? • We can’t represent information from any article in mind-map as it is. That would make it incoherent and clumsy. • Phrase extraction • General rules of grammar don’t apply here.

Possible Solution • Develop new linguistic rules for representation of text in visual form. • Use existing summarization tools to generate summary and try to represent that in mind-map.

How we did it. • Pulling out the article section wise from the Wikipedia page. • Parsing each section sentence wise using the Stanford parser. • Extracting “relevant” phrases using Tregex (another Stanford tool). • Putting these phrases into a mind map, section wise. http://nlp.stanford.edu/software/tregex.shtml

Extraction of relevant information • Identifying subtrees from the parse tree of a sentence that are important. • This was done using a few heuristics like: • Presence of a superlative adjective in a noun phrase http://nlp.stanford.edu/software/tregex.shtml

Extraction of relevant information • Presence of a cardinal number in a noun phrase http://nlp.stanford.edu/software/tregex.shtml

Extraction of relevant information • Matching of a particular verb to the bag of verbs that were considered relevant for a particular article. For example : for the history section, verbs like find , discover, settle, decline were considered “more useful”, as compared to words like derive, deduce etc. which were considered useful for some other section.

Extraction of relevant information Ex : The name India is derived from Indus. http://nlp.stanford.edu/software/tregex.shtml

Code Generated Mind Map

Evaluation http://en.wikipedia.org/wiki/Precision_and_recall

Evaluation • Survey based: • Asking a person to generate 10 questions from given article. • Asking another person to answer those question with the help of mind-map. • Repeating the same exercise in reverse manner for another article.

Observations • Pros: • Extraction of right information with high accuracy. • Concept of phrase extraction works well. • High precision value were obtained (between 0.5-0.75).

Observations • Cons • Information presented in mindmap of low depth is clumsy. • Low recall value (0.2 – 0.4) • Linking of node phrases with their apt description. • Heuristics defining “important phrases” need to be refined.

Limitations • Bag of words and Tregex expressions is hand-coded instead of machine learned. • Garbage phrases are being generated. • Level of hierarchy is limited to 3.

Future work • Using machine learning to determine the important keywords for a given sentence. • We want to explore the possibility of finding patterns in subtree expressions using machine learned approach. • Refinement of generated phrases.

References • http://en.wikipedia.org/wiki/Mind_maps • http://en.wikipedia.org/wiki/Precision_and_recall • Tool : Stanford Parser and Stanford Tregex Matchhttp://nlp.stanford.edu/software/tregex.shtml

Vision Based Attribute Segmentation from lists in Web Pages -by Rohit Kumar Saraf

Information Retrieval – and projects we have done.