Research Problems & Topics (Literature Domain)

Research Problems & Topics (Literature Domain) (CS598-CXZ Advanced Topics in IR Presentation) Feb 1, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Research Area Mining • There are all kinds of research branches for one department, for example, Artificial Intelligence, Machine Learning, Data Mining, and Computer Vision… for Computer Science. What is the relationship between these areas? For example, Machine Learning always has strong relation with Data Mining and Computer Vision. Data Mining is always correlated with Information Retrieval. Could we find the relation from the Web? Could we find or anticipate the new emerging areas or inter-disciplinary areas? • Users: students, faculties • Data: I think the homepages of the faculties are good sources. Faculties always state their interests and their publications in their homepage. If one professor has more than one interest, the two areas are probably related. If two professors collaborate on one paper, the two professors’ interests are probably related. Such an application may help faculties and students find new interests. • Functions: Research Area relation mining. • Challenges: How to recognize the faculties’ interests? How to mine the relation?

Paper Classification & Organization • The problem is to classify published papers in CS domain into different sub areas, and organize them in the time order. • The current situation for a researcher is, if he wants to know what has been done or not been done in a field, he has to search on the web in an ad-hoc way. It is easy for someone to miss some important publications by searching in this way. • If this task is done, researchers who want to do literature survey in a specific area will benefit a lot. For example, if a data mining person wants to know what has been done on frequent pattern mining. He can input “frequent pattern mining” and all the relevant papers are output in the time order. Then he can do the literature survey very easily. • The major challenge is, how to summarize and classify the papers correctly. And if a paper is an interdisciplinary one, we should assign it to every related field.

Automatic Survey Generation • A useful tool for researchers may be a program that can automatically generate surveys on topics given by the user, and/or recommend a state-of-the-art technique that works the best for the user regarding the topic/problem. • The users of such a program are either researchers who want to find out the current state-of-the-art technology in some research areas new to them (or simply new research areas), or researchers and engineers who want to use existing methods/tools as building blocks for their research/products at a higher level. • The problem is not trivial because • (1) for new research topics, there probably is no survey paper published, yet, • (2) even for old research topics, the existing survey papers may be out-dated already. Therefore, data involved in this challenge not only include existing survey papers on the given topic, but also include the most recent research papers that are addressing the problem. • The problem is challenging in several aspects. • (1) Some topics (especially new topics) may not be well defined. When searching for relevant papers, the system needs to consider different ways of describing the problem, in addition to what the user provides, (similar to query expansion.) • (2) How to summarize the methods proposed in different papers, and how to compare the pros and cons of these methods may be difficult. This may involve text summarization and information extraction. For example, can the system identify a benchmark for the given problem and compare the performance of different methods on that benchmark? • (3) If the user provides his constraints/requirements, can the system recommend a good method that fits the userâ^À^Ùs need the best? This may involve more sophisticated techniques.

Note Taking System • When a student study a new subject, sometime it is hard to distinguish what is more important in this subject from what is not important. By collecting many textbooks and the class handout or lecture note, we might automated generate the note for the subject. It help student learn a new subject quickly.

Integrated information system for bioinformatics sources • Functional analysis, which studies how a biological entity is functionally related to other biological entities, is a major research issue in modern biology. To perform successful functional analysis, biologists must integrate data from multiple sources, which is usually carried out largely by hand. Hence, developing automatic techniques to integrate genomic data has now become truly critical to successful functional analysis. • Users: biologists, bioinformaticists et al. • Data involved: biomedical literature, biology entities et al. • Functions to be developed: text search, relational query et al.

Evolutive Text Mining • In literature collections, there would be hundreds of papers on the each area every year. Concepts, problems and technologies are not only evolutive over time in each field, but also involved in interdisciplinary interactions. Taking concepts for example, as time goes by, some concepts dies out, some concepts emerges, some concepts are borrowed from other fields, some merges together and some splits. Some concepts in different fields (collection, community) may have different name but share analogical content and similar evolution path. • If we can model the evolution of concepts/problems/technologies in one field, we can understand the evolution of this field well; sometimes even can predict the change of this field. For an even more ambitious scenario, suppose A, B, C, .. are techniques in field 1, and A¡¯, B¡¯, C¡¯ are their analogical techniques in field 2. Suppose we discover two evolutive paths in field 1 and 2: Field1: A->B -> (+D) ->C-> (+E) ->F; Field2: A¡¯ ->B¡¯ -> (+ D¡¯) -> C¡¯; C and C¡¯ share similar evolutive process in field 1 and 2. Does this indicate that the involving of a technique E¡¯ (which is analogical to E in field1) might bring the next development of C¡¯ in field2? • This would be very useful for scientific researchers. Using Comparative Text Mining, we are able to find analogical concepts in different fields, and if we can model the evolution of concepts well, this task becomes possible. • User: Scientists, researchers Data: Scientific literatures, for example, Honeybee data and Flybase data. • Functions: Finding analogical concepts over collections; Modeling the evolutive paths in each collection; compare and make predictions with the paths in different collections. • Challenge: How to find a good model of concept evolutions. How to use CTM to define analogical concepts.

Topic Evolution Discovery • Challenge: To find and discover how a topic has evolved through time • Users: Researchers in different fields, managers who want to streamline the company's process by looking for inefficiencies, etc. • Data: Scientific literature, company documents • Method: In the simpliest sense, it may be interesting to find the function parent_of(A,B) where A and B are documents and much of the content of B comes from (or influenced by) A. With this function and a timestamp for each document, it should be possible to create a timeline that shows the lineage of a concept.

Paper Writing Support(Finding Related Work) • Help the research paper writing When I am writing a paper, one tedious thing is to compose the related work. Usually, I only have a few competitive or referential papers. But the related work needs a more thorough survey, so as to avoid some unnecessary arguments from the reviewers. It would be good to have a system that I can give it some articles or some paragraphs from the ongoing paper and it can return some typical related works together with a rough organization of them according to research topics. For example, given this note, the system may return some papers about searching in local cached pages, some about email categorization, and some about paper retrieval and summarization. • To retrieve related papers, it may need the techniques of content-based information retrieval together with link-based approach on the collection obtained by expanding the citations in the given articles. To give each paper a short summary, we may apply some sort of summarization technique on each paper, or just extract th! ose sentences mentioning the paper in other papers that refer the target paper. • To organize the result pagers can be achieved by classification using some well-defined research topic hierachy or by clustering if no such topic information is available in advanced. • User: Research paper writers • Data: Research papers • Function: Given some papers, return some typical related works together with some summarization of each paper as the reference to compose the article and some topical information about each paper that help us to organize those related papers.

Automatic Identification of Related Literature • IR for literature may be the most important since the information is authoritative. Google and Citeseer can index papers in PS and PDF form and Citeseer appears to automatically extract the special fields from the document (e.g. title, author, bibliography). • Perhaps an interesting next step would be to make browsing through the documents more tractable by automatically identifying related literature. • Possible ways to find related literature would be word-level similarity (common keywords), bibliographic similarity, medium appeared in (same conference, same workshop, name author, etc.) For the suggestions to not overwhelm the user, some user feedback would seem necessary. If suggested literature from the same workshop is not relevant the system might suggest documents using a different heuristic.

Limited Domain Question Answering • Help Windows programmers find solutions to a technical problem. • Users: Windows programmers • Data: MSDN library and knowledge base, and maybe external sources like articles in codeproject.com • Description: Develop a system to quickly help programmers find Win32 APIs or sample code that can help them solve a particular technical problem. If the terms used by the programmer to describe the problem are different from those used in the documentation, or if the solution is not explicitly stated in the documentation and scattered in other documentation, then it may be difficult to find the solution. For example, if one wants to find out how to convert from DOS 8.3 filename to Windows long filename, the search result for ^Óconvert file name 8.3 long^Ô in MSDN would return GetFullPathName, and one has to read carefully its documentation to discover the actual API that does the desired job is GetLongPathName. It happens so because ^Ó8.3^Ô is never mentioned in the documentation of GetLongPathName and only in the documentation of GetFullPathName as a side note. It would be nice if the system could collect all these information together to provide the programmer a direct answer. This may be a challenge because it may need sophisticated NLP analysis like those used in question answering.

Personal Literature Management • Researchers store many papers on the local disk. Sometimes, it is hard to ¯nd the downloaded literature. So it is important to organize these papers and provide some functionality such as search to the user. • Every researcher will bene¯t from this tool. • The personal literature data collection is the data this tool manages. • The functionality will include search (¯nd relevant papers) and classi¯cation. The user provides a hierarchy, then the system will associate each paper to several tags automatically. Of course, • some papers will be tagged by the user so that some training data is provided. The challenge of this project will be how to design and implement such a system and choose the best search and classi¯cation algorithms suitable for personal literature collection.

Topic-Specific Paper Rank • Users always prefer to good papers. Such good papers can be divided into two types: Good survey papers, which include all the good topics of one area, and good technical papers, which set a new direction or address the specific problems thoroughly. However, a paper is good or not is area-dependent. For example, the user would like to get a good paper of Information Retrieval. Another user would like a good paper of Data Mining. The question here is how to rank the paper according to their areas. Such an application may tell people the necessity of writing a new survey paper if he can’t find a good survey paper right now. • Users: researchers, scientists, graduate students. • Data: literature materials • Functions: Paper search and topic-specific rank • Challenges: How to identify a paper as a good survey paper, how to identify a paper as a good technical paper and how to classify a paper to a specific domain? How to use the author information in the paper rank?

Starting Point for Research Name: Starting Point for Research in Any Area User: Any faculty, or student who is looking at entering a new are of research. Data Involved: All the papers/or indexed summaries available on the web. Function: Whenever a researcher wants to enter a new area, he/she faces a big question: How or from where should I start? Finding an answer for this question can be a difficult or at least time consuming task. It would be great if a system exists that can gather, and summarize all the information about all the papers (including classic, highly referenced, cutting edge, etc) and also all the people that work in the related areas (including summaries and information about their publications, projects, affiliations, etc). This intelligent system can somehow generate a route through which the user can get all the information he/she would need in order to start getting into the desired area.

Automatically Discover Cause-Effect Relationship In literatures, facts as cause-effect relationship are popular, especially in medical, law, and history literature. To be able to do that, a person need to read all the related documents, remember most of facts, and do a good reasoning. However, with a huge number of literatures in each field today, no one could be able to do that thoroughly. Most of attempts success with some forms of lucky which is reaching right documents at right time. Making this task done automatically, much useful and maybe surprised knowledge will not be missed. And base on this, we could build some a new kind of expert system which works directly with knowledge in form of literatures. Users: Researchers, lawyers, historians. Data: Existing literatures, especially literatures of medical, law, history, and chemistry. Challenges: Recognizing and connecting causes and effects together is extremely hard.

Literature Network • One topic is how to automatically build a "literature network" for a topic. In a literature network, every node is a paper which is related to the topic and every edge between nodes is annotated with the relation between these two paper (i.e. why one paper cite the other and how these two papers are related ). Such literature network will provides users a whole picture of the area, which makes literature survey easier. Note that there are two major types of citation between papers, one is about some known techniques (not necessarily related to the topic) and the other one is about the previous and related work. The first type of citation should not be included in the network. • The users are researchers. • The data are conference papers, journal papers and books. • The major challenge is how to identify the relations between two papers, which involving the techniques of information extraction, information summarization and text categorization.

Statistical models for Peptide Tandem Mass Spectrometry Data Analysis • Description: Molecular biology has been revolutionized by the advent of high throughoutput experimental methods that could investigate thousands of genes or proteins in parallel. With the great success of Microarray analysis techniques for genomics, mass spectrometry based proteomics becomes the next hot point in the literature. However, unlike the reliable microarray based analysis methods for genes, interpreting high-throughoutput peptide tandem mass spectrometry data is still an open problem. The large volume of data generated from peptide tandem mass spectrometry experiments is full of noise and unknown underlying biochemistry principles. How to utilize these data to extract useful information and knowledge remains a problem. • In this project, our long term research goal is two folds: • 1. decide what kind of proteins are presented in the tissue samples. • 2. decide the quantitative ratios of different proteins. • Under the guidance of these two directions, many sub goals could be derived, such as, how to design efficient and effective scoring functions for sequence databases searching, how to design probabilistic models to simulate interactions between different proteins, how to derive useful features from pure peptide sequences and spectrum data. • The rough outline for this project is: • 1. make a literature survey and write a review report, to summarize what the other researchers are currently doing. • 2. identify one or two promising topics from the literature survey. • 3. conduct the research work and get some initial result. • 4. finish the course project paper. • Group: this project could be a 1 person project, since it needs some biology background about peptide tandem mass spectromety experiments and machine learning knowledge, however, if there are other students whoe are really interested in it, it may be expaned to a 2-team member group.

Possible Topics • Literature Access • Personal literature management • Summarization • Generating/Identifying Survey papers • “Starting points” • Literature Mining • Literature networks/Find related work • Research area mining • Topic evolution mining • Biology functional analysis/Question answering

Assignment 2 (for Literature Team) • Search on the web (starting with digital library conferences, JCDL, and summarization work) • Every one identifies one or two most interesting papers, which you like to present • Send me your choices by this Saturday (Feb. 5) • Need one volunteer for presenting a literature paper on Feb. 10 • Possible choices: • Building domain-specific web collections for scientific digital libraries: a meta-search enhanced focused crawling method • Panorama: extending digital libraries with topical crawlers

Research Problems & Topics (Literature Domain)