710 likes | 849 Views
Introduction to IR Research. ChengXiang Zhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http:// www.cs.uiuc.edu/homes/czhai , czhai@cs.uiuc.edu. Outline.
E N D
Introduction to IR Research ChengXiangZhai Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai, czhai@cs.uiuc.edu
Outline • What is research? • How to prepare yourself for IR research? • How to identify and define a good IR research problem? • How to formulate and test IR research hypotheses? • How to write and publish an IR paper?
What is Research? • Research • Discover new knowledge • Seek answers to questions • Basic research • Goal: Expand man’s knowledge (e.g., which genes control social behavior of honey bees? ) • Often driven by curiosity (but not always) • High impact examples: relativity theory, DNA, … • Applied research • Goal: Improve human condition (i.e., improve the wolrd) (e.g., how to cure cancers?) • Driven by practical needs • High impact examples: computers, transistors, vaccinations, … • The boundary is vague; distinction isn’t important
Why Research? Funding Curiosity Quality of Life Utility of Applications Advancement of Technology Amount of knowledge Application Development Applied Research Basic Research
Where’s IR Research? Information Science Funding Quality of Life Utility of Applications Advancement of Technology Amount of knowledge Computer Science Application Development Applied Research Basic Research
Where’s Your Position? Different position benefits from different collaborators Quality of Life Utility of Applications Advancement of Technology Amount of knowledge Application Development Applied Research Basic Research
Research Process • Identification of the topic (e.g., Web search) • Hypothesis formulation (e.g., algorithm X is better than Y=state-of-the-art) • Experiment design (measures, data, etc) (e.g., retrieval accuracy on a sample of web data) • Test hypothesis (e.g., compare X and Y on the data) • Draw conclusions and repeat the cycle of hypothesis formulation and testing if necessary (e.g., Y is better only for some queries, now what?)
Typical IR Research Process • Look for a high-impact topic (basic or applied) • New problem: define/frame the problem • Identify weakness of existing solutions if any • Propose new methods • Choose data sets (often a main challenge) • Design evaluation measures (can be very difficult) • Run many experiments (need to have clear research hypotheses) • Analyze results and repeat the steps above if necessary • Publish research results
Research Methods • Exploratory research: Identify and frame a new problem (e.g., “a survey/outlook of personalized search”) • Constructive research: Construct a (new) solution to a problem (e.g., “a new method for expert finding”) • Empirical research: evaluate and compare existing solutions (e.g., “a comparative evaluation of link analysis methods for web search”) • The “E-C-E cycle”: exploratoryconstructiveempiricalexploratory…
Types of Research Questions and Results • Exploratory (Framework): What’s out there? • Descriptive (Principles): What does it look like? How does it work? • Evaluative (Empirical results): How well does a method solve a problem? • Explanatory (Causes): Why does something happen the way it happens? • Predictive (Models): What would happen if xxx ?
Solid and High Impact Research • Solid work: • A clear hypothesis (research question) with conclusive result (either positive or negative) • Clearly adds to our knowledge base (what can we learn from this work?) • Implications: a solid, focused contribution is often better than a non-conclusive broad exploration • High impact = high-importance-of-problem * high-quality-of-solution • high impact = open up an important problem • high impact = close a problem with the best solution • high impact = major milestones in between • Implications: question the importance of the problem and don’t just be satisfied with a good solution, make it the best
What It Takes to Do Research • Curiosity: allow you to ask questions • Critical thinking: allow you to challenge assumptions • Learning: take you to the frontier of knowledge • Persistence: so that you don’t give up • Respect data and truth: ensure your research is solid • Communication: allow you to publish your work • …
Learning about IR • Start with an IR text book (e.g., Manning et al., Grossman & Frieder, a forth-coming book from UMass,…) • Then read “Readings in IR” by Karen Sparck Jones, Peter Willett • And read papers recommended in the following article: http://www.sigir.org/forum/2005D/2005d_sigirforum_moffat.pdf • Read other papers published in recent IR/IR-related conferences
Learning about IR (cont.) • Getting more focused • Choose your favorite sub-area (e.g., retrieval models) • Extend your knowledge about related topics (e.g., machine learning, statistical modeling, optimization) • Stay in frontier: • Keep monitoring literature in both IR and related areas • Broaden your view: Keep an eye on • Industry activities • Read about industry trends • Try out novel prototype systems • Funding trends • Read request for proposals
Critical Thinking • Develop a habit of asking questions, especially why questions • Always try to make sense of what you have read/heard; don’t let any question pass by • Get used to challenging everything • Practical advice • Question every claim made in a paper or a talk (can you argue the other way?) • Try to write two opposite reviews of a paper (one mainly to argue for accepting the paper and the other for rejecting it) • Force yourself to challenge one point in every talk that you attend and raise a question
Respect Data and Truth • Be honest with the experiment results • Don’t throw away negative results! • Try to learn from negative results • Don’t twist data to fit your hypothesis; instead, let the hypothesis choose data • Be objective in data analysis and interpretation; don’t mislead readers • Aim at understanding/explanation instead of just good results • Be careful not to over-generalize (for both good and bad results); you may be far from the truth
Communications • General communication skills: • Oral and written • Formal and informal • Talk to people with different level of backgrounds • Be clear, concise, accurate, and adaptive (elaborate with examples, summarize by abstraction) • English proficiency • Get used to talking to people from different fields
Persistence • Work only on topics that you are passionate about • Work only on hypotheses that you believe in • Don’t draw negative conclusions prematurely and give up easily • positive results may be hidden in negative results • In many cases, negative results don’t completely reject a hypothesis • Be comfortable with criticisms about your work (learn from negative reviews of a rejected paper) • Think of possibilities of repositioning a work
Optimize Your Training • Know your strengths and weaknesses • strong in math vs. strong in system development • creative vs. thorough • … • Train yourself to fix weaknesses • Find strategic partners • Position yourself to take advantage of your strengths
Part 3. How to identify and define a good IR research problem?
What is a Good Research Problem? • Well-defined: Would we be able to tell whether we’ve solved the problem? • Highly important: Who would care about the solution to the problem? What would happen if we don’t solve the problem? • Solvable: Is there any clue about how to solve it? Do you have a baseline approach? Do you have the needed resources? • Matching your strength: Are you at a good position to solve the problem?
Challenge-Impact Analysis High impact High risk (hard) Good long-term research problems Difficult basic research Problems, but questionable impact High impact Low risk (easy) Good short-term research problems Low impact Low risk Bad research problems (May not be publishable) Good applications Not interesting for research “entry point” problems Level of Challenges Unknown Known Impact/Usefulness
Optimizing “Research Return”:Pick a Problem Best for You High (Potential) Impact Your Passion Your Strength Best problems for you Find your passion: If you don’t have to work/study for money, what would you do? Test of impact: If you are given $1M to fund a research project, what would you fund? Find your strength: If you don’t know your strength, at least avoid your weakness; acquire strength through training
How to Find a Problem? • Application-driven (Find a nail, then make a hammer) • Identify a need by people/users that cannot be satisfied well currently (“complaints” about current data/information management systems?) • How difficult is it to solve the problem? • No big technical challenges: do a startup • Lots of big challenges: write a research proposal • Identify one technical challenge as your topic • Formulate/frame the problem appropriately so that you can solve it • Aim at a completely new application/function (find a high-stake nail)
How to Find a Problem? (cont.) • Tool-driven (Hold a hammer, and look for a nail) • Choose your favorite state-of-the-art tools • Ideally, you have a “secret weapon” • Otherwise, bring tools from area X to area Y • Look around for possible applications • Find a novel application that seems to match your tools • How difficult is it to use your tools to solve the problem? • No big technical challenges: do a startup • Lots of big challenges: write a research proposal • Identify one technical challenge as your topic • Formulate/frame the problem appropriately so that you can solve it • Aim at important extension of the tool (find an unexpected application and use the best hammer)
How to Find a Problem? (cont.) • In practice, you do both in various kinds of ways • You talk to people in application domains and identify new “nails” • You take courses and read books to acquire new “hammers” • You check out related areas for both new “nails” and new “hammers” • You read visionary papers and the “future work” sections of research papers, and then take a problem from there • …
Three Basic Questions to Ask about an IR Problem Everyone (who has an Internet connection) The whole web (indexed by Google) Search (by keywords) • Who are the users? • Everyone vs. Small group of people • What data do we have? • Web (whole web vs. sub-web) • Email (public email vs. personal email) • Literature (general vs. special discipline) • Blog, forum, … • What functions do we want to support? • Information access vs. knowledge acquisition • Decision and task support
The Data-User-Service (DUS) Triangle Users Data Services
Many Ways to Connect DUS Triangle!(Map of IR Applications) Web Search Literature Assistant Enterprise Search Opinion Advisor Customer Rel. Man. … Customer Service People UIUC Employees Everyone Scientists Online Shoppers Web pages Literature Organization docs Blog articles Product reviews … Customer emails … Task/Decision support Search Browsing Alert Mining
Today’s Search Engine Services Search Keyword Queries Bag of words User Data/Text
Where Do We Want to Be? Full-Fledged Text Info. Management Task Support Mining Access Personalization (User Modeling) Search History Entities-Relations Large-Scale Semantic Analysis Complete User Model Knowledge Representation Search Current Search Engine Keyword Queries Bag of words
High-Level Challenges in IR • How to make use of imperfect IR techniques to do something useful? • Save human labor (e.g., partially automate a task) • Create “add on” value (e.g., literature alert) • A lot of HCI issues (e.g., allowing users to control) • How to develop robust, effective, and efficient methods for a particular application? • Methods need to “work all the time” without failure • Methods need to be accurate enough to be useful • Methods need to be efficient enough to be useful
Challenge 1: From Search to Information Access • Search is only one way to access information • Browsing and recommendation are two other ways • How can we effectively combine these three ways to provided integrated information access? • E.g., artificially linking search results with additional hyperlinks, “literature pop-ups”…
Challenge 2: From Information Access to Task Support • The purpose of accessing information is often to perform some tasks • How can we go beyond information access to support a user at the task level? • E.g., automatic/semi-automatic email reply for customer service, literature information service for paper writing (suggest relevant citations, term definitions, etc), comparing prices for shoppers
Challenge 3: Support Whole Life Cycle of Information • A life cycle of information consists of “creation”, “storage”, “transformation”, “consumption”, “recycling”, etc • Most existing applications support one stage (e.g., search supports “consumption”) • How can we support the whole life cycle in an integrated way? • E.g., Community publication/subscription service (no need for crawling, user profiling)
Challenge 4: Collaborative Information Management • Users (especially similar users) often have similar information need • Users who have explored the information space can share their experiences with other users • How to exploit the collective expertise of users and allow users to help each other? • E.g., allowing “information annotation” on the Web (“footprints”), collaborative filtering/retrieval,
Look for New IR Research Questions • Driven by new data: X is a new type of data emerging (e.g., X= blog vs. news) • How is X different from existing types of data? • What new issues/problems are raised by X? • Are existing methods sufficient for solving old problems on X? If not, what are the new challenges? • What new methods are needed? • Are old evaluation measures adequate? • Driven by new users: Y is a set of new users (e.g., ordinary people vs. librarians) • How are the new users different from old ones? What new needs do they have? • Can existing methods work well to satisfy their needs? If not, what are the new challenges? • What new functions are appropriate for Y? • Driven by new tasks (not necessarily new users or new data): Z is a new task (e.g., social networking, online shopping) • What information management functions are needed to better support Z? • Can these new functions reduced to old ones? If not, what are the new challenges?
General Steps to Define a Research Problem • Generate and Test • Raise a question • Novelty test: Figure out to what extent we know how to answer the question • There’s already an answer to it: Is the answer good enough? • Yes: not interesting, but can you make the question more challenging? • No: your research problem is how to get a better answer to the raised question • No obvious answer: you’ve got an interesting problem to work on • Tractability test: Figure out whether the raised question can be answered • I can see a way to answer it or potentially answer it: you’ve got a solvable problem • I can’t easily see a way to answer it: Is it because the question is too hard or you’ve not worked hard enough? Try to reframe the problem to make it easier • Evaluation test: Can you obtain a data set and define measures to test solutions/answers? • Yes: you’ve got a clearly defined problem to work on • No: can you think of anyway to indirectly test the solutions/answers? Can you reframe the problem to fit the data? • Every time you reframe a problem, try to do all the three tests again.
Rigorously Define Your Research Problem • Exploratory: what is the scope of exploration? What is the goal of exploration? Can you rigorously answer these questions? • Descriptive: what does it look like? How does it work? Can you formally define a principle? • Evaluative: can you clearly state the assumptions about data collection? Can you rigorously define measures? • Explanatory: how can you rigorously verify a cause? • Predictive: can you rigorously define what prediction is to be made?
Frame a New Computation Task • Define basic concepts • Specify the input • Specify the output • Specify any preferences or constraints
From a new application to a clearly defined research problem • Try to picture a new system, thus clarify what new functionality is to be provided and what benefit you’ll bring to a user • Among all the system modules, which are easy to build and which are challenging? • Pick a challenge and try to formalize the challenge • What exactly would be the input? • What exactly would be the output? • Is this challenge really a new challenge (not immediately clear how to solve it)? • Yes, your research problem is how to solve this new problem • No, it can be reduced to some known challenge: are existing methods sufficient? • Yes, not a good problem to work on • No, your research problem is how to extend/adapt existing methods to solve your new challenge • Tuning the problem
Tuning the Problem Level of Challenges Make an easy problem harder Increase impact (more general) Make a hard problem easier Unknown Known Impact/Usefulness
“Short-Cut” for starting IR research • Scan most recently published papers to find papers that you like or can understand • Read such papers in detail • Track down background papers to increase your understanding • Brainstorm ideas of extending the work • Start with ideas mentioned in the future work part • Systematically question the solidness of the paper (have the authors answered all the questions? Can you think of questions that aren’t answered?) • Is there a better formulation of the problem • Is there a better method for solving the problem • Is the evaluation solid? • Pick one new idea and work on it
Formulate Research Hypotheses • Typical hypotheses in IR: • Hypothesis about user characteristics (tested with user studies or user-log analysis, e.g., clickthrough bias) • Hypothesis about data characteristics (tested with fitting actual data, e.g., Zipf’s law) • Hypothesis about methods (tested with experiments): • Method A works (or doesn’t work) for task B under condition C by measure D (feasibility) • Method A performs better than method A’ for task B under condition C by measure D (comparative) • Introduce baselines naturally lead to hypotheses • Carefully study existing literature to figure our where exactly you can make a new contribution (what do you want others to cite your work as?) • The more specialized a hypothesis is, the more likely it’s new, but a narrow hypothesis has lower impact than a general one, so try to generalize as much as you can to increase impact • But avoid over-generalize (must be supported by your experiments) • Tuning hypotheses
Procedure of Hypothesis Testing • Clearly define the hypothesis to be tested (include any necessary conditions) • Design the right experiments to test it (experiments must match the hypothesis in all aspects) • Carefully analyze results (seek for understanding and explanation rather than just description) • Unless you’ve got a complete understanding of everything, always attempts to formulate a further hypothesis to achieve better understanding
Clearly Define a Hypothesis • A clearly defined hypothesis helps you choose the right data and right measures • Make sure to include any necessary conditions so that you don’t over claim • Be clear about any justification for your hypothesis (testing a random hypothesis requires more data than testing a well-justified hypothesis)
Design the Right Experiments • Flawed experiment design is a common cause of rejection of an IR paper (e.g., a poorly chosen baseline) • The data should match the hypothesis • A general claim like “method A is better than B” would need a variety of representative data sets to prove • The measure should match the hypothesis • Multiple measures are often needed (e.g., both precision and recall) • The experiment procedure shouldn’t be biased • Comparing A with B requires using identical procedure for both • Common mistake: baseline method not tuned or not tuned seriously • Test multiple hypotheses simultaneously if possible (for the sake of efficiency)