Relevance in big data : an IR perspective

Relevance in big data: an IR perspective Jian-Yun Nie University of Montreal

Big data dream • Data available when needed • Data can be processed at a click • Finding needed data as required • Relating data from different sources • Understanding the data correctly • Making appropriate inference/prediction • …

Implicit assumptionsunderlying some dreams • Data are well structured • We understand what a field means and how it is connected to others • E.g. transactional data • Data are precisely valued • A value has a standard representation and unique meaning • E.g. Date=20130325

What we can do under the assumptions • Retrieving exactly what we want • Formulate a query formally • The result is what we want • E.g. all transactions in US dollar of a client, all people who traveled to Canada in 2012 • Discover patterns and relations among data • Exploit data structure and values • E.g. People who buy bread also buy butter. • Connecting data from different sources • E.g. Finding a picture of the house of a person who bought a fridge a day ago

Some realities in big data • Data are often unstructured • Data values expressed in flexible and imprecise way • Date=2013/06/25 vs. 25/06/13 vs. 13年6月25号 • Date=beginning of summer in 2013 • Date=几天以前 • Difficulties in making sense of a data (understanding) • E.g. the word “china” in a text • Content of an image or a sequence of video • Who/what are shown? • Data may still be too large to be processed completely • All the surveillance videos • All the texts on the web

Format of big data • Unstructured texts • Webpages • Microblog posts • Auto. Speech Recognition • Multimedia data • Sensor data • Surveillance camera • Pictures • … • How can we query these data?

Basic retrieval in big data • Query: • We can no longer always formulate a requirement in SQL or alike • E.g. finding posts relating a social event • A picture of someone • We no longer query for data, but for information • Query = expression of some characteristics of the information • Retrieval: • Cannot rely on simple matches of values • There is not a unique set of answers • Criterion: Relevance • How relevant a piece of information is to a query?

Data retrieval vs. Information retrieval • Data = structured • Query = exact formal specification • Retrieval = exact value match • Answer = exact set • Information =unstructured • Query = approximate specification • Retrieval =concept match • Answer = ranked list, more or less relevant

Dealing with textual data/information • A large part of available data (likely in Big Data) is textual • Webpages • Microblog discussions • Auto. Speech Recognition • … • IR has been in the big textual data era for some years

How do we process big textual data? • Infrastructure: computer clusters, cloud • Distributed computing: MapReduce • Understanding texts (NLP): Indexing, information extraction, named entities, … • Retrieval models: define a function of match and ranking • Goal: approximate user’s relevance judgments

Relevance in IR • No formal definition of relevance • Is variable, depending on query, user, time, … • IR: defining models to approximate user relevance as much as possible • Boolean model, language model • Learning-to-rank: Learn an approximate function of relevance from samples • Learning from users (query logs)

Examples on using web data • Usage frequency • What usage is correct in English? • Correlation between economic development and search behavior • Higher correlation with search in future  more developed • Using approximate statistics based on relevant data

Relevance as a basic concept in big data • Retrieving a relevant subset of data corresponding (more or less) to a criterion • User does not know how data is organized • Mining correlation/relationship between subsets of data / raking lists • E.g. People active in microblogs are also active in society? • Prediction using more abstract information concepts instead of data • E.g. youngstersfamiliar with IT will …

Accessing the quality of a retrieval model • Any IR system can retrieve a set of answers • A system is not very useful if it happens to find some relevant ones for a query from time to time • Ideally, it should always do it • Implemented relevance ≈ User relevance • Criteria of quality • Desired answers  System answers • Precision + Recall • nDCG, …

Quality of a system on big data • Retrieval system: Is the data retrieved relevant? • How can we access quality of retrieval in big data? • Precision is easy to access • Recall is impossible • nDCG not always enough • May depend on applications

Should we access the quality of a mining system? • Success story: A system can successfully mine the relation between “buying bread” and “buying butter”. • Informed mining: Humans make a hypothesis • However, if the system also mine a large number of irrelevant relations at the same time? • Still useful: Can provide possible relations to human experts • More useful: find candidate relations of higher quality • Precision/Recall • Can we define standard answers (data or relations to be mined)? Let’s dream !

From data processing to information processing • Data  Information  Knowledge • Data retrieval: exact representation • Information: flexible representation • Mining knowledge from data  Mining knowledge from information • E.g. bad-weather  more traffic jam? • Relevance as a basic notion in big data • Access relevant data/information • Mining relevant properties on relevant data/information

Thanks

Relevance in big data : an IR perspective