LIS618 lecture 1

LIS618 lecture 1 Thomas Krichel 2003-01-29

economic rational for traditional model • In olden days the cost of telecommunication was high. • database use costs • cost of communication • cost of access time to the database • the traditional model controls an upper bound on costs

disintermediation • with access cost time gone, the traditional model is under threat • there is disintermediation where the librarian looses her role • but that may not be good news for information retrieval results • user knows subject matter best • librarian knows searching best

Web searching • IR has received a lot of impetus through the web, which poses unprecedented search challenges. • with more and more data appearing on the web DS may be a subject in decline • it is primarily concerned with non-web databases • There is more and more web-based methods of searching

Quote to think about • “Clearly, intermediated searching has past its prime. No longer does a search require a searcher—at least not a professional one.” • By Barbara Quint, in “Quint’s Online”, a regular column she contributes for Information Today. • It appeared in the 2002-12 issue.

Quint’s points • It was forseeable in 1992 that the public at large would be able to do online searching. • At the same time need for quality answers has grown. • Google ask-a service • Quality-filtered services will become more important.

Quint’s points • From the requirement for vetted sources, it does not follow that formal publishing and databases will flourish. • The current offerings will have to change. In the current databases, there is as lot that would already be available for free mixed with quality-controlled stuff. • Item based pricing implies same price for all items • Subscription-based pricing still means that the user has to make quality judgment

Quint’s points • A good business news service would • Extracts reporting form company sites • Gets external reviews on the reports • Post statistical counts about the company • Offer to get users and authors in touch so that authors could do private research for a user. • Publishers have direct offerings and intermediated vending is in decline. • Example: CSA pulls out of DIALOG

Main theory part • Literature: "Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribiero-Neto • Don't buy it. It is a not a good book.

before the IR process • provider • define data that is available • documents that can be used • document operations • document structure • index • user • user need • IR system familiarity

the IR process • query expresses user need in a query language • processing of query yields retrieved documents • calculation of relevance ranking • examination of retrieved documents • possible relevance cycle

main problem • user is not an expert at the formulation of a query • garbage in garbage out, the retrieval yields poor result • ways out • design very intuitive interface for the query • give expert guidance

taxonomy of classic IR models • Boolean, or set-theoretic • fuzzy set models • extended Boolean • vector, or algebraic • generalized vector model • latent semantic indexing • neural network model • probabilistic • inference network • belief network

summary • There are three basic types of models in classic information retrieval. • Extensions of these types are a matter of research concern and require good mathematical skills. • All classic models treat document as individual pieces.

key aid: index • an index is a list of terms, with a list of locations where the term is to be found. • The way to express locations usually depends on the form that the indexed data takes. • for a book, it is usually the page number, e.g. "shmoo 34, 75" • for computer files it is usually the name of the file plus the number of the byte where the indexed term starts, e.g. "krichel index.html 34, cv.html 890 1209" • there is usually more than one location of the term.

key aid: index terms • index term is a part of the document that has a meaning on its own. • it is usually a noun word. • retrieval based on index term raises questions • semantics in query or document is lost • matching done in imprecise space of index terms • predicting relevance is a central problem • the IR model determines the process of relevance ranking

basic concept: weight of index term • given all nouns, not all appear to have the same relevance to the text • sometimes, we can have a simple measure of the importance of a term, example? • more generally, for each indexing term and each document we can associate a weight with the term and the document. • usually, if the document does not contain the term, its weight is zero

Boolean model • in the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise. • This allows to combine query terms with Boolean operator AND, OR, and NOT • thus powerful queries can be written

http://openlib.org/home/krichel Thank you for your attention!

LIS618 lecture 1

LIS618 lecture 1

Presentation Transcript

Lecture 1

LECTURE 1

LIS618 lecture 2

Lecture № 1 1

LIS618 lecture 5

Lecture 1

LIS618 lecture 6

LIS618 lecture 6

LIS618 lecture 6

LIS618 lecture 6

LIS618 lecture 0

LIS618 lecture 11 Citation indexing and searching

LIS618 lecture 3

LIS618 lecture 2

LIS618 lecture 3

Lecture # 1