460 likes | 581 Views
Modern Information Retreival. Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3. Introduction. Text main form of communicating knowledge. Document loosely defined, denote a single unit of information. can be any physical unit
E N D
Modern Information Retreival Chap. 06: Text and Multimedia Languages and Properties (Introduction, Metadata and Text) 6.1, 6.2, 6.3
Introduction • Text • main form of communicating knowledge. • Document • loosely defined, denote a single unit of information. • can be any physical unit • a file • an email • a Web Page
Introduction • Document • Syntax and structure • Semantics • Information about itself
Introduction • Document Syntax • Implicit, or expressed in a language (e.g, TeX) • Powerful languages: easier to parse, difficult to convert to other formats. • Open languages are better (interchange) • Semantics of texts in natural language are not easy for a computer to understand • Trend: languages which provides information on structure, format and semantics being readable by human and computers
Introduction • New applications are pushing for format such that information can be represented independetly of style. • Style: defined by the author, but the reader may decide part of it • Style can include treatment of other media
Metadata • “Data about the data” • e.g: in a DBMS, schema specifies name of the relations, attributes, domains, etc. • Descriptive Metadata • Author, source, length • Dublin Core Metadata Element Set • Semantic Metadata • Characterizes the subject matter within the document contents • MEDLINE
Metadata • Metadata information on Web documents • cataloging, content rating, property rights, digital signatures • New standard: Resource Description Framework • description of Web resources to facilitate automated processing of information • nodes and attched atribute/values pairs • Metadescription of non-textual objects • keyword can be used to search the objects
Predicate Statement RDF Model • A model is a collection of statements • Statement := (predicate,subject,object) • Predicate is a resource • Subject is a resource • Object is either a resource or a literal Subject Object
RDF model and natural language • Subject. In grammar, this is the noun or noun phrase that is the doer of the action. In the sentence “The company sells batteries,” the subject is “the company.” • Predicate. In grammar, this is the part of a sentence that modifies the subject and includes the verb phrase. In our sentence, the predicate is the phrase “sells” • Object. In grammar this is a noun that is acted upon by the verb. In our sentence, the object is the noun “batteries.”
XML vs. RDF • RDF is not just an XML dialect. • XML: • Has a tree structure data model. • Only nodes are labeled. • RDF: • Has a graph structure data model. • Both edges (properties) and nodes (subjects/objects) are labeled.
CE Ganji http://ce.sharif.edu Sharif Linking Statements • The subject of one statement can be the object of another • Such collections of statements form a directed, labeled graph studentOF departmentOF hasHomePage
RDF Graph: ‘anonymous’ nodes Person PersonName Literal Person12345 person.name value Jonathan first last value Borden
How can RDF be implemented • Usually RDF/XML syntax • However other notations are possible • e.g. Notation3: • Buddy Belden owns a business. • The business has a Web site accessible at http://www.c2i2.com/~budstv. • Buddy is the father of Lynne. • <#Buddy> <#owns> <#business>. • <#business> <#has-website> <http://www.c2i2.com/~budstv>. • <#Buddy> <#father-of> <#Lynne>.
Converting N3 to RDF • Jena toolkit can do such conversion
XML Syntax for RDF • RDF has an XML syntax that has a specific meaning: • Every Description element describes a resource • Every attribute or nested element inside a Description is apropertyof that Resource • We can refer to resources by using URIs <rdf:Description about="some.uri/person/ganji"> <studentOf resource="some.uri/Sharif/CE"/> </Description> <Description about="some.uri/Sharif/CE"> <hasHomePage>http://ce.sharif.edu</hasHomePage> <departmentOf resource="some.uri/~Sharif"/> </rdf:Description>
RDF type • RDF predifined property • Its value – a resource that represent a category or class • Its subject – Instance of that category or class prefix ex: URI: http://www.example.org/terms
Containers • Containers are collections • they allow grouping of resources (or literal values) • It is possible to make statements about the container (as a whole) or about its members individually • It is also possible to create collections based on URI patterns • for example, all files in a particular web site
RDF containers • Bag: (A resource having type rdf:Bag) • Represents an unordered list of resources or literals • Duplicated values are prermitted • Sequence: (A resource having type rdf:Seq) • Represents ordered list of resources or literal • Duplicated values are permitted • Alternatives: (A resource having type rdf:Alt) • Represents group of resources or literals that are alternatives
http://www.w3.org/TR/REC-rdf-syntax dc:Creator rdf:Type rdf:Seq rdf:_1 rdf:_2 “Ora Lassila” “Ralph Swick” Sequence example
RDF Schema (RDFS) • RDF gives a formalism for meta data annotation, and a way to write it down in XML, but it does not give any special meaning to vocabulary such as subClassOf or type • RDF Schema allows you to define vocabulary terms and the relations between those terms • it gives “extra meaning” to particular RDF predicates and resources • this “extra meaning”, or semantics, specifies how a term should be interpreted
Core Classes & Properties rdfs:Resource rdfs:Literal rdfs:XMLLiteral rdfs:Class rdfs:Property Core Classes rdfs:Type rdfs:SubClassOf rdfs:SubPropertyOf rdfs:Domain rdfs:Range rdfs:Label rdfs:Comment Core Properties
RDFS Examples <Person,type,Class> <hasColleague,type,Property> <Professor,subClassOf,Person> <Carole,type,Professor> <hasColleague,range,Person> <hasColleague,domain,Person>
RDF/RDFS “Liberality” • No distinction between classes and instances (individuals) <Species,type,Class> <Lion,type,Species> <Leo,type,Lion> • Properties can themselves have properties <hasDaughter,subPropertyOf,hasChild> <hasDaughter,type,familyProperty> • No distinction between language constructors and ontology vocabulary, so constructors can be applied to themselves/each other <type,range,Class> <Property,type,Class> <type,subPropertyOf,subClassOf>
Problems with RDFS • RDFS too weak to describe resources in sufficient detail • No localised range and domain constraints • Can’t say that the range of hasChild is person when applied to persons and elephant when applied to elephants • No existence/cardinality constraints • Can’t say that all instances of person have a mother that is also a person, or that persons have exactly 2 parents • No transitive, inverse or symmetrical properties • Can’t say that isPartOf is a transitive property, that hasPart is the inverse of isPartOf or that touches is symmetrical • … • Difficult to provide reasoning support • No “native” reasoners for non-standard semantics • May be possible to reason via FO axiomatisation
RDF(S) tools • Read RDF data • Parsers: Jena, Redland, SWI-Prolog • Validators: W3C RDF validation service • Editors: IsaViz, RDF Author, RDFEd, InferEd • Store RDF data (XML format, tripples or relational/oo DB) • Sesame, RSSDB, RDFLib • Use RDF data (applications, RSS news, etc.) • Manipulate RDF data (inference, query, etc.) • Jena RDQL, etc. • Example: SELECT ?person, ?knows WHERE (?x <http://xmlns.com/foap/knows> ?z), (?x <http://xmlns.com/foap/name> ?person), (?z <http://xmlns.com/foap/name> ?knows)
RDF Validators • RDF Validation Service • http://www.w3.org/RDF/Validator/ • In general all the RDF parsers do some kind of validation
References • RDF Resource Guide: • http://www.ilrt.bris.ac.uk/discovery/rdf/resources/ • http://www.w3.org/RDF • http://www.w3.org/RDF/Validator/
Text • Text coding in bits • EBCDIC, ASCII • Initially, 7 bits. Later, 8 bits • Unicode • 16 bits, to accommodate oriental languages
Text • Formats • No single format exists • IR system should retrieve information from different formats • Past: IR systems convert the documents • Today: IR systems use filters
Text • Formats • Formats for document interchange (RTF) • Formats for displaying (PDF, PostScript) • Formats for encode email (MIME) • Compressed files • uuencode/uudecode, binhex
Text • Information Theory • Amount of information is related to the distribution of symbols in the document. • Entropy: • Definition of entropy depends on the probabilities of each symbol. • Text models are used to obtain those probabilites
Text • Example - Entropy • 001001011011
Text • Example - Entropy • 111111111111
Text • Modeling Natural Language • Symbols: separate words or belong to words • Symbols are not uniformly distributed • binomial model • Dependency of previous symbols • k-order markovian model • We can take words as symbols
Text • Modeling Natural Language • Words distribution inside documents • Zipf´s Law: i-th most frequent word appears 1/i times of the most frequent word, hence i-th frequent word appears: • Real data fits better with between 1.5 and 2.0
Text • Modeling Natural Language • Example - word distibution (Zipf’s Law) • V=1000, = 2 • most frequent word: n=300 • 2nd most frequent: n=76 • 3rd most frequent: n=33 • 4th most frequent: n=19
Text • Modeling Natural Language • Number of distinct words • Heaps’ Law: • Set of different words is fixed by a constant, but the limit is too high
Text • Modeling Natural Language • Heaps’ Law example • k between 10 and 100, is less than 1 • example: n=400000, = 0.5 • K=25, V=15811 • K=35, V=22135
Text • Modeling Natural Language • Length of the words • defines total space needed for vocabulary • Heaps’ Law: length increases logarithmically with text size. • In practice, a finit-state model is used • space has p=0.2 • space cannot apear twice subsequently • there are 26 letters
Text • Similarity Models • Distance Function • Should be symmetric and satisfy triangle inequality • Hamming Distance • number of positions that have different characters reverse receive
Text • Similarity Models • Edit (Levenshtein) Distance • minimum number of operations needed to make strings equal survey surgery • superior for modeling syntatic errors • extensions: weights, transpositions, etc
Text • Similarity Models • Longest Common Subsequence (LCS) survey - surgery LCS: surey • Documents: lines as symbols (diff in Unix) • time consuming
Conclusions • Text is the main form of communicating knowledge. • Documents have syntax, structure and semantics • Metadata: information about data • Formats of text • Modeling Natural Language • Entropy • Distribution of symbols • Similarity