1 / 15

Superficial & Lexical level 1

Superficial & Lexical level 1. Superficial level What is a word Lexical level Lexicons How to acquire lexical information. Superficial level 1. Textual pre-process Getting the document(s) Accessing a BD Accessing the Web (wrappers) Getting the textual fragments of a document

Download Presentation

Superficial & Lexical level 1

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Superficial & Lexical level 1 • Superficial level • What is a word • Lexical level • Lexicons • How to acquire lexical information

  2. Superficial level 1 • Textual pre-process • Getting the document(s) • Accessing a BD • Accessing the Web (wrappers) • Getting the textual fragments of a document • Multimedia documents, Web pages, ... • Filtering out meta-information • tags HTML, XML, ...

  3. Superficial level 2 • Text segmentation into paragraphs or sentences • Tokenization • orthographic vs grammatical word • Multiword terms • dates, formulas, acronyms, abbreviations, quantities (and units), idioms, • Named entities • NER, NEC, NERC • Unknown word • Language identification Beeferman et al, 1999 Ratnaparkhi, 1998 Bikel et al, 1999 Borthwick, 1999 Mikheev et al, 1999 Elworthy, 1999 Adams,Resnik, 1997

  4. Superficial level 3 • Vocabulary size (V) • Heap's Law • V = KN • K depends on the text 10  K  100 • N total number of words •  depends on the language, for English 0.4    0.6 • Vocabulary grows sublinealy but does not saturate •  tends to stabilize for 1Mb of text (150.000w) Different words words

  5. Superficial level 4 • word tokens vs word types • Statistical distribution of words in a document • Obviously non uniform • Most common words cover more than 50% of occurrences • 50% of the words only occur once • ~12% of the document is formed by word occurring less than 4 times.

  6. Superficial level 5 Zipf law: We sort the words occurring in a document by their frequency. The product of the frequency of a word (f) by its position (r) is aproximatelly constant

  7. Lexical level 1 • Part of Speech (POS) • Formal property of a word-type determining its acceptable uses in syntax. • A POS can be seen as a class of words • A word-type can own several POS, a word-token only one • Plain categories • open, many elements, neologisms, independent and semantically rich classes • N, Adj, Adv, V • Functional categories • closed

  8. Lexical level 2 Lexicon • Repository of lexical information for human or computer use • Two aspects to consider • Representation of lexical information • Acquisition of lexical information

  9. Lexical level 3 Lexicon content • Orthografic Transcription • Phonetic Transcription • Flexion model • diathesis alternations, subcategorization frames • LOVE VTR (OBJLIST: SN). • LOVE • CAT = VERB • SUBCAT = <SN, SN>

  10. Lexical level 4 • POS • Argument structure • Semantic information • dictionaries => definition • lexicons => semantic types predefined in a hierarchy. • Lexical Relations • derivation • Equivalence with other languages

  11. Lexical level 5 Problems • Form • attribute/value pairs, binarr or n-ary relations, coded values, open domain values… • Multiple assignments • One to many and many to one relations • Contextual dependencies … • Facets of features • Mandatory or optional, cardinality, default values • Grading • Exact values, preferences, probabilistic assigments.

  12. Lexical level 6 Representation • General purpose databases • Textual databases • Lexical databases • OO formalisms • OO databases • Frames • Unification-based formalisms

  13. Lexical level 7 Lexical Information acquisition • Dictionaries • MRD • Predefined internal structure • Some degree of coding in some contents • Internal relations (synonimy, hyponymy, ...) • (sometimes) restricted vocabulary • Some sistematics on building definitions

  14. Lexical level 8 Information present in corpora • Colocations • Argument structure. • Frecuency information • Context • Grammatical Induction • Probabilistic Analysis. • Lexical relations • Examples of use. • Selectional Restrictions • Nominal compounds • Idioms, ...

  15. Lexical level 9 Corpus typology • Raw corpus • Horizontal or vertical Corpus • Tagged corpora • Parenthized corpora • Treebanks

More Related