230 likes | 354 Views
LIS 7450, Searching Electronic Databases. Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres. Database Structure. Organization of Data Elements and records. Database Record.
E N D
LIS 7450, Searching Electronic Databases Basic: Database Structure & Database Construction Dialog: Database Construction for Dialog (FYI) Deborah A. Torres
Database Structure Organization of Data Elements and records
Database Record • Record – basic unit of information in a database (file). • Example: Bibliographic record contains description information, i.e. author, title, publisher etc.
Fields • Field – a distinct part or section of a record (a unit of information within the record) • Example of personnel record fields: employee’s name, special identifier number, address, date of hire etc.
Field Design Decisions • For each field • Decide what information is placed within that field & format for that information (text, numeric) • Should there be subfields within a field? • What to call the fields? • Field codes (abbreviations, numbering) • Order of the fields
Example: MARC Record (a type of record you should be familiar with) Record Fields & Codes The 100 field contain author information. The 245 field contains main title information.
Other Design Decisions • Hyphenated words • Home-school • Stop words • High frequency words not useful for searching • Single words and phrases • Library, library science, color of money • Alternative spellings of words • Color, colour
Types of Databases • Bibliographic – references and abstracts of published documents • Fulltext – complete text of articles, dictionary entry, code of law, or other such document. • Directory – factual information about organizations, companies, products, people, or materials.
Types of Databases • Numeric – data in a tabular or statistically manipulated form, often with some added text. • Hybrid – a mix of record types. For example, a database may have full-text records for some publications and citations and abstracts for other source documents.
Database Construction BasicSteps for automatic indexing of text documents
Six Basic Steps Step 1: Parse text into words Step 2: Compare to stoplist and eliminate stopwords Step 3: Stem content words (reduce to root words) (skip this step if decide not to stem) Step 4: Count stemmed word occurrences Step 5: Create union list of terms Step 6: Create data structure for specific retrieval techniques (i.e. an inverted file)
Example: Simple Set of 5, One-sentence documents D1: It is a dog eat dog world! D2: While the world sleeps. D3: Let sleeping dogs lie. D4: I will eat my hat. D5: My dog wears a hat. “D” stands for document
Step 1: Parse Text into Words Note: Some databases remove punctuation for words, like possessives; others preserve it. What difference would this make?
Step 2: Eliminate Stop Words Stop words are content-free words – those not useful in determining the content of the document. Examples: pronouns (I, my), prepositions (of, by, on), articles (a, the, this)
Types of Stemming Decisions No Stemming: contract contracts contracted contracting contractor contraction contractual contracture Weak Stemming: Inflections: -s, -es, -ed, -ing, -’s Strong Stemming: Derivations: -tion, -ly, -ally Reduce words to a root variant; there are different stemming algorithms
A bit more about stemming for searching… • Some databases automatically search for all of the words that come from the same stem/root word unless you indicate that you only want the word you entered. • Example: if you entered computer, the database would also search for computing, computers, computation, etc.
Step 4: Sort Words, Count Duplicates Sort into Alpha order Count any duplicates
Step 6: Create Inverted Index (inverted file) Union List Unique terms Inverted Index dog eat hat let lie sleep wear word dog: D1 D3 D5 eat: D1 D4 hat: D4 D5 let: D3 lie: D3 sleep: D2 D3 wear: D5 word: D1 D2 Inverted Index: has pointers to documents in which word occurs
Dialog Database Construction FYI: For those interested in Dialog
Dialog Database Construction Step 1: Create a linear fileof records received from the Information Provider. Assign sequential accession numbers to the records. Step 2: Label the fields within the records: AU for Author, TI for Title, etc. If a field is word-indexed, also label the words within each field. Exclude stop words: AN FOR THE AND FROM TO BY WITH
Dialog Database Construction Step 3: Create the Basic Index: all words and phrases from fields containing subject-related terms. Step 4: Create the Additional Indexes: all terms from all remaining fields.