1.07k likes | 1.4k Views
Hypertext (1). Historically, text is sequential: read from beginning to end Hypertext is non-sequential, with internal links from one part to another Hypertext, the word, coined by Ted Nelson in 1966. First hypertext system, Xanadu, named for Coleridge’s magical world. Hypertext (2).
E N D
Hypertext (1) • Historically, text is sequential: read from beginning to end • Hypertext is non-sequential, with internal links from one part to another • Hypertext, the word, coined by Ted Nelson in 1966. • First hypertext system, Xanadu, named for Coleridge’s magical world.
Hypertext (2) Links in hypertext give access to: • topics or information directly related to the current idea • notes, such as footnotes or endnotes • explanations of special words or phrases • biographical information about people behind the current idea
Claims about Hypertext • Represents large body of information organized into numerous fragments • Fragments relate to one another • User needs only a small fraction of the fragments at any time • Exists only in cooperation with the reader • Is a legitimate literary concept
Claims about Hypertext (2) • Integrates three technologies • Publishing (as a book publisher would) • Computing (as the infrastructure) • Broadcasting (over a computer network) • Depends on computer environment for high-speed transitions between nodes • Modelled by network ADT
Using Hypertext • Browser, or hypertext engine: a computer-based system that allows links to be followed easily • Navigation aids: parts of the user interface that provide a sense of location and direction • Notation: a convenient way of specifying links as a hypertext author
WWW as a Hypertext System • Browser: Netscape, for example • Navigational aids: • Forward, back, home • History list • Colored anchors • Consistent titles • Notation: HTML
Network ADT • Model of hypertext • Similar to tree ADT, but allows cycles • Links have an explicit direction, capturing the idea of going forward and going back
Network ADT (2) • Definition:A network is a collection of nodes and links between pairs of nodes such that • Each link has a direction. • Each node is reachable from any other node. However, the path is not necessarily unique. • No node is linked to itself. • There are no duplicate links in the same direction.
Network ADT (3) • Observations: • There is no hierarchy; all nodes are considered the same. (In a tree, the root is special.) • Links have direction, but reverse travel is possible. (One can go backwards on a link, or forwards on a link that goes in the opposite direction.) • Cycles are allowed.
Directed Graphs • Both networks and rooted trees are examples of a connected directed graph, sometimes called a digraph. • Formally, a digraph is a set of nodes and a set of links joining ordered pairs of nodes. The link (A,B) that joins A to B is different from the link (B,A) that joins B to A
Navigation in Sequential Text • Low level: • Punctuation • Fonts • Separation into sentences and paragraphs • High level: • Chapters, sections, subsections • Table of contents • Index
Navigation in Sequential Text (2) • Page layout • Page numbers • Running heads • Displayed text
Navigating in Hypertext • Issues: • Where am I? Have I been here before? When? • How did I get here? • Where can I go? • Anchors (or links) • Implicit anchors (or links): clipboard, glossary, calculator • Computed links: next train • Back • Forward • Home
Navigating in Hypertext (2) • Within a node: • Save to disk • Print • Annotate • Scroll • Zoom
Navigating in Hypertext (3) • User interface support • Give power to the users through • short response time • low cognitive load • path clues, perhaps decaying over time • Follow a path forward or backward • Return to a node
Text Markup • Unified view of text and hypertext presentation • Foundation of all word processors • Describes all electronic manuscripts by • separating logical elements • specifying processing functions for these elements
Text Markup (2) • Originated by William Tunnicliffe (Sept. 1967), in talk advocating separating information content of document from format • Control formatting with embedded codes
Generalized Markup • Goal: allow editing, formatting, and retrieval systems to share documents • Devised by Goldfarb, Mosher, Lorie at IBM, 1969 • Formally defined • document types • explicit nested element structure • generic identifier associated with each element
SGML • Standard Generalized Markup Language • First draft standard, 1980 • ISO 8879, 1986 • Based on the ADT tree • Allows the description of a document, considered as a tree, to be embedded in the file containing the document
Functions of SGML • Tags documents in a formal language • Describes internal logical structures • Links files with an addressing scheme • Acts as a database language for text • Accommodates multimedia and hypertext • Provides a grammar for style sheets • Allows coded text reuse in surprising ways
Functions of SGML (2) • Represents documents independent of computing platform • Provides a standard for transfering documents among platforms and applications • Acts as a metalanguage for document types • Represents hierarchies • Extends to accommodate new document types
Generic Identifiers • Tagging vs. formatting • Tagging shows document structure • Formatting describes document display • Example: A paragraph is a sequence of closely connected sentences and can be delimited by a tag. A paragraph can be displayed with either • initial indenting or not • extra separation or not
Generic Identifiers (2) • Syntax • Beginning: < identifier > • End: </ identifier > • Attribute list, with assigned values, may follow identifier
Generic Identifiers (3) • Typical identifiers: • p paragraph • q quotation • ol numbered (ordered) list • ul unnumbered list • li list item • b bold face • i italics
Display of Text • ASCII codes for printing characters carry no information about display • Printed or displayed characters are described by their font.
Fonts • Fonts come in families, which are a group of fonts with similar design characteristics. • A font is a set of displayed characters in a particular design. To describe a font, we specify: • The font face, or type face, which is the design of the font. • The size, measured in points, which is the height of representative characters. • The appearance: bold, italic, underline, outline, shadow, small cap, redline, strikeout, etc.
Fonts (2) • Font families include standard modifications of a base font, such as italics and bold, to change the appearance. (This family is Times New Roman.) • Some families are sans serif, without the cross strokes accentuating the ends of the main strokes.
Fonts (3) • Typical examples of fonts are • Times New Roman • Arial • Century Schoolbook • Lucinda Calligraphy • Verdana
Fonts (4) • The size of this font is 32 points • This is 54 points • This is 24 points • There are exactly 72.27 points per inch
Fonts (5) To render a character in a font, one must • Know the computer code (ASCII) of the character • The font name and properties Then the computer creates the glyph that represents the character in the specified font.
Fonts (6) In the process, the computer uses the • Baseline: the invisible line on which characters are aligned. • x-height: the actual height of the character x • Kerning: spacing between two letters. Note that in printing “wo” the “o” slides under the “w” to form and locate the glyph
Input devices for text • Keyboard • Scanning with optical character recognition • Hand printed • Hand written (cursive) • Machine printed • Voice recognition • Pen-based
Input errors • Human-based, e.g. • Typographic • Poor writing • Machine dependent • Small typeface differences: O vs. D • Limits of technology • Pre-existing errors
Automatic error correction • Error rate for keyboard input = 98% OCR accuracy + automatic correction • Automatic correction also helpful in: • Computer-aided authoring • Communication enhancement for disabled • Natural language responses • Database interaction • Example: MS Word AutoCorrect
Automatic spelling correction • Three increasingly difficult tasks: • Non-word detection: string in text not in dictionary • Isolated word correction: thier automatically becomes their • Context-dependent correction: here automatically becomes hear
General spelling correction • Can allow human intervention, e.g. choose the correct spelling from a list of candidates • No context dependent general purpose correction tool exists yet.
Issues for spelling correction • Type of input device • Focus on adjacent keys: b vs. n • Focus on similar shapes: O vs. D • Interactive vs. automatic correction • How many choices are reasonable? (One for automatic correction.) • How accurate should guesses be? • Proper choice of dictionary
Word list choice • Use lexicon--a word list appropriate to a particular topic • As opposed to dictionary -- a comprehensive list of words • Include provision for adding new words
Word list choice: Example 1 • Compare NY Times news wire text with Webster’s 7th Collegiate Dictionary • 8 million words in news wire text: • only 36% in dictionary • only 39% of dictionary words used in text
Example 1 (continued) • Of text words not in dictionary • 1/4 inflected forms (change in case, gender, tense) • 1/4 proper names • 1/6 hyphenated forms • 1/12 misspellings • 1/4 unresolved by investigators (new words, etc.) • How to handle proper names?
Example 2 • Corpus of 22 million words from a variety of genres • Effect of changing lexicon from 50,000 to 60,000 words? • Eliminated 1348 false rejections (words are now included in lexicon) • Created 23 false acceptances (originally misspelled, now occur in lexicon and therefore, treated as correctly spelled.)
Unintentionally correct spellings • Misuse of word: there for their, to for too • Typo: from for form • Quote from Mozart: I’ll see you in five minuets
Issues in detection • Given document as a sequence of words, lexicon as ordered list of words, report all document words not in lexicon, but: • How to handle upper case letters? • How to handle suffixes and prefixes? • What definition of word to use?
Issues in detection (2) • Upper case: Change all to lower case • Handles first word of sentence and proper names that are words: Bob Brown • Confuses: DEC (ok), Dec (abbreviation), dec (misspelling) • Must put back capitalization
Types of errors • From keyboard input, 80% of misspellings • Insertion • Deletion • Substitution, especially nearby keys • Transposition • Few errors occur in first letter • Mostly, length is same or changes by 1
Suggestion Strategies • Words with same first letter first • Order rest by change in length
Types of errors (2) • Improper spacing: run-ons or splits • Significant unsolved problem • Cognitive • recieve for receive; procede for proceed • conspiricy for conspiracy; mispell for misspell • Phonetic • abiss for abyss; nacherly for naturally
Spelling Rules • I before E except after C • Ex, Suc, Pro ceed. All others are cede, except supersede