90 likes | 98 Views
This proposal suggests enhancements for data frames version 3, including improvements to extracting mileage information and handling different types of data. It also introduces the idea of required context and discusses the internal representation, methods, canonicalization, inheritance, general constraints, and other related issues.
E N D
Data Frames Version 2 • Year matches [2] • constant { extract "\d{2}"; • context "([^\$\d]|^)\d{2}[^,\dkK]"; } 0.5, • { extract "\d{2}"; • context "([^\$\d]|^)\d{2},[^\d]"; } 0.6, • { extract "\d{2}"; • context "\b'\d{2}\b"; } 0.8; • end; • Mileage matches [8] • constant { extract "\b[1-9]\d{1,2}k"; } 0.6, • { extract "[1-9]\d?,\d{3}"; } 0.3; • keyword "\bmiles\b", "\bmi\.", "\bmi\b"; • end; • Also: except, substitute, filter phrases; lexicons
Still allow negation Introduce idea of “required context” Each phrase may be labeled Strong separation of value and keyword phrases Allow keyword to be specific to a subset of the value phrases for this data frame Expressions are richer than regular expressions. Supports Boolean and proximity operators; also lexicons and macros. Kimball’s Ontology Editor
Internal Representation • Replace SQL field length with arbitrary type field • This is the “internal representation” • Type is either lexical or nonlexical • Type could be the name of an object set in the ontology • Or it could be the name of a type in whatever language will be used to implement methods (more on this later), together with a units name (e.g. “miles”, “meters”, “grams”, “pounds”)
Methods • Add a method phrase to data frames • Conceptually they are restricted derived object sets and relationship sets • We only declare method signatures in data frames • Another language (e.g. Java) is used to define the method body • Our tool will generate a template in which the programmer can write method bodies • The template will have OO structures that allow read-only access to the seamless model/data instance • Keyword phrases may also apply to methods
Canonicalization Methods • Each value phrase may have an associated canonicalization method • The purpose is to convert the extracted value string into a common form • The data frame may have a default canonicalization method that applies if there is no individual method for a value phrase
Inheritance • Inheritance is defined more cleanly • Generalization/specialization will indicate inheritance hierarchy • The internal representation cannot be overridden in specializations • Multiple parents must have the same internal representation • Individual inherited phrases can be deleted or overridden • New phrases can be added • In the case of name conflict, we require fully qualified names to be used (no automatic disambiguation)
General Constraints • We may decide to implement a limited form of general constraint in the ontology • E.g. “Birth Date <= Death Date” • Or “Event Distance.toMiles() <= 26 • If so, we may want to implement operator overloading (something like C++) • The general constraint issue is not core to the current data frame discussion, but it has interesting ramifications
Other Issues • How to integrate methods and confidence values into record-assembly heuristics • Ontos system will have to be rewritten • Extract into model instance, not SQL tables • We can always generate database tables later if we’d like • Ontologies created graphically and stored as XML