310 likes | 440 Views
Creating Dictionaries. What is a Dictionary?. CSPro data files are text files with no metadata, only data A dictionary is needed to describe the contents of the data file CSPro dictionaries: End with the extension . dcf Are text files that can be edited manually, though that is inadvisable
E N D
What is a Dictionary? • CSPro data files are text files with no metadata, only data • A dictionary is needed to describe the contents of the data file • CSPro dictionaries: • End with the extension .dcf • Are text files that can be edited manually, though that is inadvisable • Are not dependent on the existence of a data entry application • Every CSPro application needs a dictionary • Multiple CSPro applications can share the same dictionary
CSPro Data Files • CSPro data files are: • Flat files (all data in a single file) • Text files (all data is stored in ANSI format and is human readable) • Items in the data file have a fixed length • Records in the data file are stored one per line • Have no specific file extension • An index is created for the data file to allow for quick access to specific cases (file extension: .idx)
Identification Items • CSPro needs a way to differentiate between different cases (questionnaires) • Identification (ID) items uniquely identify all cases • Two cases in a single data file cannot have the same ID, but cases across data files can share IDs
Identification Items (continued) • Generally a questionnaire has geocodes or some other system of attributes that uniquely identifies each unit of enumeration • For censuses, these IDs are almost always geocodes • Example: Province – District – Division – Location – Sublocation – Enumeration Area – Household Number • For surveys, these ID sections are often more condensed • Example: Cluster – Household Number
Identification Items (continued) • It is common for the “identification section” of a questionnaire to have questions that do not help uniquely identify a household • Examples include: • Enumerator number • Household type • Urban/rural status • Some people prefer to make the ID section as small as possible, to pick the fewest number of items possible to ensure that each case is unique • Other people take a more liberal approach to ID fields, but CSPro does have a limit to how long the ID field can be (length: 127)
ID Examples • ID: YearItem on Record: Winner of U.S. presidential election1996William Jefferson Clinton2000George Walker Bush2004George Walker Bush2008Barack Hussein Obama II • ID: State, countyItem on Record: County name0101Autauga [Alabama]5123Weston [Wyoming]
Dictionary Fundamentals • Identification Items: value(s) to uniquely identify a case • Levels: a group of one or several records • Records: a group of one or several items • Items: a value, or variable, that is numeric or alphanumeric • Subitems: part of an item • Value Sets: a listing of valid values for an item
Dictionary Fundamentals (with a typical survey example) • Identification Items: value (s) to uniquely identify a caseCluster number, household number • Levels: a group of one or several recordsHousehold questionnaire, female questionnaires • Records: a group of one or several itemsHousing characteristics, household roster, fertility questions • Items: a value, or variable, that is numeric or alphanumeric Water access, roof type, …, sex, age, …, children ever born • Subitems: part of an itemDate of birth broken down into year, month, day • Value Sets: a listing of valid values for an itemSex: Male (1), Female (2)
Naming Dictionary Elements • Every element of a dictionary has two attributes, a name and a label • Name • You use the name to refer to the element while programming logic • Can be up to 32 characters but must start with a letter • Each dictionary element must have a unique name, and there are some names that are reserved for CSPro keywords • Label • A more thorough description of the element • Can be up to 255 characters and can contain punctuation and spacing • Often labels are the only documentation that anyone sees, so be sure to take care when creating labels
Naming Dictionary Elements (continued) • If you plan on writing a lot of programming logic, consider how long you make the names for elements • Three common approaches exist for naming elements when the questionnaire has each question numbered • Approach 1: P10_RELATIONSHIP, P11_SEX, P12_AGE • Approach 2: RELATIONSHIP, SEX, AGE • Approach 3: P10, P11, P12 • Remember that each element has a name and a label, and that they do not (and probably should not) be the same value
Levels • Applications can have one or two levels • Most applications are and should be one-level applications, though some applications are better designed as two-level applications • Each level usually has its own questionnaire associated with it • The top-level can only have one questionnaire, while multiple questionnaires can exist at lower levels • Different sections on a questionnaire translate to multiple records, not multiple levels • How many levels do these questionnaires need? • Household questions, population questions, agriculture questions • Population questions, women of reproductive age questions
Records • Records are groupings of items, and generally translate to sections of a questionnaire • Examples of records in a census might be: housing record, population records, death records, emigrant records, agriculture record • A record can be optional, e.g., death records • A record can occur more than once per questionnaire, e.g., population records • When deciding how many times a record can occur, select the maximum possible reasonable value
Record Type • When a dictionary has more than one kind of record, each record must have a type value • The type value differentiates one record in a data file from the other records • You can specify particular values for the record types, or allow CSPro to assign these values automatically • If your dictionary has many records, you may need to increase the length of the record type (default length: 1)
Record Type in the Data File • This data file has two records: winner of the presidential election (1) and loser of the presidential election (2) • The ID item is the year of the electionRT ID RECORD ITEMS1 1996 William Jefferson Clinton2 1996 Robert Joseph Dole1 2000 George Walker Bush2 2000 Albert Arnold Gore, Jr.2 2004 John Forbes Kerry1 2004 George Walker Bush • Note that the order of the different records does not matter
Multiply-Occurring Records in the Data File • This data file has two records: winner of the presidential election (1, singly-occurring) and losers of the presidential election (2, multiply-occurring) • The ID item is the year of the electionRT ID RECORD ITEMS1 1996 William Jefferson Clinton2 1996 Robert Joseph Dole2 1996 Henry Ross Perot1 2000 George Walker Bush2 2000 Albert Arnold Gore, Jr.2 2000 Ralph Nader2 2000 Patrick Joseph Buchanan • Note that the order of the multiply-occurring records DOES matter
Items • Items (variables) describe the data for each question on a census or survey • Items have several properties: • Length: How many characters are needed to faithfully store all possible values for this question? • Data Type: Will this item contain only numeric values, or will it also store words or sentences? • Item Type: Is this a subitem? (use selectively) • Occurrences: Does this item repeat several times? (use selectively)
Items (continued) • Items have several properties: • Decimal: Will this item hold a decimal fraction? If so, how many digits are necessary to the right of the decimal point? • Decimal Character: If the numeric item holds a decimal fraction, should the item be saved to the data file with a decimal point? (This is a purely cosmetic indicator, though it does have bearing on the length of the item.) • Zero Fill: Do you want the unused spaces to the left of a number padded with zeroes?
Item Representations • This is the number 3.14 stored using various item attributes • Numeric, Length: 4, Decimal: 2, 3.14Decimal Character: Yes, Zero Fill: Yes • Numeric, Length: 6, Decimal: 2, 003.14Decimal Character: Yes, Zero Fill: Yes • Numeric, Length: 6, Decimal: 2, 000314Decimal Character: No, Zero Fill: Yes • Numeric, Length: 6, Decimal: 2, 3.14Decimal Character: Yes, Zero Fill: No • Numeric, Length: 6, Decimal: 3, 3.140Decimal Character: Yes, Zero Fill: No • Alphanumeric, Length: 6 3.14
Subitems • People tend to overuse subitems, but they are useful in situations in which you intend to process data that makes up a small part of a larger number • Using logic you can access parts of items without having to make them subitems, but subitems can simplify processing, as well as satisfy value set checking while on a form • Example: • Item: Social Security Number, Length 11, comprised of three subitems: • Area Number, digits 1-3 • Group Number, digits 5-6 • Serial Number, digits 8-11
Value Sets • Value sets are optional and tell CSPro what values are considered acceptable for an item • If no value set is present, CSPro will accept all values for the item (within limit; i.e., numeric fields cannot contain letters) • If an item has multiple value sets, CSPro will use the first one to check the validity of keyed data • Using logic the programmer can change what value set is active for an item, and can even generate a value set dynamically • Value sets can contain discrete values, and for numeric items, value sets can contain ranges • Value set ranges can overlap; this is common for tabulation applications • If many items share the same possible values, you can link the value sets so that modifying the value set of one item alters the value set for linked items
Value Set Examples • Sex: Label From To Male 1 Female 2 • Age: Minor 0 17 Teenager 13 19 Adult 18 99 Retiree 67 99 • The from/to values of each value set are what is stored in the keyed data file, not the value set labels
Special Values • CSPro has three “special values” that describe certain kinds of data • Not Applicable: the item is blank(e.g., date of menarche would not be asked of men) • Missing: the codebook had a value for missing (or not stated) and you assign this value to be missing • Default: the item has an invalid value(e.g., your program logic assigned a three-digit value to a two-digit field) • By default CSPro ensures that keyed data fits in the value set and is not blank, but if desired CSPro can accept blank data or out of range data
Documenting Dictionary Elements • To the left of every element in the dictionary editor is a small gray box under the column heading N • Clicking on this box brings up a field in which you can write notes about the dictionary element • These notes are stored in the dictionary file but are not visible during data entry • Consider making use of these notes, especially when working with partners on an application
Relative Positioning • By default, CSPro will automatically assign the starting position (column number) of each item in your dictionary • When creating a new dictionary, it is best to let CSPro generate these values • Inserting an item in between other items, or modifying the length of an item, will cause all the other items’ starting positions to automatically change • There will be no gaps in the data file • The default order in the data file will be: record type, ID items, record items in the order they appear on the screen
Absolute Positioning • However, if you are creating a dictionary to match an existing data file, it may be necessary to select absolute positioning • With absolute positioning, you must specify the starting position (column number) of each item in your dictionary • It is your responsibility to make sure that items do not overlap • Gaps can exist in a data file
Relative vs. Absolute Example • Relative:11996William Jefferson Clinton21996Robert Joseph Dole • Absolute (one of many possibilities)William Jefferson Clinton 1996 1Robert Joseph Dole 1996 2
Modifying the Dictionary • Before a data entry operation begins, feel free to modify the dictionary freely • CSPro will detect changes between the dictionary and forms, so if you rename or delete a dictionary item, the field on the form will also be renamed, or will be removed from the form • However, once some data exists using a dictionary format, modifying the dictionary must be done with great care • In all cases, make backups of your dictionary before any modifications so that you always have a dictionary to read data that was entered at any time of the data entry operation
Adding Fields to the Dictionary • If, after the data entry process has begun, some fields will be added to the dictionary, one option is to simply add them to the end of any given record • This means that, while the data that already exists will have blanks for these new values, that the data does not have to be reformatted and can be read by the new dictionary • However, if adding the fields to the end of a record is not practical, you can insert them in the record, but then all existing data must be reformatted to the new dictionary format
Modifying Item Lengths • If, after the data entry process has begun, the length of some items will be increased, you must reformat the existing data files • However, if the length of some items will be decreased, it may be possible to use absolute positioning to make your old data files readable • Likewise, deleting an item from the dictionary can be done in a way that does not require reformatting, but again absolute positioning must be used
Dictionary Macros • By right-clicking on the dictionary name in the tree you can access the undocumented dictionary macros • Names and labels of dictionary items, or value sets, can be copied to Excel format, modified in Excel, and then pasted back to CSPro • This can be particularly useful if you want coworkers who do not know how to use CSPro to help with the creation of the dictionary, perhaps by adding values to the codebook (value sets)