420 likes | 502 Views
SJTU CMGPD 2012 Methodological Lecture Day 4. Household and Relationship Variables. Outline. Existing household variables Identifiers Characteristics Dynamics Household relationship Creation of new variables Use of bysort / egen Household relationship variables. Identifiers.
E N D
SJTU CMGPD 2012Methodological LectureDay 4 Household and Relationship Variables
Outline • Existing household variables • Identifiers • Characteristics • Dynamics • Household relationship • Creation of new variables • Use of bysort/egen • Household relationship variables
Identifiers • HOUSEHOLD_ID • Identifies records associated with a household in the current register • HOUSEHOLD_SEQ • The order of the current household (linghu) within the current household group (yihu) • UNIQUE_HH_ID • Identifies records associated with the same household across different registers • New value assigned at time of household division • Each of the resulting households gets a new, different
Characteristics • HH_SIZE • Number of living members of the household • Set to missing before 1789 • HH_DIVIDE_NEXT • Number of households in the next register that the members of the current household are associated with. • 1 if no division • 0 if extinction • 2 or more if division • Set to missing before 1789
histogram HH_SIZE if PRESENT & HH_SIZE > 0, width(2) scheme(s1mono) fraction ytitle("Proportion of individuals") xtitle("Number of members")
This isn’t particularly appealing • A log scale on the x axis would help • In STATA, histogram forces fixed width bins, even when the x scale is set to log • We can collapse the data and plot using twoway bar or scatter table HH_SIZE, replace twoway bar table1 HH_SIZE if HH_SIZE > 0, xscale(log) scheme(s1mono) xlabel(0 1 2 5 10 20 50 100 150)
What if we would like to convert to fractions? • Compute total number of households by summing table1, then divide each value of table 1 by the total • sum(table1) returns the sum of table 1 up to the current observation • total[_N] returns the value of total in the last observation drop if HH_SIZE <= 0 generate total = sum(table1) generate hh_fraction = table1/total[_N] twoway bar hh_fraction HH_SIZE if HH_SIZE > 0, xscale(log) scheme(s1mono) xlabel(0 1 2 5 10 20 50 100 150) ytitle("Proportion of households")
Households as units of analysis • The previous figures all treated individuals as the units of an analysis • Every household was represented as many times as it had members • A household with 100 members would contribute 100 observations • In effect, the figures represent household size as experienced by individuals • Sometimes we would like to treat households as units of analysis • So that each household only contributes one observation per register
Households as units of analysis • One easy way is to create a flag variable that is set to 1 only for the first observation in each household • Then select based on that flag variable for tabulations etc. • This leaves the original individual level data intact bysort HOUSEHOLD_ID: generate hh_first_record = _n == 1 histogram HH_SIZE if hh_first_record & HH_SIZE > 0, width(2) scheme(s1mono) fraction ytitle("Proportion of households") xtitle("Number of members")
Another approach to plotting trends • We can plot average household size by year of birth without ‘destroying’ the data with TABLE, REPLACE or COLLAPSE bysort YEAR: egenmean_hh_size = mean(HH_SIZE) if HH_SIZE > 0 bysort YEAR: egenfirst_in_year = _n == 1 twoway scatter mean_hh_size YEAR if first_in_year & YEAR >= 1775, scheme(s1mono) ytitle("Mean household size of individuals") xlabel(1775(25)1900)
Mean household size of individuals by age keep if AGE_IN_SUI > 0 & SEX == 2 & YEAR >= 1789 & HH_SIZE > 0 bysort AGE_IN_SUI: egenmean_hh_size = mean(HH_SIZE) bysort AGE_IN_SUI: generate first_in_age = _n == 1 twoway scatter mean_hh_size AGE_IN_SUI if first_in_age & AGE_IN_SUI <= 80, scheme(s1mono) ytitle("Mean household size of individuals") xlabel(1(5)85) xtitle("Age in sui") lowessmean_hh_size AGE_IN_SUI if first_in_age & AGE_IN_SUI <= 80, scheme(s1mono) ytitle("Mean household size of individuals") xlabel(1(5)85) xtitle("Age in sui") msize(small)
Household divisionIndividuals by next register . tab HH_DIVIDE_NEXT if PRESENT & NEXT_3 & HH_DIVIDE_NEXT >= 0 Number of | household in | the next | available | register | Freq. Percent Cum. ---------------+----------------------------------- 1 | 789,250 94.98 94.98 2 | 33,000 3.97 98.95 3 | 5,815 0.70 99.65 4 | 1,812 0.22 99.87 5 | 383 0.05 99.91 6 | 314 0.04 99.95 7 | 196 0.02 99.98 8 | 34 0.00 99.98 9 | 82 0.01 99.99 10 | 86 0.01 100.00 ---------------+----------------------------------- Total | 830,972 100.00
Household divisionHouseholds by next register . bysort HOUSEHOLD_ID: generate first_in_hh = _n == 1 . tab HH_DIVIDE_NEXT if PRESENT & NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh Number of | household in | the next | available | register | Freq. Percent Cum. ---------------+----------------------------------- 1 | 117,317 97.80 97.80 2 | 2,287 1.91 99.71 3 | 272 0.23 99.94 4 | 57 0.05 99.98 5 | 8 0.01 99.99 6 | 7 0.01 100.00 7 | 2 0.00 100.00 9 | 1 0.00 100.00 10 | 1 0.00 100.00 ---------------+----------------------------------- Total | 119,952 100.00
Household divisionExample of a simple analysis generate byte DIVISION = HH_DIVIDE_NEXT > 1 generate l_HH_SIZE = ln(HH_SIZE)/ln(1.1) logit DIVISION HH_SIZE YEAR if HH_SIZE > 0 & NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh logit DIVISION l_HH_SIZE YEAR if NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh
. logit DIVISION HH_SIZE YEAR if HH_SIZE > 0 & NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh Iteration 0: log likelihood = -15419.716 Iteration 1: log likelihood = -14310.848 Iteration 2: log likelihood = -14127.244 Iteration 3: log likelihood = -14126.276 Iteration 4: log likelihood = -14126.276 Logistic regression Number of obs = 132688 LR chi2(2) = 2586.88 Prob > chi2 = 0.0000 Log likelihood = -14126.276 Pseudo R2 = 0.0839 ------------------------------------------------------------------------------ DIVISION | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- HH_SIZE | .0882472 .0016549 53.32 0.000 .0850036 .0914908 YEAR | -.0122989 .0005941 -20.70 0.000 -.0134633 -.0111345 _cons | 18.23519 1.087218 16.77 0.000 16.10428 20.3661
. logit DIVISION l_HH_SIZE YEAR if NEXT_3 & HH_DIVIDE_NEXT >= 0 & first_in_hh Iteration 0: log likelihood = -15419.716 Iteration 1: log likelihood = -13953.268 Iteration 2: log likelihood = -13468.077 Iteration 3: log likelihood = -13463.036 Iteration 4: log likelihood = -13463.032 Iteration 5: log likelihood = -13463.032 Logistic regression Number of obs = 132688 LR chi2(2) = 3913.37 Prob > chi2 = 0.0000 Log likelihood = -13463.032 Pseudo R2 = 0.1269 ------------------------------------------------------------------------------ DIVISION | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- l_HH_SIZE | .1341566 .0023316 57.54 0.000 .1295867 .1387265 YEAR | -.0130866 .0005775 -22.66 0.000 -.0142185 -.0119547 _cons | 17.75924 1.048066 16.94 0.000 15.70507 19.81342 ------------------------------------------------------------------------------
Creating household variables • bysort and egen are your friends • Use household_idto group observations of the same household in the same register • Let’s start with a count of the number of live individuals in the household bysort HOUSEHOLD_ID: egennew_hh_size = total(PRESENT) . corr HH_SIZE new_hh_size if YEAR >= 1789 (obs=1410354) | HH_SIZE new_hh~e -------------+------------------ HH_SIZE | 1.0000 new_hh_size | 1.0000 1.0000
Creating measures of age and sex composition of the household bysort HOUSEHOLD_ID: egen males_1_15 = total(PRESENT & SEX == 2 & AGE_IN_SUI >= 1 & AGE_IN_SUI <= 15) bysort HOUSEHOLD_ID: egen males_16_55 = total(PRESENT & SEX == 2 & AGE_IN_SUI >= 16 & AGE_IN_SUI <= 55) bysort HOUSEHOLD_ID: egen males_56_up = total(PRESENT & SEX == 2 & AGE_IN_SUI >= 56) bysort HOUSEHOLD_ID: egen females_1_15 = total(PRESENT & SEX == 1 & AGE_IN_SUI >= 1 & AGE_IN_SUI <= 15) bysort HOUSEHOLD_ID: egen females_16_55 = total(PRESENT & SEX == 1 & AGE_IN_SUI >= 16 & AGE_IN_SUI <= 55) bysort HOUSEHOLD_ID: egen females_56_up = total(PRESENT & SEX == 1 & AGE_IN_SUI >= 56) generate hh_dependency_ratio = (males_1_15+males56_up+females_1_15+females56_up)/HH_SIZE bysort AGE_IN_SUI: generate first_in_age = _n == 1 bysort AGE_IN_SUI: egenmean_hh_dependency_ratio = mean(hh_dependency_ratio) twoway line mean_hh_dependency_ratio AGE_IN_SUI if first_in_age & AGE_IN_SUI >= 16 & AGE_IN_SUI <= 55, scheme(s1mono) ylabel(0(0.1)0.5) xlabel(16(5)55) ytitle("Household dependency ratio (Prop. < 15 or >= 56 sui)") xtitle("Age in sui")
Numbers of individuals who co-reside with someone who holds a position . bysort HOUSEHOLD_ID: egenposition_in_hh = total(PRESENT & HAS_POSITION > 0) . tab position_in_hh if PRESENT & YEAR >= 1789 position_in | _hh | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,177,575 90.23 90.23 1 | 87,517 6.71 96.94 2 | 24,204 1.85 98.79 3 | 8,019 0.61 99.41 4 | 4,893 0.37 99.78 5 | 1,712 0.13 99.91 6 | 651 0.05 99.96 7 | 241 0.02 99.98 8 | 136 0.01 99.99 9 | 101 0.01 100.00 ------------+----------------------------------- Total | 1,305,049 100.00 . replace position_in_hh = position_in_hh > 0 (49183 real changes made) . tab position_in_hh if PRESENT & YEAR >= 1789 position_in | _hh | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,177,575 90.23 90.23 1 | 127,474 9.77 100.00 ------------+----------------------------------- Total | 1,305,049 100.00
RELATIONSHIP • String describes relationship of individual to the head of the household • Before 1789, describes relationship to head of yihu • This is the basis of our kinship linkage • Automated linkage of children to their parents • Automated linkage of wives to their husband’s • All based on processing of strings describing relationship
RELATIONSHIPCore • e is household head • wis a household head’s wife • m is household head’s mother • f is household head’s father (usually dead) • 1yb, 2yb, 2ob etc. are head’s brothers • Older brothers of the head are unusual • 1yz, 2yz, 2oz etc. are head’s unmarried sisters • 1s, 2s, etc. are head’s sons • 1d, 2d, etc. are the head’s unmarried daughters
RELATIONSHIPCombining codes • More distant relationships are built up from these core relationships by combining them • Examples • ff is grandfather of head • fm is grandmother of head • f2yb is an uncle: father’s second younger brother • f2ybw is his wife • f2yb1s is a cousin: father’s 2nd younger brother’s 1st son • 3yb2s is a nephew: 3rd younger brother’s 2nd son • 3s2s is a grandson: 3rd son’s 2nd son • 3s2sw is his wife
RELATIONSHIPLinking wives to husbands • Strip the w off of a married woman’s relationship and search the household for the remaining string. • f2yb1sw -> search for f2yb1s • Exceptions • For w, search for e • For f, search for m • For fm, search for ff • Etc. • Basically prepare a target string, and then make use of merge on HOUSEHOLD_ID and the target
RELATIONSHIPLinking children to fathers • In most cases, strip off the last relationship code and look for the remainder. • 1s1s -> look for 1s • ff2yb3s2s -> look for ff2yb3s • Exceptions • e look for f • 2yb look for f • f2yb look for ff • To link married women to their fathers-in-law, strip off w first, then convert to father’s relationship
RELATIONSHIPIndicators of specify basic relationships to head generate head = RELATIONSHIP == “e” generate head_wife = RELATIONSHIP == “w” generate mother = RELATIONSHIP == “m” generate father = RELATIONSHIP == “f” . tab head SEX if PRESENT & SEX >= 1, row col +-------------------+ | Key | |-------------------| | frequency | | row percentage | | column percentage | +-------------------+ | Sex head | Female Male | Total -----------+----------------------+---------- 0 | 539,935 671,972 | 1,211,907 | 44.55 55.45 | 100.00 | 98.69 78.90 | 86.64 -----------+----------------------+---------- 1 | 7,148 179,658 | 186,806 | 3.83 96.17 | 100.00 | 1.31 21.10 | 13.36 -----------+----------------------+---------- Total | 547,083 851,630 | 1,398,713 | 39.11 60.89 | 100.00 | 100.00 100.00 | 100.00
RELATIONSHIPProcessing for distant relationships • Strip out numbers, seniority modifiers y and b, etc. • In a .do file, this will create a new variable with a stripped relationship generate new_RELATIONSHIP = RELATIONSHIP local for_removal "1 2 3 4 5 6 7 8 9 o y w" foreach x of local for_removal { replace new_RELATIONSHIP = subinstr(new_RELATIONSHIP,"`x'","",.) }
generate brother = new_RELATIONSHIP = “b” & SEX == 2 generate brothers_wife = “b” & SEX == 1 & MARITAL_STATUS !=2 & MARITAL_STATUS > 0 generate sister = new_RELATIONSHIP = “z” & SEX == 1 generate male_cousin = new_RELATIONSHIP = “fbs” & SEX == 2 generate nephew = new_RELATIONSHIP = “bs” & SEX == 2
Proportions of different relationships by age generate brother = new_RELATIONSHIP == "b" bysort AGE_IN_SUI: egen males = total(SEX == 2 & PRESENT) bysort AGE_IN_SUI: egen brothers = total(SEX == 2 & brother & PRESENT) generate proportion_brothers = brothers/males by AGE_IN_SUI: generate first_in_age = _n == 1 twoway line proportion_brothers AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion of males who are brother of a head") scheme(s1mono) bysort AGE_IN_SUI: egen heads = total(SEX == 2 & RELATIONSHIP == "e" & PRESENT) generate proportion_heads = heads/males twoway line proportion_heads AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion of males who are household head") scheme(s1mono) bysort AGE_IN_SUI: egen sons = total(SEX == 2 & new_RELATIONSHIP == "s" & PRESENT) generate proportion_sons = sons/males twoway line proportion_sons AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion of males who are son of a head") scheme(s1mono)
Relationship at first appearance bysort PERSON_ID (YEAR): generate fa_nephew = new_RELATIONSHIP[1] == "bs" & AGE[1] <= 10 & SEX == 2 & PRESENT bysort PERSON_ID (YEAR): generate fa_son = new_RELATIONSHIP[1] == "s" & AGE[1] <= 10 & SEX == 2 & PRESENT generate fa_nephew_head = fa_nephew & head generate fa_son_head = fa_son & head bysort AGE_IN_SUI: egenfa_sons = total(fa_son) bysort AGE_IN_SUI: egenfa_nephews = total(fa_nephew) bysort AGE_IN_SUI: egenfa_sons_head = total(fa_son_head) bysort AGE_IN_SUI: egenfa_nephews_head = total(fa_nephew_head) generate p_fa_sons_head = fa_sons_head/fa_sons generate p_fa_nephews_head = fa_nephews_head/fa_nephews twoway line p_fa_sons_headp_fa_nephews_head AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion") scheme(s1mono) twoway line p_fa_sons_headp_fa_nephews_head AGE_IN_SUI if AGE_IN_SUI >= 1 & AGE_IN_SUI <= 80 & first_in_age, ytitle("Proportion now head") scheme(s1mono) legend(order(1 "Appeared as sons of head" 2 "Appeared as nephews of head"))