|
No, I haven’t decided to put all my thoughts on the project in Dick Cheney’s undisclosed location ™. I just haven’t had time to focus on this. When I was a grad student, I could just fiddle with stuff until 2 or 3 in the morning, but that doesn’t work anymore. Something about age, having teenagers in the house, being overcommitted, etc…
I thought I was so straightforward and careful, going through the 2700 records of the dataset for participants who were born in July 1963 or later (i.e., plausibly in 8th or 9th grade), adding variables for each year to recode the attendance, grade, and graduation to 1s and 0s, and then adding summary variables to indicate total years of observed attendance, regular high school graduation sometime in the first decade of the study, retentions in K-12 grades, and retentions specifically in 9th grade.
Then I filtered the set for only those participants who were in 9th grade for the first time in 1980 (i.e., in 8th in 1979 and 9th in 1980). And when I divided those participants into graduates vs. nongraduates, the average number of years with observed attendance fit roughly with what I expected: an average of 2.5 observed attendance years for nongraduates, an average of almost precisely 4 observed years of attendance for graduates, with the number of years distributed widely for nongraduates and the vast majority of graduates attending for 4 years. Wonderful! Great! Good step forward to the simulation process!
And then I thought I’d be clever and figure out how many participants had left school and then returned. So I calculated “return” variables for each year after 1979, summed them up, and then realized the problem: about 85% of participants “returned,” at least by this method.
No, that’s not what happened. I realized instantly that this was an artifact of a flawed assumption I had made a few steps above: recoding a year of survey nonparticipation as nonattendance. It’s easy for someone to skip a year of participation, be counted as nonattending, and then come back into participation and have an artifactual “return.” That sounds like a “so what? big deal” situation, except that it also screws up my estimates of years of attendance, because I am artificially deflating the number of years (or cycles) attending by counting survey nonparticipation as nonattendance. Shoot shoot shoot.
This is a standard problem of data censorship with nonparticipation, both right censorship (censorship on the right end of the study timeline) with study attrition and middle censorship when participants at the beginning of the survey and in later years skip a year or more of participation. But it creates some interesting problems and requires that I reread the NLSY79 interview protocols so I know how to interpret the vast majority of missing-variable codes (valid skips and skip-interview codes).
Late last week I received the exemption letter and downloaded the dataset for NLSY79 from the NLS Investigator.
Next step: familiarize myself with the sampling codes for R (an open-source statistical package similar to S-Plus). I wish R had been around when I was in grad school, because in several ways it’s simpler than either SAS or SPSS. Any loss of speed is largely irrelevant to those of us who do piddling social-science statistics. I’ve used R for a few pilot things, but this and another project I need to finish up will be the first two projects I’ll be using R for in a more intensive way, and it’s time to brush up on a new language.
I sent in the request for an IRB exemption certification for this pilot project, because secondary analysis of existing longitudinal data sets would not give me any identifiable information. There is an option for the NLSY79 to acquire geocoding (which requires permission from the Bureau of Labor Statistics as well as an IRB form), but that’s not needed for this particular project. We’ll see how long the response takes!
In odd moments in the last few weeks, I’ve been playing around with a standard demographic concept, the stationary population model. This is one of those things that don’t really exist in reality, a population with constant mortality and fertility rates, with no migration in or out, and where the population is the same every year (no natural increase). In essence, a stationary population model is like a stripped-down car, something with all the extras out of the way so you can look at the engine while it’s running. The question I’ve had is, if one looks at a stationary population model of high school, what can one say about a high school if one observes the total enrollment, the ninth-grade enrollment, the number of graduates, and the distribution of graduates by years in high school?
A few minutes of scribbling shows that the crude graduation rate (or the number of graduates divided by the total enrollment) is equal to the probability of graduating times the rate of new ninth graders entering every year. The probability of graduating and the number of new ninth graders are both interesting and unobserved quantities. Unfortunately, they’re also dependent on a crucial third unobserved quantity, the difference between the entering-ninth-grade rate and the proportion of the high school in ninth grade. (One way of interpreting this is the overestimate of entering 9th graders. Another interpretation is the proportion of total school life experienced in repeating ninth grade.)
Because my life is now booked, I’ve only spent odd moments away from a computer on this exercise, but the obvious next step is to generate some simulated stationary populations (e.g., bootstrap samples of the National Longitudinal Study of Youth 1979), constrained to confirm to a range of graduation probabilities) and then look for regularities in the relationships between the underlying population measures and what would normally be observed from published data. Given the inherent constraints of the true value for the entering-ninth-grade rate (between 0.25 and the observed ninth-grade proportion), and a few other things, I suspect that regularities exist.
Then the next step is to move on to a stable population model, where you relax the zero-growth assumption and assume a constant growth rate. That’s important because school populations do not remain constant. (Neither does growth remain constant, but a stable population model introduces one level of complexity, and it’s loads easier to understand than the full-blown, “let the population do what it wants to” model.) The problem here is that one crucial number in a stable population model is a term that normally corresponds to the mean length of a generation. This has no clear interpretation in a model of high school enrollment, so that’s an interesting hurdle.
Incidentally, if anyone wants to jump ahead of me on this research program, feel free to dive in. The water’s fine, I’m not likely to follow up for some months, and there are some interesting payoffs. Among other things, in a stationary population model, the product of life expectancy at birth and the birth rate is always one. In the school parallel with a stationary population model, if you multiply the entering-ninth-grade rate by the average time spent in high school, you will always get one. From there and the data on graduates, it’s simple to calculate the average time spent in high school by those who eventually drop out.
Originally published on Sherman Dorn’s independent blog.
|
|