|
Sample Design
The primary sample of individuals for all Early Indicators data
sets was drawn using a one-stage cluster sampling procedure. For a target
sample size of 40,000 individuals, 303 Union Army companies
(go to list of companies)
were chosen
randomly from the "Regimental Books" stored at the National Archives in
Washington, D.C. These books contain the records of over 20,000
companies. They were created by clerks during the Civil War and provide
the name, birth place, rank, personal description, age at enlistment, and
place of enlistment for the recruits in each company. These identifying
variables are crucial in linking the recruits to other historical
documents. All recruits in the selected companies were entered into the
sample.
The sample was restricted to white volunteer infantry regiments.
Commissioned officers, black recruits, and other branches of military
service were not sampled. Preliminary work done for the Early
Indicators project's grant proposal indicated that the sample was
representative of the contemporary white male population who served in
the Union Army. Due to the large proportion of white men of military age
who served in the Union Army, careful use of the sample should allow
extrapolation for the entire Northern white male population of military
age in the early 1860s.
The starting point for the construction of the life-cycle sample was the drawing of a sample of white
recruits who were mustered into the Union Army. During 1981 such a sample was randomly drawn from the
surviving regimental records of the Union Army at the National Archives in Washington. The technique employed
was a one-stage cluster sampling procedure. As is well known, a cluster sampling procedure does not bias the
estimates of the parameters of the population being sampled, but it makes the sample variance larger than it would
be in a sample based on the individual recruits (Cochran 1953). However, a sample based on companies has three
advantages over one based on individual recruits. First, since the principal objective of this project is not point or
interval estimates of means or comparable descriptive statistics, but of multivariate analysis of the relationship
between factors inducing early age stress and variables reflecting middle and late age health and behavior,
moderately increased variance in the sample is an asset rather than a liability. Second, sampling by companies
rather than individuals greatly reduces the cost of linking individuals to other military records and to the pension
records. Third, a sample of companies makes it possible to separate company effects of exposure to military stress
from individual effects.
The sampling frame was the complete set of companies in the complete list of white regiments and other
independent organizations presented in Dyer (1908). A number was assigned to each of more than 20,000
companies and these numbers were arranged in the order in which they were drawn from a random number
generator. The descriptive books of the regiments containing the designated companies were requested from the
National Archives in the order that they were drawn. If a particular book had not survived, the book corresponding
to the next random number was called. Once a book was obtained, all of the information on all the recruits in the
designated company was typed into a portable terminal with storage capacity for a day's work (about 400
observations). At the end of the day all of the information in the terminal was transmitted to the computer at
Chicago where it was cleaned, coded, and organized into working files. This process was continued until a sample
of about 40,000 recruits was obtained.
The result of this work yielded 331 companies in 284 regiments. So about 11 percent of the regiments
and other independent organizations covering all of the states except Rhode Island, from which the Union Army
recruited white troops, are represented. The 39,616 individuals are a 1.6 percent random sample of all whites
mustered into the Union Army (Dyer 1908).
Table 3 presents a number of statistics that can be used to assess how representative the recruits sample is
of the Union Army. Lines 1-6 compare estimates of some key behavioral characteristics. In each of these
comparisons the difference between the sample estimate and the figure obtained from the aggregate source is less
than one percent (varying between 1 and 9 per thousand). Lines 7-10, which compare the geographic distribution in
the recruits sample and in the aggregate source, show that the North Central region is somewhat over represented
and New England is somewhat under represented. This was due to the differences in the proportion of the
regiments in the two regions whose descriptive books were deposited in the National Archives. The issue could be
addressed either by postweighting or by adding additional New England companies (chosen by a random
procedure) to the recruit sample. However, for the multivariate procedures currently contemplated, the size of the
New England subsample is adequate. Various experiments with postweighting produced results that were virtually
the same as the internal weights, a finding anticipated by the closeness of the statistics computed from the recruit
sample to those in the aggregate sources reported in lines 1-6 of Table 3.
|