Sample Design

 

The primary sample of individuals for all Early Indicators data sets was drawn using a one-stage cluster sampling procedure. For a target sample size of 40,000 individuals, 303 Union Army companies (go to list of companies) were chosen randomly from the "Regimental Books" stored at the National Archives in Washington, D.C. These books contain the records of over 20,000 companies. They were created by clerks during the Civil War and provide the name, birth place, rank, personal description, age at enlistment, and place of enlistment for the recruits in each company. These identifying variables are crucial in linking the recruits to other historical documents. All recruits in the selected companies were entered into the sample.

The sample was restricted to white volunteer infantry regiments. Commissioned officers, black recruits, and other branches of military service were not sampled. Preliminary work done for the Early Indicators project's grant proposal indicated that the sample was representative of the contemporary white male population who served in the Union Army. Due to the large proportion of white men of military age who served in the Union Army, careful use of the sample should allow extrapolation for the entire Northern white male population of military age in the early 1860s.

The starting point for the construction of the life-cycle sample was the drawing of a sample of white recruits who were mustered into the Union Army. During 1981 such a sample was randomly drawn from the surviving regimental records of the Union Army at the National Archives in Washington. The technique employed was a one-stage cluster sampling procedure. As is well known, a cluster sampling procedure does not bias the estimates of the parameters of the population being sampled, but it makes the sample variance larger than it would be in a sample based on the individual recruits (Cochran 1953). However, a sample based on companies has three advantages over one based on individual recruits. First, since the principal objective of this project is not point or interval estimates of means or comparable descriptive statistics, but of multivariate analysis of the relationship between factors inducing early age stress and variables reflecting middle and late age health and behavior, moderately increased variance in the sample is an asset rather than a liability. Second, sampling by companies rather than individuals greatly reduces the cost of linking individuals to other military records and to the pension records. Third, a sample of companies makes it possible to separate company effects of exposure to military stress from individual effects.

The sampling frame was the complete set of companies in the complete list of white regiments and other independent organizations presented in Dyer (1908). A number was assigned to each of more than 20,000 companies and these numbers were arranged in the order in which they were drawn from a random number generator. The descriptive books of the regiments containing the designated companies were requested from the National Archives in the order that they were drawn. If a particular book had not survived, the book corresponding to the next random number was called. Once a book was obtained, all of the information on all the recruits in the designated company was typed into a portable terminal with storage capacity for a day's work (about 400 observations). At the end of the day all of the information in the terminal was transmitted to the computer at Chicago where it was cleaned, coded, and organized into working files. This process was continued until a sample of about 40,000 recruits was obtained.

The result of this work yielded 331 companies in 284 regiments. So about 11 percent of the regiments and other independent organizations covering all of the states except Rhode Island, from which the Union Army recruited white troops, are represented. The 39,616 individuals are a 1.6 percent random sample of all whites mustered into the Union Army (Dyer 1908).

Table 3 presents a number of statistics that can be used to assess how representative the recruits sample is of the Union Army. Lines 1-6 compare estimates of some key behavioral characteristics. In each of these comparisons the difference between the sample estimate and the figure obtained from the aggregate source is less than one percent (varying between 1 and 9 per thousand). Lines 7-10, which compare the geographic distribution in the recruits sample and in the aggregate source, show that the North Central region is somewhat over represented and New England is somewhat under represented. This was due to the differences in the proportion of the regiments in the two regions whose descriptive books were deposited in the National Archives. The issue could be addressed either by postweighting or by adding additional New England companies (chosen by a random procedure) to the recruit sample. However, for the multivariate procedures currently contemplated, the size of the New England subsample is adequate. Various experiments with postweighting produced results that were virtually the same as the internal weights, a finding anticipated by the closeness of the statistics computed from the recruit sample to those in the aggregate sources reported in lines 1-6 of Table 3.