• Data Archive Login




  • Download Previous Extract
  • New User Registration
  • Data User Agreement
  •  

    How Data Are Cleaned

     

    Numerous pieces of military, medical, and socio-economic information covering the lifetime of the recruits in the UA data set have been collected from military, census, and medical and pension records. The collected information has been divided into three main sets of variables, corresponding to the three main data sets:

    Specific information on the cleaning process applied to each variable can be found under the variable names on these lists. (However, see below on Surgeon’s Certificates Variables.)

     

    Data cleaning was the final step in the processing of the UA data. Cleaning took place at the Center for Population Economics at the University of Chicago. In general terms, cleaning involved the standardization of values, including the correction of spelling errors and standardization of variant or archaic spellings as well as the standardization of punctuation, use of abbreviations, and word order. Also important was the exclusion of values that did not fit the form or logical possible content of a particular variable, such as when a number had been input for a variable requiring an alphabetic value. Whenever possible, errors of this sort have been corrected by consulting the original records, a process that continues where needed today. Some variables (such as residence and occupation) also have been subjected to a coding process, undertaken for the purposes of clarifying and simplifying the range of data values. Any coding of a variable will be described in detail in the codebooks, which contain the most intricate information about the variables in the data sets.

     

    Note on Surgeon’s

    Data collection from the surgeon’s certificates presented special challenges. Variables in this data set are listed according to the “disease screens” through which the data was input. The disease screens principally consist of distinct inputting screens pertaining to various disease conditions for which a recruit may have been examined. Most of the information in this data set is coded according to “answer classes” and “modifiers” devised by Dr Louis Nguyen. (For more information, see the description under The Surgeon’s Certificates Variables.)