Search
strumenti

Methods and software of the statistical process

Data integration

Record linkage

Record linkage is an important process for the integration of data coming from different sources. The purpose of record linkage is to identify the same real world entity that can be differently represented in data sources, even if unique identifiers are not available or are affected by errors.

In official statistics, record linkage is needed for several applications: for instance,

  • to enrich the information stored in different data sets;
  • to create, update and de-duplicate a frame;
  • to improve the data quality of a source;
  • to measure a population amount by capture-recapture method;
  • to check the confidentiality of public-use microdata.

The complexity of the whole linking process relies on several aspects. The lack of unique identifiers requires sophisticated statistical procedures, the huge amount of data to process involves complex IT solutions, constraints related to a specific application may require the solution of difficult linear programming problems.

Statistical matching

The goal of statistical matching (sometimes named as data fusion) is the integration of two or more data sources referring to the same population with the aim of exploring the relationships between variables that are not jointly observed in the same data source. The sources to be integrated are composed of different non-overlapping units; as usually happens when data from several sample surveys are integrated. The typical situation of statistical matching is the one in which there are two data sources A and B; variables X and Y are available in A, variables X and Z are observed in B; the objective is to study the relationship between Y and Z by exploiting the common information in X. The objective of statistical matching can be macro or micro; in the first case the interest is in one or more parameters that summarize the relationship between Y and Z (correlation coefficient, regression coefficient, contingency table, etc.); in the second case the result of integration is a synthetic data set in which all the variables of interest, X, Y and Z are present.

The objectives of matching can be achieved by means of a parametric or nonparametric approach, or a mixture of them (mixed methods).

The parametric approach requires the specification of a model and the estimation of the related parameters. In absence of auxiliary information, it is generally assumed the conditional independence of Y and Z given the common variables X. This assumption is rather strong and unfortunately in the typical situation of the matching it is not testable.

Nonparametric methods are usually applied when you have a micro objective. In this case hot-deck imputation methods are frequently used. They aims at imputing missing variables in the data set chosen as recipient (for instance A) by using the observed values in the data set (B) chosen as donor. The donor unit for a given unit in A is the most similar observation in B in terms of the values of the common variables X. The mixed approach is composed of two steps: 1) a parametric model is assumed, parameter estimation is performed, and imputation is carried out; 2) a hot-deck imputation procedure is applied, it makes use of imputed values for choosing the donor observation.

It is worthwhile noting that an alternative approach based on the quantification of the uncertainty inherent the estimation of a particular parameter can be used. This approach does not require the conditional independence assumption nor of auxiliary information on non-identifiable parameters, i.e., those related to relations between Y and Z. The study of uncertainty does not lead to a point estimate but to a set of plausible estimates. The set of parameter estimates referring to Y and Z is composed of those consistent with the estimates obtainable from the data at hand, i.e., parameters concerning the couples (Y, X) and (Z, X).

The application of the matching data from complex surveys poses additional problems. In such circumstances it is required to take into account the sampling design as well as the other methodologies used to deal with nonsampling errors (coverage and non-response).