Coding of textual answers

The coding activity is generally performed in case the survey questionnaire contains textual variables that refer to official classifications that allow for national and/or international data comparability. Example of this kind of variables are Economic Activity (NACE), Occupation, Education, Places (of birth, of residence, etc.).

Coding means to assign a unique code to a textual answer according to a classification scheme. The level of detail of the matched code depends on the survey aims and/or the dissemination needs. Coding can be performed manually or trough automated systems. Manual coding can be performed only at the end of the data collection phase, while if automated systems are used, it can be run during or after collection of data: in the first case it is called assisted coding (on-line coding) while in second case automated coding (batch coding).

With reference to GSBPM, coding belongs to the sub-process 5.2 “Classify and code” of the Phase 5 “Process” that includes those activities that are necessary to make data ready for the analysis (Phase 6 “Analyse“). Obviously, in case of assisted coding some of the activities of sub-process 5.2 can start before Phase 4 “Collect” ends, improving the timeliness of data delivery.

Coding is, in general, a very hard activity of the survey process. Besides, if it is manually performed it is also difficult to standardise, because coding results strictly depend on coders. Despite coders are well trained about criteria and principles of each official classification, coding is influenced by the cognitive process of each coder that might lead to different (subjective) interpretations and, therefore, different coding of the same textual answer.

The use of specialised coding software can produce a considerable saving of time and resources and will also guarantee a higher standardisation level of the coding process, increasing the expected quality of the coding results.

As already said, computer assisted coding can be distinguished in “automatic coding” and in “assisted coding”. They differs in terms of aims and coding process:

Automatic coding: the coding software analyses and codes, on the basis of a reference dictionary, a data file containing all the textual answers collected during the collection phase (batch coding). The aim is to look for and to assign a single code to each textual answer according to quality thresholds;
Assisted coding: the coding software is an interactive instrument, that aids the coder/respondent in coding the textual answer. The aim is to offer the user a wider set of possible matching codes among which to choose the correct one.

The key point of any coding system, automated or assisted, is the implementation of the informative basis that represents the reference dictionary containing codes and texts of the official classification and enriched with textual answers collected by Istat surveys (and correctly coded). In order to be processed by a software, the reference dictionary has to undergo a number a standardisation operations aimed at producing analytic, synthetic and not ambiguous descriptions. Besides, in general, the richest the dictionary the higher the coding rate.

Generally speaking, coding systems varying according to the algorithm used to match the textual answers with the dictionary descriptions. They can be classified as follows:

dictionary algorithms: they look for exact matches on the bases of key words (or groups of key words);
weighting algorithms: they look for partial or exact matches on the basis of similarity functions among texts that assign weights to each word according to its informative content;
sub-strings algorithms: they look for partial or exact matches processing portions of texts (bigrams or trigrams).

Besides, for what concern assisted coding, there are three possible methods to consult (to navigate) the reference dictionary:

tree search: it navigates inside the classification hierarchical structure, from the higher branch to the lowest one (leave) that represents the most detailed code (highest number of digits) that can be assigned to a textual answer;
alphabetic search: it navigates inside the entire dictionary looking for the definition which is equal or the most similar to the textual answer to be coded;
mixed mode search: it makes an alphabetic search inside the selected classification branch.

Data collection technique highly influences the choice of the searching method. A special distinction is among interviewer administered and self-administered modes. For the latter, where respondents are not trained on classifications and coding like interviewers are, it is extremely important to provide a coding system that is user friendly and guarantees high quality results.

The quality of coding activity is highly influenced by the update of both the dictionary content and the matching rules (training phase). It is advisable to perform the training phase periodically, in general after the coding of textual answers collected by a survey. To this aim, after a coding application, it is important to:

verify the quality of the coded cases;
use the not coded cases to update the coding application (dictionary and checking rules);
highlight eventual lacks of the classification used.

Per la valutazione della qualità delle due modalità di codifica, è possibile utilizzare i seguenti indicatori:

Indicators for assisted and automated coding can be used to evaluate the performance of the coding phase:

Automated coding indicators:

efficacy/coding rate: ratio of “number of coded texts” to “total number of texts to be coded”;
accuracy: ratio of “number of correctly coded texts” to “number of coded texts“;
efficiency: unitary coding time.

Assisted coding indicators:

average time to assign a single code;
coherence among each collected textual description and the assigned code.

Methods and software of the statistical process

Coding of textual answers