blog

Avoiding Garbage In, Garbage Out: The Importance of Data Quality—Part 3

by Sean Howard | May 03, 2016

In the two previous posts discussing data quality (Part 1, Part 2), we looked at the roles of input data and methodology in creating quality data. In this final post, we turn our attention to the third component: quality control, which is also referred to as quality assurance, or QA. Quality control involves assessing the model and evaluating the data it produces, and it should be done as frequently as possible.

Ensuring Quality Control

Quality control or assurance (QA) really boils down to two key elements: comparisons against authoritative data sources and judgment. In cases where authoritative data are available, it is quite straightforward to calibrate some models, test their results and assess the prediction accuracy. This is a highly valuable step in any modelling exercise. In essence it is an expansion of cross-validation techniques used in statistics and machine learning. All good modellers build models, make predictions, measure the accuracy of those predictions and then refine their models as a result. There is no skipping that last step.

The second element, judgment, is much more challenging and can be somewhat subjective. In our business, there can be a relatively long time period between when we make our predictions and when authoritative data are available to validate those predictions. In the case of DemoStats, we have to wait a minimum of 5 years to evaluate and measure accuracy.

At EA, we spend just as much time doing quality control as we do building models. When we perform quality control on our data, we use our experience, domain knowledge and best judgment to test the reliability of our data and models. One way we do that is by building competing methods that we can use to test our core methodologies against. This process typically leads us to some very important questions: How many predictions are comparable? Why and where are predictions different? Which prediction is more believable? Are there systematic differences between the two predictions that we can leverage? Also, when new authoritative data become available, we compare the various methodologies that we have maintained to determine if it is necessary to change the core methodology.

QA is an integral part of building datasets and of ensuring their quality. Our commitment to QA means that we are continually improving our methodologies and datasets. It also means that our researchers do not become complacent. Without a thorough QA process it is easy for researchers to fall into the trap of using the same methodologies and data sources simply because they were what was used in the past. And the last thing any business wants is complacent researchers!

* * *

In this three-part series, we examined the challenges to creating quality data. We have come to understand that, without exception, no data are perfect and determining how clean the input data are is vital. When it comes to methodology, one size does not fit all and there are trade-offs that must be intelligently considered based on the nature of the data and how the data are going to be used. Finally, creating quality data requires that models be tested and assessed as frequently as possible and then adjusted based on the assessment and on new data. The quality of your business decisions rests on the quality of the analysis that drives your decisions, and the quality of the analysis rests on the quality of the data.  At EA, we never forget that fundamental relationship.

Thanks to Jessica Moore, Michael Weiss, Sandra Albanese and Tony Lea for their editorial contributions.