The amount of data being collected is growing exponentially, both in academics as well as in business. Unfortunately, the quality of that data can be poor, leading to poor decisions and increasing costs. Data cleaning, the process of detecting and correcting errors from a dataset
...
The amount of data being collected is growing exponentially, both in academics as well as in business. Unfortunately, the quality of that data can be poor, leading to poor decisions and increasing costs. Data cleaning, the process of detecting and correcting errors from a dataset, could be the solution to improve bad data.
This research focuses on detecting these errors. There are (semi-)automated error detection tools available, but it is unclear how well these tools perform under varying conditions and on different datasets.
Following from this problem, the main research question was developed: How to
choose a fitting error detection algorithm for a specific relational dataset?
To answer this question, a comparative study has been done for error detection tools on relational data. An interactive error detection tool, Raha, performed best from the selected state of the art tools.
Subsequently, an attempt was made to estimate the performance of error detection tools and particular configurations on unseen datasets, based on high-level profiles of these datasets. According to the qualitative and quantitative experiments in this research, the proposed estimators have been shown to be effective. Moreover, the performance estimators were analyzed to provide more interpretability on the functioning of the error detection tools on the datasets in this research.
Ultimately, these performance estimators were used to generate suggested rankings of error detection strategies. The produced system outperformed the set baseline and was able to create valuable rankings. The proposed strategy ranking system could help real-world computer scientists and data experts choose a fitting error detection algorithm for a specific relational dataset.