Abstract:Data quality rules are key to the database quality detection. To discover data quality rules from relational databases automatically and detect the error or abnormal data based on them, the form and evaluation measures of data quality rules are studied, and criterions of computing data quality rules are presented based on data item groups and the confidence threshold. The algorithms of mining minimal data quality rules and the main idea of detecting data errors using data quality rules are also given. The new form of data quality rules makes use of confidence mechanism of association rules and the expression of conditional functional dependencies to describe functional dependencies, conditional functional dependencies and association rules in the same format. It can be concluded that this kind of data quality rules has the properties of conciseness, objectivity, completeness and accuracy of detecting the error or abnormal data. Compared with other related research work, the proposed algorithms have lower temporal complexity, and the discovered quality rules improve the detecting rate. The effectiveness and correctness of the proposed methods are proved by the experiments.
[1] Hipp J,Güntzer U,Grimmer U.Data Quality Mining-Making a Virtue of Necessity // Proc of the 6th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.Santa Barbara,USA,2001: 52-57 [2] Ciszak L.Application of Clustering and Association Methods in Data Cleaning // Proc of the International Multiconference on Computer Science and Information Technology.Wisla,Poland,2008: 97-103 [3] Liu Bo,Pan Jiuhui.Study of Abnormal Data Detecting Method Using Attribute Correlation Analysis.System Engineering and Electronics,2011,33(1): 63-68 (in Chinese) (刘 波,潘久辉.采用属性相关分析的异常数据检测方法研究,系统工程与电子技术,2011,33(1): 63-68) [4] Hu Yanli,Zhang Weiming,Xiao Weidong,et al.Functional Dependencies with Built-in Predicates and Its Axiomatization.Journal of National University of Defense Technology,2009,31(5): 58-63 (in Chinese) (胡艳丽,张维明,肖卫东,等.内置谓词函数依赖及其推理规则.国防大学学报,2009,31(5): 58-63) [5] Fan W F,Geerts F,Jia X B,et al.Conditional Functional Dependencies for Capturing Data Inconsistencies.ACM Trans on Database Systems,2008,33(2): 1-48 [6] Hu Yanli,Zhang Weiming,Luo Xuhui,et al.Dependencies Theory and Its Application for Repairing Inconsistent Data.Computer Science,2009,36(10): 11-15 (in Chinese) (胡艳丽,张维明,罗旭辉,等.基于数据依赖的数据修复研究进展,计算机科学,2009,36(10): 11-15) [7] Huhtala Y,Krkkinen J,Porkka P,et al.TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependencies.Computer Journal,1999,42(2): 100-111 [8] Wyss C,Giannella C,Robertson E.FastFDs: A Heuristic-Driven,Depth-First Algorithm for Mining Functional Dependencies from Relation Instances-Extended for Abstract // Proc of the 3rd International Conference on Data Warehousing and Knowledge Discovery.Munich,Germany,2001: 101-110 [9] Chiang F,Miller R J.Discovering Data Quality Rules // Proc of the VLDB Endowment.Auckland,New Zealand,2008,I: 1166-1177 [10] Fan W F,Geerts F,Li J Z,et al.Discovering Conditional Functional Dependencies.IEEE Trans on Knowledge and Data Engineering,2011,23(5): 683-698 [11] Medina R,Nourine L.A Unified Hierarchy for Functional Dependencies,Conditional Functional Dependencies and Association Rules // Proc of the 7th International Conference on Formal Concept Analysis.Darmstadt,Germany,2009: 98-113 [12] Beskales G,Ilyas I F,Golab L.Sampling the Repairs of Functional Dependency Violations under Hard Constraints // Proc of the VLDB Endowment.Singapore,Singapore,2010,III: 197-207