An AND/OR TreeBased Algorithm for Checking Pestilent Ambiguity in Regular Expressions
DENG XuBin1, ZHU YangYong2,3
1.School of Information, Zhejiang University of Finance and Economics, Hangzhou 310018 2.Department of Computing and Information Technology, Fudan University, Shanghai 200433 3.Shanghai Center for Bioinformation Technology, Shanghai 201203
Abstract:During the construction of regular expressions (REs) for applications, introducing some beneficial ambiguities may simplify RE construction, while leaving some pestilent ambiguities in the RE will harm the correctness of matching. In order to treat these two categories of ambiguities in different ways, an algorithm based on AND/OR tree that checks and locates the pestilent ambiguities in REs is proposed. The algorithm is helpful to reducing the workload of debugging REs. Experiments show that the algorithm outperforms the present ambiguitychecking algorithm based on automaton not only in time and space behaviors, but also in practicality. A visualized RE editing and debugging environment based on the algorithm has been applied to build the first online integrated biological data warehouse of China.
[1] Chan C Y, Garofalakis M, Rastogi R. RE-Tree: An Efficient Index Structure for Regular Expressions. International Journal on Very Large Data Bases, 2003, 12(2): 102-119 [2] Embley D W, Campbell D M, Smith R D, et al. Ontology-Based Extraction and Structuring of Information from Data-Rich Unstructured Documents. In: Proc of the 7th International Conference on Information and Knowledge Management. Bethesda, USA, 1998, 52-59 [3] Laender A H F, Ribeiro-Neto B, da Silva A S. DEByE: Data Extraction by Example. Data and Knowledge Engineering, 2002, 40(2): 121-154 [4] Wu T H, Pottenger W M. A Semi-Supervised Algorithm for Pattern Discovery in Information Extraction from Textual Data. In: Proc of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Seoul, Korea, 2003, 117-123 [5] Heddad A, Brameier M, MacCallum R M. Evolving Regular Expression-Based Sequence Classifiers for Protein Nuclear Localisation. In: Proc of the 2nd European Workshop on Evolutionary Bioinformatics. Coimbra, Portugal, 2004, 31-40 [6] Vansummeren S. Unique Pattern Matching in Strings. 2003. http://arXiv.org/abs/cs/0302004 [7] Frisch A, Cardelli L. Greedy Regular Expression Matching. In: Proc of the 31st International Colloquium on Automata, Languages and Programming. Turku, Finland, 2004, 618-629 [8] Hosoya H. Regular Expression Pattern Matching: A Simpler Design. 2003. http://arbre.is.s.u-tokyo.ac.jp/~hahosoya/papers/ambig.ps