Vision Based Important Change Detection Method for Web Pages
SHI Cunhui1,2, YU Xiaoming1, LIU Yue1, JIN Xiaolong1,2, CHENG Xueqi1,2
1. Key Laboratory of Network Data Science and Technology,Institute of Computing Technology,Chinese Academy of Sciences,Beijing 100190; 2. School of Computer Science and Technology,University of Chinese Academy of Sciences,Beijing 100049
Abstract:Duplicate Web indexes of Web crawling can be reduced effectively by detecting important changes and determining changes of essential content in Web pages.Therefore,a vision based detection method is proposed to detect changes in different semantic regions of the page and compress the page into a low dimensional vector representation.The proposed method is utilized to understand the difference of semantic importance in different regions from the perspective of users.Compared with the existing methods,the proposed method is independent of the analysis of HTML,and thus it is suitable for new media,such as mobile Internet.Experiments show the effectiveness of the proposed method.
史存会, 俞晓明, 刘悦, 靳小龙, 程学旗. 基于视觉的网页重要变化检测方法[J]. 模式识别与人工智能, 2020, 33(11): 1004-1012.
SHI Cunhui, YU Xiaoming, LIU Yue, JIN Xiaolong, CHENG Xueqi. Vision Based Important Change Detection Method for Web Pages. , 2020, 33(11): 1004-1012.
[1] CHO J,GARCIA-MOLINA H.Estimating Frequency of Change.ACM Transactions on Internet Technology,2003,3(3):256-290. [2] CHO J,GARCIA-MOLINA H.The Evolution of the Web and Implications for an Incremental Crawler//Proc of the 26th International Conference on Very Large Data Bases.New York,USA:ACM,2000:200-209. [3] EDWARDS J,MCCURLEY K,TOMLIN J.An Adaptive Model for Optimizing Performance of an Incremental Web Crawler//Proc of the 10th International Conference on World Wide Web.Washington,USA:IEEE,2001:106-113. [4] LIU L,TANG W,BUTTLER D,et al.Information Monitoring on the Web:A Scalable Solution.World Wide Web,2002,5(4):263-304. [5] WANG Y,DEWITT D J,CAI J Y.X-Diff:An Effective Change Detection Algorithm for XML Documents//Proc of the 19th International Conference on Data Engineering.Berlin,Germany:Springer, 2003:519-530. [6] JACOB J,SACHDE A,CHAKRAVARTHY S.CX-DIFF:A Change Detection Algorithm for XML Content and Change Visualization for WebVigiL//Proc of the International Conference on Conceptual Modeling.Berlin,Germany:Springer,2003:273-284. [7] BORGOLTE K,KRUEGEL C,VIGNA G.Relevant Change Detection:A Framework for the Precise Extraction of Modified and Novel Web-Based Content as a Filtering Technique for Analysis Engines//Proc of the 23rd International Conference on World Wide Web.Washington,USA:IEEE,2014:595-598. [8] SAAD M B,GANC,ARSKI S.Using Visual Pages Analysis for Optimizing Web Archiving//Proc of the EDBT/ICDT Workshops.Berlin,Germany:Springer,2010:1-7. [9] MANKU G S,JAIN A,DAS SARMA A.Detecting Near-Duplicates for Web Crawling//Proc of the 16th International Conference on World Wide Web.Washington,USA:IEEE,2007:141-150. [10] NIE Z Q,WEN J R,MA W Y.Webpage Understanding:Beyond Page-Level Search.ACM SIGMOD Record,2009,37(4):48-54. [11] CAI D,YU S P,WEN J R,et al.VIPS:A Vision-Based Page Segmentation Algorithm.Technical Report,MSR-TR-2003-79.Redmond,USA:Microsoft Research,2003. [12] FENG H Y,ZHANG W Z,WU H S,et al. Web Page Segmentation and Its Application for Web Information Crawling//Proc of the 28th IEEE International Conference on Tools with Artificial Intelligence.Washington,USA:IEEE,2016:598-605. [13] BOZKIR A S,SEZER E A.Layout-Based Computation of Web Page Similarity Ranks.International Journal of Human-Computer Studies,2018,110:95-114. [14] MALHOTRA P,MALIK S K.Web Page Segmentation towards Information Extraction for Web Semantics//Proc of the International Conference on Innovative Computing and Communications.Berlin,Germany:Springer,2019:431-442. [15] FAN Y X,GUO J F,LAN Y Y,et al.Learning Visual Features from Snapshots for Web Search//Proc of the ACM Conference on Information and Knowledge Management.New York,USA:ACM,2017:247-256. [16] KOTHARI R,VYAS G.An Image Processing Based Approach for Monitoring Changes in Webpages//Proc of the International Conference on Computational Vision and Bio Inspired Computing.Berlin,Germany:Springer,2019:974-982. [17] BROMLEY J,GUYON I,LECUN Y,et al.Signature Verification Using a "Siamese"Time Delay Neural Network//Proc of the 6th International Conference on Neural Information Processing Systems.Cambridge,USA:The MIT Press,1994:737-744. [18] CHOPRA S,HADSELL R,LECUN Y.Learning a Similarity Me-tric Discriminatively,with Application to Face Verification//Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2005:539-546. [19] SCHROFF F,KALENICHENKO D,PHILBIN J.FaceNet:A Unified Embedding for Face Recognition and Clustering//Proc of the IEEE Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2015:815-823. [20] RAO S J,WANG Y,COTTRELL G W.A Deep Siamese Neural Network Learns the Human-Perceived Similarity Structure of Facial Expressions without Explicit Categories[C/OL].[2020-07-02].https://mindmodeling.org//cogsci2016/papers/0050/paper0050.pdf. [21] WANG F,CHENG J,LIU W Y,et al. Additive Margin Softmax for Face Verification.IEEE Signal Processing Letters,2018,25(7):926-930. [22] WOO S,PARK J,LEE J Y,et al.CBAM:Convolutional Block Attention Module//Proc of the European Conference on Computer Vision.Berlin,Germany:Springer,2018:3-19. [23] HADSELL R,CHOPRA S,LECUN Y.Dimensionality Reduction by Learning an Invariant Mapping//Proc of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington,USA:IEEE,2006:1735-1742. [24] KREUZER R,HAGE J,FEELDERS A.A Quantitative Comparison of Semantic Web Page Segmentation Approaches//Proc of the International Conference on Web Engineering.Berlin,Germany:Springer,2015:374-391. [25] VAN DER MAATEN L,HINTON G.Visualizing Data Using t-SNE.Journal of Machine Learning Research,2008,9:2579-2605.