Scene Text Removal Based on Multi-scale Attention Mechanism
HE Ping1, ZHANG Heng2, LIU Chenglin2,3
1.School of Computer Science and Technology, Anhui University, Hefei 230601; 2.National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190; 3.School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049
Abstract:Scene text removal is of great significance for privacy protection and image editing in image communication. However, existing scene text removal models are insufficient in extracting robust features for images with complex background and multi-scale texts, resulting in incomplete text detection and background repair. To solve this problem, a scene text removal framework based on multi-scale attention mechanism is proposed for robust background repair and text detection. The proposed framework is mainly composed of background repair network and text detection network, sharing a backbone network. In the background repair network, a texture adaptive module is designed to encode the channel/spatial features and adaptively integrate local/global features, effectively repairing shadow parts in text reconstruction. To improve text detection, a context aware module is designed to learn the discriminative features between texts and non-texts in the image. Besides, to enhance the receptive field of the network and improve the removal of multi-scale texts, a multi-scale feature loss function is designed to optimize the background repair and text detection modules. Experimental results on SCUT-SYN and SCUT-EnsText datasets show that the proposed method can achieve the state-of-the-art performance in text removal.
何平, 张恒, 刘成林. 基于多尺度注意力机制的场景文本擦除[J]. 模式识别与人工智能, 2022, 35(7): 614-624.
HE Ping, ZHANG Heng, LIU Chenglin. Scene Text Removal Based on Multi-scale Attention Mechanism. Pattern Recognition and Artificial Intelligence, 2022, 35(7): 614-624.
[1] KARATZAS D, SHAFAIT F, NCHIDA S, et al. ICDAR 2013 Robust Reading Competition // Proc of the 12th International Conference on Document Analysis and Recognition. Washington, USA: IEEE, 2013: 1484-1493. [2] CHENG Z Z, BAI F, XU Y L, et al. Focusing Attention: Towards Accurate Text Recognition in Natural Images // Proc of the IEEE International Conference on Computer Vision. Washington, USA: IEEE, 2017: 5086-5094. [3] LIU C Y, LIU Y L, JIN L W, et al. EraseNet: End-to-End Text Removal in the Wild. IEEE Transactions on Image Processing, 2020, 29: 8760-8775. [4] TANG Z M, MIYAZAKI T, SUGAYA Y, et al. Stroke-Based Scene Text Erasing Using Synthetic Data for Training. IEEE Transactions on Image Processing, 2021, 30: 9306-9320. [5] PEREPU P K. Deep Learning for Detection of Text Polarity in Natural Scene Images. Neurocomputing, 2021, 431: 1-6. [6] WANG W H, XIE E Z, SONG X G, et al. Efficient and Accurate Arbitrary-Shaped Text Detection with Pixel Aggregation Network // Proc of the IEEE/CVF International Conference on Computer Vision. Washington, USA: IEEE, 2019: 8439-8448. [7] SHI J G, QI C. Sparse Modeling Based Image Inpainting with Local Similarity Constraint // Proc of the IEEE International Conference on Image Processing. Washington, USA: IEEE, 2013: 1371-1375. [8] GOODFELLOW I J, POUGET-ABADIE J, MIRZA M, et al. Generative Adversarial Nets // Proc of the 27th International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2014: 2672-2680. [9] ISOLA P, ZHU J Y, ZHOU T H, et al. Image-to-Image Translation with Conditional Adversarial Networks // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 5967-5976. [10] MAKHZANI A, FREY B.PixelGAN Autoencoders // Proc of the 31st International Conference on Neural Information Processing Systems. Cambridge, USA: The MIT Press, 2017: 1972-1982. [11] TURSUN O, ZENG R, DENMAN S, et al. MTRNet: A Generic Scene Text Eraser // Proc of the International Conference on Document Analysis and Recognition. Washington, USA: IEEE, 2019: 39-44. [12] ZDENEK J, NAKAYAMA H. Erasing Scene Text with Weak Supervision // Proc of the IEEE Winter Conference on Applications of Computer Vision. Washington, USA: IEEE, 2020: 2227-2235. [13] NAYEF N, YIN F, BIZID I, et al. ICDAR2017 Robust Reading Challenge on Multi-lingual Scene Text Detection and Script Identification-RRC-MLT // Proc of the 14th IAPR International Conference on Document Analysis and Recognition. Washington, USA: IEEE, 2017: 1454-1459. [14] PATHAK D, KRÄHENBÜHI P, DONAHUE J, et al.. Context Encoders: Feature Learning by Inpainting // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 2536-2544. [15] ZHOU B L, LAPEDRIZA A, KHOSLA A, et al. Places: A 10 Million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 40(6): 1452-1464. [16] RUSSAKOVSKY O, DENG J, SU H, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 2015, 115(3): 211-252. [17] CHO J, YUN S, HAN D, et al. Detecting and Removing Text in the Wild. IEEE Access, 2021, 9: 123313-123323. [18] NAKAMURA T, ZHU A N, YANAI K, et al. Scene Text Eraser // Proc of the 14th IAPR International Conference on Document Analysis and Recognition. Washington, USA: IEEE, 2017: 832-837. [19] TURSUN O, DENMAN S, SIVAPALAN S, et al. Component-Based Attention for Large-Scale Trademark Retrieval. IEEE Transactions on Information Forensics and Security, 2022, 17: 2350-2363. [20] ZHANG S T, LIU Y L, JIN L W, et al. EnsNet: Ensconce Text in the Wild. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 801-808. [21] TURSUN O, DENMAN S, ZENG R, et al. MTRNet++: One-Stage Mask-Based Scene Text Eraser. Computer Vision and Image Understanding, 2020, 201. DOI: 10.1016/j.cviu.2020.103066. [22] WOO S, PARK J, LEE J Y, et al. CBAM: Convolutional Block Attention Module // Proc of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 3-19. [23] HE K M, ZHANG X Y, REN S Q, et al.. Deep Residual Learning for Image Recognition // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2016: 770-778. [24] LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature Pyramid Networks for Object Detection // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2017: 936-944. [25] CHEN L C, PAPANDREOU G, KOKKINOS I, et al. Deeplab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(4): 834-848. [26] YOSHIDA Y, MIYATO T. Spectral Norm Regularization for Improving the Generalizability of Deep Learning[C/OL]. [2022-04-20].https://arxiv.org/pdf/1705.10941.pdf. [27] MILLETARI F, NAVAB N, AHMADI S A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation // Proc of the 4th International Conference on 3D Vision. Washington, USA: IEEE, 2016: 565-571. [28] DENG J, DONG W, SOCHER R, et al. ImageNet: A Large-Scale Hierarchical Image Database // Proc of the IEEE Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2009: 248-255. [29] BAEK Y, LEE B, HAN D, et al. Character Region Awareness for Text Detection // Proc of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington, USA: IEEE, 2019: 9357-9366.