模式识别与人工智能
Home      About Journal      Editorial Board      Instructions      Ethics Statement      Contact Us                   中文
Pattern Recognition and Artificial Intelligence
22 Judgement and Disposal of Academic Misconduct Article
22 Copyright Transfer Agreement
22 Proof of Confidentiality
22 Requirements for Electronic Version
More....
22 Chinese Association of Automation
22 National ResearchCenter for Intelligent Computing System
22 Institute of Intelligent Machines,Chinese Academy of Sciences
More....
 
 
2022 Vol.35 Issue.12, Published 2022-12-25

Deep Learning Based Image Understanding and Its Applications   
   
Deep Learning Based Image Understanding and Its Applications
1047 Image Inpainting with a Three-Stage Generative Network
SHAO Xinru, YE Hailiang, YANG Bing, CAO Feilong
One of the research emphases of image inpainting based on deep learning is to generate color, edge and texture. However, generation methods of these three important properties need to be further improved. A three-stage generative network is proposed, and three stages tend to synthesize colors, edges and textures respectively. Specifically, the global color of the image is reconstructed in the HSV color space at the HSV color generation stage to provide color guidance for image inpainting. An edge learning framework is designed at the edge optimization stage to obtain more accurate edge information. At the texture synthesis stage, a decoder with feature bidirectional fusion is designed to enhance the details of the image. The three stages are successively connected, and each stage plays an important role in improving the performance of image inpainting. Extensive experiments demonstrate the superiority of the proposed method compared with the state-of-the-art methods.
2022 Vol. 35 (12): 1047-1063 [Abstract] ( 554 ) [HTML 1KB] [ PDF 5120KB] ( 494 )
1064 Modal Invariance Feature Learning and Consistent Fine-Grained Information Mining Based Cross-Modal Person Re-identification
SHI Linbo, LI Huafeng, ZHANG Yafei, XIE Minghong
In the existing cross-modal person re-identification methods, modal differences are lessened by aligning features or pixel distributions of different modalities. However, the discriminative fine-grained information of pedestrians is ignored in these methods. To obtain more discriminative pedestrian features independent of modal differences, a modal invariance feature learning and consistent fine-grained information mining based cross-modal person re-identification method is proposed. The proposed method is mainly composed of two modules, modal invariance feature learning and semantically consistent fine-grained information mining. The two modules are combined to drive the feature extraction network to obtain discriminative features. Specifically, the modal invariant feature learning module is utilized to remove the modal information from the feature map to reduce the modal differences. Channel grouping and horizontal segmentation are conducted on person feature maps via the semantic consistent fine-grained information mining module. Consequently, the semantic alignment is achieved and the discriminative fine-grained information is fully mined. Experimental results show that the performance of the proposed method is significantly improved compared with the state-of-the-art cross-modal person re-identification methods.
2022 Vol. 35 (12): 1064-1077 [Abstract] ( 310 ) [HTML 1KB] [ PDF 5413KB] ( 419 )
1078 Generalized Zero-Shot Image Classification Based on Reconstruction Contrast
XU Rui, SHAO Shuai, CAO Weijia, LIU Baodi, TAO Dapeng, LIU Weifeng
In generalized zero-shot image classification, generative models are often exploited to reconstruct visual or semantic information for further learning. However, the representation performance of the methods based on variational autoencoders is poor due to the underutilization of the reconstructed samples. Therefore, a generalized zero-shot image classification model based on reconstruction and contrastive learning is proposed. Firstly, two variational self-encoders are utilized to encode visual information and semantic information into low dimensional latent vectors of the same dimension, and then the latent vectors are decoded into two modes respectively. Next, the project modules are utilized to project both the original visual information and the visual information reconstructed from semantic modal latent vectors. Then, reconstruction contrastive learning is performed to learn the features after projection. The reconstruction performance of the encoder is maintained, the discriminative performance of the encoder is enhanced, and the application ability of pre-training features on the generalized zero-shot task is improved by the proposed method. The effectiveness of the proposed model is verified on four benchmark datasets.
2022 Vol. 35 (12): 1078-1088 [Abstract] ( 251 ) [HTML 1KB] [ PDF 1449KB] ( 374 )
1089 Adjacent Feature Combination Based Adaptive Fusion Network for Infrared and Visible Images
XU Shaoping, CHEN Xiaojun, LUO Jie, CHENG Xiaohui, XIAO Nan
To obtain an infrared and visible fusion image with clear target edges and rich texture details,a fusion network model, adjacent feature combination based adaptive fusion network(AFCAFNet) is proposed based on the classical feed-forward denoising convolutional neural network(DnCNN) backbone network by improving the network architecture and the loss function of the model. The feature channels of several adjacent convolutional layers in the first half of DnCNN network are fully fused by adopting the strategy of expanding the number of channels, and the abilities of the model to extract and transmit feature information are consequently enhanced. All batch normalization layers in the network are removed to improve the computational efficiency, and the original rectified linear unit(ReLU) is replaced with the leaky ReLU to alleviate the gradient disappearance problem. To better handle the fusion of images with different scene contents, the gradient feature responses of infrared and visible images are extracted respectively based on the VGG16 image classification model. After normalization, they are regarded as the weight coefficients for the infrared image and visible image ,respectively. The weight coefficients are applied to three loss functions, namely mean square error, structural similarity and total variation. Experimental results on the benchmark databases show that AFCAFNet holds significant advantages in both subjective and objective evaluations. In addition, AFCAFNet achieves superior overall performance in subjective visual perception with clearer edges and richer texture details for specific targets and it is more in line with the characteristics of human visual perception.
2022 Vol. 35 (12): 1089-1100 [Abstract] ( 219 ) [HTML 1KB] [ PDF 2820KB] ( 242 )
1101 Image Super-Resolution Reconstruction Based on l1 Induced Lightweight Deep Networks
ZHANG Dabao, ZHAO Jianwei, ZHOU Zhenghua
Existing deep-learning based super-resolution reconstruction methods improve the reconstruction performance of networks by deepening networks. However, sharp increase of the number of network weights is caused by deepening networks, resulting in a huge burden for the storage and training network. With the consideration of the sparsity of noise, the cost of training network and the sharpness of reconstructed edges, an image super-resolution reconstruction is proposed based on l1 induced lightweight deep networks integrating with the idea of model-driven and data-driven. Firstly, the split Bregman algorithm and soft threshold operator are utilized to deduce an effective iterative algorithm from the l1 reconstruction optimization model with an edge regularization term. Secondly, a corresponding recursive deep network is designed for image reconstruction under the guidance of the iterative algorithm. Therefore, the proposed deep network is designed under the guidance of the reconstruction optimization model, and its derived recursive structure reduces the number of network weights due to its property of weight sharing. Experimental results show that the proposed method achieves good reconstruction performance with less number of network weights.
2022 Vol. 35 (12): 1101-1121 [Abstract] ( 293 ) [HTML 1KB] [ PDF 2411KB] ( 214 )
1111 Chinese Lipreading Network Based on Vision Transformer
XUE Feng, HONG Zikun, LI Shujie, LI Yu, XIE Yincen
Lipreading is a multimodal task to convert lipreading videos into text, and it is intended to understand the meaning expressed by a speaker in the absence of sound. In the existing lipreading methods, convolutional neural networks are adopted to extract visual features of the lips and capture short-distance pixel relationships, resulting in difficulties in distinguishing lip shapes of similarly pronounced characters. To capture the long-distance relationship between pixels in the lip region of the video images, an end-to-end Chinese sentence-level lipreading model based on vision transformer(ViT) is proposed. The ability of the model to extract visual spatio-temporal features from lip videos is improved by fusing ViT and Gate Recurrent Unit(GRU). Firstly, the global spatial features of lip images are extracted using the self-attention module of ViT. Then, GRU is employed to model the temporal sequence of frames. Finally, the cascading sequence-to-sequence model based on the attention mechanism is utilized to predict Chinese pinyin and Chinese character utterances. Experimental results on Chinese lipreading dataset CMLR show that the proposed model produces a lower Chinese character error rate.
2022 Vol. 35 (12): 1111-1121 [Abstract] ( 528 ) [HTML 1KB] [ PDF 1611KB] ( 414 )
1122 Camouflaged Object Detection Network Based on Global Multi-scale Feature Fusion
TONG Xuwei, ZHANG Guangjian
In the detection of camouflaged object, it is difficult to segment camouflaged object accurately due to the high similarity between appearance and backgrounds. In context-aware cross-level fusion network, the high-level semantic information is diluted and lost when it is transmitted to the shallow network fusion, resulting in the reduction of accuracy. Aiming at the problem, an camouflaged object detection(COD) network based on global multi-scale feature fusion(GMF2Net) is proposed. Firstly, the global enhanced fusion module(GEFM) is designed to capture the context information at different scales, and then the high-level semantic information is transmitted to the shallow network through different fusion enhanced branches to reduce the feature loss during multi-scale fusion. The location capture mechanism is designed in the high-level network to extract and refine the location of the camouflaged object, and feature extraction and fusion for high-resolution images are carried out in shallow network to enhance high-resolution feature details. Experiments on three benchmark datasets show that GMF2Net produces better performance.
2022 Vol. 35 (12): 1122-1130 [Abstract] ( 553 ) [HTML 1KB] [ PDF 1812KB] ( 490 )
模式识别与人工智能
 

Supervised by
China Association for Science and Technology
Sponsored by
Chinese Association of Automation
NationalResearchCenter for Intelligent Computing System
Institute of Intelligent Machines, Chinese Academy of Sciences
Published by
Science Press
 
Copyright © 2010 Editorial Office of Pattern Recognition and Artificial Intelligence
Address: No.350 Shushanhu Road, Hefei, Anhui Province, P.R. China Tel: 0551-65591176 Fax:0551-65591176 Email: bjb@iim.ac.cn
Supported by Beijing Magtech  Email:support@magtech.com.cn