Feature Fusion Pyramid Network for End-to-End Scene Text Detection

Abstract

How to properly involve text characteristics like multi-scale, arbitrary direction, length aspect ratio, into detection network design has become a hot topic in computer vision. Feature Pyramid Network (FPN) is a typical method to achieve robust text detection, where its low-level and high-level feature map retains spatial structure and global semantic information, respectively. However, its strict hierarchical structure fails to fuse low-level and high-level information to improve the distinguish ability of feature map. To address this problem, we propose a novel feature fusion pyramid network for end-to-end scene text detection by fusing multi-modal information. By dividing pyramid structure into high-level and low-level layers, channel and spatial attention modules are adopted to enhance high-level and low-level feature representation by encoding channel and spatial-wise context information, respectively. In order to reduce information loss by layer transmission, a special residual network is designed to achieve short-cut between high-level and low-level features, so as to realize multi-modal feature fusion. Experiments show the precision and recall of the proposed method on ICDAR2015, ICDAR2017-MLT, and MSRA-TD500 datasets reach 88.7%/82.1%, 77.0%/60.3%, and 85.3%/74.8%, respectively.

Publication
ACM Transactions on Asian and Low-Resource Language Information Processing
Yirui Wu
Yirui Wu
Young Professor, CCF Senior Member

My research interests include Computer Vision, Artifical Intelligence, Multimedia Computing and Intelligent Water Conservancy.