Abstract
A self-attention-based method termed as Vision Transformer (ViT) is applied to efficiently detect the Surface Defects of Steel Plate. The defect image is divided to N*N patches, each of which corresponds to a word, and the whole image data is used as a sentence or paragraph in NPL. A ViT framework is constructed by a learnable module with sequence length of L and 12 multi-head attention layers. We train the proposed model on the surface defects dataset. The experiment results show empirically that ViT has superior performance compared to alternative approaches.
Export citation and abstract BibTeX RIS
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.