Detection and rectification method for bent QR code recognition using convolutional neural networks

This paper proposes a method for decoding a bent quick-response code attached to a cylinder. The proposed method consists of two-stage image rectification using the shape function employed in a finite-element-method-based deformation analysis and a pix2pix network, which is a type of generative adversarial network. Rectification based on the shape function requires eight feature points, called nodes, of the bent code. A stacked hourglass network, a convolutional neural network used for human pose estimation, is used to detect these eight nodes. The experimental results show that, compared with other methods, the proposed method can more accurately decode bent codes with larger degrees of curvature.


Introduction
Quick response (QR) codes are extensively used in our daily lives to represent various types of information including product information. The process of reading such information can be broadly divided into detecting the QR code region from an image and decoding the QR code. Finder patterns that have the characteristic feature of black and white ratios of 1:1:3:1:1 are utilized to detect QR codes; however, if a QR code is attached to a surface with a large degree of curvature, the ratio will collapse and the detection will fail. Furthermore, even if a QR code is detected, decoding will be hindered because the code is deformed. Alignment patterns, which have been added to Model 2 of the QR code, are utilized to correct a deformation; however, their correction capability is limited, and it is difficult to read the code on a surface with a large curvature. This study focuses on the detection of the QR code region on a cylinder with a large curvature and rectification for proper decoding. Because it is easier to attach a QR code to the side surface of a cylinder than to a spherical or free-form surface, and owing to the numerous cylindrical industrial products such as bottles, there is a significant need for accurate detection and decoding. When attaching a QR code to a cylindrical product, it will be attached so that the bending effect is minimized. Since the bending on a cylinder surface is greatest along its circumference, this study covers the case where a QR code is attached so that its two parallel sides are parallel to the direction of the cylinder's central axis.
Early studies on the detection of QR code regions on cylindrical surfaces were often based on a combination of image processing techniques, such as the localization of bent finder patterns through binarization and corner detection [1]. Recently, however, methods using a convolutional neural network (CNN), such as region detection using a Faster-Region-based CNN [2], have been proposed, enabling a robust detection against complex backgrounds. CNNs outperform conventional image processing techniques in the object detection task, therefore the method proposed in this study detects a region by detecting eight feature points (nodes) on or near the contour of the region using a stacked hourglass network [3], which is a type of CNN used for a human joints detection. The node coordinates are also used for rectification, as detailed in section 2.1.
The methods proposed for the rectification of bent QR codes can be broadly divided into image-processing and neural network (NN) based approaches. Image-processing based methods include a perspective projection from the 3D coordinates of the bent QR code to the rectified image [1,4] and rectification equations obtained from the geometrical relationship between a cylindrical surface and the camera [5,6]. However, certain

Methods
Prior to this study, a bent image rectification method using a pix2pix network [12] has made it possible to read QR codes with large curvatures. However, when the distortion in the peripheral part of the code image is high, the rectification will fail (figure 1). Therefore, in this study, to enable the reading of larger curvatures, I have developed a method for correcting the distortion of the peripheral part mainly using the shape function prior to rectification through the pix2pix network. In this section, the details of a rectification using the shape function is described first, and then the stacked hourglass network for outputting the heatmaps of the eight-node input to this rectification process is described. Next, the pix2pix network, which applies a refinement as the second stage of rectification, is described, and finally, the overall flow of the two-stage rectification is elucidated.

First-stage rectification using shape function
The shape function interpolates values between discrete nodes on a 2D or 3D element (e.g., a quadrilateral or hexahedron element), and is often used in a finite element method based deformation analysis in the field of structural mechanics [10]. Using this function, the displacement of each point inside the element can be estimated when the displacement of the element nodes is known. It is therefore possible to correct the deformation by conducting an inverse calculation of the internal deformation of the quadrilateral elements. In this study, the serendipity shape function Ni (i = 1, K, 8) for eight nodes [10], four vertices (corner nodes) of quadrilateral elements, and the midpoint nodes of each side are used for a first-stage rectification.
A quadrate bounding box for the QR code is set as a quadrilateral element (figure 2, left). Let w 1 , K, w 8 be the points where the four vertices, v 1 , K, v 4 , and four midpoints,  using the pix2pix network [12], respectively. (e) Ground truth. The image in (c) is successfully rectified, whereas that in (d) fails the rectification near the periphery. Reproduced from [12], with permission from Springer Nature.
Consequently, the bent QR code can be rectified using equation (2) during the first stage. x where f q and f b are the image density at the given coordinates of the rectified and bent QR code images, respectively; and x and n(·) denote the coordinates of a rectified QR code image and a linear function for the normalization, respectively. Through a rectification, w 1 , K, w 8 are converted to match v 1 , K, v 8 , and it is thus expected that the distortion of the peripheral part that the pix2pix network in [12] fails to correct will be corrected prior to inputting into our pix2pix network. To calculate equation (2), it is necessary to determine the positions of w 1 , K, w 8 from the bent QR code image. However, it is difficult to find the four midpoints w 5 , K, w 8 . In [9], the bent QR code is first roughly rectified using the four-vertex shape function, the four midpoints are then calculated using the rectification result, and further rectification is applied for the resulting image with an eight-node shape function. However, there is no guarantee that the midpoints obtained in the first step are accurate; hence, there is a problem with the rectification accuracy. By contrast, our method directly estimates the positions of the eight nodes using a stacked hourglass network.

Stacked hourglass network for eight node localization
An hourglass network is a CNN that downsamples and upsamples images through the use of skip connections. A stacked hourglass network, in which the hourglass is connected in-series, was proposed for human joint detection and is robust to changes in scale. I have modified the stacked hourglass network to output the heatmaps of eight nodes instead of joint heatmaps and trained it from scratch using the procedure described in section 3. The input was a binarized image of the scene showing a bent QR code (left of figure 3). In this study, the structure was one in which two hourglasses were combined (i.e., two stacks), and the input/output images had a pixel resolution of 512 × 512.

Second-stage rectification using Pix2pix
An example of the rectification results from the first stage is shown in the second part of figure 3. As can be seen from the results, the QR code region is transformed into a rectangle; however, if the curvature is large, the deformation inside the region remains, and the QR code cannot be decoded. Therefore, this deformation was corrected using the pix2pix network.
The pix2pix network is a CNN that transforms images from domain A to domain B. The network was trained using a generative adversarial network framework comprising a generator and a discriminator. This generator (referred to as pix2pix herein) was used in the second stage. In this study, the input/output images of pix2pix have a pixel resolution of 512 × 512. The third image in figure 3 is an example of a rectification result by pix2pix.

Procedure for detection and rectification of bent QR Code
The flow from the detection of the bent QR code region to its rectification is illustrated in figure 4. After converting the scene image containing the bent QR code into a binary image, the image is input into a stacked hourglass network, and eight heatmaps are obtained. From the heatmaps, the node coordinates w 1 , K, w 8 of the bent QR code region are calculated. This provides a rectified image of the first stage. Finally, to obtain a refined image, the rectified image is input into the pix2pix network.

Network training procedure
The scene images containing bent QR code needed to train the two networks (stacked hourglass and pix2pix networks) are generated using OpenGL. The pseudo QR code [12] and the corresponding eight-node image (binary image with eight white nodes and a black background) are separately attached to a virtual cylinder (see figure 5). Subsequently, two types of scene images (including pseudo QR codes or eight-node images) are generated by rendering the scene. The pseudo QR code has finder, alignment, and timing patterns, and the other cells are randomly filled with white or black and created during each rendering. In this study, Model 2 version 8 of the QR code was used, and the code size was 49 × 49 cells. The curvature is varied by fixing the size of the pseudo QR code in the virtual space to L × L (L denotes the side length of the QR code, which in this training procedure was set to 114) and randomly changing the radius R of the cylinder for each rendering. The degree of curvature is defined by L′/L, where L′ denotes the chord length of bent QR code (shown in figure 5), and in this study, 0.76 L′/L 0.96. Note that the smaller L′/L is, the greater the effect of bending. The viewing angle was randomly selected from among six types, ranging from 40°to 90°in 10°increments for each rendering, and perturbations within the range of ±5°were added. The rendering procedure is as follows: Step 1) The VX × VY × VZ space (60 × 60 × 450 in this study) immediately in front of the area where the pseudo QR code and eight-node image are attached, and the AX × AY tangent plane (L/1.9 × L/1.9 in this study) right in front of that area, are the eye point space and aim point area of OpenGL, respectively (see figure 5). Divide the eye point space into a 3D grid with a mesh size of d (where d = 15 in this study). Randomly select one point from the inside of each voxel, set it as an eye point, and divide the aim point area into a 2D grid with a mesh size of d. Randomly select one point from the inside of each cell and set it as an aim point.
Step 2) For all combinations of the above eye and aim points, attach the pseudo QR code or the image of the corresponding eight nodes on the cylinder and render the scene (in this study, the viewport has a resolution of 512 × 512). In the case of a pseudo QR code, a randomly selected texture is mapped to the cylinder and background, and in the case of an image of eight nodes, the cylinder and background are filled with black. Determine the direction of the upvector such that the rendered QR code is randomly tilted within 10°from the direction parallel to the y-axis of the image.
Step 3) Delete the image that does not show the entire code region, and the image in which the code region is too small (1/4 or less than the image used in this study).
Step 4) Binarize the remaining images. In this study, an adaptive threshold was used for binarization. Figure 5. Generation of images required for training. Bent pseudo QR code and bent eight-node images are obtained by rendering the scene including a pseudo QR code attached to a virtual cylinder and an eight-node image attached to it, respectively. The bent pseudo QR code image and the eight single-node images obtained by dividing the bent eight-node image are utilized to train the stacked hourglass network. The rectified pseudo QR code is obtained from the bent pseudo QR code image using equation (2) to create an image pair with a flat pseudo QR code (upper left in this figure), and the pair is then utilized to train the pix2pix network.
Using this procedure, 10,700 images were obtained for scenes containing the pseudo QR code and scenes with only eight nodes, respectively. For the test dataset, 2,000 images were prepared in the same manner by varying the eye point space, aim point area, and d.

Stacked hourglass training
The scene image containing the pseudo QR code obtained through rendering (bent pseudo QR code image in figure 5) was used as the input image to train the stacked hourglass network. The scene image of only eight nodes corresponding to the scene image of the pseudo-QR code (bent eight-node image in figure 5) was divided into eight single-node images, constituting the ground truth of the output of the stacked hourglass network. The sets of input images and ground truths numbered 10,700. the stacked hourglass network was trained for 100 epochs with eight minibatch sizes. The test errors of the node estimations are shown in figure 6.

Pix2pix training
A rectified pseudo QR code using equation (2) (see figure 5) is the input image for training the pix2pix network. The pseudo QR code before bending ( figure 5, upper left) is the ground truth of the pix2pix network output. Two methods are available to calculate the eight node coordinates (w 1 , K, w 8 ) required for equation (2). One is to input the bent pseudo QR code image obtained through rendering into a trained stacked hourglass network to obtain the heatmaps of the eight-node, and the other is to calculate the node coordinates directly from the scene image of only eight nodes obtained through rendering (bent eight-node image in figure 5). For comparison, 10,700 input images for training the pix2pix network were generated using both approaches. Thus, the pix2pix network was trained in two ways. Hereafter, the approach using a stacked hourglass network is labelled pix2pix_1 and that using bent eight-node image is labelled pix2pix_2. Each pix2pix network was trained for 150 epochs without minibatches.

Experimental evaluation
I have conducted an experiment to detect and rectify five types of bent QR codes of Model 2 version 8. The information encoded in the QR codes in the experiment was five types of strings (see figure 7). These QR codes were attached to actual cylinders with five degrees of curvature (L′/L = 0.90, 0.86, 0.81, 0.78, and 0.76) (upper part of figure 8). A smartphone camera with an image size of 1200 (W) × 1600 (H) was used for filming. With the range of 50%-70% of the 1200 (W) × 1200 (H) area at the center of the image, each QR code was filmed 100 times from a randomly selected position such that the code region would appear almost upright. Subsequently, the filmed image was resized to obtain an image with a pixel resolution of 512 × 512.
The code region was detected for the 2,500 images obtained, the bent code was rectified using the proposed method (middle and bottom of figure 8), and a decoding test was conducted. The proposed method was programmed in Python, and the Pyzbar function was used for decoding. A complete reading of the encoded string was considered a success, while an error in the decoded string or impossible to decode was considered a failure. There were no cases where an incorrect string was read. Therefore, the degree of object recovery for each test was 100% or 0%. As mentioned in section 3.2, two pix2pix methods were differently trained (pix2pix_1 and pix2pix_2), and the success rate of each decoding is shown in table 1. Furthermore, the table also shows the results of [12], which was able to read QR codes with the largest curvature as far as I know from the literature review. The detection rate of the code region using the proposed method (stacked hourglass network in the firststage rectification) was 100%. As listed in table 1, the proposed method using pix2pix_1 has a decoding success rate of 100% when 0.81 L′/L. The rate is approximately halved at 0.78, and is almost 0% at 0.76. Another method using pix2pix_2 has a higher success rate than [12], which could read the QR code with the largest curvature in previous studies, but decoding failure occurs from L′/L = 0.86, and the success rate is 0% at 0.78. However, for both methods, L′/L of the images output by the second-stage rectification (i.e., pix2pix output) were all 1.00, indicating that the outlines were successfully squared. Nevertheless, there are cases where decoding fails because of a problem in the QR code refinement (see figure 9). Therefore, we evaluated the code recovery performance by calculating the matching ratio between the network output image and the ground truth. The results are shown in table 2. The table shows that for all methods, the matching ratio decreases as the curvature increases. However, pix2pix_1 shows the slowest decrease. Therefore, the performance of pix2pix_1 in refinement is the best.
Finally, the time required for the two-stage rectification was measured. The average time was approximately 0.96 s on a PC (Windows OS, 3.30-GHz Intel Core i9 CPU, Nvidia GTX 1080Ti GPU). Furthermore, the calculations for the first and second stages consumed approximately 49% and 51% of the total computation time, respectively.

Discussions
If there is an error in the estimated positions of the eight nodes (w 1 , K, w 8 ) of the bent QR code, the rectification of the first stage should be affected by the error. For example, if w n ' is detected instead of node w n , equation (2) will transform the bent QR code image so that (w 1 , K, w n ', K, w 8 ) matches eight nodes (v 1 , K, v n , K, v 8 ) of a square. Consequently, the area surrounded by the correct eight nodes (w 1 , K, w n , K, w 8 ) of the bent QR code is transformed into a distorted rectangle. Figure 6 shows the average error of the node coordinates estimated using The square bounding the rectified code region is shown in red. Note that the shape of this region is not square but slightly distorted. Pix2pix_1 succeeded in refining this, but pix2pix_2 failed. (Third and fourth) The second-stage rectification results with pix2pix_1 and pix2pix_2, respectively. Regardless of whether it succeeds or fails, the outline is square (i.e., L′/L = 1.00) after refinement. The fourth image, which could not be decoded, shows that many parts of the QR code pattern are not recovered correctly. (Right) Ground truth.  Proposed-1 and Proposed-2 are the methods using pix2pix_1 and pix2pix_2 for a second rectification, respectively. c The degree of curvature used in [12] was converted into L′/L. the trained stacked hourglass network as approximately four pixels in a 512 × 512 sized image. Therefore, some parts of the images used to train pix2pix_1 must have been distorted owing to errors in the eight nodes output by the stacked hourglass network. However, as a result of the training, pix2pix_1 should have improved its ability to refine the QR code image resulting from the rectification of the first stage, including the distortion. Conversely, pix2pix_2 was trained on how to refine the images rectified using the coordinates of w 1 ,K,w 8 detected in the bent eight-node image obtained through rendering. This training did not train the ability to refine distortions caused by errors of the stacked hourglass network. However, when executing the proposed method, the node positions estimated using the stacked hourglass network are applied; therefore, the position estimation errors should affect the rectification of the second stage when pix2pix_2 is used.
Therefore, we can assume that the reason why the decoding success rate of pix2pix_2 was lower than that of pix2pix_1 is that pix2pix_2 could not refine images containing distortions caused by the output error of the stacked hourglass network as well as pix2pix_1 could. An example of this is shown in figure 9. Equation (2) used in the rectification of the first stage works to match the eight nodes of the bent QR code to the eight nodes of the square, as described in section 2.1. Nevertheless, in the second image from the left in figure 9, the four vertices are not the vertices of a square. This may be due to an error in the eight nodes of the bent QR code estimated by the stacked hourglass network. The distortion must have caused pix2pix_2 to fail to refine (i.e., decoding failed) as shown in the fourth image from the left in figure 9. In contrast, the result of pix2pix_1 (third from left) was successfully decoded.
In [4] and [12] (shown in table 1), the decoding success rate is reported for a degree of curvature. When the degree of curvature used in [4] was converted into L′/L, it equated to a success rate of 100% at L′/L = 0.93, and the study reported no degree of curvature larger than this. Therefore, considering the degrees of curvature reported in the literature, the proposed method using pix2pix_1 achieves a top-class recognition rate.
As shown in table 2, the matching ratio between the refined QR code-pattern in the second stage and the ground truth is not always 100% even if decoding is successful. The bottom images except the left-most of figure 8 shows an example, and an insufficiently transformed part can be observed around the right end. The QR code has an error-correction function for stains and defects, and such an image can be decoded. However, if the defects are large, the decoding fail.
As described in section 1, this study was conducted for the case where the QR code is bent only in the horizontal direction. In the future, the proposed method will be extended to the case where the QR code is attached in various orientations relative to the cylinder. Furthermore, future work will include extending the CNN models to versions other than version 8 QR codes and reducing the time required for two-stage rectification through the building of lightweight CNN models.

Conclusions
The paper proposed a two-stage rectification method for bent QR codes using the shape function for eight nodes estimated using a stacked hourglass network for the first stage and a pix2pix network for the second stage. Through this method, it is possible to recognize QR codes with a larger degree of curvature compared with previous methods. The stacked hourglass and pix2pix networks can be trained using bent pseudo QR code images generated through OpenGL, and it is particularly effective to use images generated using nodes estimated using the trained stacked hourglass network for training pix2pix network.