Blind-Depth Light Field Super-Resolution

Light field cameras have drawn much attention due to the ability of post-capture adjustments such as refocusing and 3D reconstruction. However, the low resolution has been the bottleneck of them and limits their further application. To solve this problem, various methods have been proposed to increase the resolution, namely, to recover details and reconstruct a high resolution (HR) image of the scene from several low resolution (LR) observations. Most of the previous super-resolution (SR) frameworks depend on prior knowledge of depth information, which is an indispensable part in their SR approaches. However, it may be difficult to obtain precise depth information in some practical situations. In this paper, we propose a novel light field super-resolution (LFSR) framework independent of the prior depth information. The framework combines a variational regularization-based SR approach with light field refocusing process, which can super-resolve the focused region while preserve the unfocused region and generate a series of multi-focus super-resolved images. We then employ a multi-focus image fusion algorithm based on stationary wavelet transform (SWT) and finally obtain an all-in-focus HR image. Synthetic and real-world datasets are utilized to demonstrate the effectiveness of the proposed framework both quantitatively and visually.


Introduction
Light field cameras have drawn more and more attention from both industry and academia recently. Compared to traditional 2D cameras, a light field camera not only catches the accumulated intensity at each point, but also separates intensity values for each ray direction. In other words, it captures the 3D scene geometry and reflectance property [1], possessing a strong capability of post-capture adjustments [31,32]. It can be clearly seen that light field cameras are becoming increasingly popular and have a huge commercial potential.
To capture the light field signal, many camera architectures have been proposed such as microlens array based design [2], mask based design [3], coded aperture based design [4] and camera array based design [5]. Among these, the microlens array based design (also called plenoptic camera) is commercially available such as Lytro and Raytrix [6], which places a microlens array between main lens and CCD sensors to provide a sample of the light field. However, the design faces a trade-off between special and angular resolution [1]. Since the total sensor resolution is limited, one can either opt for a dense sampling in the angular (viewpoint) domain [7], or vice versa [2,8,9]. Relatively, camera array based design has shown the potential advantage in image resolution. The design sets several independent cameras in an array to get the parallax of the scene. The resolution of each subimage caught by the camera can be higher and the baseline (the distance between two neighbouring camera) is adjustable. All these properties outperform the plenoptic camera. Pelican Imaging 2 Corporation in Columbia University has developed an ultra-thin high performance monolithic camera array called PiCam, which can capture light fields and synthesize high resolution (HR) images along with a range image (scene depth) through integrated parallax detection and superresolution (SR) [10]. Wang et al [11] set 8 low resolution (LR) side cameras around the central high-quality single lens reflex (SLR) lens and then formed a small camera array based light field imaging device. With the development of camera miniaturization, camera array based light field imaging devices can hopefully surpass other professional imaging devices such as SLR in the coming future.
Besides the preceding advantages, light field cameras can utilize the parallax obtained by different sub-cameras to super-resolve the composited image [33,34,40,41]. SR is a technique for generating an HR image using one or several observed LR images [35,36,37,38,39]. Namely, the goal of SR is to increase the pixel density and use an appropriate reconstruction algorithm to estimate the true highfrequency components. Typical SR techniques utilize a maximum likelihood (ML) method [12], maximum a posterior (MAP) method [13,14], iterative backward projection (IBP) method [15], frequency domain approach [16], or projection onto convex sets (POCS) method [17,18] as a reconstruction algorithm. However, all the SR methods above assume that the sub-pixel shifts between each LR image are constants, which means that all pixels in an image correspond to the same shift. Nevertheless, in the sub-images caught by an array of cameras, shifts of pixels depend not only on the coordinate of the camera, but also on the depth in the scene. Due to the spatial-variant property of the light field, traditional SR techniques need to be modified. As a result, light field super-resolution (LFSR) techniques have been proposed. Bishop et al. [19] firstly designed a variational Bayesian framework to recover more information and super-resolved light field images. Wanner and Goldluecke [1,20] optimized a continuous variational framework and generated super-resolved novel views of one scene. Nava et al. [26] introduced the focal stack transform and produced an all-in-focus image of a scene.
Most of the previous work [1,6,10,11,19,20,21] depends on the depth map or disparity map to obtain pixels' shearing shifts in LR images. However, it may be difficult to obtain precise depth information in some practical situations especially near edges or in richly-textured area. The increasing estimating error in depth map will mismatch the pixels and the shifts, which can reduce the sharpness of the SR result. Inspired by the light field refocusing process [22,31], the depth information is actually not necessary in SR process. In this paper, we propose a novel LFSR framework which does not utilize estimated depth information. The framework combines a variational regularization-based SR method with light field refocusing process, generating an HR image with focused region super-resolved and unfocused region well-preserved. By varying the 'slope' of the refocusing process, a series of multi-focus and super-resolved images can be produced. We then employ a stationary wavelet transform (SWT) based fusion algorithm to obtain an all-in-focus HR image of the scene. This paper is organized as follows: In Section 2 we formulate the light field degradation model and adapt the regularization-based SR model to light field situation. In Section 3 we present our Blind-Depth LFSR framework systematically. Extensive experiments are conducted using both synthetic and real-world datasets to test the effectiveness of the proposed framework in Section 4. Finally the conclusions are drawn in Section 5.

Lightfield Images Degradation Model
The HR image we need to reconstruct is a 2D projection of the 3D real-world scene. The light field sub-images caught by cameras suffer from shearing shifts (caused by parallax disparity), blurring (caught by optical distortions) and down sampling (caught by unideal CCD or CMOS density). Due to the 3D property of the light field, the refocused objects on various depth correspond to different shifts in each sub-image. For objects near the cameras, the parallax between each camera is obvious, which dues to the larger shearing shifts in sub-images. However, for objects in remoter area, the parallax between different cameras is tiny, which corresponds to smaller shearing shifts in sub-images. Since a  [12][13][14][15][16][17][18] neglect the depth property, we take this into consideration and formulate the light field images degradation process as Where x is a lexicographically ordered vector, which represents the desired, ideal HR image captured by a selected reference camera (usually but not necessarily choose the centre camera). Since the refocused image on arbitrary depth has a depth of field, which means that there is a distance between the nearest and the furthest depth in focus. We divide the real-world scene into p depth layers. i x is the th i layer of x , with pixels on the layer reserved and pixels out of the layer set to 0. k y is the vectorized th k observation caught by an sub-camera. Suppose that the observed sub-images are sized n n × and m is the upsampling rate. The sizes of k y , i x and x are 2 1 n × , 2 2 1 m n × and 2 2 1 m n × respectively. As we have discussed above, pixels on different depth or caught by different cameras correspond to different shifts. ,

Variational Regularization-Based SR Model
After the degradation model being formulated, we need to estimate x through the observations k y . As the dimension of k y is smaller than the dimension of x , the problem becomes ill-posed and the solution is not unique. So we explore variational regularization-based SR method to solve the ill-posed LFSR problem. The objective function can be established as follows: In (3), x is the estimation of the HR image. ( ) E x is called fidelity term which is an 2 L norm expression measuring the error between estimations and observations. ( ) J x represents regularization term which controls the smoothness of the solution. λ is a scalar to balance the weight between fidelity and smoothness, which depends on the kind of regularization terms and is generally set empirically.
We employ BTV regularization term as shown in (4), which is generally utilized and confirmed to have a better performance [23]. respectively. The scalar weight α (0 1) α < < is applied to give a spatially decaying effect to the summation of the regularization terms.

Blind-Depth LFSR Framework
Due to the special property of light fields, the optimization problem formulated in (3) can not be solved directly. Since the shearing shifts are related to the depth of the scene, depth-relied LFSR techniques [1,6,10,11,19,20,21] first employ depth or disparity estimation to build a bridge between pixels' locations and their corresponding shifts. However, as the refocusing techniques mentioned in [22], we do not need depth information while refocusing sub-images in various depth. Merely presenting the slope into a series of discrete values and traversaling most depth of the scene will be enough.
Inspired by the refocusing process, we propose a novel LFSR framework independent of the depth information. We first refocus the sub-images and project it onto a target HR buffer simultaneously. A bicubic interpolation method is used in the process. Then the refocused 'HR' image is adopted to a variational regularization-based SR algorithm as an initial value. After the iterations, the algorithm converges to an HR image with focused region super-resolved and unfocused region well-preserved. By traversaling the present slope parameter, we can get a series of multi-focus HR images. Finally we use SWT-based fusion method to generate an all-in-focus HR image of the scene. The process of the proposed framework is demonstrated in Figure. 1.

Variational Regularization-Based LFSR Method
Since the initial input of the iteration is refocused on particular depth, which means objects in the depth of field are clear while objects out of the depth of field are blurred. So the LFSR method needs to be adapted. We propose a local optimal function replacing the global optimal function represented in (3), as Where i X is the desired, vectorized HR image focused on the th i depth layer, with the region in focus clear and out of focus blurred. Then we can represent i X as , i k Y is the projection of i X in LR spaces with focused region aligned and unfocused region blurred.
Correspondingly, , We adopt gradient descent method in [10] to obtain the approximate solution, as We employ the projection method proposed in [24]. Fully taking the sparseness of the degradation matrix into consideration, we can transfer matrix multiplication into operations upon the image matrix, which can greatly reduce the computational complexity.

( )
During the iterations, we first project the estimated HR image into LR space and subtract the estimation from LR images. Then we project the residuals onto HR grid and correct the estimated HR image. β is the step size to balance the weight of correction.

Focused Region Recognition
After the iteration in (10), both focused and unfocused region are 'super-resolved'. However, as Figure. 2 shows, the unfocused region does not match to a correct shift in the refocusing process. In other words, they  Figure. 5 'D3' and 'D5', it will cause an ill result in unfocused regions, which results in the poor SR quality and thus damages the final fusion result. Consequently, we have to employ a method to recognize the focused region and prevent the unfocused region from being damaged.
Inspired by the epipolar image (EPI) proposed in [1], [6], we represent a novel approach. Consider that the focused region in LR space is aligned by the shearing shifts. Thus the undulation among the residuals in LR space is tiny. However, due to the unalignment of the unfocused region, there is a much larger undulation, which causes a larger variance. Therefore, we judge the region in focus exactly by the variance of the pixel residuals and match the variance to the weight of correction matrix in the iteration.

Figure 3. Focused Region Recognition (a) The 'Lego Knights' scene refocused on the two side knights; (b) The variance map of the residual matrix; (c) The plot of transform proposed in (11); (d) The transformed weight map
As Figure 3 (a) shows, we use the camera array light field datasets from Stanford University [27] and refocus them on the depth of two side knights. Then we calculate the residuals and draw the variance map. From Figure. 3 (b) we can see that the region in focus corresponds to smaller variance. We provide a mapping relationship to match the variance to the weight of the correction matrix, which is demonstrated in (11) and Where V represents the variance in certain pixel and θ is the weight of the corresponding element in the correction matrix. And the correction matrix can be obtained by (12).
Where Θ is the weight vector matching to the corresponding pixels in the residual vector in LR space. By (11), the variance map can be transformed into weight map, which is depicted in Figure. 3 (d).

Multi-Focus Image Fusion
Multi-focus image fusion techniques aim at the problem that a clear image with all objects in focus cannot be obtained within one snapshot due to the camera's limited depth of field. And the goal of the techniques is to create an all-in-focus image from multi-focus images [25]. Since the camera array has a large equivalent aperture, the algorithm is required to deal with a large number of images with an extremely shallow depth of field. To solve the problem, we employ an SWT-based multi-focus image  Figure. 4.
We assume that the multi-focus source images named n I ( 1, 2,..., ) n N = are decomposed into k levels, which contain approximate components A, horizontal details H, vertical details V and diagonal details D in each level. Suppose the size of n I is P Q × , due to the avoidance of downsampling process [26], the size of all SWT coefficients A, H, V, D in each level is P Q × as well. Then different fusion rules are adopted to select the coefficients.
Usually, the 'max absolute value (Max Abs)' rule is employed to select the detail coefficients because the focused area always corresponds to more edges and textures, whose detail coefficients' absolute values are large. Max Abs rule can be represented as Where N is the total number of the source multi-focus images, n is the frame of the source images ranged 1 to N . ( , ) i j represents the coordinate in decomposed coefficient images. k represents the decomposition level.
k n H represents the horizontal detail's coefficients of image n in th k level and k H represents the selected horizontal detail's coefficients in th k level. The same selecting process can be employed to the coefficients V and D .
As for approximate components which contain some large-scale information, 'average (Avg)' rule is commonly adopted in many algorithm, which is defined as Where k n A represents the approximate components' coefficients of image n in th k level and k A represents the integrated approximate components' coefficients in th k level. Both rules above are confirmed efficient by practice. Finally, the integrated coefficients need to be fused to obtain the output image f I .

Blind-Depth LFSR Framework
In this section, both synthetic and real-world scenes are employed to test the effectiveness of the proposed framework. As for synthetic scenes, we use the datasets produced by the light field laboratory in Stanford University [27]. We first simulate the degradation process (shifting, blurring and downsampling) and then employ the simulated LR sub-images to the algorithms. We set the original center-view sub-image as a reference image. By comparing the results with the reference image, we can get precise quantitative metrics of the methods, which is significant for the evaluation. Average Gradient (AG), Spatial Frequency (SF), Root Mean Square Error (Rmmse), Peak Signal to Noise Ration (PSNR) and Correlation Coefficient (Corr) are employed to make a quantitative assessment [28], [29].
The Stanford University's light field datasets are produced by a 17*17 array of cameras. We choose the central 5*5 sub-images to run the experiment. We arbitrarily select the scenes named 'Lego Knights', 'Stanford Bunny' and 'Tarot Cards', with each sub-images' resolution 1024*1024. As for all the datasets, we down-sample the source sub-images into the resolution 256*256, then employ a 3*3 Gaussian blurring kernel with the variance 0.5 to simulate the blurring effect. Finally, we add a zero mean standard Gaussian white noise to each sub-image. In the LFSR process, we set the correction rate β to 0.05 and the regularization weight λ to 0.001. And the number of iteration is set to 10. In the fusion process, the SWT decomposition level is set to 7. All of the parameters are optimized to balance the effectiveness and the computational complexity. Bicubic interpolation method, LFSR approach without focused region recognition (rough LFSR method) and the proposed LFSR method are compared. The visual results are shown in Fig. 5  From Figure. 5 we can see that the bicubic interpolation method has little contributions in the superresolution but causes no distortion in the fusion results. In contrast, the rough LFSR method can produce highly super-resolved results in focused region but causes severe distortion in unfocused region, which destroys the fusion results obviously. By comparison, although the proposed LFSR framework loses some sharpness in focused region, it can preserve the unfocused region and generate a satisfactory fusion result without distortion.
By analysing the quantitative results from Table 1, we can also infer that the distortion caused by the rough LFSR method in unfocused region makes a huge difference in fusion results. It not only brings in the noise, which results in a high AG and SF, but also reduces the fidelity, which reduces the PSNR, Corr and increases the Rmmse. In contrast, our proposed method can deal with the detailriched scenes and generate the results with best quantitative metrics.
The experiments with synthetic scenes preliminarily demonstrate the outperformance of our proposed framework. However, the simulation cannot be equivalent to the real degradation process. In order to test the practical value of the framework, experiments with real-world scenes are undoubtedly necessary. Since the single camera scanning approach is equivalent to the camera array approach when dealing with statistic scenes. We employ the self-developed scanning gantry with cameras on it to capture the real-world scene in our own laboratory. The devices and the scenes is shown in Figure.6. In the experiments, we transfer the camera into 25 different positions to form a 5*5 array. With the Leika Q camera, the exposure time is set to 0.1 second and the size of camera's aperture is set to 2 mm. Since the Meizu (Meilan 5) smart phone camera has a certain aperture (2.2mm) which is unadjustable, we set the exposure time to 0.05 second which can achieve a proper brightness. After the camera  [30] and sub-images registration, the resolutions of rectified sub-images are 2148*2048 for Leika Q camera and 2560*1440 for phone camera respectively. To reduce the computational complexity, we cut the central 1024*1024 pixels and super-resolve them into a 4096*4096 resolution. The level of the SWT decomposition is set to 10 and other parameters are the same as those in former synthetic scene experiments. Figure. 7 and Figure. 8  From the results of the real-world scene experiments, it can be seen that the proposed framework works well and produces results with the highest visual quality. In Fig. 7 we observe the details 'date', 'remarks' and 'days of the week' and in Fig. 8 we observe the numbers on the calendar. The proposed LFSR framework produces more clear results than the original central-view sub-images and the results of bicubic interpolation. Compared with the results of rough LFSR method, our proposed framework have a better performance in the reduction of the block distortion and generates much cleaner images. The two scenes with different cameras reasonably demonstrate the effectiveness and the practical value of our framework.

Conclusion
In this paper, we propose a novel LFSR framework free of prior estimated depth information. Refocused image is used as an initial value of the iteration and a regularization-based SR algorithm is adapted to produce an HR image with focused region super-resolved. By traversaling the preset slope parameters, we can obtain a series of multi-focus images. Finally we use the SWT-based fusion