Video Static Foreground Detection Method for Desktop Robotic Arm

Although there are many efficient methods to detect the foreground from video frames, these methods are not completely applicable to the application scenario of desktop robotic arm. This paper presents a video static foreground detection method suitable for the desktop robotic arm application scenario. In our method, an auxiliary board is used to decrease false detection caused by camera-shake and work area movement, and to effectively eliminate non-target foreground interference such as human hands. In addition, with the auxiliary board we can also accurately calculate the 2-dimensional position and posture of the foreground. This method is general applicable to most desktop robotic arm application scenarios. Furthermore, compared with the traditional foreground detection method, this paper presents the 3-frames threshold method to solve the problem of multi-texture background environment appears many noise. What’s more, for the problem that it is difficult to classify when the foreground colors are similar to the background colors, this paper presents the difference degree to describe pixels.


Introduction
The idea of background subtraction [1] is to build a background model with the first frame of the video or the previous frame, and then to compare the current frame with the background model, so as to classify the foreground and background. This method can be divided into two categories: (1) parametric techniques that use a parametric model for each pixel location [1] [2], and(2)samples-based techniques that build their model by aggregating previously observed values for each pixel location [3] [4]. These methods are widely used for object detection when the camera is fixed and the background is static, which are adapted for the application scenario of desktop robotic arm. Unfortunately, this idea is only used to detect the moving foreground. In the application scenario of desktop robotic arm, we need to detect the static foreground to obtain its static position and posture. In addition, we need to eliminate the false detection of non-target foreground, such as robotic arm and human hands in the process of interacting with the work area. Furthermore, we need to overcome the camera-shake and work areamovement caused by the movement of the robotic arm.
Among the static foreground detection methods, some methods directly compare the current frame with the background frame [5] [6], but these methods can get all the moving and static foreground, and can't distinguish whether the foreground is moving. There are also some methods which use double foreground to calculate all the foreground and moving foreground respectively, and subtract the moving foreground from all the foreground to get the static foreground [7] [8]. However, these methods are only effective for the continuously moving foreground, and moving objects that static for a short time or move slowly will be mistakenly detected. For example, in this application scenario, robotic arms and 2 human hands will be mistakenly detected when they static in the process of interacting with the work area.
In this paper, we add an auxiliary board to the work area of the robotic arm to skillfully solve the above problems (the specific installation method is given in Section 2), and presents a video static foreground detection method suitable for desktop robotic arm, which is divided into three steps. The first step is to extract and correct the area in the auxiliary board (the work area). The second step is foreground detection, and in this step, a background subtraction method based on samples is designed for the application scenario of desktop robotic arm. What is new in this method is that the auxiliary board is introduced to solve the problem of static foreground detection and eliminate the interference of non-target foreground. This method can work smoothly in various environments, such as light changes, the foreground colors are similar to background colors, and the camera-shake or work area-movement.
The rest of the paper is organized as follows. Section 2 introduces the installation method of the desktop robotic arm. Section 3 presents method in detail. The experimental results are given in section 4. Section 5 gives a summary and future work.

Installation of Desktop Robotic Arm
(a) Common method (b) Method of add auxiliary board Fig.1 Installation way of robotic arm The eye to hand installation method of desktop robotic arm is: the camera and robotic arm are both fixed on the desktop, and the camera is directly above the work area of robotic arm and shoots downward. The specific installation method is shown in Fig.1(a).
In our method, we need to install a rectangular auxiliary board (cardboard is used in this paper) in the camera's field of vision, which should be placed in the camera's field of vision and occupy the camera's field of vision as much as possible, as shown in Fig.1(b). If the color of the auxiliary board is similar to the desktop, it is necessary to draw a obvious dividing line on the boundary edges of the auxiliary board, and no object interference can be placed on the board boundary. When it is inconvenient to install the auxiliary board, an obvious rectangular boundary mark can be made on the desktop instead of the auxiliary board. The auxiliary board can be directly regarded as the workbench of the desktop robotic arm. If there are other workbench, the workbench can be fixed on the auxiliary board at will, but the size of the auxiliary board should be slightly larger than that of the workbench, and the boundary of the workbench should not exceed that of the auxiliary board. We stipulate that the work area of the desktop robotic arm is the area within the auxiliary board.

Extraction for Auxiliary Board Region
Although the camera is fixed, the background will change due to camera-shake and work areamovement, which will make the background subtraction method invalid. We have to find a way to fix the background, that is, extract the image only containing the auxiliary board region from the video frame, and the background subtraction processing is only for the image. In this way, not only the background can be fixed, but also the interference of non-work area can be eliminated to reduce the amount of calculation. To extract the auxiliary board region, we must first extract the outer boundary contour of the auxiliary board, which is defined as * , as shown in Fig.2. We can take advantage of the obvious boundary between the auxiliary board region and the desktop. Firstly, we use Canny Edge Detection Operator to find the edges of video frames, and then find the contours of these edges and denote them as sets , , , … . Because the auxiliary board occupies the field of view of the camera as much as possible during installation, we only need to find the contour with the largest area in the set , , , … , and the contour has a great probability of being the contour * . Fig.2 Outer boundary profile of auxiliary board However, because edge detection may lead to noise, sometimes the contour * is unusable (not a clean and complete quadrilateral contour), and * may be incomplete or have obvious noise interference, even may not be * . This will cause mistakes in our subsequent detection, so we need to do correctness detection. Because these mistakes have a small probability, we can just eliminate them, see Section 3.1.3 for details.
Next, we need to find the four vertices of the contour * . First do polygon fitting on the contour * with Douglas-Peucker algorithm [9], and then do corner detection with Shi-Tomasi algorithm [10] to get the coordinates of the four vertices of the contour * , which are denoted as , , , , , , , , as shown in Fig.3(b). Finally, according to these four vertices, the pixels in the contour * can be cut into a quadrilateral image from the video frame, which is denoted as , which is the auxiliary board region, as shown in Fig.3(c).
(a) Video frame (b) Contour * and 4 vertices (c) Extraction result Fig.3 Extraction process of auxiliary board region

Correction for Auxiliary Board Region
Although the auxiliary board is rectangular, due to the installation error of the camera, the optical center of the camera cannot be completely guaranteed to be perpendicular to the desktop, so the we obtained is an irregular quadrilateral image, as shown in Fig.4(a). This can cause distortion of the foreground, which is inconvenient for subsequent processing and calculating the 2-dimensional position and posture of the foreground. Therefore, it is necessary to transform the irregular quadrilateral image into a rectangular image through perspective transformation. In 3.1.1, we calculated the vertex coordinates , , , , , , , of , and let the four vertex coordinates of after perspective transformation be 0,0 , , 0 , , ℎ , 0, ℎ , among them: , , and 0,0 , , 0 , , ℎ , 0, ℎ these four pairs of points can directly solve the linear equations to calculate the homography matrix of perspective transformation. Then we can perspective transform into a rectangular image through . In this process, we also need to use linear interpolation algorithm to fill the vacant part in the rectangular image.
Finally, we get a rectangular image which only contains the auxiliary board region, as shown in Figure 4 (b). After that, we only focus on the rectangular region, which can effectively reduce the impact of camera-shake and work area-movement. Moreover, on the premise of knowing the size of the auxiliary board, we can accurately and easily calculate the 2-dimensional position and posture of the foreground.
(a) Before correction (b) After correction Fig.4 Correction process of auxiliary board region

Correctness Check for Auxiliary Board Region
In chapter 3.1.1, we emphasized * may be unusable, which will lead to is wrong, so we need to make a correct judgment for * . In fact, we can easily see * with the naked eye whether it is usable (clean and complete quadrilateral contour), but it is impossible to judge whether each frame is available with the naked eye. So we take the first transformed image as the initial frame _ and confirm it is availability with naked eyes, then use _ to build the background model (see 3.2.1 for details), then use _ as a reference to check the image _ at the time whether it is available. The method is as follows: firstly, we resize _ to the size of _ , then respectively search ORB feature [11] points of _ and _ images , and their feature points are matched by FLANN [12] to obtain point pair set , , , , … , , . Because the background in this application scenario is static, and _ is the background frame, and the feature point pairs on matching are all static background points, so the coordinate values of these point pairs in pixels should be the same. If the pixel coordinate value of any point pairs are not equal, the obtained _ is considered to be wrong, so we should eliminate it. retain _ , ∀ ∈ 1,2, … , position position eliminate _ , ∃ ∈ 1,2, … , position position

Intrusion Detection for Auxiliary Board Region
In this method, we need to judge whether the foreground in the auxiliary board is static, and also need to eliminate the interference of non-target foreground. Based on the conventional idea, we need to detect all the static foreground first, and then judge whether it is a non-target foreground such as human hands or robotic arms. This requires the design of the corresponding algorithm to judge whether the foreground is a target foreground, which increases the complexity of the algorithm. However, in this application scenario, the foreground objects to be detected are placed in the work area by human hands or robotic arms, and the work area is in the auxiliary board. The process of human hands and robotic arms interacting with the work area is an intrusion process, and the intrusion will destroy the boundary contour * of the rectangular auxiliary board(as shown in Fig. 5 (c)). And we default that the foreground will not move by themselves when hands or robotic arms does not interact with the work area. So we transform the two problems of whether the foreground is static and eliminating non-target foreground into intrusion detection problem, and intrusion detection can be transformed into the completeness detection of the * . Judge contour the completeness of * is very simple, using the first frame of nonintrusion frame to calculate the length of contour * (as shown in Fig.5(b)) and denote as , if an intrusion occurs, its length will change(as shown in Fig.5(c)). We detect the foreground of _ only when there is no intrusion, this solve the problems of static foreground and non-target foreground interference.

Background Modeling
Here, we build a background sample set for image _ . we denote , as the point of image _ at the time coordinate , . In order to initialize the background sample set quickly, we use the VIBE algorithm [4] for reference, and randomly select N points in the 8 neighbourhood of point , as the sample set of the point, which is denoted as { , , , ,…, , }. We denote x, y as the description value of point , . Which is different from the previous methods, in addition to information, we add the difference degree , that is: , , , , , , , , Among them: , | , , | | , , | | , , | The addition of makes it possible that the foreground colors are similar to background colors， can enlarge the distance of RGB color space. When the distance of two pixels is very close, d still has a large distance. At the same time, it can resist the change of brightness. When the brightness changes, the value will increase or decrease at the same time, so will not change. The Fig.6 can be a good illustration.
(a) Background video frame (b) Video frame (c) Distance when not add (d) Distance when add Fig.6 Comparison results of add

Classification of Foreground and Background
This section introduces the method of background classification. When the background is fixed, according to the conventional idea, if a point is close to at least one sample point in its corresponding sample set, then we think that the point is a background point. For point , , we calculate the distance between it and N sample points in the corresponding background sample set (Euclidean distance of the description value), and then find the smallest distance and denote the distance as , :

If
, is less than the dynamic threshold , the pixel is judged as the quasi foreground. The dynamic threshold is described in 3.2.3. sign , 1， , , sign , 0， , , Considering the spatial correlation of pixels, the probability that the neighborhood of the foreground is also the foreground will be higher, so we use this feature to vote on the 8 neighborhood of each pixel. If more than half of the quasi foreground appears in the 8 neighborhood, we think that the point is the foreground, which can effectively filter the noise and holes.

Dynamic Threshold and Background Model Updating
In order to deal with the change of ambient light, it is necessary to update the background sample set regularly. Here we also refer to VIBE [4] update strategy. When updating the sample set, a random sample point in the corresponding sample set is replaced by a background point, so as to ensure the smooth life cycle of the sample set. At the same time, in order to avoid salt noise foreground is updated as the background, only the points detected as the background in three consecutive frames update their background samples. If the threshold value is fixed, although it is simple, there will be a lot of texture noise in the application scenario with too many textures, as shown in Fig.7 (c). This is because our foreground classification method is based on pixels, some pixels will inevitably be misclassified, especially in the place where the pixel value gradient is larger, we call these points flicker points. So the dynamic threshold is very necessary. Considering the temporal correlation of the image, we presents the 3-frames threshold method. For point , , we take the maximum value of , , , and , plus the preset radius R as the threshold value: x, y R, 2 , max , x, y , x, y R, 3 , max x, y , x, y , x, y R, 3 In this way, the characteristics of the recent image are reasonably used, and the interference of texture noise is effectively avoided. Consistent with the background update, only the points detected as the background in three consecutive frames update their dynamic thresholds. The foreground and background classification results after using the 3-frames threshold method are shown in Fig.7 Fig.7 Comparison results of 3-frames threshold method Because the junction of foreground and background is also a place with large pixel gradient, flickering points are easy to appear, and same flickering points will appear in the foreground. These flickering points will make the points in the foreground be updated as background samples, which will cause ghosts at the flickering points when the foreground is moved or removed. Based on this, we presents the reject update region strategy, in which all the pixels in the convex hull region of the foreground connected domain join the reject update region. In order to avoid texture noise being added to the refused update region, considering that texture noise has the characteristics of small area, the foreground whose connected domain is less than the preset threshold T is not added to the reject update region. The value of preset threshold T is set as 1/50 of the area of image _ in this paper.

Calculating the 2-dimensional Position and Posture of the Foreground
Based on the above foreground detection results, we can easily get the center pixel coordinates and 2dimensional angles of each foreground. The specific method is as follows: Firstly, we find the outer contour point sets of each foreground connected domain. Then we find the convex hull of these point sets. Finally, we use the Rotating Calipers algorithm to find the minimum circumscribed rectangle of these convex hull. Let the center position of the minimum circumscribed rectangle be the center position of the foreground pixel, denoted as ( , ). The angle between the minimum circumscribed rectangle and the X axis of the image is regarded as the 2-dimensional posture of the foreground pixel, denoted as . Next, we solve the physical 2-dimensional position and posture of each foreground. According to the principle of camera imaging, the relationship between the pixel coordinate ( , ) and its world physical 2-dimensional position coordinates , , is as follows: and are the scale factors of the horizontal and vertical axes of the image, is the focal length of the camera, and are the compensation values caused by the non-coincidence between the optical center coordinates of the camera and the origin coordinates of the image, all of them are constants. The imaging region is a desktop, and desktop is a plane, then the Z coordinate value can also be regarded as a constant, so the physical coordinates of the foreground has a linear mapping relationship with its pixel coordinates, and the pixel angle is the same as the physical angle. The physical size W , H ) of the auxiliary board is known, and its pixel size , ℎ is also obtained in 3.1.2. According to the above reasoning, we can get the corresponding physical 2-dimensional position and posture of the foreground pixel with the upper left corner of the auxiliary board as the origin: