3d object detections and recognitions: Assisting visually impaired people in daily activities

To evaluate the 3-D queried objects detection of the VIPs, we have prepared

the ground truth data according to the two phases. The first phase is to evaluate

the table plane detection, we prepared as Sec. 2.2.4.2 and using ’EM1’ measurement

for evaluating the table plane detection. To evaluate the objects detection, we also

prepared the ground truth data and compute T1 for evaluating 3-D cylindrical objects

detection and T2 for evaluating 3-D spherical objects detection. They are presented in

the Sec. 4.1.4.2. To detect objects in the RGB images, we utilize the YOLO network

for training the object classifier. The number of classes, iterations are used as Sec.

4.1.4.3. All source code of program is published in the link: 1

pdf28 trang | Chia sẻ: honganh20 | Ngày: 18/03/2022 | Lượt xem: 403 | Lượt tải: 0download
Bạn đang xem trước 20 trang tài liệu 3d object detections and recognitions: Assisting visually impaired people in daily activities, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
narios including data sets collected in lab environments and the public datasets. Particularly, these research works in the dissertation are composed of six chapters as following: ˆ Introduction: This chapter describes the main motivations and objectives of the study. We also present critical points the research’s context, constraints and challenges, that we meet and address in the dissertation. Additionally, the general framework and main contributions of the dissertation are also presented. ˆ Chapter 1: A Literature Review: This chapter mainly surveys existing aided systems for the VIPs. Particularly, the related techniques for developing an aided system are discussed. We also presented the relevant works on estimation algorithms and a series of the techniques for 3-D object detection and recognition. ˆ Chapter 2: In this chapter, we describe a point cloud representation from data collected by a MS Kinect Sensor. A real-time table plane detection technique for separating the interested objects from a certain scene is described. The proposed table plane detection technique is adapted with the contextual constraints. The experimental results confirm the effectiveness of the proposed method on both self-collected and public datasets. ˆ Chapter 3: This chapter describes a new robust estimator for the primitive shapes estimation from a point cloud data. The proposed robust estimator, named GC- 6 SAC (Geometrical Constraint SAmple Consensus), utilizes the geometrical con- straints to choose good samples for estimating models. Furthermore, we utilize the contextual information to validate the estimation’s results. In the experi- ments, the proposed GCSAC is compared with various RANSAC-based variations in both synthesized and the real datasets. ˆ Chapter 4: This chapter describes the completed framework for locating and providing the full information of the queried objects. In this chapter, we exploit advantages of recent deep learning techniques for object detection. Moreover, to estimate full 3-D model of the queried-object, we utilize GCSAC on point cloud data of the labeled object. Consequently, we can directly extract the object’s information (e.g., size, normal surface, grasping direction). This scheme outper- forms existing approaches such as solely using 3-D object fitting or 3-D feature learning. ˆ Chapter 5: We conclude the works and discuss the limitations of the proposed method. Research directions are also described for future works. CHAPTER 1 LITERATURE REVIEW In this chapter, we would like to present surveys on the related works of aid systems for the VIPs and detecting objects methods in indoor environment. Firstly, relevant aiding applications for VIPs are presented in Sec. 1.1. Then, the robust estimators and their applications in the robotics, computer vision are presented in Sec. 1.3. Finally, we will introduce and analyses the state-of-the-art works with 3-D object detection, recognition in Sec. 1.2. 7 1.1 Aided systems supporting for visually impaired people 1.1.1 Aided systems for navigation service 1.1.2 Aided systems for obstacle detection 1.1.3 Aided systems for locating the interested objects in scenes 1.1.4 Aided systems for detecting objects in daily activities 1.1.5 Discussions 1.2 3-D object detection, recognition from a point cloud data 1.2.1 Appearance-based methods 1.2.2 Geometry-based methods 1.2.3 Discussions 1.3 Fitting primitive shapes: A brief survey 1.3.1 Linear fitting algorithms 1.3.2 Robust estimation algorithms 1.3.3 RANdom SAmple Consensus (RANSAC) and its variations 1.3.4 Discussions CHAPTER 2 POINT CLOUD REPRESENTATION AND THE PROPOSED METHOD FOR TABLE PLANE DETECTION A common situation in activities of daily living of visually impaired people (VIPs) is to query an object (a coffee cup, water bottle, so on) on a flat surface. We assume that such flat surface could be a table plane in a sharing room, or in a kitchen. To build the completed aided-system supporting for VIPs, obviously, the queried objects should be separated from a table plane in current scene. In a general frame-work that consists other steps such as detection, and estimation full model of the queried objects, the table plane detection could be considered as a pre-processing step. Therefore, this chapter is organized as follows: Firstly, we introduce a representation of the point clouds which are combined the data collected by Kinect sensor in Section 2.1. We then present the proposed method for the table plane detection in Section 2.2. 8 2.1 Point cloud representation 2.1.1 Capturing data by a Microsoft Kinect sensor In order to collect the data from the environment for building an aid system for the VIPs to detect, grasp objects that have simple geometrical structure on the table in the indoor environment. The color image and depth image are captured from MS Kinect sensor version 1. 2.1.2 Point cloud representation The result of calibration images is a the camera’s intrinsic matrixHm for projecting pixels in 2-D space to 3-D space as follows: Hm = fx 0 cx0 fy cy 0 0 1  where (cx, cy) is the principle point (usually the image center), fx and fy are the focal lengths. 2.2 The proposed method for table plane detection 2.2.1 Introduction Plane detection in 3-D point clouds is a critical task for many robotics and com- puter vision applications. In order to help visually impaired/blind people find and grasp interesting objects (e.g., coffee cup, bottle, bowl) on the table, one has to find the table planes in the captured scenes. This work is motivated by such adaptation in which acceleration data provided by the MS Kinect sensor to prune the extrac- tion results. The proposed algorithms achieve real-time performance as well as a high detection rate of the table planes. 2.2.2 Related Work 2.2.3 The proposed method 2.2.3.1 The proposed framework Our research context aims to develop object finding and grasping-aided services for VIP. The proposed framework, as shown in Fig. 2.6, consists of four steps: down- sampling, organized point cloud representation, plane segmentation and table plane classification. Because of our work utilizing only depth feature, a simple and effective method for down-sampling and smoothing the depth data is described below. 9 Depth Down sampling Organized point cloud representation Plane classification Plane segmentation Table plane Acceleration vector Microsoft Kinect Figure 2.6: The proposed frame-work for table plane detection. Given a sliding window (of size n × n pixels), the depth value of a center pixel D(xc, yc) is computed from the Eq. 2.2: D(xc, yc) = ∑N i=1D(xi, yi) N (2.2) where D(xi, yi) is depth value of i th neighboring pixel of the center pixel (xc, yc); N is the number of pixels in the neighborhood n× n (N=(n× n) -1). 2.2.3.2 Plane segmentation The detailed process of the plane segmentation is given in (Holz et al. RoboCup, 2011). 2.2.3.3 Table plane detection/extraction The results of the first step are planes that are perpendicular to the acceleration vector. After rotating the y axis such that it is parallel with the acceleration vector. Therefore, the table plane is highest plane in the scene, that means the table plane is the one with minimum y-value. 2.2.4 Experimental results 2.2.4.1 Experimental setup and dataset collection The first dataset called ’MICA3D’ : A Microsoft Kinect version 1 is mounted on the person’s chest, the person then moves around one table in the room. The distance between the Kinect and the center of the table is about 1.5 m. The height of the Kinect compared with table plane is about 0.6 meter. The height of table plane is about 60 → 80 cm. We capture data of 10 different scenes which include a cafeteria, showroom, and kitchen and, so on. These scenes cover common contexts in daily activities of visually impaired people. The second dataset is introduced of (Richtsfeld et al. IROS, 2012) . This dataset contains calibrated RGB-D data of 111 scenes. Each scene has a table plane. The size of the image is 640x480 pixels. 2.2.4.2 Table plane detection evaluation method Therefore, three evaluation measures are needed and they are defined as below. Evaluation measure 1 (EM1): This measure evaluates the difference between the 10 Table 2.2: The average result of detected table plane on our own dataset(%). Approach Evaluation Measurement Missing rate Frame per secondEM1 EM2 EM3 Average First Method 87.43 87.26 71.77 82.15 1.2 0.2 Second Method 98.29 98.25 96.02 97.52 0.63 0.83 Proposed Method 96.65 96.78 97.73 97.0 0.81 5 Table 2.3: The average result of detected table plane on the dataset [3] (%). Approach Evaluation Measurement Missing rate Frame per secondEM1 EM2 EM3 Average First Method 87.39 68.47 98.19 84.68 0.0 1.19 Second Method 87.39 68.47 95.49 83.78 0.0 0.98 Proposed Method 87.39 68.47 99.09 84.99 0.0 5.43 normal vector extracted from the detected table plane and the normal vector extracted from ground-truth data. Evaluation measure 2 (EM2): By using EM1, only one point was used (center point of the ground-truth) to estimate the angle. To reduce the noise influence, more points for determining the normal vector of the ground truth are used. For the EM2, 3 points (p1, p2, p3) are randomly selected from the ground-truth point cloud. Evaluation measure 3 (EM3): The two evaluation measures presented above do not take into account the area of the detected table plane. Therefore, it is to propose EM3 that is inspired by the Jaccard index for object detection. r = Rd ∩Rg Rd ∪Rg (2.6) 2.2.4.3 Results The comparative results of three different evaluation measures on two datasets are shown in Tab. 2.2 and Tab. 2.3, respectively. 2.2.5 Discussions In this work, a method for table plane detection using down sampling, accelerom- eter data and organized point cloud structure obtained from color and depth images of the MS Kinect sensor is proposed. 11 2.3 Separating the interested objects on the table plane 2.3.1 Coordinate system transformation 2.3.2 Separating table plane and interested objects 2.3.3 Discussions CHAPTER 3 PRIMITIVE SHAPES ESTIMATION BY A NEW ROBUST ESTIMATOR USING GEOMETRICAL CONSTRAINTS 3.1 Fitting primitive shapes: By GCSAC 3.1.1 Introduction The geometrical model of an interested object can be estimated using from two to seven geometrical parameters as in (Schnabel et al. 2007). A Random Sample Consen- sus (RANSAC) and its paradigm attempt to extract as good as possible shape parame- ters which are objected either heavy noise in the data or processing time constraints. In particular, at each hypothesis in a framework of a RANSAC-based algorithm, a search- ing process aims at finding good samples based on the constraints of an estimated model is implemented. To perform search for good samples, we define two criteria: (1) The selected samples must ensure being consistent with the estimated model via a roughly inlier ratio evaluation; (2) The samples must satisfy explicit geometrical constraints of the interested objects (e.g., cylindrical constraints). 3.1.2 Related work 3.1.3 The proposed new robust estimator 3.1.3.1 Overview of the proposed robust estimator (GCSAC) To estimate parameters of a 3-D primitive shape, an original RANSAC paradigm, as shown in the top panel of Figure 3.2, selects randomly an (Minimum Sample Subset- MSS) from a point cloud and then model parameters are estimated and validated. The algorithm is often computationally infeasible and it is unnecessary to try every possible sample. Our proposed method (GCSAC - in the bottom panel of Figure 3.2) is based on an original version of RANSAC, however it is different in three major aspects: (1) At each iteration, the minimal sample set is conducted when the random sampling procedure is performed, so that probing the consensus data is easily achievable. In other words, a low pre-defined inlier threshold can be deployed as a weak condition of the consistency. Then after only (few) random sampling iterations, the candidates 12 Randomly sampling a minimal subset Randomly sampling a minimal subset Geometrical parameters estimation M Model evaluation M; Update the best model Terminate ? RANSAC/ MLESAC paradigm Proposed Method (GCSAC) Geometrical parameters Estimation M Model evaluation M via (inlier ratio or Negative log-likelihood); Update the best model Update the number of iterations K adaptively (Eq. 3.2) Randomly sampling a minimal subset Searching good samples using geometrical constraints Geometrical parameters estimation M Model evaluation M via Negative Log-likehood; Update the best model Update the number of iterations K adaptively (Eq. 3.2) RANSAC Iteration A point cloud Estimated Model Update the number of iterations K adaptively (Eq. 3.2) No yes A point cloud Compute Negative log- lihood L, update the best model Estimation model; Compute the inlier ratio w Search good sampling based on Geometrical constraint based on (GS) Random sampling w≥wt k≤K Estimated mode Good samples (GS) Yes No No k=0: MLESAC k=1:w≥ wt: Yes k=1:w≥ wt: No As MLESAC Figure 3.2: Top panel: Over view of RANSAC-based algorithm. Bottom panel: A diagram of the GCSAC’s implementations. of good samples could be achieved. (2) The minimal sample sets consist of qualified samples which ensure geometrical constraints of the interested object. (3) Termination condition of the adaptive RANSAC algorithm of (Hartley et al. 2003) is adopted so that the algorithm terminates as soon as the minimal sample set is found for which the number of iterations of current estimation is less than that which has already been achieved. To determine the termination criterion for the estimation algorithm, a well-known calculation for determining a number of sample selection K is as Eq. 3.2. K = log(1− p) log(1− ws) (3.2) where p is the probability to find a model describing the data, s is the minimal number of samples needed to estimate a model, w is percentage of inliers in the point cloud. 13 (a) (b) p1p2 n1 n2 PlaneY L1 L2 Ic Estimated cylinder (c) (e) (f)(d) p1p2 p3 n1 n2 n3 γ2 γ1 p1 p2 γ n1 n2 γc Figure 3.3: Geometrical parameters of a cylindrical object. (a)-(c) Explanation of the geometrical analysis to estimate a cylindrical object. (d)-(e) Illustration of the geometrical constraints applied in GCSAC. (f) Result of the estimated cylinder from a point cloud. Blue points are outliers, red points are inliers. 3.1.3.2 Geometrical analyses and constraints for qualifying good samples In the following sections, the principles of 3-D the primitive shapes are explained. Based on the geometrical analysis, related constraints are given to select good samples. The normal vector of any point is computed following the approach in (Holz et al. 2011) At each point pi, k-nearest neighbors kn of pi are determined within a radius r. The normal vector of pi is therefore reduced to analysis of eigenvectors and eigenvalues of the covariance matrix C, that is presented as in Sec. 2.2.3.2. a. Geometrical analysis for cylindrical objects The geometrical relationships of above parameters are shown in Fig. 3.3 (a). A cylinder can be estimated from two points (p1, p2) (two blue-squared points) and their corre- sponding normal vectors (n1,n2) (marked by green and yellow line). Let γc be the main axis of the cylinder (red line) which is estimated by: γc = n1 × n2 (3.3) To specify a centroid point I, we project the two parametric lines L1 = p1 + tn1 and L2 = p2 + tn2 onto a plane specified by PlaneY (see Figure 3.3(b)). The normal vector of this plane is estimated by a cross product of γc and n1 vectors (γc×n1). The centroid point I is the intersection of L1 and L2 (see Figure 3.3 (c)). The radius Ra is set by the distance between I and p1 in PlaneY . A result of the estimated cylinder from a point cloud is illustrated in Figure 3.3 (f). The height of the estimated cylinder is normalized to 1. 14 Figure 3.4: (a) Setting geometrical parameters for estimating a cylindrical object from a point cloud as described above. (b) The estimated cylinder (green one) from an inlier p1 and an outlier p2. As shown, it is an incorrect estimation. (c) Normal vectors n1 and n ∗ 2 on the plane pi are specified. We first built a plane pi that is perpendicular to the plane PlaneY and consists of n1. Therefore its normal vector is npi = (nPlaneY×n1) where nPlaneY is the normal vector of PlaneY , as shown in Figure 3.4 (a). In the other words, n1 is nearly perpendicular with n∗2 where n ∗ 2 is the projection of n2 onto the plane pi. This observation leads to the criterion below: cp = arg min p2∈{Un\p1} {n1 · n∗2} (3.4) 3.1.4 Experimental results of robust estimator 3.1.4.1 Datasets for evaluation of the robust estimator The first one is synthesized datasets. These datasets consists of cylinders, spheres and cones. In addition, we evaluate the proposed method on real datasets. For the cylindrical objects, the dataset is collected from a public dataset [1] which contains 300 objects belonging to 51 categories. It named ’second cylinder’. For the spherical object, the dataset consists of two balls collected from four real scenes. Finally, point cloud data of the cone objects, named ’second cone’, is collected from dataset given in [4]. 3.1.4.2 Evaluation measurements of robust estimator To evaluate the performance of the proposed method, we use following measure- ments: - Let denote the relative error Ew of the estimated inlier ratio. The smaller Ew is, the better the algorithm is. Where wgt is the defined inlier ratio of ground-truth; w is the inlier ratio of the estimated model. - The total distance errors Sd is calculated by summation of distances from any point pj to the estimated model Me. 15 Table 3.2: The average evaluation results of synthesized datasets. The synthesized datasets were repeated 50 times for statistically representative results. Dataset/ Method Measure RANSAC PROSAC MLESAC MSAC LOSAC NAPSAC GCSAC ’first cylinder’ Ew (%) 23.59 28.62 43.13 10.92 9.95 61.27 8.49 Sd 1528.71 1562.42 1568.81 1527.93 1536.47 3168.17 1495.33 tp(ms) 89.54 52.71 70.94 90.84 536.84 52.03 41.35 Ed(cm) 0.05 0.06 0.17 0.04 0.05 0.93 0.03 EA(deg.) 3.12 4.02 5.87 2.81 2.84 7.02 2.24 Er(%) 1.54 2.33 7.54 1.02 2.40 112.06 0.69 ’first sphere’ Ew(%) 23.01 31.53 85.65 33.43 23.63 57.76 19.44 Sd 3801.95 3803.62 3774.77 3804.27 3558.06 3904.22 3452.88 tp(ms) 10.68 23.45 1728.21 9.46 31.57 2.96 6.48 Ed(cm) 0.05 0.07 1.71 0.08 0.21 0.97 0.05 Er(%) 2.92 4.12 203.60 5.15 17.52 63.60 2.61 ’first cone’ Ew(%) 24.89 37.86 68.32 40.74 30.11 86.15 24.40 Sd 2361.79 2523.68 2383.01 2388.64 2298.03 13730.53 2223.14 tp(ms) 495.26 242.26 52525 227.57 1258.07 206.17 188.4 EA(deg.) 6.48 15.64 11.67 15.64 6.79 14.54 4.77 E r(%) 20.47 17.65 429.44 17.31 20.22 54.44 17.21 Table 3.3: Experimental results on the ’second cylinder’ dataset. The experiments were repeated 20 times, then errors are averaged. Dataset/ Measure Method w (%) Sd tp (ms) Er (%) ’second cylinder’ (coffee mug) MLESAC 9.94 3269.77 110.28 9.93 GCSAC 13.83 2807.40 33.44 7.00 ’second cylinder’ (food can) MLESAC 19.05 1231.16 479.74 19.58 GCSAC 21.41 1015.38 119.46 13.48 ’second cylinder’ (food cup) MLESAC 15.04 1211.91 101.61 21.89 GCSAC 18.8 1035.19 14.43 17.87 ’second cylinder’ (soda can) MLESAC 13.54 1238.96 620.62 29.63 GCSAC 20.6 1004.27 16.25 27.7 - The processing time tp is measured in milliseconds (ms). The smaller tp is the faster the algorithm is. - The relative error of the estimated center (only for synthesized datasets) Ed is Euclidean distance of the estimated center Ee and the truth one Et. 3.1.4.3 Evaluation results of new robust estimator The performances of each method on the synthesized datasets are reported in Tab. 3.2. For evaluating the real datasets, the experimental results are reported in Tab. 3.3 for the cylindrical objects. Table 3.4 reports fitting results for spherical and cone datasets. 16 Table 3.4: The average evaluation results on the ’second sphere’, ’second cone’ datasets. The real datasets were repeated 20 times for statistically representative results. Dataset/ Method Measure RANSACPROSACMLESACMSAC LOSAC NAPSAC GCSAC ’second sphere’ w(%) 99.77 99.98 99.83 99.80 99.78 98.20 100.00 Sd 29.60 26.62 29.38 29.37 28.77 35.55 11.31 tp(ms) 3.44 3.43 4.17 2.97 7.82 4.11 2.93 Er(%) 30.56 26.55 30.36 30.38 31.05 33.72 14.08 ’second cone’ w(%) 79.52 71.89 75.45 71.89 80.21 38.79 82.27 Sd 126.56 156.40 147.00 143.00 96.37 1043.34 116.09 tp(ms) 10.94 7.42 13.05 9.65 96.37 25.39 7.14 EA(deg.) 38.11 40.35 35.62 25.39 29.42 52.64 23.74 Er(%) 77.52 77.09 74.84 75.10 71.66 76.06 68.84 3.1.5 Discussions In this work, we have proposed GCSAC that is a new RANSAC-based robust esti- mator for fitting the primitive shapes from point clouds. The key idea of the proposed GCSAC is the combination of ensuring consistency with the estimated model via a roughly inlier ratio evaluation and geometrical constraints of the interested shapes. This strategy aimed to select good samples for the model estimation. The proposed method was examined with primitive shapes such as a cylinder, sphere and cone. The experimental datasets consisted of synthesized, real datasets. The results of the GC- SAC algorithm were compared to various RANSAC-based algorithms and they confirm that GCSAC worked well even the point-clouds with low inlier ratio. In the future, we will continue to validate GCSAC on other geometrical structures and evaluate the proposed method with the real scenario for detecting multiple objects. 3.2 Fitting objects using the context and geometrical constraints 3.2.1 Finding objects using the context and geometrical constraints Let’s consider a real scenario in common daily activities of the visually impaired people. They come to a cafeteria then give a query ”where is a coffee cup?”, as shown in Fig. 1. 3.2.2 The proposed method of finding objects using the context and geo- metrical constraints In the context of developing object-finding-aided systems for the VIPs (as shown in Fig. 1). 17 3.2.2.1 Model verification using contextual constraints 3.2.3 Experimental results of finding objects using the context and geo- metrical constraints 3.2.3.1 Descriptions of the datasets for evaluation The first dataset is constructed from a public one used in [3]. 3.2.3.2 Evaluation measurements 3.2.3.3 Results of finding objects using the context and geometrical constraints Table 3.5 compares the performances of the proposed method GCSAC and MLE- SAC. Table 3.5: Average results of the evaluation measurements using GCSAC and MLESAC on three datasets. The fitting procedures were repeated 50 times for statistical evaluations. Dataset/ Method without the context’s constraint Ea(deg.) Er(%) tp(ms) First dataset MLESAC 46.47 92.85 18.10 GCSAC 36.17 81.01 13.51 Second dataset MLESAC 47.56 50.78 25.89 GCSAC 40.68 38.29 18.38 Third dataset MLESAC 45.32 48.48 22.75 GCSAC 43.06 46.9 17.14 3.2.4 Discussions CHAPTER 4 DETECTING AND ESTIMATING THE FULL MODEL OF 3-D OBJECTS AND DEPLOYING THE APPLICATION 4.1 3-D object detection 4.1.1 Introduction The interested objects are placed on the table plane and the objects have simple geometry structure (e.g. coffee mugs, jars, bottles, soda cans are cylindrical, soccer- balls are spherical). Our method exploited the performance of YOLO [2] as a state-of- the-art method for objects detection in the RGB images because it is a method that has the highest performance for objects detection. After that, the detected objects are projects into the point cloud data (3-D data) to generate the full objects model for grasping, describing objects. 18 Table 4.1: The average result detecting spherical objects on two stages Measure/ dataset First stage Second stage Average Processing time tp(s)/sceneFirst Dataset Method Recall (%) Precision (%) Recall (%) Precision (%) PSM 62.23 48.36 60.56 46.68 1.05 CVFGS 56.24 50.38 48.27 42.34 1.2 DLGS 88.24 78.52 76.52 72.29 0.5 4.1.2 Related Work 4.1.3 Three different approaches for 3-D objects detection in a complex scene 4.1.3.1 Geometry-based method for Primitive Shape detection Method (PSM) This method used the detecting Primitive Shape Method (PSM) of (Schnabel et al) in point cloud of the objects. 4.1.3.2 Combination of Clustering objects, Viewpoint Features Histogram, GCSAC for estimating 3-D full object models - CVFGS 4.1.3.3 Combination of Deep Learning based, GCSAC for

Các file đính kèm theo tài liệu này:

  • pdf3d_object_detections_and_recognitions_assisting_visually_imp.pdf
Tài liệu liên quan