Advanced Computer Vision

Advanced Computer Vision Chapter 7 STRUCTUREFROM MOTION Presented by Prof. Chiou-Shann Fuh & Pradnya Borade 0988472377 r99922145@ntu.edu.tw Structure from Motion

Today’s Lecture Structure from Motion • What is structure from motion? • Triangulation and pose • Two-frame methods Structure from Motion

What Is Structure from Motion? • Study of visual perception. • Process of finding the three-dimensional structure of an object by analyzing local motion signals over time. • A method for creating 3D models from 2D pictures of an object. Structure from Motion

Example Picture 1 Picture 2 Structure from Motion

Example (cont). 3D model created from the two images Structure from Motion

Example Figure: Structure from motion systems: Orthographic factorization Structure from Motion

Example Figure: line matching Structure from Motion

Example a b c d e Figure: (a-e) incremental structure from motion Structure from Motion

Example Figure: 3D reconstruction of Trafalgar Square Structure from Motion

Example Figure: 3D reconstruction of Great Wall of China. Structure from Motion

Example Figure: 3D reconstruction of the Old Town Square, Prague Structure from Motion

7.1Triangulation • A problem of estimating a point’s 3D location when it is seen from multiple cameras is known as triangulation. • It is a converse of pose estimation problem. • Given projection matrices, 3D points can be computed from their measured image positions in two or more views. Structure from Motion

Triangulation (cont). • Find the 3D point p that lies closest to all of the 3D rays corresponding to the 2D matching feature locations {xj} observed by cameras {Pj =Kj[Rj| tj] } Structure from Motion

Triangulation (cont). Figure: 3D point triangulation by finding the points p that lies nearest to all of the optical rays Structure from Motion

Triangulation (cont). • The rays originate at cj in a direction • The nearest point to p on this ray, which is denoted as qj, minimizes the distance. which has a minimum at Hence, Structure from Motion

Triangulation (cont). • Alternative formulation which is optimal and can produce better estimates if some of the cameras are closer to the 3D points than the others, it minimizes the residual in the measurement equations. Structure from Motion

Triangulation (cont). (xj,yj): the measured 2D feature location {p00(j)….p23(j)}: the known entries in camera matrix pj. Structure from Motion

Triangulation (cont). • The squared distance between p and qj is • The optimal value for p, which lies closest to all of the rays, can be computed as a regular least square problem by summing over all the rj2 and finding the optimal value of p, Structure from Motion

Triangulation (cont). • If we use homogeneous coordinates p=(X,Y,Z,W), the resulting set of equation is homogeneous and is solved as singular value decomposition (SVD). • If we set W=1, we can use regular linear least square, but the resulting system may be singular or poorly coordinated (i.e. all of the viewing rays are parallel). Structure from Motion

Triangulation (cont). For this reason; it is generally preferable to parameterized 3D points using homogeneous coordinates, especially if we know that there are likely to be points at generally varying distances from the cameras. Structure from Motion

7.2Two-Frame Structure from Motion • In 3D reconstruction we have always assumed that either 3D points position or the 3D camera poses are known in advance. Structure from Motion

Two-Frame Structure from Motion (cont). Figure: Epipolar geometry: The vectors t=c1 – c0, p – c0and p-c1are co-planar and the basic epipolar constraint expressed in terms of the pixel measurement x0and x1 Structure from Motion

Two-Frame Structure from Motion (cont). • Figure shows a 3D point p being viewed from two cameras whose relative position can be encoded by a rotation R and a translation t. • We do not know anything about the camera positions, without loss of generality. • We can set the first camera at the origin c0=0 and at a canonical orientation R0=I Structure from Motion

Two-Frame Structure from Motion (cont). • The observed location of point p in the first image, is mapped into the second image by the transformation : the ray direction vectors. Structure from Motion

Two-Frame Structure from Motion (cont). • Taking the cross product of both the sides with t in order to annihilate it on the right hand side yields • Taking the dot product of both the sides with yields Structure from Motion

Two-Frame Structure from Motion (cont). • The right hand side is triple product with two identical entries • We therefore arrive at the basic epipolar constraint : essential matrix Structure from Motion

Two-Frame Structure from Motion (cont). • The essential matrix E maps a point in image 0 into a line in image 1 since • All such lines must pass through the second epipole e1, which is therefore defined as the left singular vector of E with 0 singular value, or, equivalently the projection of the vector t into image 1. Structure from Motion

Two-Frame Structure from Motion (cont). • The transpose of these relationships gives us the epipolar line in the first image as and e0 as the zero value right singular vector E. Structure from Motion

Two-Frame Structure from Motion (cont). • Given the relationship How can we use it to recover the camera motion encoded in the essential matrix E? • If we have n corresponding measurements {(xi0,xi1)}, we can form N homogeneous equations in the elements of E= {e00…..e22} Structure from Motion

Two-Frame Structure from Motion (cont). : element-wise multiplication and summation of matrix elements zi and f: the vector forms of the and E matrices. Given N>8 such equation, we can compute an estimate for the entire E using a Singular Value Decomposition (SVD). Structure from Motion

Two-Frame Structure from Motion (cont). • In the presence of noisy measurement, how close is this estimate to being statistically optimal? • In the matrix, some entries are product of image measurement such as xi0yi1 and others are direct image measurements (even identity). Structure from Motion

Two-Frame Structure from Motion (cont). • If the measurements have noise, the terms that are product of measurement have their noise amplified by the other element in the product, which lead to poor scaling. • In order to deal with this, a suggestion is that the point coordinate should be translated and scaled so that their centroid lies at the original variance is unity; i.e. Structure from Motion

Two-Frame Structure from Motion (cont). such that and n= number of points. Once the essential matrix has been computed from the transformed coordinates; the original essential matrix E can be recovered as Structure from Motion

Two-Frame Structure from Motion (cont). • When the essential matrix has been recovered, the direction of the translation vector t can be estimated. • The absolute distance between two cameras can never be recovered from pure image measurement alone. • Ground control points in Photogrammetry: knowledge about absolute camera, point positions or distances. • Required to establish the final scale, position and orientation. Structure from Motion

Two-Frame Structure from Motion (cont). • To estimate direction observe that under the ideal noise-free conditions, the essential matrix E is singular, i.e., • This singularity shows up as a singular value of 0 when an SVD of E is performed, Structure from Motion

Pure Translation Figure: Pure translation camera motion results in visual motion where all the points move towards (or away from) a common focus of expansion (FOE). They therefore satisfies the triple product condition Structure from Motion

Pure Translation (cont). • Known rotation: The resulting essential matrix E is (in the noise-free case) skew symmetric and can estimate more directly by setting eij= -ejiand eii = 0. Two-point parallax now suffices to estimate the FOE. Structure from Motion

Pure Translation (cont). • A more direct derivation of FOE estimates can be obtained by minimizing the triple product. which is equivalent to finding null space for the set of equations Structure from Motion

Pure Translation (cont). • In a situation where large number of points at infinity are available, (when the camera motion is small compared to distant objects, this suggests a strategy. • Pick a pair of points to estimate a rotation, hoping that both of the points lie at infinity (very far from camera). • Then compute FOE and check whether residual error is small and whether the motions towards or away from the epipoler (FOE) are all in the same direction. Structure from Motion

Pure Rotation • This results in a degenerate estimate of the essential matrix E and the translation direction. • If we consider that the rotation matrix is known, the estimates for the FOE will be degenerate, since and hence is degenerate. Structure from Motion

Pure Rotation (cont). • Before comparing a full essential matrix to first compute a rotation estimate R, potentially with just a small number of points. • Then compute the residuals after rotating the points before processing with a full E computation. Structure from Motion

Projective Reconstruction • When we try to build 3D model from the photos taken by unknown cameras, we do not know ahead of time the intrinsic calibration parameters associated with input images. • Still, we can estimate a two-frame reconstruction, although the true metric structure may not be available. • : the basic epipoler constraint. Structure from Motion

Projective Reconstruction (cont.) • In the unreliable case, we do not know the calibration matrices Kj, so we cannot use the normalized ray directions. • We have access to the image coordinate xj, so essential matrix becomes: • fundamental matrix: Structure from Motion

Projective Reconstruction (cont.) • Its smallest left singular vector indicates the epipole e1 in the image 1. • Its smallest right singular vector is e0. Structure from Motion

Projective Reconstruction (cont.) • To create a projective reconstruction of a scene, we pick up any valid homography that satisfies and hence : singular value matrix with the smallest value replaced by the middle value. Structure from Motion

Self-calibration • Auto-calibration is developed for covering a projective reconstruction into a metric one, which is equivalent to recovering the unknown calibration matrix Kj associated with each image. • In the presence of additional information about scene, different methods can be applied. • If there are parallel lines in the scene, three or more vanishing points, which are the images of points at infinity, can be used to establish homography for the plane at infinity, from which focal length and rotation can be recovered. Structure from Motion

Self-calibration (cont). • In the absence of external information: consider all sets of camera matrices Pj = Kj[ Rj | tj] projecting world coordinates pi=(Xi,Yi,Zi,Wi) into screen coordinates xij ~ Pjpi. • Consider transforming the 3D scene {pi} through an arbitrary 4 4 projective transformation yielding a new model consisting of points • Post-multiplying each other Pj matrix by still produces the same screen coordinates and a new set of calibration matrices can be computed by applying RQ decomposition to the new camera matrix . Structure from Motion

Self-calibration (cont). • A technique that can recover the focal lengths (f0,f1) of both images from fundamental matrix F in a two-frame reconstruction. • Assume that camera has zero skew, a known aspect ratio, and known optical center. • Most cameras have square pixels and an optical center near middle of image and are likely to deviate from simple camera model due to radial distortion • Problem occurs when images have been cropped off-center. Structure from Motion

Self-calibration (cont). • Take left to right singular vectors {u0,u1,v0,v1} of fundamental matrix F and their associated singular values and form the equation: two matrices: Structure from Motion

Self-calibration (cont). • Encode the unknown focal length. Write numerators and denominators as: Structure from Motion

Advanced Computer Vision