Interactive Media Systems, TU Wien

Evaluation and Design of Energy Functions for Global Stereo Matching

Research project in the area of Image and Video Analysis & Synthesis.

Keywords: Stereo, 3D Reconstruction, Computer Vision, Scene Modeling, Energy Function, Optimization.

About this Project

This project investigates and improves the modelling component of energy minimization techniques for stereo matching. One major contribution is a competitive performance evaluation among energy functions that have been proposed in the literature. In the second phase of the project, we use the knowledge gained in our evaluation study to develop novel energy functions. These energy functions are designed to deliver high-quality disparity maps that improve over the current state-of-the-art. These high-quality disparity maps are vital for a variety of applications, ranging from quality assurance, robotics and virtual reality to modern applications in the entertainment industry such as novel view synthesis.

Funding provided by

Austrian Science Fund (FWF) under contract no. P-19797.

Additional Information

This work was funded by the Austrian Science Fund (FWF) under project P19797-N13

FWF - Der Wissenschaftsfonds


The project "Evaluation and Design of Energy Functions for Global Stereo Matching" focuses on one of the most challenging research topics in computer vision, namely stereo vision. Stereo vision approaches imitate the ability of the human visual system to infer depth from the surrounding environment via the use of two eyes. In analogy to human depth perception, two slightly displaced cameras record the same scene. Roughly spoken, left and right images are then “overlaid” to determine depth information. This basic principle is illustrated in Figure 1. The problem of “overlying” the two input images to obtain the result of Figure 1(b) is known as the stereo matching problem.

Figure 1: The principle of 3D reconstruction via stereo vision. (a) Left and right views of a stereo pair. Due to the different viewpoints, corresponding points in the two images are displaced in horizontal direction. (b) Amount of horizontal pixel displacement between the input views. (Large displacements are represented by bright pixels.) The amount of displacement is inversely proportional to a pixel’s depth in the scene. The image of (b) is therefore sufficient for generating the 3D scene reconstruction of (c).

Being able to solve the stereo matching problem is important in two respects. Firstly, it sheds light on the way how human depth perception might work. Secondly, there are numerous applications in computer vision. For example, automated 3D visualisation of terrain and cities has recently gained popularity. In this context, stereo-derived 3D models can be integrated into Google Earth to allow the user to take a walk in a 3D computer reconstruction of Vienna. In medical imaging, 3D reconstructions of organs created from multiple 2D (MRI) images can aid in diagnosis. Apart from visualisation, stereo reconstructions can be applied for robot navigation (autonomously driving car), but also to assist handicapped (blind) people to navigate in their environment. Without being exhaustive, other applications include 3D tracking (surveillance, pose estimation, human-computer interaction), depth segmentation (z-keying), industrial applications (quality assurance) and novel view generation (free viewpoint video), to name just a few of them. Basically, whenever one needs to infer geometric information from the surrounding world, stereo vision represents a low-cost and non-intrusive alternative to active devices, such as range finders.

During the last few years, stereo matching has experienced a significant advance with the introduction of new optimization algorithms. Energy minimization methods based on these optimization schemes currently show the best performance in stereo computation. However, while a lot of research effort has been put into the optimization problem of the energy minimization approach, the fact that the energy functions under consideration might represent an unsatisfactory model for the stereo problem has often been ignored. In the proposed project, we aim at pushing the state-of-the-art in stereo vision by investigating and improving the modelling component of energy minimization techniques. The project consists of two parts, namely an evaluation part and a design part.


In the first part, a benchmark test among existing energy functions is performed. We have started our benchmark by focusing on the colour component of modern stereo algorithms [6]. This is motivated by the observation that a lot of stereo researchers still simply convert the stereo pairs to grey-scale images, although colour is typically available. Since it is unclear if colour shows positive effects when using global methods, the colour information is thereby often discarded deliberately. Therefore, we have concentrated on the question: "Does colour help to improve the performance of modern stereo matching approaches?". To answer this question we have tested approximately 20 different colour-based energy functions on 30 ground truth image pairs. We have found relative large improvement when using colour in stereo matching. The best-performing colour space gives 25% less disparity errors in comparison to grey-scale matching according to our benchmark.

In the second part of the project, we focus on the design of new energy functions. This energy functions are designed to provide a better model of the stereo problem, and therefore improved stereo matching results are expected by their application.

In [5], we have focused on the matting problem in stereo matching. At disparity borders, lense blur and image discretization generate pixels whose colour is the composite of fore- and background surfaces. The problem in stereo is that this composition is different across stereo input views, and almost all stereo algorithms make a systematical mistake by assuming colour consistency in these regions. We have proposed an energy function that overcomes this problem by explicitly modelling pixel mattes. Handling mattes does not only lead to an improved stereo model, but is also important for applications such as novel view generation (Figure 2).

Figure 2: Novel view generation with pixel mattes. (a) Original Image. (b) The image of (a) is transformed into a new perspective using the computed depth information. (c) Zoomed view. Due to using matting information, the lamp naturally melts with its new background. (d) The same result without handling the pixel mattes. Disturbing artifacts around the lamp’s border occur.

In [2], we have worked on improving the data term of energy functions by proposing a new aggregation scheme for window-based matching. Although we have only applied local optimization, our method achieves a 10th rank in the Middlebury Benchmark. Our method is the top performer among local methods in the Middlebury Online Table.

Finally, we have proposed a stereo algorithm that achieves comparable results to computational expensive methods such as Graph-Cuts at greatly reduced processing time [12]. The idea is to apply dynamic programming on special purpose tree structures.