Why 3D vision technology? What’s wrong with 2D?

15 January 2020 by Martin Affolter, Computer Vision Engineer

Advertima uses computer vision and machine learning to interpret people and understand their behavior in real time during their customer journey in physical spaces. To achieve best-in-class data quality and accurate results, three-dimensional (3D) vision is essential to all stages of our real-time analysis of the physical world. From tracking, to scene understanding and attention estimation, 3D data is crucial to provide our customers with the desired information. In this blog post, I want to shed some light on the field of 3D vision from an industry perspective. Starting with an overview on human and 3D vision, I go on to highlight the limitations of 2D vision technology and conclude by demonstrating why Advertima chose 3D vision technology.

1. 3D human and machine vision

Depth perception is an integral part of human vision. The most dominant and precise type of depth perception requires the simultaneous use of both eyes and a high overlap between the left and right field of vision. This is known as stereo vision. Nature has always inspired technology, vision being no exception. Consequently, the digital analogy for nature’s stereo vision is the stereo camera. It too consists of two cameras facing the same direction but with a horizontal displacement. Similar to the human eye, the two cameras produce two images of the same scene, looking at that scene from slightly different positions. An object close to the cameras appears to be shifted by a large amount from left to the right image, whereas the shift gets smaller the further away the object moves. This shift, called disparity, is what allows for calculating the object’s distance from the cameras.

2. The limitations of 2D vision technology

Opposite to 3D vision technology, two-dimensional (2D) vision technology only uses one camera. The main difference between a 2D and 3D sensor is the inability of the former to provide depth information. As a real-life example to highlight the relevance of depth, try catching a tennis ball thrown in your direction with one eye closed. Give it several tries, and you will find it hard to grab the ball in mid-air because of the lack of depth perception.

3. Why Advertima chose 3D vision technology to interpret people and understand their behavior in the physical world?

At Advertima, we compute and estimate numerous features describing people in the physical world through the eyes of our sensors. These features include among others space position, velocity, age, gender, head pose position, body appearance, face appearance, dwell time, walking path prediction, attention focus, anonymous person ID and zoning. Out of these 12 features more than half are supported by 3D depth information, mirroring the relevance of 3D data in our pipeline. The following examples will focus on our product “Advertima Smart Signage” to showcase the importance of high-quality data. The key features of Smart Signage are providing consumer insight and delivering relevant content to the right audience at the right time. This in turn is only possible if the system has an accurate and complete picture of what is happening in front of the screen.

In reality, however, our solution is deployed in crowded shopping centers and busy retail stores. As such, picking the right person to deliver the appropriate best next content is a delicate process. As a negative example, we don’t want to pick people looking at the phone, completely ignoring our screens. We also don’t want to focus on people walking away from screens or walking so fast that they will be out of reach once the current clip is finished. In contrast, we do want to focus on people looking at our screens, passing the screens just as the next clip is ready or standing in front of it. In order to deal with these complexities, we use a combination of tracking, zones of interest, walking path prediction as well as attention and head pose estimation to score the relevance of each person. In the following, I will showcase how those decisive features depend on 3D information.

Tracking can be moved from 2D into 3D space to improve resolving ambiguities that arise when tracking people. From a camera’s perspective, one tracked person can occlude, i.e. partially cover, another person. Such a situation bears the potential of confusing the two people (ID switch). Especially if the people are similar in color, shape and/or moving in the same direction. Countless of our experiments verified that such ambiguities can often only be resolved in the 3D space.

Walking path prediction is the art of “fortune-telling”. Trajectories of people can be related to the scene they are moving in. This allows us to determine frequently used pathways leading to a better understanding of the scene at hand. Obstacles like columns, walls and desks can be implicitly derived from the moving patterns. Once the most frequented pathways are known, trajectory predictions for other people can be estimated in real time. This allows us to predict the future and estimate each person’s whereabouts in the next couple of seconds.

Zone of interest divides the space in front of the screen into different relevance classes. A wide-angle camera combined with depth estimation up to 15 meters can cover a relatively large scene. However, not all of what is visible might be relevant. Inside a shopping mall the user of our solution may only be interested to attract people entering or leaving a shop. Knowing the ground floor plan, camera position and orientation we at Advertima are able to define zones of interest on the virtual floor in 3D and concentrate our analytics on only the relevant people entering, leaving or dwelling within the zone. If you were to try defining zones in a 2D picture, you would completely confuse people in the fore- and background.

Attention / Head Pose is a metric that describes if people are looking towards a given screen. We estimate three angles for every face describing the head orientation relative to the camera. Ultimately, this information is used to determine which object every visitor is focusing on. Imagine a scene with one camera and two screens on each side of the camera, all devices facing the same direction. If a person stood in front of the camera, would it be possible to determine what screen he/she is looking at? The answer is yes, but only if the position of the person relative to the camera is known. Only by utilizing the depth information of the stereoscopic camera, we are able to determine the location and understand, which screen or content draws that person’s interest.

You can see how each of our features individually requires 3D location. So the question may arise, why 3D cameras haven’t become the de-facto standard and replaced every 2D camera by now. After all, 3D cameras are strictly better than 2D cameras because they provide depth information. The answer to this question lies in the high computational costs associated with stereo vision. Compared to ordinary 2D sensors, a 3D camera requires more than twice the bandwidth and has a significant computational overhead in the postprocessing. This is intuitively obvious as the stereo vision is always processing and combining two camera images from the left and right sensors. At Advertima however, we can afford the required computational effort due to tremendous optimizations. We’ve invested countless man-hours to build a system that leverages all the 3D data in real time, despite running state of the art neural network models. All that on a local edge pc so no information leaks from the customer. This has been an expensive and time-consuming effort but a necessary one. In the end, we believe in providing the best quality to our customers.