Free3D creates 3D-videos out of 3 Kinect cameras.

Abstract

This student research project explores the recording of full three-dimensional (3D) scenes using only three Azure Kinect cameras. By leveraging novel methodologies such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, the captured images are processed into a "4D data structure", enabling the creation of new videos from any perspective. The project's goal is to determine the extent to which visually high-quality 3D scenes can be generated with minimal equipment. The study highlights the potential of these advanced techniques to enhance for example online learning experiences by providing an accessible tool for creating immersive 3D content. Experiments demonstrated that using depth data from RGB-D cameras compensates for the reduced number of input images, maintaining high visual quality. The results show that combining static Gaussian Splatting-generated backgrounds with point cloud data from Azure Kinect cameras can produce impressive 3D scene reconstructions with reduced computational demands, making the technology more accessible and cost-effective.

Infrastructure

Hardware Setup

Time Lapse Setup

It takes approximately seven minutes for two people to set up the hardware, which consists of three Azure Kinect cameras on tripods connected via cables to each other and a laptop each.

Kinect Setup

The illustration shows the setup of the three Azure Kinect cameras. They are connected by two AUX cables. Each from the sync out to the sync in of the next. The camera connected via sync out only is the master, the other two are subs. They are powered and connected to the laptops via the USB-C port.

Kinect-Set-Up

Software Setup

Network Infrastructure

Our small setup was implemented without any kind of DHCP server. We had only four IP addresses to use, so we assigned them manually to the PCs. The PC with the IP address 192.168.10.1 uses the Windows Network Share feature to share a specific folder across the entire network. This allows all other clients to save their Kinect videos in that folder, ensuring that all video files are stored centrally. That folder has to be added as a network folder to all other clients, so its folder path can be used in the code as the path for the recordings of the Kinect cameras.

Physical-Infrastructure

Basic Overview

In order to reproduce our results, it is important to use 3 Azure Kinect cameras. Each of these cameras is connected to a laptop or PC as shown in the container ‘For each kinect’. The recorder will run on this device, which will connect to the websocket server and needs access to a central file system so that the recorded video can be saved on it. The websocket server, Calibrator and the Operator Client are executed on another laptop or PC. This device also provides the central file system. If all recorders and the calibrator are connected, both the recordings and the calibration can be started centrally via the Operator Client.

Infrastructure

Marker Based Calibration

ArUco Marker

An ArUco marker cube (17 cm edge length) with unique IDs on each side is used to calculate the cube's center from the tracked markers in the camera images. The cube's center is then set as the scene center for determining camera positions and orientations. Due to marker size and lighting, tracking inaccuracies may occur. Over a ten-second recording at 30 frames per second, up to 300 orientation matrix and position vector results per marker ID are generated. An algorithm compares pairs of detected marker positions, with the median of these comparisons determining the camera position relative to the cube.

ArUco Marker

ChArUco Board

A ChArUco board was also created, printed on an A1 poster. The ArUco markers are 11.2 cm x 11.2 cm, and the chessboard squares are 15 cm x 15 cm. In a ten-second calibration video at 30 frames per second, up to 300 orientation matrices and position vectors are generated. OpenCV outputs a single position, and the median of these results is used to determine the poster's position, which in turn is used to determine the camera position.

ChArUco Board

Post-Processing

From whole room to extracted center

RGB-D data from MKV files is processed with calibration data to form the point cloud. Open3D functions extract the central object or person and DB-Scan selects the largest cluster, typically the desired object. The ground is removed with a surface selection function. These steps isolate the object in the scene, though errors may occur depending on the scene.

PointCloud PlaneSegmentation
DB_Scan ExtractetObjectInCenter

Combining Point Clouds and Gaussian Splats

To visualize the point clouds, we used Unity with a Gaussian Splat integration to display PLY files. We developed a custom plugin to animate and display point clouds in Unity. The picture below shows the successful combination, featuring the typical appearance of a Gaussian Splat with our point cloud in the center.

Unity

Related Links

There's a lot of excellent work that inspired us:

NeRF provided revolutionary results on novel view synthesis.

Im4D and 4k4D provides the creation of 3D-videos through dynamic NeRFs.

Some work, that created 3D-videos through dynamic gaussian spaltting: Dynamic 3D Gaussians .

BibTeX

                            
@article{brunn2024free3d,
    author    = {Brunn, Felix and Christmeier, Jan and Häcker, Nick P. and Stempfle, Laura C. and Willmann, Lukas and Zakowski, Simon and Hahne, Prof. Dr. Uwe},
    title     = {Free3D - Free-viewpoint 3D video creation},
    year      = {2024}
}