3D spatial recognition is a fundamental technology that supports automatic driving. For example, the processing accuracy of the vehicle depends on the accuracy of depth information around the vehicle body. While methods to geometrically measure the depth of the captured space by applying stereo vision to images taken by multiple cameras become being widely used, it is difficult to measure depth in poorly textured or occluded regions. On the other hand, it becomes possible to estimate depth information in such areas with the advent of estimating depth from monocular images by deep learning. However, if the observation conditions differ between training and estimation, the accuracy of the estimation will decline. This paper proposes a complementary method that integrates both methods by using a convolutional autoencoder.
Camera work is a critical aspect of conveying the atmosphere and impressions in movies, and it plays a vital role in video analysis. This research proposes a method to estimate camera work from monocular videos by analyzing the optical flow within the video frames. Our method facilitates the estimation of camera work in videos featuring dynamic subjects by incorporating semantic segmentation. Additionally, it is capable of distinguishing between zoom and dolly movements, which previous works have not achieved. The method uses the relationship between image depth, optical flow, and image coordinates to perform such classification.
Catcher framing technique in baseball has recently gained attention in game analysis. This technique involves a catcher adjusting their catching motion to increase the likelihood of an umpire calling a pitch a strike. Its success is typically evaluated based on the strike rate at the boundary of the strike zone, calculated from pitch trajectory data obtained through a tracking system. However, evaluating catcher framing in games without a tracking system is challenging, and alternative methods based on different types of information are needed. This research proposes a method to detect the catcher's mitt movement trajectory during catcher framing, which is considered useful information apart from pitch trajectory. The method applies object detection, pose estimation and deep learning to videos of baseball pitching scenes.
Micro tunnel excavation is limited to straight paths due to size constraints that prevent human entry through the jacking pipes. A compact, wide-angle system enables the measurement of enclosed spaces with complex paths. This paper proposes a 3D estimation method using a catadioptric imaging system, which consists of an omnidirectional camera and a single spherical mirror, to capture monocular images without relying on training data. The applicability of our method is demonstrated by measuring the geometry of a pipe designed to imitate a propulsion pipe used in tunnel excavation.
At resource mining sites, drilling into the ground and bedrock often occurs in geological surveys and blasting explosives filling. However, extracting core samples from underground boreholes is time-consuming, labor-intensive, and difficult to evaluate quantitatively. Visualization of boreholes, which is realized by computer vision technology, allows engineers to evaluate the geological characteristics of underground rocks to determine trajectories and overall budgets. With the help of VR (Virtual Reality) simulation, this research develops a multi-view fiberscope camera system, which obtains videos of a borehole to generate a high-quality 3D borehole model by using one of the 3D photogrammetric techniques, Structure from Motion (SfM).
In this study, we proposed a method for detecting a player’s movement trajectory from video images to evaluate the defensive range of an outfielder in baseball, quantitatively. Using this method, we succeeded in accurately estimating and visualizing the movement trajectory of a player by identifying only specific players and estimating the homography of changes in the angle of the view by matching the feature points using SIFT.
KEYWORDS: Video, Head, Cameras, Image quality, RGB color model, 3D modeling, Education and training, Deep learning, Image processing, 3D image processing
This paper introduces a method to generate 4D portrait of a person that can be played over a long period. 4D portrait is free-viewpoint video of a person with temporal changes in facial expression. In our proposed method, the parameters that represent person’s facial expressions and head poses are obtained from the video captured by a monocular RGB camera with a continuously moving viewpoint. A neural radiance field (NeRF) is trained from the captured video and estimated parameters. Using the radiance field, 4D portrait is generated based on the similarity of the person's facial expressions.
Owing to the impact of COVID-19, the venues for dancers to perform have shifted from the stage to the media. In this study, we focus on the creation of dance videos that allow audiences to feel a sense of excitement without disturbing their awareness of the dance subject and propose a video generation method that links the dance and the scene by utilizing a sound detection method and an object detection algorithm. The generated video was evaluated using the Semantic Differential method, and it was confirmed that the proposed method could transform the original video into an uplifting video without any sense of discomfort.
One of the sports training methods is VR training, where users watch a video image to recognize the situation, estimate the timing to move, and move their bodies by the situation. In this paper, we propose a real-time visual feedback method for sports motion information on the VR experience of Alpine skiing. We prepare a carefully designed visual feedback panel that presents the user's center of gravity and head height as sports motion information. An HMD presents the skiing situation on the slopes in VR180 style. The load sensor of our preliminary system is placed under the user's feet, and it acquires the center of gravity position. The tracking function of the HMD estimates head height. In the evaluation experiment, we investigated the appropriate parameters to realize the good visibility of the visual feedback panel during VR training.
Laparoscopic surgery provides for patients such advantages as a small incision range and quick postoperative recovery. Unfortunately, surgeons struggle to grasp 3D spatial relationships in the abdominal cavity. Methods have been proposed to present the 3D information of the abdominal cavity using AR or VR. Although 3D geometrical information is crucial to perform such methods, it is difficult to reconstruct dense 3D organ shapes using a feature-point-based 3D reconstruction method such as structure from motion (SfM) due to the appearance characteristics of organs (e.g., texture-less and glossy). Our research solves this problem by estimating depth information from laparoscopic images using deep learning. We constructed a training dataset from both RGB and depth images with an RGB-D camera, implemented a depth image generator by applying a generative adversarial network (GAN), and generated a depth image from a single-shot RGB image. By calibration with a laparoscopic camera and an RGB-D camera, the laparoscopic image was transformed to an RGB image. We generated depth images by inputting the transformed laparoscopic images into a GAN generator. The scale parameter of the depth image with real-world dimensions was calculated by comparing the depth value and the 3D information estimated by SfM. Consequently, the density of the organ model increased by back-projecting the depth image to the 3D space.
This paper proposes a shot detection method using the poses of a player in a badminton video sequence. In the proposed method, the hit timing is detected by focusing on the arm movements of the player and analysing the swing movement using skeletal information. The simple shot information is estimated by connecting player positions in the hit timing frame. By performing an experiment to verify hit timing detection, we confirmed that the detection is highly accurate and shot information detection is achieved.
We are researching navigation for the visually impaired. We propose a new interface that utilizes sound and vibration to support turn-by-turn navigation for visually impaired people. In our proposed interface, the target path is divided into straight segments and points of change direction. The navigation instruction given by the sound and vibration is carefully designed to give minimum yet sufficient clues on the visually impaired walking. We have implemented a preliminary system based on our proposal and conducted a subject experiment for visually impaired people. The results imply that our proposed approach is useful for visually impaired people.
Pitching in the correct form is essential for preventing injury and improving skills. It is not easy for athletes and instructors to check whether a pitcher is throwing in the correct form. In this study, we record a pitcher from the direction of the catcher by a monocular camera and estimate the skeleton pose of the pitcher by using OpenPose. We propose a new method to evaluate whether the pitcher can pitch in the correct form by examining the estimated pose. We use SSE(Shoulder, Shoulder, Elbow)-line as an evaluation index. When the upper body of the pitcher faces a batter, the SSE-line should be straight. To find the right frame at which the pitcher body turns squarely to the batter, the distance of the shoulders in a video frame is used. When it becomes the largest, the shape of SSE-line should be measured. Since the motion of the pitcher was fast, we use a 240 fps camera to investigate the relationship between the shape of SSE-line and the shoulder distance. The relationship between the shape of the SSE-line and the shoulder distance with the 240 fps camera was evaluated, and we discussed their pitching properties based on the evaluation.
The availability of pedestrian location estimation is one of the critical issues to realize a reliable navigation system for pedestrians in daily scenes. We propose a new pedestrian location estimation system that utilizes both the image-retrieval approach we have developed and a SLAM (Simultaneous Localization and Mapping) approach. Both approaches need only one single camera unit as a sensor, and the location is estimated by computer vision technology on both approaches. The problem here is that high processing cost is required to operate two approaches simultaneously. It could be impractical to run these two on a single wearable computing unit. We solve the problem by executing the two approaches on two separate computers that are connected with a computer network. We have implemented a preliminary system that unites the two approaches in hybrid fashion over two computers. We measured its performance in typical daily scenes on our campus. The result is promising for further implementation.
A new method of estimating swimmer position in swimming pool video is proposed. The video of swimming games is taken from a higher seat row in audience seat area. It can cover the whole field of a swimming pool. The swimming pool video is transformed so that each lane can be analyzed along with the lane direction. The foreground region that includes both the swimmer and their water splash is extracted by adaptive background modeling and by setting the mask region to cope with the influence of the non-planer water surface. Then, based on the color analysis on water splash, swimmer region can be successfully extracted. The position is estimated as the center of the Gaussian distribution of the swimmer region. The proposed method was applied to a nationwide swimming game.
KEYWORDS: Cameras, 3D modeling, Video, Image segmentation, Imaging systems, Visualization, Video acceleration, Control systems, Video surveillance, Reliability
In this paper, we propose a new free-view video system that generates 3D video from arbitrary point of view, using
multiple cameras. When target objects are captured by these cameras, the PC allocated to each capturing camera
segments the objects and transmits the masks and color textures to a 3D modeling server via the system's network. The
modeling server then generates 3D models of each object from the gathered masks. Finally, the server generates a 3D
video at the designated point of view with the 3D model and texture information. In 3D modeling, a reliability-based
shape-from-silhouette technique reconstructs a visual hull by carving a 3D space based on the intra-/inter-silhouette
reliabilities. In final view rendering, we use a cinematographic camera control system and an ARToolkit to control
virtual cameras.
KEYWORDS: Image segmentation, 3D modeling, RGB color model, Video, Cameras, Optical engineering, Image processing algorithms and systems, Light sources and illumination, Video surveillance, Imaging systems
We propose a robust method to extract silhouettes of foreground objects from color-video sequences. To cope with various changes in the background, we model the background as a Laplace distribution and update it with a selective running average and static pixel observation. All pixels in the input video image are classified into four initial regions using background subtraction with multiple thresholds. Shadow regions are eliminated using color components, and the final foreground silhouette is extracted by smoothing the boundaries of the foreground and eliminating errors inside and outside of the regions. Experimental results show that the proposed algorithm works very well in various background and foreground situations.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.