PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.
This PDF file contains the front matter associated with SPIE Proceedings Volume 12674, including the Title Page, Copyright information, Table of Contents, and Conference Committee information.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, the remarkable progress in facial manipulation techniques has raised social concerns due to their potential malicious usage and has received considerable attention from both industry and academia. While current deep learning-based face forgery detection methods have achieved promising results, their performance often degrades drastically when they are tested in non-trivial situations under realistic perturbations. This paper proposes to leverage the information in the frequency domain, particularly the phase spectrum, to better differentiate between deepfakes and authentic images. Specifically, a new augmentation method called degradation-based amplitude-phase switch (DAPS) is proposed, which disregards the sensitive amplitude spectrum of a forged facial image and enforces the detection network to focus on phase components during the training process. Extensive evaluation results from a realistic assessment framework show that the proposed augmentation method significantly improves the robustness of two deepfake detectors analyzed and consistently outperform other augmentation approaches under various perturbations.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image analytics solutions are essential for next generation manufacturing as they can compute important KPIs related to production, maintenance, and quality in real-time which are otherwise not obtainable through machine data. One main challenge that reduces the performance of an image analytics model is when, during inference, it encounters an image, whose statistical characteristics are very different from the images that it had been trained on. This is called data drift. In commercial applications this is addressed by forming a very large dataset such as ImageNet, which contains several images corresponding to the same class and then training a model over it so hence there is little to no drift during inference. However, image analytics solutions in factories are usually custom developed for a specific situation such as one line/product which does scale when the situation changes due to drift which can be caused due to changes in lighting, camera placement, different camera specs in a new line, dust or oil on camera lens, occlusion by human workers and a multitude of other reasons. For factory settings it is not possible to create an ImageNet type dataset given the sheer volume of different and moving parts within a factory shopfloor, some of which are very specific to a given factory. Instead monitoring and compensating this data drift will detect and resolve the degradation in image analytics model performance. In this paper, the proposed solution detects the drift in images during inference using convolutional variational autoencoder and compensates this drift with minimum system integration to easily scale the solution for wide range of changing conditions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In recent years, computing-based video processing workflows have become an integral part of the motion picture industry. These workloads are highly data- and compute-intensive, requiring capable hardware to achieve the required performance. Currently, general-purpose CPUs and GPUs are used to accelerate video processing functions in motion picture workflows. While such devices are highly suited to software-driven video processing workflows, they consume large amounts of energy in these tasks. In this work, we present a case for deploying an FPGA-based accelerator as an energy-efficient alternative to general-purpose hardware in high-resolution motion picture video processing, using an ingest module as a case study. We show that an FPGA-based accelerator for decoding 8K OpenEXR B44 video frames, designed using a commercial high-level synthesis workflow and executing on a PCIe-connected Alveo U50 device, outperforms a highly parallel, CPU-optimised inbuilt B44 decoder implementation in terms of energy consumption per frame decoded. In our experiments, the FPGA-accelerated B44 was able to decode 8K frames with 47.9 ms latency while consuming 0.98 J of energy per frame, compared to the 58.3 ms achieved by a high-end Intel 11700-K CPU while consuming 4.5 J per frame, when averaged over 1000 runs. We further show that this offload can be seamlessly integrated into state-of-the-art motion picture tools such as NUKE with minimal effort. With FPGAs becoming mainstream in cloud servers, we envision that this work paves the way for more efficient integration and utilisation of custom hardware and FPGAs in compute-intensive motion picture workflows.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Hyperspectral anomaly detection is an active topic in remote sensing application research. Researchers have proposed many detection methods based on spatial differences to detect anomaly targets. However, due to the low spatial resolution of images or human manipulation, the spatial differences of targets in practical applications are not enough to provide reliable support, which reduces the accuracy of anomaly detection. In order to solve this problem and take advantage of high spectral resolution unique to hyperspectral images, this paper proposes the hyperspectral anomaly detection method based on spectral difference extraction. Specifically, spectral derivatives are introduced to extract bands with spectral differences between the tested pixels and surrounding sample pixels to form a combined image, and the corresponding Mahalanobis distance is calculated to obtain suspected anomaly results. Then, the suspected anomaly results are subjected to anomaly assessment on the suspected anomaly part through the kurtosis value to obtain the final detection result. In addition, this paper obtains the corresponding suspected anomaly part in the suspected anomaly results through the corresponding atoms in the background dictionary. Experiments applied on the real datasets show the effectiveness of the proposed method compared with other state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper explores the use of Mixed Reality in live television shows by allowing remote participants to “teleport” into a virtual studio. The solution utilizes background extraction (BE) and super-resolution (SR) modules to extract remote participants from their videos and composite them seamlessly into the studio footage, allowing for participation in live TV programs using standard mobile devices or webcams. This paper aims to investigate the impact of capturing devices and background settings on the output videos from the end user’s point of view. The results of the study are presented and discussed with a focus on the effectiveness of the BE and SR components in response to variations in capturing devices and background settings.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recently, manufacturers of electronic devices such as mobiles and displays tend to introduce various audiovisual assistive technologies in their products for the disabled. Since TV was born, however, the nature of TV “watching experience” for the visually impaired which is the first ranked leisure activity has not been improved unfortunately. Instead, it has been nothing more than using the TV in a way that understands the screen with only voice or listens to the voice that explains the screen. With the goal of improving the nature for the visually impaired with blurry vision, the world first visual-aid algorithm is proposed and implemented on TV for mass production and its effectiveness is proven by medical trials for many low vision people. This visual-aid algorithm is a technology that unprecedentedly emphasizes important features of pictures for the human vision including the edge, color and contrast so that low vision people who have significantly lowered contrast sensitivity can better understand the screen. In order to break the paradox of finding the pleasure of watching TV for specifically the low vision people, clinical trials were absolutely necessary and the results said that the proposed algorithm is surprisingly meaningful for the visually impaired. The clinical trials and our simulation experiments convinced that the TV viewing experience of the visually impaired and further the quality of their lives can be improved.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Photon-counting x-ray computed tomography (PCCT) is useful for selecting optimal energy photons to image various portions of the target object, and we performed fundamental experiments of PCCT to carry out gadolinium (Gd) K-edge CT using Gd-based contrast media. The scanner mainly consists of an x-ray generator with a 0.1-mm-focus tube, a turntable, a cadmium-telluride (CdTe) flat panel detector (FPD) with pixel dimensions of 100 Pm, and a personal computer. An object on the turntable is irradiated by the x-ray generator, 720 radiograms are taken using the FPD, and tomograms are reconstructed. We used 1.3-time magnification tomography, the effective pixel dimensions were approximately 80 Pm, and Gd-K-edge CT was carried out using Gd-based contrast media at a tube voltage of 100 kV, a tube current of 0.40 mA, and a threshold energy of 50 keV.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
To image blood vessels, we performed fundamental experiments of a red-ray computed tomography (RRCT) scanner using 650-nm-laser and high-sensitivity-photodiode (PD) modules. The line laser beam is irradiated to an object, and the photons penetrating through the object are detected using the PD module through a 1.0-mm-diameter graphite pinhole and a 0.7-mm-diameter 5-mm-length graphite collimator for the PD. The spatial resolutions were primarily determined by the collimator diameter for the PD and were approximately 0.7×0.7 mm2. RRCT was performed by repeating the reciprocating translations and rotations of the object, and the ray-sampling-translation and rotation steps were 0.1 mm and 0.5°, respectively. The image contrast was regulated using the digital amplifier, and the visible diameter of object was 0.5mm.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We present a fast, simple, and parallelizable deconvolution algorithm for the real-time deblurring of one- or two- dimensional signals (i.e. images) degraded by defocus or bokeh-like blur. The proposed algorithm runs in linear- time and performs significantly faster than other popular deconvolution methods tested, bringing the deblurring time down to under 10ms for full-HD images. It has a simple software implementation, requiring no Fourier transforms or dynamic memory allocation. Its parallel design makes it especially suitable for GPU acceleration. For one-dimensional noise-free signals, the algorithm is proven to converge exactly to the original un-blurred signal.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In the field of image acquisition, Dynamic Vision Sensors (DVS) present an innovative methodology, capturing only the variations in pixel brightness instead of absolute values and thereby revealing unique features. Given that the primary deployment of DVS is within embedded systems characterized by their constrained transmission and storage capabilities, the emphasis on data compression becomes significant. Nonetheless, such a compression could potentially compromise the efficacy of computer vision (CV) applications. This study investigates the implications of a lossy compression technique, premised on point cloud representation, for event data in CV tasks. Multiple scenarios under various compression intensities are applied to event data, and the experiments indicate the feasibility of attaining reduced bitrates while incurring minimal impact in CV task performance.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
To perform energy-dispersive x-ray computed tomography (EDCT), we constructed a computer program to amplify the digital values of raw radiograms. The CT scanner consists of an x-ray generator with a 0.1-mm-focus tube, a turntable, a flat panel detector (FPD), and a personal computer (PC). An object on the turntable is irradiated by the x-ray generator, 1.3-magnified 720 radiograms are taken by the FPD, and tomograms are reconstructed using the PC. Utilizing the digital amplifier, the object projections obtained using low-energy photons disappeared with increasing amplification factor at a constant maximum value, and the effective energy increased according to increases in the amplification factor by beam hardening. Using the beam-hardening CT (BHCT) scanner, high-contrast tomography for various objects was performed by controlling effective energy. In particular, fine blood vessels were observed by K-edge CT using iodine media.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
One of the biggest challenges in Meta’s video delivery system is device fragmentation. Due to the large user base of Meta’s Family of Apps, our supported devices range from single-core Galaxy Y to the latest Galaxy S22, from the first generation of iPad mini to iPhone 14 pro. Moreover, for devices that do not support hardware decoding for more advanced codecs, such as AV1, we have to rely on software decoders which require high compute power and memory bandwidth. It would be ideal if we can deliver high resolution AVC or VP9 encoded ABR lanes along with low resolution AV1 encoded ABR lanes for the same video, so the client device can choose which one to play based on its compute capacity. In addition, when user uploaded videos that are already encoded with advanced codecs, such as VP9 or HEVC, while generating ABR encoding ladders from the uploaded video using AVC, we also want to deliver the original uploaded video as passthrough to maximize the quality. In both use cases, we will need to support streaming ABR manifest with multiple encoded bitstreams from different codecs and play them smoothly on the client side. On top of that, we can also optimize the ABR encoding and delivery system to select ABR lanes encoded with different codecs for different bitrate targets. In this paper, we will describe how we implemented the end-to-end mixed codec manifest support and deployed the solution in production. The proposed approach effectively generalizes encoding selection into a filtering pipeline, evaluating device capacity via a feedback loop, and being adaptive to the bandwidth estimation, viewport size and device capacity.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Super-resolution video coding describes the process of coding video at lower resolution and upsampling the result. This process is included in the AV1 standard, which ensures the same super-resolution process is employed on all receiving devices. Regrettably, the design is limited to horizontal scaling with a maximum scale factor of two. In this paper, we analyze the benefit of enabling two-dimensional upsampling with larger scale factors. Additionally, we consider the value of sending residual information to correct the super-resolution output. Results show a 6.3% and 5.6% improvement in coding efficiency for UHD SDR and UHD HDR content.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In video codecs, CNN-based models have shown huge promise in two related tasks: in-loop restoration and frame super-resolution. In our previous work, we presented a framework that uses a common CNN architecture with downloadable model parameters for both these tasks with a preliminary performance study, where encoderside selection of scale factor was left as future work. The advantage of a common architecture with switchable parameters is that a single hardware inference engine can be utilized in all cases of same-resolution and super-resolution restoration, thereby limiting implementation costs. In this paper, we fully integrate this framework into the under-development AV2 video codec from the Alliance for Open Media (AOM). We also implement an algorithm for encoder-side selection of the super-resolution scale factor. With this implementation, we are able to achieve combined compression improvement up to −3.5% (AI) and −3.9% (RA) in BDRATE PSNR-Y and up to −7.8% (AI) and −7.9% (RA) in BDRATE VMAF, with inference cost as low as 1500 MACs/pixel.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this paper, we present an encoder-aware motion compensated temporal pre-processing filter (EA-MCTF) that adapts the filter on a block-basis based upon the spatio-temporal content properties and block-level encoding parameters. Some sample parameters include block-level QP, variance and mean-squared error of motion compensated block difference, slice types of adjoining frames, and frequency of a block being used as a reference. Applying the EA-MCTF to a HEVC encoder yields -12.4% average VMAF BD-rate savings over unfiltered encodings, and furthermore, the EA-MCTF yields a superior BD-rate and computational complexity performance over the MCTF available in the HEVC reference software.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Smooth prediction modes and angular intra prediction modes are two major types of intra prediction modes in AV1 video codec designed to reduce spatial redundancy in video signals. The smooth prediction mode is particularly effective in predicting blocks with a smooth gradient within the block, while angular intra prediction mode predicts pixel values by a weighted average of neighboring pixels along different angular directions within the block. This paper proposes extensions to these modes to improve the intra coding performance towards a next-generation video codec beyond the AV1 codec. The first extension involves refining the smooth modes by considering the geometric distance between each sample and its reference pixels to achieve more precise prediction. The second extension involves refining the distribution of intra prediction angles in AV1 such that the intra prediction angles are denser around vertical and horizontal modes and coarser around diagonal directions. The third extension proposes applying Intra Bi-Prediction (IBP) to a subset of prediction angles, which implicitly allows the codec to choose between IBP on and off case. Experimental results show that the proposed methods achieve up to 0.3% luma and chroma average BD-rate savings with no encoding time increase for all intra configuration when compared to research-v4.0.0 tag of reference software AVM which is developed for exploring next-generation video coding beyond AV1. Notably, the highest coding gain observed was up to 1.5% for 4K video sequences.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
JPEG XS is a lightweight, low-latency image coding standard developed for transmission of video streams over IP. The third edition of JPEG XS, currently under development of ISO, adds a frame buffer and temporal predictive coding in the wavelet domain to improve its coding efficiency on screen content and thus improves JPEG XS for remote desktop applications. Due to bandwidth constraints, the frame buffer itself needs to be compressed, and thus the frame buffer bandwidth needs to be considered for interoperability between implementations. This paper reports on core experiments conducted in JPEG (ISO/IEC SC29 WG1) to design profiles and levels for the third edition, and how WG1 is currently considering profiling the screen content coding extensions of JPEG XS.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In AV1 local warped motion mode, the warped motion parameters of the current block are derived by fitting a model to nearby motion vectors using least-squares. This paper extend this mechanism to add two new warped motion modes ( called WARP_EXTEND and WARP_DELTA), which provide different ways to compute a local warped motion model. In the proposed WARP_EXTEND mode, a warped motion model is constructed by smoothly extending the motion of neighboring blocks into the current block, with modification based on the signaled motion vector. In the proposed WARP_DELTA mode, at first, the parameters of the warped motion model of the current coding block is predicted from the previously encoded/decoded blocks. Then the difference between the current block model parameters and the predicted model parameters are signaled. For each block, a warped motion reference list ( WRL) is maintained to store all of the predicted model parameters. The WRL is generated from the corner motion vectors (MVs) and from the warped motion model of the spatial neighboring blocks. A model parameter bank is also maintained to store the model parameters of the previously decoded blocks of the current frame. If there are not enough candidates from the spatial neighborhood to fill the WRL, the warped motion models from the model parameter bank are inserted to the WRL. Besides WARP_EXTEND and WARP_DELTA modes, this paper also proposes a separate single reference prediction mode where motion vector (MV) of the current block is predicted from the WRL. The simulation results show -1.18% (YUV) and -1.53%(YUV) compression gain as compared to the existing AV1 warped motion in random access and low delay configurations, respectively.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Wedge mode is a crucial compound prediction mode for predicting a block that includes motion object boundaries in the AV1 video codec. The current AV1 design incorporates 16 block shape adaptive modes, utilizing a set of handcrafted one-dimensional blending masks, which are predefined and applied to the blending process of combining two predictors. However, this design lacks flexibility to cater to various types of motion object boundaries in real-world scenarios. This paper proposes an extended wedge mode to replace the original design. The proposed method features novel components such as relaxed block sizes and number of modes limitation, mathematically derived non-linear based two-dimensional masks for the blending process, and more efficient mode signaling methods. The experimental results of the proposed method evaluated with the AOM reference software under the AOM common test conditions demonstrate that it provides an averaged YUV Bjøntegaard Delta rate reduction of 0.3 % for random access and 0.7 % for low delay configurations without a major increase in complexity.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper improves the inter prediction method of AOMedia Video Model (AVM) by introducing a sub-block based motion vector (MV) refinement method. In the proposed method, if a block is coded as a compound mode with bidirectional reference frame, MVs of the blocks are refined before producing the final prediction. In the proposed method, at first a predicted block is divided into number of non-overlapping sub-blocks. Then, for each sub-block, offset motion vectors are searched by minimizing the sum of absolute differences (SAD) between two predicted signals P0 and P1 (Assume, P0 and P1 are the predicted blocks from reference 0 and 1, respectively ). To simplify the searching process, instead of full search, a two-step integer search is proposed. At the first step, 9 offset MVs are searched ( initial MV and 8 neighbors ). The offset which produces minimum SAD in first step is selected as the center for the second step. In the second step, additional searching is conducted around the best offset found from the first step. A block level flag is conditionally signaled in the bitstream to indicate if the proposed method is used or not. The proposed method is implemented in AVM reference software research-v4.0.0. The simulation results show the proposed method can achieve -0.47% (YUV) and 0.90% (VMAF) BDRate gain as compared to AVM research-v4.0.0.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recent advances in image compression made it possible and desirable for image quality to approach the visually lossless range. Nevertheless, the most commonly used subjective visual quality assessment protocols, e.g. the ones reported in ITU-T Rec. BT.500, were found ineffective in evaluating images with visual quality between high to nearly visually lossless. In this context, the JPEG Committee initiated a renewed activity on the Assessment of Image Coding, also referred to as JPEG AIC, aiming at standardizing new subjective and objective image quality assessment methodologies applicable in the quality range from high to nearly visually lossless. For this purpose, a Call for Contributions on Subjective Image Quality Assessment was released with deadline in April 2023, and a Call for Proposals on Objective Image Quality Assessment is expected to be issued in the near future. This paper aims at providing an overview of submissions to the Call for Contributions and presenting the recent advances in this activity, as well as future directions.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Image to Image (I2I) transformations have been an integral part of video processing workflows with applications in Image Synthesis for Virtual Productions, Segmentation, and Matting, among others. Over the years, deep learning-based approaches have been enabling new methods and tools for automating parts of the processing pipeline, reducing the human effort involved in post-production workflows. These compute-intensive models are often accelerated through on-premise or in-cloud GPU instances to improve the responsiveness and latency while expending large amounts of energy in performing these complex transformations. In this work, we present an approach for optimising the energy efficiency of I2I deep-learning models using quantised neural networks accelerated on a server-style FPGA. We use deep learning-based alpha background matting as the I2I application which is implemented using a U-Net conditional Generative Adversarial Network deep learning model. The model is trained and quantised using Vitis-AI flow from AMD/Xilinx and deployed on a data centre class Alveo U50 FPGA device. Our results show that the quantised model on the FPGA achieves a 1.14× higher throughput for inference acceleration while consuming 11× lower energy consumption per inference when compared to a GPU-accelerated version of the model on a 3080-Ti, while generating nearly identical results with an average IoU > 0.95 across multiple user images at 1080p and 4K resolutions. Additionally, offloads to the FPGA device can be seamlessly integrated into widely used motion picture tools like NUKE with minimal effort. With most cloud providers integrating heterogenous platforms (including FPGAs) into systems, we envision that this work paves the way for more efficient utilisation of custom precision deep-learning models and FPGA acceleration in deep learning-based motion picture workflows.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper proposes a novel steganographic method that employs a feedback mechanism to improve the efficiency and stealth of data hiding within the Discrete Cosine Transform (DCT) coefficients of JPEG images. This method enhances the correlation between the hidden message and the cover image, while minimizing the perceptible changes to the image. The system starts by dividing the cover image into blocks and applying DCT to each. It then evaluates the correlation between the hidden message and the DCT coefficients to identify potential data embedding points. A trained decision rules algorithm then chooses the optimal data embedding technique, considering factors like the size and location of the DCT coefficient within image blocks. Different embedding techniques are employed. The system subsequently generates feedback based on metrics such as image quality and data detectability, refining the decision ruls's effectiveness over time. By employing this dynamic approach, our system adaptively improves the data hiding process, enhancing capacity and minimizing detectability. This work opens new doors in the realm of steganography, presenting an intelligent system capable of adaptively embedding data with optimized stealth and efficiency.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Despite the significant progress in recent years, deep face recognition is often treated as a “black box” and has been criticized for lacking explainability. It becomes increasingly important to understand the characteristics and decisions of deep face recognition systems to make them more acceptable to the public. Explainable face recognition (XFR) refers to the problem of interpreting why a recognition model matches a probe face with one identity over others. Recent studies have explored use of visual saliency maps as an explanation mechanism, but they often lack a deeper analysis in the context of face recognition. This paper starts by proposing a rigorous definition of explainable face recognition (XFR) which focuses on the decision-making process of the deep recognition model. Based on that definition, a similarity-based RISE algorithm (S-RISE) is then introduced to produce high-quality visual saliency maps for a deep face recognition model. Furthermore, an evaluation approach is proposed to systematically validate the reliability and accuracy of general visual saliency-based XFR methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The demand for data storage has been growing exponentially over the past decades. Current techniques have significant shortcomings, such as high resource requirements and a lack of sufficient longevity. In contrast, research on DNA-based storage has been advancing notably due to its low environmental impact, larger capacity, and longer lifespan. This led to the development of compression methods that adapted the binary representation of legacy JPEG images into a quaternary base of nucleotides following the biochemical constraints of current synthesis and sequencing mechanisms. In this work, we show that DNA can also be leveraged to efficiently store images compressed with neural networks even without retraining, by combining a convolutional autoencoder with a Goldman encoder. The proposed method is compared to the state of the art, resulting in higher compression efficiency on two different datasets when evaluated by a number of objective quality metrics.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In today’s online video delivery systems, videos are streamed and displayed on various devices with different screen sizes, from large-screen UHD and HDTVs to smaller-screen devices such as mobile phones and tablets. A video will be perceived differently depending on the device’s screen size, pixel density, and viewing distance when viewed on different devices. Quality models which can estimate the relative differences in perceptual quality of a video on different devices can be used to understand the end-user QoE, design optimal encoding ladders for a multi-screen delivery environment, and better rate-adaptation algorithms. We previously presented a BC-KU Multi-Screen dataset1 consisting of subjective scores for different contents encoded in different resolution-bitrate pairs when viewed on three different devices. This paper presents several contributions extending the earlier dataset, which is of interest to the multimedia quality of experience (QoE) community. We first present an in-depth statistical data analysis on the previously unpublished individual subjective ratings of the Multi-Screen dataset. To better understand the relative differences in MOS scores, we present and analyze various demographic information about the test participants. We then evaluate the performance of twelve quality metrics based on five different performance measures. Individual subjective ratings, analysis scripts, and results are available as an open-source dataset. We believe the newly contributed results, files, and scripts will help analyze and design improved, low-complexity parametric models for multi-screen video delivery systems.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recent years have seen tremendous growth and advancement in the field of super-resolution algorithms for both images and videos. Such algorithms are mainly based on deep learning technologies and are primarily used for upsampling lower resolution images and videos, often outperforming existing traditional upsampling algorithms. Using such advanced upscaling algorithms on the client side can result in significant bandwidth and storage savings as the client can simply request lower-resolution images/videos and then upscale them to the required (higher) display resolution. However, the performance analysis of such proposed algorithms has been limited to a few datasets which are not representative of modern-era adaptive bitrate video streaming applications. Also, many times they only consider scaling artefacts, and hence their performance when considering typical compression artefacts is not known. In this paper, we evaluate the performance of such AI-based upscaling algorithms on different datasets considering a typical adaptive streaming system. Different content types, video compression standards and renditions are considered. Our results indicate that the performance of video upsampling algorithms measured objectively in terms of PSNR and SSIM is insignificant compared to traditional upsampling algorithms. However, more detailed analysis in terms of other advanced quality metrics as well as subjective tests are required for a comprehensive evaluation.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In today's digital world, the secure transmission of sensitive information such as images is of paramount importance. Image data often contains private and sensitive information, so protecting it from unauthorized access and interception becomes a critical challenge. This article shows an encrypted image transmission scheme based on a chaotic dynamic configuration of multiple displacements supported by the saturated nonlinear function (SNLF) and implemented on the multipurpose system-on-chip (MPSoC) platform using Python. The main activity was the realization of a chaotic SNLF space system on the Xilinx FPGA PYNQ-Z1 (MPSoC) board by programming by interactive Python on Jupyter Notebook, with the purpose of implementing a secure communication system. The main contribution is the successful synchronization of a system with several chaotic attractors in a master-slave topology. Another important contribution is the rapid implementation on the PYNQ-Z1 FPGA of the robust secure communication system capable of resisting comprehensive attacks based on chaotic space attractor SNLF. Among the results obtained is to encrypt an image, in grayscale and RGB, with chaos and transmission key in the transmission system, send the encrypted image through its state variables and from them reconstruct the encrypted image by which the receiving system recovers the sent image without loss of information. By achieving this digital image processing architecture in MPSoC, it be possible to program and synthesize other types of algorithms for digital processing and real-time video application.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The Codec Working Group of Alliance for Open Media is working on creating next-generation royalty-free video coding technology beyond AV1. This paper proposes a parity hiding method to optimize the coefficients coding of transform block. The parity of quantization level at top-left position may be derived with information of other non-zero quantization levels within the same transform block. As a result, the quantization level at top-left position with hidden parity may be coded with half of its level. The proposed method has been implemented on top of a recent release of AOM reference software. Experimental results show that the proposed approach brings 0.26%/0.87%/0.85% BD-rate saving for all-intra test and 0.29%/0.70%/0.46% BD-rate saving for random-access test for Y/U/V components with marginal encoding and decoding time changes.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Accurately identifying multiple sclerosis (MS) lesions in magnetic resonance imaging (MRI) of the brain and spinal cord is a challenging task due to variations in location, size, and shape, as well as anatomical differences among individuals. The number and volume of these lesions play a crucial role in assessing the severity of MS, monitoring disease progression, and evaluating the effectiveness of new drugs in clinical trials. Manual segmentation, while used previously, is not ideal due to its reliance on expert knowledge, time-consuming nature, and susceptibility to variations among different experts. To address these challenges, several automatic methods for segmenting MS lesions have been proposed. This research presents an innovative unsupervised methodology for the accurate identification and categorization of white matter lesions in MRI scans of patients with Multiple Sclerosis (MS). The methodology combines state-of-the-art computer vision-based image processing techniques, leveraging the powerful capabilities of CVIPtools. Through integration of preprocessing, segmentation, feature extraction, and pattern classification stages, our algorithm achieves over 90% accuracy in lesion detection and classification. Employing the K-Nearest Neighbor algorithm for pattern classification, the algorithm achieves a rate of 90.63% success for lesion classification and 93.33% success for non-lesion classification. This approach holds significant promise for enhancing the accuracy and efficacy of white matter lesion analysis, aiding in the early detection and monitoring of MS.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
This paper presents an automated system for the classification of white matter lesions (WMLs) on the brain surface of multiple sclerosis (MS) patients' MRI images. The proposed method utilizes deep learning techniques, specifically the ResNet18 architecture, and incorporates gray level linear modification methods for image preprocessing. The objective is to accurately classify WMLs using a deep learning model, thereby reducing the workload of radiologists, and improving the accuracy of MS diagnosis. The study demonstrates the potential of using deep learning models to automate the detection and analysis of WMLs in MS patients' MRI images. By applying gray level linear modification methods, the visibility of lesion pixels is enhanced, facilitating their recognition. The ResNet18 architecture, trained using transfer learning, achieves an accuracy of up to 93% in classifying MRI images into two categories: those with and without WMLs. The results indicate that the proposed system offers an efficient tool for radiologists, enabling them to streamline the classification of WMLs and enhance the accuracy of MS diagnosis. The incorporation of gray level enhancement techniques provides improved visualization of lesion pixels, aiding in their identification and analysis.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Purpose-built silicon for hyper-scaled video platforms is becoming mainstream as developers move beyond common video IP cores and commodity chip designs. A new generation of video processing units (VPUs) powered by Application Specific Integrated Circuits (ASICs) combine the essential encoding, decoding, and transcoding functionality with an on-chip deep neural network engine for AI and ML framework integration. This paper explores the transformative impact of custom ASICs on data center video processing, exploring their pivotal role in meeting the ever-evolving demands of this dynamic landscape. We will discuss design trade-offs for data center workloads using VPUs that involve multiple priorities, such as improving video quality while maintaining low bitrates using AI and ML applications enabled by silicon-powered video encoding and processing stacks. The paper will also showcase practical applications of AI and ML that are currently infeasible using software alone.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
We are in an era of immense growth in visual processing where intelligent, interactive, and immersive visual experiences are delivered from the cloud anywhere, anytime and on any device. The demand for delivering premium visual experience from the cloud brings new challenges in providing high-resolution, high-quality, high-density, and yet low latency media. Intel® Flex Series GPU is a breakthrough solution designed for delivering premium visual experiences from the edge and Cloud using Intel’s long-standing Intel® Quick Sync Video technology. It provides a flexible, robust and industry’s most open GPU solution for the intelligent visual cloud. It includes broad industry codec support across common formats in broadcasting and creation, pre- and post-processing capability without compromising visual quality. This paper highlights how the Flex Series GPU improves flexibility, scalability and lowers total cost of ownership (TCO) for cloud and edge computing across a wide variety of visual services. It discusses the latest Intel® Quick Sync Video architecture, its broad support for popular media tools, APIs and frameworks, and its future vision. It also showcases several software innovations on Flex Series GPU across large neural network architecture support of generative AI and visual AI models, support for 8K60 real-time transcoding with Intel® Deep Link Hyper Encode technology on FFMPEG, low latency cloud gaming and license free virtualization solution. The paper also covers Intel’s oneAPI which empowers developers to deliver open, portable code across Intel’s CPUs and GPUs to maximize visual services throughput.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Videos uploaded to Meta's Family-of-Apps are transcoded into multiple bitstreams of various codec formats, resolutions and quality to provide the best video quality across the wide variety of devices and connection bandwidth constraints. On Facebook alone, there are more than 4 billion video views per day and to address the video processing at this scale, we needed a video processing solution that can deliver the best video quality possible, with the shortest amount of encoding time — all while being energy efficient, programmable, and scalable. In this paper, we present, Meta Scalable Video Processor (MSVP) that can do video processing at on-par quality compared to SW solutions but at a small fraction of the compute time and energy. Each MSVP ASIC can offer a peak SIMO (Single Input Multiple Output) transcoding performance of 4K at 15fps at the highest quality configuration and can scale up to 4K at 60fps at the standard quality configuration. This performance is achieved at ~10W of PCIe module power. We achieved a throughput gain of ~9x for H.264 when compared against libx264 SW encoding. For VP9, we achieved a throughput gain of ~50x when compared with libVPX speed 2 preset. Key components of MSVP transcoding include video decode, scalar, encoding and quality metric computation. In this paper, we go over ASIC architecture of MSVP, design of individual components and compare the perf/W vs quality against standard industry used SW encoders.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
There has been considerable progress in implicit neural representation to upscale an image to any arbitrary resolution. However, existing methods are based on defining a function to predict the RGB value from just four specific loci. Relying on just four loci is insufficient as it leads to losing fine details from the neighboring region(s). We show that by taking into account the semi-local region leads to an improvement in performance. In this paper, a new technique called Overlapping Windows on Semi-Local Region (OW-SLR) is applied to an image to obtain any arbitrary resolution by taking the coordinates of the semi-local region around a point in the latent space. This extracted detail is used to predict the RGB value of a point. We illustrate the technique by applying the algorithm to the Optical Coherence Tomography-Angiography (OCT-A) images and show that it can upscale them to random resolution. This technique outperforms the existing state-of-the-art methods when applied to the OCT500 dataset. OW-SLR provides better results for classifying healthy and diseased retinal images such as diabetic retinopathy and normals from the given set of OCT-A images. The project page is available at https://rishavbb.github.io/ow-slr/index.html
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Through optical equipment such as the ophthalmoscope, it is possible to visualize and image the inner surface of the eye, where the main structures of the retina can be observed. The visual analysis of the retinal vasculature is widely used by ophthalmologists for prevention, diagnosis, and monitoring of retinal diseases. Nevertheless, derived from pathologies that generate an opacity in the crystalline lens (such as cataracts), the task of visualize blood vessels becomes difficult, since there is a lack of contrast in the fundus image. In this work, a multiscale decomposition method based on the Weighted Least Squares (WLS) optimization is applied to cataractous eye fundus images, with the aim of obtaining a better blood-vessel to background contrast. The proposed scheme is implemented over a publicly-available cataract eye fundus dataset. The experimental results provide a notorious visual improvement in contrast and restoration of blood vessels pixels and, in addition, maintains adequate saturation and lighting for visual analysis. The visual improvement of the vasculature represents a potential benefit in the ophthalmic analysis of patients with cataracts, since it is possible to observe the vascular morphology in greater detail while keeping relevant image features.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The volume of User Generated Content (UGC) has increased in recent years. The challenge with this type of content is assessing its quality. So far, the state-of-the-art metrics are not exhibiting a very high correlation with perceptual quality. In this paper, we explore state-of-the-art metrics that extract/combine natural scene statistics and deep neural network features. We experiment with these by introducing saliency maps to improve perceptibility. We train and test our models using public datasets, namely, YouTube-UGC and KoNViD-1k. Preliminary results indicate that high correlations are achieved by using only deep features while adding saliency is not always boosting the performance. Our results and code will be made publicly available to serve as a benchmark for the research community and can be found on our project page.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Vein pattern recognition is a novel method to reliably identify or authenticate a person’s safety. It uses infrared images from the palm, wrist, or fingers, which shows the network of veins under the skin. This paper presents a Convolutional Neural Network (CNN) to classify infrared images of the hand vein pattern. The public PolyU Database is used to train the CNN. The CNN can classify 6000 vein patterns of the hand with an accuracy of 92.81%. Even more, its performance is compared with the invariant moment descriptors. In this case, vein pattern recognition is carried out on the raw images using k-Nearest Neighbors (k-NN) and invariant Zernike moments. An accuracy of 99.97% is obtained.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
A non-contact device for measuring oxygen saturation during wound healing was fabricated using a multi-spectral camera and LEDs of two different wavelengths (660 nm and 940 nm). The performance of the system was evaluated by creating a wound model and measuring oxygen saturation during the healing process. The results showed a decrease in oxygen saturation due to hypoxia immediately after wounding, which gradually returned to normal as the wound healed. Moreover, the accuracy of the system was compared with commercial veterinary oximeters, and the root mean square error (RMSE) was found to meet the clinical criteria (≤ 3.5%).
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The paper deals with the problem of overlapping two point-data clouds. Traditionally, iterative or variational methods are used to solve such problems. However, these methods are ineffective to solve tasks with a large number of points in the clouds or in the cases of task series with real-time cloud mapping. For those tasks, it is more appropriate to use neural network technique and deep learning methods. The overlap of point-data clouds is understood as finding the displacement vector between them and the rotation matrix of the clouds relative to each other. First of all, point-data clouds are reduced to the zero displacement by means of some transformation. To find the rotation matrix for transformed clouds the authors proposed a simple neural network implementation of the ICP algorithm. This implementation consists of two stages substantially formed by neural networks. At the first stage, a two-layer probabilistic network acts as a metric classifier. The first layer of the probabilistic network is composed of radial-basis elements – Gaussians. The Gaussian activation function makes it possible to identify the output of the first layer with the probability showing the proximity of the points of the superimposed clouds. The second layer of this network is competitive. As a result of the probabilistic network, the points of these two clouds are ranked according to the degree of proximity. The point-data clouds sorted by proximity are sent to the second single-layer neural network. On the second stage, the rotation matrix is calculated using the learning procedure according to the Hebb rule. In the case of small point clouds (less than 10 thousand points), it is more appropriate to use a pseudo-inverse rule (calculations using a pseudoinverse Penrose-Moore matrix) based on the Hebb rule. At the output of the second stage, the rotation matrix is obtained, with which we can easily calculate the displacement vector of the original point clouds. The approbation of the proposed point-data cloud overlap method showed a good match on samples from the ModelNet40 database.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Point cloud registration is a central problem in many mapping and monitoring applications such as 3D model reconstruction, computer vision, autonomous driving, and others. Generating maps of the environment is often referred to as the Simultaneous Localization and Mapping (SLAM) problem. Note that some point clouds from the considered set may not have intersections. In this paper, we propose an algorithm to align the multiple point clouds based on an effective pairwise registration and a global refinement algorithm. The global refinement algorithm is non-iterative. Computer simulation results are provided to illustrate the performance of the proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The important task of 2D image classification and segmentation is the extraction of the local geometrical features. The convolution neural network is the common approach last years in this field. Usually, the neighborhood of each pixel of the image is implemented to collect local geometrical information. The information for each pixel is stored in a matrix. Then, Convolutional Auto-Encoder (CAE) is utilized to extract the main geometrical features. In this paper, we propose a neural network based on CAE to solve the extraction of local geometrical features problem for noisy images. Computer simulation results are provided to illustrate the performance of the proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
The paper deals with the design of a fast algorithm for computing the hopping discrete cosine transform in equidistant signal windows using a recursive relationship between transform spectra. Discrete cosine transform is widely used in digital signal processing such as image coding, spectral analysis, feature extraction, and filtering. Short-time transform is suitable for adaptive processing and time-frequency analysis of quasi-stationary data. Hopping transform refers to a transform computed on the signal of a fixed-size window that slides over the signal with an integer hop step. Hopping discrete transform can be employed for time-frequency analysis and adaptive processing quasi-stationary data such as speech, biomedical, radar and communication signals. The performance of the algorithm with respect to computational costs and execution time is compared with that of conventional sliding and fast algorithms.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
Recently, there has been essential progress in the field of deep learning, which has led to compelling advances in most of the semantic tasks of computer vision, such as classification, detection, and segmentation. Point cloud registration is a task that aligns two or more different point clouds by evaluating the relative transformation between them. The Iterative Closest Points (ICP) algorithm and its variants have relatively good computational efficiency but are known to be subject to local minima, so rely on the quality of the initialization. In this paper, we propose a neural network based on the Deep Closest Points (DCP) neural network to solve the point cloud registration problem for incongruent point clouds. Computer simulation results are provided to illustrate the performance of the proposed method.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
In this study, the main goal is to improve the performance of existing computer diagnostic systems by proposing new processing methods. We use the public CBIS-DDSM dataset for training and validation. The dataset consists of normal screenings with benign tumors and malignant tumors, with all pathologies carefully selected and checked by a radiologist. The data set also includes ROI masks and pathology bounding boxes, as well as labels corresponding to the class of each pathology diagnosis. To achieve better results on the dataset, we transform the data for their more efficient representation using autoencoders in order to obtain features with low intraclass and high interclass variance, and apply LDA to the encoded features to classify pathologies. Methods for automated pathology detection are not considered in this article, since it is mainly focused on the classification task itself. The entire pipeline of the system consists of the following steps: first, feature extraction using pathology segmentation; dividing the data into two clusters; feature transformation using linear discriminant analysis to minimize intra-class variance; finally, the classification of pathologies. The results of this study for the classification of pathologies using various deep learning methods are presented and discussed.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.