MMSP-2.1

Automated Generation of News Content Hierarchy By Integrating Audio, Video, and Text Information
Qian Huang, Zhu Liu, Aaron Rosenberg, David Gibbon, Behzad Shahraray (AT&T Labs - Research)

This paper addresses the problem of generating semantically meaningful content by integrating information from different media. The goal is to automatically contruct a compact yet meaningful abstraction of the multimedia data that can serve as an effective index table, allowing users browsing through large amounts of data in a non-linear fashion with flexibility, efficiency, and confidence. We propose an integrated solution in the context of news broadcast that simultaneously utilizes cues from video, audio, and text to achieve the goal. Some experimental results are presented and discussed in the paper.

MMSP-2.2

Finding Presentations in Recorded Meetings Using Audio and Video Features
Jonathan Foote, John Boreczky, Lynn Wilcox (FX Palo Alto Laboratory, Inc.)

This paper describes a method for finding segments in video-recorded meetings that correspond to presentations. These segments serve as indexes into the recorded meeting. The system automatically detects intervals of video that correspond to presentation slides. We assume that only one person speaks during an interval when slides are detected. Thus these intervals can be used as training data for a speaker spotting system. An HMM is automatically constructed and trained on the audio data from each slide interval. A Viterbi alignment then resegments the audio according to speaker. Since the same speaker may talk across multiple slide intervals, the acoustic data from these intervals is clustered to yield an estimate of the number of distinct speakers and their order. This allows the individual presentations in the video to be identified from the location of each presenter's speech. Results are presented for a corpus of six meeting videos.

MMSP-2.3

Video Content Extraction and Representation Using a Joint Audio and Video Processing
Caterina Saraceno (PRIP Vienna University of Technology)

Computer technology allows for large collections of digital archived material. At the same time, the increasing availability of potentially interesting data makes difficult the retrieval of desired information. Currently, access to such information is limited to textual queries or characteristics such as color or texture. The demand for new solutions allowing common users to easily access, store and retrieve relevant audio-visual information is becoming urgent. One possible solution to this problem is to hierarchically organize the audio-visual data so as to create a nested indexing structure which provides efficient access to relevant information at each level of the hierarchy. This work presents an automatic methodology to extract and hierarchically represent the semantic of the contents, based on a joint audio and visual analysis. Descriptions on each media (audio, video) will be used to recognize higher level of meaningful structures, such as specific types of scenes, or, at the highest level, correlations beyond the temporal organization of information, allowing to reflect classes of visual or audio or audio-visual types. Once a hierarchy is extracted from the data analysis, a nested indexing structure can be created to access relevant information at a specific level of detail, according to the user requirements.

MMSP-2.4

Unsupervised Clustering of Ambulatory Audio and Video
Brian P Clarkson, Alex Pentland (MIT Media Lab)

A truly personal and reactive computer system should have access to the same information as its user, including the ambient sights and sounds. To this end, we have developed a system for extracting events and scenes from natural audio/visual input. We find our system can (without any prior labeling of data) cluster the audio/visual data into events, such as passing through doors and crossing the street. Also, we hierarchically cluster these events into scenes and get clusters that correlate with visiting the supermarket, or walking down a busy street.

MMSP-2.5

Summarizing Video using a Shot Importance Measure and a Frame-Packing Algorithm
Shingo Uchihashi, Jonathan Foote (FX Palo Alto Laboratory, Inc.)

This paper presents methods of generating compact pictorial summarizations of video. By calculating a measure of shot importance video can be summarized by de-emphasizing or discarding less important information, such as repeated or common scenes. In contrast to other approaches that present keyframes for each shot, this measure allows summarization by presenting only the most important shots. Selected keyframes can also be resized depending on their relative importance. We present an efficient packing algorithm that constructs a pictorial representation from differently-sized keyframes. This results in a compact and visually pleasing summary reminiscent of a comic book.

MMSP-2.6

Video Classification Using Transform Coefficients
Andreas Girgensohn, Jonathan Foote (FX Palo Alto Laboratory, Inc.)

This paper describes techniques for classifying video frames using statistical models of reduced DCT or Hadamard transform coefficients. When decimated in time and reduced using truncation or principal component analysis, transform coefficients taken across an entire frame image allow rapid modeling, segmentation, and similarity calculation. Unlike color-histogram metrics, this approach models image composition and works on grayscale images. Modeling the statistics of the transformed video frame images gives a likelihood measure that allows video to be segmented, classified, and ranked by similarity for retrieval. Experiments are presented that show an 87% correct classification rate for different classes. Applications are presented including a content-aware video browser.

MMSP-2.7

On the Choice of Transforms for Data Hiding in Compressed Video
Mahalingam Ramkumar (Department of Electrical and Computer Engineering, NJIT. Newark, NJ), Ali N Akansu (Department of Electrical and Computer Engineering, NJIT, Newark, NJ), Aydin A Alatan (Department of Electrical and Computer Engineering, NJIT)

We present an information-theoretic approach to obtain an estimate of the number of bits that can be hidden in compressed video. We show how embedding the message signal in a suitable transform domain rather than the spatial domain can significantly increase the data hiding capacity. We compare the data hiding capacities achievable for different block transforms and show that the choice of the transform depends on the robustness required. While it is better to choose transforms with good energy compaction property (like DCT, Wavelet etc.) when the robustness requirement is low, transforms with poorer energy compaction property (like Hadamard or Hartley transform) are preferable choices for higher robustness.

MMSP-2.8

V2ID: Virtual Visual Interior Design System
Zhibin Lei (Bell Laboratories), Yufeng Liang (Rutgers University), Weicong Wang (RutgersUniversity)

In this paper we propose a novel system of semantic feature extraction and retrieval for interior design and decoration application. The system, $V^2ID$ (Virtual Visual Interior Design), uses colored texture and spatial edge layout to obtain simple information about global room environment. We address the domain specific segmentation problem in our application and the techniques for obtaining semantic features from a room environment. We also discuss heuristics for making use of these features (color, texture or shape) to retrieve objects from an existing database. The final resynthesized room environment with original scene and novel object from database is created for the purpose of animation and virtual room walkthrough.

MMSP-2.9

Simulating MPEG-2 Transport Stream Transmission over Wireless ATM
Andreas Kassler, Oliver Schirpf, Peter Schulthess (University of Ulm, Dept. Distributed Systems, Oberer Eselsberg, 89069 Ulm, Germany)

Within this paper we simulate the transmission of MPEG-2 Transport Stream (TS) Packets over a wireless ATM network. Based on a finite state radio channel model for the physical layer of a wireless ATM link including the characteristics of the ATM and MAC layer, different packing schemes are evaluated for encapsulating MPEG-2 Transport Stream packets in ATM Adaptation Layer 5 (AAL5) PDUs. We analyze the performance with respect to both delay and visual quality in terms of PSNR based on a calculated cell error ratio (CER) for each given state of the radio model. The statistics show, that the 1TP (one MPEG-2 TS per AAL5 PDU) scheme outperforms all other packing schemes in terms of visual quality. At medium channel quality (38 dB), quality is judged to be good, although the CER may be as high as 25 %.

< MMSP-1 MMSP-3 >

Last Update: February 4, 1999 Ingo Höntsch