Home
 Mirror Sites
 General Information
 Confernce Schedule
 Technical Program
 Tutorials
 Industry Technology Tracks
 Exhibits
 Sponsors
 Registration
 Coming to Phoenix
 Call for Papers
 Author's Kit
 On-line Review
 Future Conferences
 Help
|
Abstract: Session MMSP-2 |
|
MMSP-2.1
|
Automated Generation of News Content Hierarchy By Integrating Audio, Video, and Text Information
Qian Huang,
Zhu Liu,
Aaron Rosenberg,
David Gibbon,
Behzad Shahraray (AT&T Labs - Research)
This paper addresses the problem of generating semantically
meaningful content by integrating information from
different media. The goal is to automatically contruct
a compact yet meaningful abstraction of the multimedia data
that can serve as an effective index table, allowing
users browsing through large amounts of data in a
non-linear fashion with flexibility, efficiency, and
confidence. We propose an integrated solution in the
context of news broadcast that simultaneously utilizes
cues from video, audio, and text to achieve the goal.
Some experimental results are presented and discussed
in the paper.
|
MMSP-2.2
|
Finding Presentations in Recorded Meetings Using Audio and Video Features
Jonathan Foote,
John Boreczky,
Lynn Wilcox (FX Palo Alto Laboratory, Inc.)
This paper describes a method for finding segments in video-recorded meetings that correspond to presentations. These segments serve as indexes into the recorded meeting. The system automatically detects intervals of video that correspond to presentation slides. We assume that only one person speaks during an interval when slides are detected. Thus these intervals can be used as training data for a speaker spotting system. An HMM is automatically constructed and trained on the audio data from each slide interval. A Viterbi alignment then resegments the audio according to speaker. Since the same speaker may talk across multiple slide intervals, the acoustic data from these intervals is clustered to yield an estimate of the number of distinct speakers and their order. This allows the individual presentations in the video to be identified from the location of each presenter's speech. Results are presented for a corpus of six meeting videos.
|
MMSP-2.3
|
Video Content Extraction and Representation Using a Joint Audio and Video Processing
Caterina Saraceno (PRIP Vienna University of Technology)
Computer technology allows for large collections of digital
archived material. At the same time, the increasing availability of
potentially interesting
data makes difficult the retrieval of desired information. Currently,
access to such information is limited to textual queries or characteristics
such as color or texture. The demand for new solutions allowing
common users to easily access, store and retrieve relevant audio-visual
information is becoming urgent.
One possible solution to this problem is to hierarchically
organize the audio-visual data so as to create a nested indexing
structure which provides efficient access to relevant information at each
level of the hierarchy.
This work presents an automatic methodology to extract and
hierarchically represent the
semantic of the contents,
based on a joint audio and visual analysis.
Descriptions on each media (audio, video) will be used to recognize
higher level of meaningful structures, such as specific types of
scenes, or, at the highest level, correlations beyond the temporal
organization of information, allowing to reflect classes of visual
or audio or audio-visual types.
Once a hierarchy is
extracted from the data analysis, a nested indexing structure can be
created to access relevant information at a specific level of detail,
according to the user requirements.
|
MMSP-2.4
|
Unsupervised Clustering of Ambulatory Audio and Video
Brian P Clarkson,
Alex Pentland (MIT Media Lab)
A truly personal and reactive computer system should have access to
the same information as its user, including the ambient sights and
sounds. To this end, we have developed a system for extracting events
and scenes from natural audio/visual input. We find our system can
(without any prior labeling of data) cluster the audio/visual data
into events, such as passing through doors and crossing the
street. Also, we hierarchically cluster these events into scenes and
get clusters that correlate with visiting the supermarket, or walking
down a busy street.
|
MMSP-2.5
|
Summarizing Video using a Shot Importance Measure and a Frame-Packing Algorithm
Shingo Uchihashi,
Jonathan Foote (FX Palo Alto Laboratory, Inc.)
This paper presents methods of generating compact pictorial summarizations of video. By calculating a measure of shot importance video can be summarized by de-emphasizing or discarding less important information, such as repeated or common scenes. In contrast to other approaches that present keyframes for each shot, this measure allows summarization by presenting only the most important shots. Selected keyframes can also be resized depending on their relative importance. We present an efficient packing algorithm that constructs a pictorial representation from differently-sized keyframes. This results in a compact and visually pleasing summary reminiscent of a comic book.
|
MMSP-2.6
|
Video Classification Using Transform Coefficients
Andreas Girgensohn,
Jonathan Foote (FX Palo Alto Laboratory, Inc.)
This paper describes techniques for classifying video frames using statistical models of reduced DCT or Hadamard transform coefficients. When decimated in time and reduced using truncation or principal component analysis, transform coefficients taken across an entire frame image allow rapid modeling, segmentation, and similarity calculation. Unlike color-histogram metrics, this approach models image composition and works on grayscale images. Modeling the statistics of the transformed video frame images gives a likelihood measure that allows video to be segmented, classified, and ranked by similarity for retrieval. Experiments are presented that show an 87% correct classification rate for different classes. Applications are presented including a content-aware video browser.
|
MMSP-2.7
|
On the Choice of Transforms for Data Hiding in Compressed Video
Mahalingam Ramkumar (Department of Electrical and Computer Engineering, NJIT. Newark, NJ),
Ali N Akansu (Department of Electrical and Computer Engineering, NJIT, Newark, NJ),
Aydin A Alatan (Department of Electrical and Computer Engineering, NJIT)
We present an information-theoretic approach to
obtain an estimate of the number of bits that can be hidden in
compressed video.
We show how embedding the message signal in a suitable
transform domain rather than the spatial domain can significantly
increase the data hiding capacity.
We compare the data hiding
capacities achievable for different block transforms
and show that the choice of
the transform depends on the robustness required.
While it is better to choose transforms with good energy compaction
property (like DCT, Wavelet etc.) when the robustness requirement is low,
transforms with poorer energy compaction property (like Hadamard or
Hartley transform) are preferable choices for higher robustness.
|
MMSP-2.8
|
V2ID: Virtual Visual Interior Design System
Zhibin Lei (Bell Laboratories),
Yufeng Liang (Rutgers University),
Weicong Wang (RutgersUniversity)
In this paper we propose a novel system of semantic feature
extraction and retrieval for interior design and decoration
application. The system, $V^2ID$ (Virtual Visual Interior Design),
uses colored texture and spatial edge layout to obtain simple
information about global room environment.
We address the domain specific segmentation
problem in our application and the techniques for obtaining semantic
features from a room environment. We also discuss heuristics for
making use of these features (color, texture or shape)
to retrieve objects from
an existing database. The final resynthesized
room environment with original scene and novel object from database
is created for the purpose of animation and virtual room walkthrough.
|
MMSP-2.9
|
Simulating MPEG-2 Transport Stream Transmission over Wireless ATM
Andreas Kassler,
Oliver Schirpf,
Peter Schulthess (University of Ulm, Dept. Distributed Systems, Oberer Eselsberg, 89069 Ulm, Germany)
Within this paper we simulate the transmission of MPEG-2 Transport Stream (TS) Packets over a wireless ATM network. Based on a finite state radio channel model for the physical layer of a wireless ATM link including the characteristics of the ATM and MAC layer, different packing schemes are evaluated for encapsulating MPEG-2 Transport Stream packets in ATM Adaptation Layer 5 (AAL5) PDUs. We analyze the performance with respect to both delay and visual quality in terms of PSNR based on a calculated cell error ratio (CER) for each given state of the radio model. The statistics show, that the 1TP (one MPEG-2 TS per AAL5 PDU) scheme outperforms all other packing schemes in terms of visual quality. At medium channel quality (38 dB), quality is judged to be good, although the CER may be as high as 25 %.
|
|