Authors:
Qian Huang,
Zhu Liu,
Aaron E Rosenberg,
David Gibbon,
Behzad Shahraray,
Page (NA) Paper number 2373
Abstract:
This paper addresses the problem of generating semantically meaningful
content by integrating information from different media. The goal is
to automatically contruct a compact yet meaningful abstraction of the
multimedia data that can serve as an effective index table, allowing
users browsing through large amounts of data in a non-linear fashion
with flexibility, efficiency, and confidence. We propose an integrated
solution in the context of news broadcast that simultaneously utilizes
cues from video, audio, and text to achieve the goal. Some experimental
results are presented and discussed in the paper.
Authors:
Jonathan Foote,
John Boreczky,
Lynn Wilcox,
Page (NA) Paper number 1490
Abstract:
This paper describes a method for finding segments in video-recorded
meetings that correspond to presentations. These segments serve as
indexes into the recorded meeting. The system automatically detects
intervals of video that correspond to presentation slides. We assume
that only one person speaks during an interval when slides are detected.
Thus these intervals can be used as training data for a speaker spotting
system. An HMM is automatically constructed and trained on the audio
data from each slide interval. A Viterbi alignment then resegments
the audio according to speaker. Since the same speaker may talk across
multiple slide intervals, the acoustic data from these intervals is
clustered to yield an estimate of the number of distinct speakers and
their order. This allows the individual presentations in the video
to be identified from the location of each presenter's speech. Results
are presented for a corpus of six meeting videos.
Authors:
Caterina Saraceno,
Page (NA) Paper number 1685
Abstract:
Computer technology allows for large collections of digital archived
material. At the same time, the increasing availability of potentially
interesting data makes difficult the retrieval of desired information.
Currently, access to such information is limited to textual queries
or characteristics such as color or texture. The demand for new solutions
allowing common users to easily access, store and retrieve relevant
audio-visual information is becoming urgent. One possible solution
to this problem is to hierarchically organize the audio-visual data
so as to create a nested indexing structure which provides efficient
access to relevant information at each level of the hierarchy. This
work presents an automatic methodology to extract and hierarchically
represent the semantic of the contents, based on a joint audio and
visual analysis. Descriptions on each media (audio, video) will be
used to recognize higher level of meaningful structures, such as specific
types of scenes, or, at the highest level, correlations beyond the
temporal organization of information, allowing to reflect classes of
visual or audio or audio-visual types. Once a hierarchy is extracted
from the data analysis, a nested indexing structure can be created
to access relevant information at a specific level of detail, according
to the user requirements.
Authors:
Brian P Clarkson,
Alex Pentland,
Page (NA) Paper number 2385
Abstract:
A truly personal and reactive computer system should have access to
the same information as its user, including the ambient sights and
sounds. To this end, we have developed a system for extracting events
and scenes from natural audio/visual input. We find our system can
(without any prior labeling of data) cluster the audio/visual data
into events, such as passing through doors and crossing the street.
Also, we hierarchically cluster these events into scenes and get clusters
that correlate with visiting the supermarket, or walking down a busy
street.
Authors:
Shingo Uchihashi,
Jonathan Foote,
Page (NA) Paper number 1494
Abstract:
This paper presents methods of generating compact pictorial summarizations
of video. By calculating a measure of shot importance video can be
summarized by de-emphasizing or discarding less important information,
such as repeated or common scenes. In contrast to other approaches
that present keyframes for each shot, this measure allows summarization
by presenting only the most important shots. Selected keyframes can
also be resized depending on their relative importance. We present
an efficient packing algorithm that constructs a pictorial representation
from differently-sized keyframes. This results in a compact and visually
pleasing summary reminiscent of a comic book.
Authors:
Andreas Girgensohn,
Jonathan Foote,
Page (NA) Paper number 1492
Abstract:
This paper describes techniques for classifying video frames using
statistical models of reduced DCT or Hadamard transform coefficients.
When decimated in time and reduced using truncation or principal component
analysis, transform coefficients taken across an entire frame image
allow rapid modeling, segmentation, and similarity calculation. Unlike
color-histogram metrics, this approach models image composition and
works on grayscale images. Modeling the statistics of the transformed
video frame images gives a likelihood measure that allows video to
be segmented, classified, and ranked by similarity for retrieval. Experiments
are presented that show an 87% correct classification rate for different
classes. Applications are presented including a content-aware video
browser.
Authors:
Mahalingam Ramkumar,
Ali N Akansu,
Aydin A Alatan,
Page (NA) Paper number 2460
Abstract:
We present an information-theoretic approach to obtain an estimate
of the number of bits that can be hidden in compressed video. We show
how embedding the message signal in a suitable transform domain rather
than the spatial domain can significantly increase the data hiding
capacity. We compare the data hiding capacities achievable for different
block transforms and show that the choice of the transform depends
on the robustness required. While it is better to choose transforms
with good energy compaction property (like DCT, Wavelet etc.) when
the robustness requirement is low, transforms with poorer energy compaction
property (like Hadamard or Hartley transform) are preferable choices
for higher robustness.
Authors:
Zhibin Lei,
Yufeng Liang,
Weicong Wang,
Page (NA) Paper number 2113
Abstract:
In this paper we propose a novel system of semantic feature extraction
and retrieval for interior design and decoration application. The system,
V^2ID (Virtual Visual Interior Design), uses colored texture and spatial
edge layout to obtain simple information about global room environment.
We address the domain specific segmentation problem in our application
and the techniques for obtaining semantic features from a room environment.
We also discuss heuristics for making use of these features (color,
texture or shape) to retrieve objects from an existing database. The
final resynthesized room environment with original scene and novel
object from database is created for the purpose of animation and virtual
room walkthrough.
Authors:
Andreas Kassler, University of Ulm, Dept. Distributed Systems, Oberer Eselsberg, 89069 Ulm, Germany (Germany)
Oliver Schirpf, University of Ulm, Dept. Distributed Systems, Oberer Eselsberg, 89069 Ulm, Germany (Germany)
Peter Schulthess, University of Ulm, Dept. Distributed Systems, Oberer Eselsberg, 89069 Ulm, Germany (Germany)
Page (NA) Paper number 1380
Abstract:
Within this paper we simulate the transmission of MPEG-2 Transport
Stream (TS) Packets over a wireless ATM network. Based on a finite
state radio channel model for the physical layer of a wireless ATM
link including the characteristics of the ATM and MAC layer, different
packing schemes are evaluated for encapsulating MPEG-2 Transport Stream
packets in ATM Adaptation Layer 5 (AAL5) PDUs. We analyze the performance
with respect to both delay and visual quality in terms of PSNR based
on a calculated cell error ratio (CER) for each given state of the
radio model. The statistics show, that the 1TP (one MPEG-2 TS per AAL5
PDU) scheme outperforms all other packing schemes in terms of visual
quality. At medium channel quality (38 dB), quality is judged to be
good, although the CER may be as high as 25 %.
|