Authors:
Timothy J Wark,
Sridha Sridharan,
Vinod Chandran,
Page (NA) Paper number 1839
Abstract:
This paper investigates the use of lip information, in conjunction
with speech information, for robust speaker verification in the presence
of background noise. It has been previously shown in our own work,
and in the work of others, that features extracted from a speaker's
moving lips hold speaker dependencies which are complementary with
speech features. We demonstrate that the fusion of lip and speech information
allows for a highly robust speaker verification system which outperforms
the performance of either sub-system. We present a new technique for
determining the weighting to be applied to each modality so as to optimize
the performance of the fused system. Given a correct weighting, lip
information is shown to be highly effective for reducing the false
acceptance and false rejection error rates in the presence of background
noise.
Authors:
Marc Liévin, Signal And Image Laboratory, LIS-INPG, France (France)
Franck Luthon, Signal And Image Laboratory, LIS-INPG, France (France)
Page (NA) Paper number 2303
Abstract:
An unsupervised algorithm for speaker's lip segmentation is presented
in this paper. A color video sequence of speaker's face is acquired,
under natural lighting conditions and without any particular make-up.
First, a logarithmic color transform is performed from RGB to HI (hue,
intensity) color space and sequence dependant parameters are evaluated.
Second, a statistical approach using Markov random field modeling segment
mouth shape using red hue predominant region and motion in a spatiotemporal
neighborhood. Simultaneously, a Region Of Interest (ROI) is automatically
extracted. Third, the speaker's lip shape is extracted from the final
hue field with good quality results in this challenging situation.
Authors:
Patrice Delmas,
Pierre-Yves Coulon,
Vincent Fristot,
Page (NA) Paper number 2312
Abstract:
Active contours or snakes are widely used in object segmentation for
their ability to integrate features extraction and pixel candidate
linking in a single energy minimizing process. But the sensitivity
to parameters values and initialization is also a widely known problem.
Performance of snakes can be enhanced by better initialization close
to the desired solution. We present here a fine mouth region of interest
(ROI) extraction using gray level image and corresponding gradient
informations. We link this technique with an original snake method.
The Automatic Snakes use spatially varying coefficients to remain along
its evolution in a mouth-like shape. Our experimentations on a large
image base proove its robustness regarding speakers change of the ROI
mouth extraction and automatic snakes algorithms. The main application
of our algorithms is video-conferencing.
Authors:
Mohammed Yeasin,
Subhasis Chaudhuri, Indian Institute of Technology, Bombay (India)
Page (NA) Paper number 1218
Abstract:
Analysis of a dynamic hand gesture requires processing a spatio-temporal
image sequence. The actual length of the sequence varies with each
instantiation of the gesture. We propose a novel, vision based system
for automatic interpretation of a limited set of dynamic hand gestures.
This involves extracting the temporal signature of the hand motion
from the performed gesture and is subsequently analyzed by a finite
state machine to automatically interpret the performed gesture.
Authors:
Athanasios Mouchtaris,
Panagiotis Reveliotis,
Chris Kyriakakis,
Page (NA) Paper number 1799
Abstract:
Immersive audio systems are being envisioned for applications that
include teleconferencing and telepresence; augmented and virtual reality
for manufacturing and entertainment; air traffic control, pilot warning,
and guidance systems; displays for the visually-impaired; distance
learning; and professional sound and picture editing for television
and film. The principal function of such systems is to synthesize,
manipulate and render sound fields in real time. In this paper we examine
several signal processing considerations in spatial sound rendering
over loudspeakers. We propose two methods that can be used to implement
the necessary filters for generating virtual sound sources based on
synthetic head-related transfer functions with the same spectral characteristics
as those of the real source.
Authors:
Norbert K Strobel,
Rudolf Rabenstein,
Page (NA) Paper number 1651
Abstract:
This paper proposes a solution to the problem of robust speaker localization
under adverse acoustic conditions. The approach is based on the classification
of time delay estimates. Two classification techniques are investigated
in detail: maximum likelihood (ML) classification and classification
based on histogram comparison. Their performance under adverse acoustic
conditions is compared to outcomes obtained with the traditional approach
which uses time delay estimates directly to infer speaker positions.
Experiments indicate that the ML classification method provides little
improvement over the traditional method. On the other hand, using histogram
classification, we can improve the probability of correct speaker localization
by more than 60% compared to either the traditional approach or the
ML classification technique.
Authors:
Panayiotis G Georgiou,
Panagiotis Tsakalides,
Chris Kyriakakis,
Page (NA) Paper number 1817
Abstract:
In this paper we address the problem of sound source localization in
the presence of impulsive noise for application in immersive telepresence
and teleconferencing. Traditional Gaussian modeling of noise signals
fails when the signals exhibit impulsive behavior. A new model is used,
namely the Symmetric alpha-Stable (SaS), which can better account for
the outliers that exist in real-world signals. Real data is used to
compare the performance of both the Gaussian and the alpha-stable models.
We demonstrate that the alpha-stable model gives a much better approximation
to the noise signal than the Gaussian model. Furthermore, we study
the problem of Time Delay Estimation (TDE) and we demonstrate the shortcomings
of TDE techniques based on second-order statistics when the noise is
of SaS nature. We propose an alternative to second-order based methods,
based on Fractional Lower-Order Statistics, and demonstrate the achieved
improvement via simulation experiments.
Authors:
Anssi P Klapuri, Signal Processing Laboratory of the Tampere University of Technology, Finland (Finland)
Page (NA) Paper number 1334
Abstract:
A system was designed, which is able to detect the perceptual onsets
of sounds in acoustic signals. The system is general in regard to the
sounds involved and was found to be robust for different kinds of signals.
This was achieved without assuming regularities in the positions of
the onsets. In this paper, a method is first proposed that can determine
the beginnings of sounds that exhibit onset imperfections, i.e., the
amplitude envelope of which does not rise monothinically. Then we describe
the mentioned system, which utilizes band-wise processing and a psychoacoustic
model of intensity coding to combining the results from the separate
frequency bands. The performance of the system was validated by applying
it to the detection of onsets in musical signals ranging from rock
music to classical and big band recordings.
Authors:
Juyang Weng, Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 USA (USA)
Yong-Beom Lee, Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48824 USA (USA)
Colin H. Evans, Department of Computer Science and Engineering, Michigan State University, East Lansing, MI 48824 USA (USA)
Page (NA) Paper number 2205
Abstract:
This paper introduces the developmental approach to speech learning,
motivated by human cognitive development from infancy to adulthood.
Central in the developmental approach is what is called the developmental
algorithm. We introduce AA-learning as a basic learning mode for our
developmental algorithm. The developmental algorithm enables the system
to learn new tasks without a need for reprogramming. Some experimental
results for AA-learning using our developmental algorithm are presented.
Authors:
Phillip L DeLeon, New Mexico State University (Mexico)
Cormac J Sreenan,
Page (NA) Paper number 1487
Abstract:
Receiver playout buffers are required to smooth network delay variations
for multimedia streams. Playout buffer algorithms such as those commonly
used in the Internet, autoregressively measure the network delay and
variation and adjust the buffer delay accordingly, to avoid packets
arriving too late. In this work, we attempt to adjust the buffer delay
based on a "prediction" of the network delay and a similar measure
of variation. The philosophy here is that the use of an accurate prediction
will adjust the buffer delay more effectively by tracking rapid fluctuations
more accurately. Proper buffer delay can lead to either (or both) a
lower total end-to-end delay for a fixed packet lateness percentage
or fewer late packets for a fixed total end-to-end delay which are
both important metrics for applications such as IP telephony. We present
a playout algorithm based on a simple normalized least-mean-square
(NLMS) adaptive predictor and demonstrate using Internet packet traces
that it can yield reductions in average total end-to-end delays.
|