Session: MMEDIA-L2
Time: 3:30 - 5:30, Thursday, May 10, 2001
Location: Room 250 D
Title: Signal Processing for Media Integration
Chair: Aggelos Katsaggelos

3:30, MMEDIA-L2.1
NEW APPROACHES TO AUDIO-VISUAL SEGMENTATION OF TV NEWS FOR AUTOMATIC TOPIC RETRIEVAL
U. IURGEL, R. MEERMEIER, S. EICKELER, G. RIGOLL
This paper presents two new real-time approaches to segmentation of TV news shows into topics. The goal of this research work is the high precision retrieval of topics from TV news. For that purpose, the detection of correct topic boundaries is of great importance. We introduce a stochastic and a rule-based topic model based on HMMs. The former combines features from the visual as well as from the audio channel of the news show, whereas the latter uses the video channel only. They are compared to the detection of topics using only the audio channel, which is common for many other approaches. The paper contains the following innovations: 1) The detected segment boundaries correspond directly to topics and not to video or audio cuts, as most other segmentation methods. 2) An advanced stochastic topic model is introduced that uses audio as well as video features. 3) The introduced HMM-based approaches both outperform the audio-based approach. One algorithm has a very good topic boundary detection rate, whereas the other minimizes the number of wrongly inserted boundaries without missing too many real boundaries.

3:50, MMEDIA-L2.2
SPEECH RETRIEVAL WITH VIDEO PARSING FOR TELEVISION NEWS PROGRAMS
H. MENG, X. TANG, P. HUI, X. GAO, Y. LI
We have been working on speech retrieval from Chinese (Cantonese) television news programs. The use of automatic speech recognition for audio indexing produces imperfect transcriptions, and recognition errors affect retrieval performance. A news story typically contains a brief report by the anchor person(s) in the studio, as well as news footage from the field. Investigation shows that our recognizer performs better when indexing audio from the studio, compared to that from the field. In order to automatically extract the "reliable" audio segments for speech retrieval, we attempt to detect studio-to-field transitions by means of video parsing. Our study is based on 146 news stories collected from local television Cantonese news programs. We formulated a known-item retrieval task and adopted the average inverse rank (AIR) as our evaluation metric. Retrieval is performed based on syllable bigram units, augmented with skipped syllable bigrams. Retrieval using the entire audio track of each news story gave AIR=0.759. With the incorporation of video parsing, we performed retrieval based only on the studio recordings, which produced AIR=0.768.

4:10, MMEDIA-L2.3
VIDEO SCOUTING: AN ARCHITECTURE AND SYSTEM FOR THE INTEGRATION OF MULTIMEDIA INFORMATION IN PERSONAL TV APPLICATIONS
R. JASINSCHI, N. DIMITROVA, T. MCGEE, L. AGNIHOTRI, J. ZIMMERMAN
Currently available Personal Video Recorders find and store whole TV programs. Our system, Video Scouting, not only finds and stores programs; it automatically segments and indexes story segments from the programs according to viewers' profiles. The extracted descriptions serve the viewers' content information requests for program segment selection, e.g. play the three minute interview with Hillary Clinton. To achieve this, the system combines information from the audio, visual, and transcript domains in a probabilistic framework based on Bayesian networks. In this paper we describe the overall architecture, a system implementation, and discuss some experimental results.

4:30, MMEDIA-L2.4
THE EFFECT OF TEXT IN STORYBOARDS FOR VIDEO NAVIGATION
M. CHRISTEL, A. WARMACK
A storyboard is a presentation scheme for abstracting information in a digital video clip based on imagery. This paper describes a series of storyboard interfaces with added transcript text features. These interfaces are used in a controlled experiment focusing on the utility of transcript text in storyboards for news video navigation. We wished to explore whether such text resulted in improvements in video navigation, and, if so, whether the amount of text and its synchronization with video imagery affected the navigation task. The text-augmented storyboards performed significantly better than storyboards with no text. Full transcript text produced benefits when presented as a block, whereas reduced contextual text descriptions produced benefits when aligned with storyboard image rows.

4:50, MMEDIA-L2.5
MAJOR CAST DETECTION IN VIDEO USING BOTH AUDIO AND VISUAL INFORMATION
Z. LIU, Y. WANG
Major casts, for example, the anchor persons or reporters in news broadcast programs and principle characters in movies play an important role in video, and their occurrences provide good indices for organizing and presenting video content. This paper describes a new approach for automatically generating the list of major casts in a video sequence based on multiple modalities, specifically, both speaker and face information. A list of major casts is created and ordered by the accumulative temporal and spatial presence of corresponding casts. Preliminary simulation results show that the detected major casts are meaningful and the proposed approach is promising.

5:10, MMEDIA-L2.6
SEQUENCE FAMILIES SETS CONSTRUCTED FROM QUADRATIC CONGRUENCE CODES FOR USE IN SECURE SPREAD SPECTRUM WATERMARKING FOR MULTIMEDIA
C. LIM, S. ABEYSEKERA, S. AMARASINGHE
Digital watermarking for multimedia information has been widely adopted as a measure to detect for copyrights infringements. In many watermarking schemes, often only one watermarking signature is used in the embedding of hidden information. This paper introduces new multiple sequence families with good auto- and cross-correlation properties, which are ideal in the application of digital watermarking. It is also envisaged that this good cross-correlation property of these multiple families of sequences allows for the use of multiple signatures in a watermarking scheme. Multiple signatures increases the robustness of the watermarking scheme and also retrieving for independent signatures is straightforward due to the low cross-correlation property of member sequence.