MUSCLE Showcase: Content Analysis Showcase and Evaluation Web Portal
(MUSCLE NoE |
The goal of this showcase is to bring together and demonstrate the wide range of semantic analysis and annotation capabilities that are present within MUSCLE. Based on contributions from the participants, a video data base was compiled from TV short recordings of different genres (e.g. news reports, music clips, commercials, etc.). These were integrated and shared by all team members as well as provided in the evaluation web portal.
Showcase participants will perform whatever semantic extraction and analysis (single or multimodal) they can apply to the videos, such as all kinds of low-level feature extraction, face detection, moving objects, fire and smoke detection, logo detection, music genre analysis, music segment clustering, speech recognition, text detection and recognition, etc. To achieve this, partners are allowed to use all kind of algorithms, additional external information, as well as additional data they may have and use within their own labs to enhance the information extracted from the video.
The CASEWP web portal is particularly meant to evaluate results of temporal segmentation tools. The web portal will provide all the necessary environment resources (e.g. free data sets and annotations) and will enable evaluations of state-of-the-art methods outside the constrained timelines of scientific evaluation campaigns. The goal here is to propose an online evaluation tool to the research community, which objectively measure various temporal segmentation results “on demand”, and which indirectly promote the best technology.
Results of the showcase will demonstrate the wide range of current semantic analysis algorithms developed within MUSCLE on a single common database.
The results will be:
The result of this e-team in the first stage would thus provide a kind of
practical state-of-the-art showcase of the capabilities and competences
on feature extraction and semantic annotation witin MUSCLE.
- a video to be used for analysis, serving as a common, joint showcase on what we can achieve by applying all our techniques from different perspectives.
- a set of different of features for audio, video, images, and text extracted from this video that may
later-on be used by other partners within MUSCLE
- a comprehensive set of annotations performed on this video using the variety of Machine Learning techniques employed by MUSCLE members
Collaboration and Vision
The ambitious goal of this e-Team is to integrate all the activities and competencies of all partners active in
extracting information from multimedia data in single and multiple modalities.
To this end we have started to collaboratively build a joint dataset including recordings from different national TV
stations and private holdings.
These are now to be distributed amongst the participating partners, with each partner first starting to apply their own
analysis techniques individually. From this stage on, collaboration will happen in two levels:
Overall, this e-Team shall foster the overall integration and understanding of each others algorithms and approaches by
jointly working on a common real world data source, sharing data, algorithms and results.
- on the one hand, a number of MUSCLE partners are working on related aspects, using similar techniques for different
purposs, or different approaches to achieve the same goal. since all partners will now be woriking on a heterogeneous,
but joint dataset, we envision a number of research exchange visist ranging from short-term 2-3 days to longer term (2
weeks) visits in order to get a better understanding of each others techniques and the algorithms/tools developed and
used, so that actual tools can be exchanged and used by more people in the individual labs, specifically addressing
graduate and PhD student levels.
- on the other hand, a lot of potential rests in the welth of individual features and single semantic elements extrated
by the individual partners. By combining all the resulting pieces of information, a lot of semantic information may be
learned on a higher level. This will allow all partners to build upon the work of the other groups, utilizing features
and semantics that they could not possible have created within their own institutions based on the locally available
know-how. At this level we envisage a set of longer-term exchange visits or workshop meetings by 3-4 teams in a single
place in order to analyze the potential lying in the integration of the different feature sets, and come up with new
challenging proposals on higher levels of semantics to be learned from the data available.
- Vienna University of Technology: Andreas Rauber, Thomas Lidy, Robert Neumayer
- University of Amsterdam: Cees Snoek, Nicu Sebe
- Cambridge University: Julien Fauqueur, Ryan Anderson, Nick Kingsbury
- AIIA - AUTH, Thessaloniki: Costas Kotropoulos, Emmanouil Benetos
- Univ. Surrey: Bill Christmas
- TAU: Arie Yeredor
Contribution of partners
- All: all partners will contribute in recording and building a joint video from heterogenous video sources (preferably TV recordings from newsbroadcasts, music video clips, commercials) in order to have a wide range of different characteristics, objects and events present in the video.
Subsequently, the video will be segmented in the audio and video stream, as well as a set of keyframes extracted. These will be used separately or combined to allow partners to apply their respective image, audio and video indexing tools to extract features.
Specifically, we are looking for approx. 10 minutes of each of the following:
- Sports TV recordings (single sport as well as sport news)
- Music video clips
- Soap operas
- Movies (Color, B/W, different genres,...)
- Home Movies (birthdays, vacations, ...)
Recordings should be taken, if possible, fromnational TV stations rather than international ones to get different flavours of broadcastings, different languages, etc. Recordings should be MPEG-1 encoded. All partners willing to participate should drop me an email specifying which recordings they will be able to provide, so I can compile a list and see what we will be getting. We'll then arrange for data upload and combination into a single video.
- TUWIEN-IFS: TUWIEN will focus on the analysis of the audio features from the resulting video stream. Specifically, we will extract a range of features comprising
These will then be used to analyze the audio stream both statistically as well as with machine learning techniques (clustering, classification) to identify
- Rhythm Patterns
- Statistical Spectrum Descriptors
- Roughness Features
- Set of standard features like energy, zero-crossings, etc.
- speech/audio/speech+audio separation, classification into male/female speach
- classification into different music genres
- cluster analysis of the different audio characteristics present
- integration and comparison with other audio sources, standard ference sets from e.g. the MIREX competition
- UoA: The University of Amsterdam will focus on the automatic detection of 30-50 high-level semantic concepts in the video stream, such as table, car, crowd, and mountain. Specifically we will analyse the visual modality using our wiccest features, which combine color invariance with natural image statistics. In case, other participants make a speech transcript available, we will perform a multimodal analysis that will improve the performance of the detected concepts further.
In addition, we can also provide basic video analysis techniques such as:
- camera motion results
- shot-detection results
- CAM: We propose to perform the unified extraction of different
types of low-level multiscale low-level features which cover the major
visual types of visual saliencies: blobs, corners, edges, ridges.
Detected features can be visualised as an overlay on the video. These
features can serve as a strong basis in the future for various
higher-level tasks such as object recognition, scene classification,
Tradeoff between detection accuracy and speed will be considered
depending on the real-time constraint of the application.
- other partners: work description will be added as soon as the data is made available for analysis.
Tentative Work Plan
- Months 1-2: data acquisition: discussion of some restrictions concerning the video to be created to ensure a certain heterogeneity of content, duration, etc.. Subsequently, all partners will record a short video sequence, which will then be joined to form a single video for analysis. This video will be made available to eTeam partners in several forms (eg. video formats, audio only, set of images)
- Months 3-7: Partners will analyze the data using their standard techniques, extracting the set of features they have implemented, and sharing these with the other partners. Results will be presented and discussed during a meeting at the end of this period,and a joint showcase wil be developed.
- Months 8-12: The various feature sets extracted will be combined and used to extract higher-level and more robust semantic descriptors. In order to ensure know-how transfer as well as foster collaboration, a set of pairwise exchanges (1-2 weeks) is foreseen in this period. Results stemming from the combination of different feature sets will be published at intl. conferences, as well as demonstrated as a protoype showcase.
Current status of Work
- April 3: Work on data collection has started - recordings are being made in the time period of April 3-7, with date synchronization particularly for the nwes segments. Recordings are collected from a range of national TV stations as well as private (home) video holdings in Dutch, German, English, French, Greek, Italian, and Hebrew.
- April 21: Data has been transferred to a central server - we are about to start with cleansing and data unification.
- This will be followed by data integration, producing a compiled data set for analysis as well as the preparation of individual data streams (video, audio-only, images) - data should be available by May/June.
- June 28: the data submitted so far has been cleansed and is now available at the MUSCLE data server for MUSCLE members to download. Currently available Formats are:
We are still looking for partners able to extract a series of images (one image avery 10 to 30 seconds) to form the image retrieval testset for CAS.
- original video streams
- mpeg 1 video streams
- mp3 auido streams
Also available on the server is the first set of feature vectors, namely a range of audio descriptors for the individual pieces. More feature sets will be added as they are provided by e-team members.
Dept. of Software Technology and Interactive Systems
Vienna Univ. of Technology
Favoritenstr. 9 - 11 / 188
A - 1040 Wien