ITU-T workshop on "From Speech to Audio: bandwidth extension, binaural perception"

عربي | 中文 | Español | Français | Русский

Home : ITU-T Home : Workshops and Seminars

ITU-T workshop on "From Speech to Audio: bandwidth extension, binaural perception"
Lannion, France, 10 – 12 September 2008	Contact: tsbworkshops@itu.int

Abstracts

Day 1, 10 September 2008
14:00 – 15:30	Opening Session & keynote speakers Workshop Chair: Catherine Quinquis (France Telecom, France)
	Prof Jens Blauert, Professor Emeritus (University of Bochum, Germany); Models of the Binaural Hearing System: The Precedence Effect MODELS OF THE BINAURAL HEARING SYSTEM (co-authored by Jonas Braasch) – Prominent Features of Binaural Hearing – Architecture of a Model of Binaural Hearing – The Jeffress Processor – The Lindemann/Gaik Extensions – Interpreting Binaural Activity – The Effect of Interaural Incoherence – Binaural Speech Enhancement – Problems of Current Binaural Models – Future Work THE PRECEDENCE EFFECT (co-authored by Jonas Braasch) The acoustic modality is of paramount importance for human inter-individual communication. Consequently, the human auditory system is highly differentiated and able to perform sophisticated tasks such as the identification, recognition and segregation of concurrent sound sources in acoustically adverse conditions - e.g., in reverberant or noisy environments. To this end the different stages of the system at peripheral, sub-cortical and cortical level act in a coordinated manner. In this lecture we take the auditory Precedence Effect as an example to discuss the role of the different stages of the auditory system in complex sound-localisation tasks. Further, we consider different strategies of modelling auditory functions.
	Prof Sabine Meunier (CNRS LMA, Marseille, France) Loudness could be defined as the subjective intensity of a sound, that means how strong a sound seems to a listener. Although this definition seems to link loudness with the only sound intensity, loudness depends on other parameters of the sound. The links between loudness, sound pressure level, frequency, bandwidth and duration are well known. Based on researches done for many years, loudness models have been developed, some are now included into standards. Nowadays, the researches focus on the loudness of non-stationary sounds and on the effect of context. The question concerning the psychophysical method best suited to measure loudness is still of current event. In this presentation, the relationship between loudness, sound pressure level, frequency, bandwidth and duration will be shown and the questions addressed by the current researches will be presented.
	Master of Ceremonies: Jean-Yves Monfort (France Telecom, France)
15:30 – 16:00	Coffee break
16:00 – 17:30	Round Table with SDOs
« TOP »

Day 2, 11 September 2008
09:00 – 10:30	SESSION 1: Loudness Coordinator: Gerald Lecucq (Alcatel, France)
	Sridhar Kalluri (Starkey Hearing Research Center, USA): Effect on sound quality of extending the bandwidth of amplification to high frequencies in hearing-impaired listeners While it is becoming possible for hearing aids to give a broader frequency range of amplification than possible in the past, there is little consistent objective evidence that a greater audible bandwidth gives perceptual benefit to hearing-impaired (HI) listeners. This study investigates whether extending the bandwidth of reception to high frequencies gives an improvement of sound quality to HI listeners, as it does for normal-hearing listeners. We address this question by asking 10 moderate HI listeners to rate their preference in terms of sound quality of different upper frequency limits of amplification (4, 6, 8, 10 and 12 kHz) in paired-comparison trials. Subjects rate the quality of 3 music samples and 1 speech sample, with all samples selected to have significant high-frequency content. Stimuli are amplified linearly according to a high-frequency version of the CAMEQ prescription that compensates for the subject’s hearing loss. Inconsistent findings from past studies regarding the efficacy of increasing bandwidth may be due to insufficient audibility of high-frequency energy. The problem stems from the difficulty of verifying sound levels at the ear drum at high frequencies. The present study addresses verification of audibility by measuring sound levels in each subject with a probe microphone placed approximately 2-3 mm from the ear drum. The proximity of the probe tip to the ear drum helps overcome the across-subject variability in the levels of high-frequency components of sound due to individual differences in ear-canal geometry. The study also verifies audibility by measuring the ability of individual subjects to discriminate the different bandwidth conditions for every stimulus sample used in the assessment of sound quality. We will discuss the results of the experiment and its implications for the bandwidth of amplification in moderate HI listeners.
	Arnault Nagle (France Telecom, France): Assessment of audio codecs in the context of VoIP audio conferencing: Monaural vs diotic listening In VoIP audio conferencing, audio rendering is usually proposed over either handsets or headphones, which means two distinct kinds of listening condition: monaural or diotic. The goal of our study is to determine whether listening over the monaural or diotic condition has an impact on the perceived quality of speech processed by VoIP codecs. We performed two ACR tests: one in narrowband and one in wideband. Each test had two sessions: one with a monaural listening and the other with diotic listening. Both tests were performed using 32 different listeners, divided into four groups of eight listeners each. The processed speech material was presented randomly to each group, seated in an acoustically conditioned sound room following the P.800 requirements. The speech material used was extracted from the France Telecom french speech database. Speech database consists of simple meaningful short sentences recorded in a quiet environment. The listening level was not the same between the monaural and the diotic condition, in order to keep equivalent the loudness. It was chosen as 79 dB SPL for the monaural condition, whereas a decrease of 10 dB SPL was applied per channel over all the bandwidth for the diotic condition (69 dB SPL). It is shown that the listening condition has a significant effect on the perceived codec quality. For diotic listening, quality is judged more severely when speech is degraded for instance by packet loss or low bit rate. Diotic listening seems to help subjects to better discriminate degradations. In addition, the difference of listening level between the monaural and diotic conditions leads to hide noise defects, which points out the potential weight of the listening level for quality evaluation. In function of the codec, that is the degradation introduced (packet loss or bit rate), the impact can be more or less strong resulting in shifts in codec ranking between the two listening modes. At the opposite, in comparison with monaural listening, diotic listening highlights the benefits of high quality codecs. These results suggest that audio codecs should be chosen carefully for use cases.
10:30 – 11:00	Coffee break
11:00 – 12:30	SESSION 2: Modelling: binaural, spatialisation Coordinator: Thomas Sporer (Fraunhofer, Germany)
	Gunilla Berndtsson (Ericsson Research, Sweden): Creation of test material that simulates the stereo capture of a teleconference site We consider audio- and videoconferencing to be a key application for wideband and super wideband stereo codecs. Hence it is crucial that these codecs perform well on stereo audio signals captured at the participating sites. In order to test the performance of these codecs for this key application it is important to have good test material that is representative of such teleconferencing sessions. In this contribution we begin by discussing the audio scene to be captured and its key spatial audio components, which are the reverberated signals of the main speakers in the room and background noise which contains both diffused components and spatially placed components such as interfering talkers. We then proceed to discuss the captured stereo image of this audio scene and the most important spatial characteristics that need to be preserved in order to deliver a good stereo image of the audio scene. We continue to describe the main stereo capture methods used and the kind of spatial characteristics their stereo images are able to deliver. The contribution closes by proposing several concrete audio scenes that we feel are representative of a teleconferencing session and methods for creating test material that simulates the proposed audio scenes.
	Peter Hughes (BT Group, UK): Conferencing with Spatial Audio Most people take part in telephone conferences from time to time, and will be very familiar with both their benefits (practical multi-party communications) and drawbacks (stilted conversations, difficulty in recognizing who is talking, frequent poor quality audio, etc). In addition teleconferencing can be fatiguing due the telephony quality, the relatively long duration of teleconferences compared to normal telephone calls and the majority of time being spent listening rather than talking Contrast this to real life, where we hear sounds from all around us and our two ears enable us to not only locate where a sound is coming from and turn to face it if required, but also to filter it out from background noise or other talkers. This gives us the ability to focus in on single conversations amongst many - the so called 'cocktail part effect' To investigate the benefits of employing spatial sound in audio conferencing systems a PC based SIP VoIP client called the “Senate” has been developed with the following key features: PC based audio client capable of playing both streamed speech and local sound files in a spatial environment. Spatial sound using HRTF based 3D audio processing or cinematic 5 channel reproduction. Wideband speech using AMR-WB wideband coder Named talkers with graphic icons in virtual room. Visual indication of who is talking. Audio Smileys - the ability to mix sound effects and other audio into the transmitted audio stream. This paper will discuss a number of topics based on the Senate including the benefits of spatial audio conferencing, extensions into the consumer market, implementation issues including efficient network usage and some ideas for user interfaces for both PCs and other communications devices.
	Mansoor Hyder (University of Tübingen, Germany): 3D Telephony Telephony is a well-established and important tool for interpersonal communication. Despite the revolutionary expansion in the use of telephony brought about by IP-based services and mobile phones, telephony as a media has stagnated. The basic principle of a microphone and speaker has not changed. The major limitation of today's telephony systems is that the location of a person speaking cannot be identified. This adds to the problem of poor quality, especially in multi-user scenarios. This research therefore aims to extend telephony into the third dimension. This will enable users to locate sound sources in space, after all our ears and perception abilities are naturally binaural, which has not yet been exploited by the telecommunication industry. Building on IP based telephony, advanced codecs, and recent developments in micro-mechanical tracking sensor technology, components are emerging making a fully fledged 3D telephony system feasible at modest costs. In this work we describe the design of a system using innovative 3D audio rendering based on Uni-Verse, head tracking using MEMS sensor and an IP-based VoIP protocol. We also will give an overview on the work-in-progress regarding the implementation of our prototype. This 3D phone can be used in a conferencing solution. Then, conference calls would be more realistic because the participants could identify who is talking by locating the origin of the sound. Also, many non verbal signs can be heard such as head or the body movements due to changes in the acoustic delays and echoes.
12:30 – 14:00	Lunch break
14:00 – 15:30	SESSION 3: Artificial Head, Ear and Mouth Coordinator: Luc Madec (B&K, Denmark)
	Hans Gierlich (HEAD acoustics GmbH, Germany): Optimum frequency response characteristics for Wideband Terminals In ETSI standard ES 202 739 and ES 202 740 a new testing technique for the measurement of wideband terminals is introduced. Tolerance masks are given for sending and receiving frequency response characteristics. As an important new concept in this standard, no longer the ERP but the free field reference point is used for determining the response characteristics in receiving direction. Nevertheless, the question is open to what extent the frequency response tolerance masks in sending as well as in receiving direction can be relaxed without affecting a good wideband transmission performance. Subjective tests have been carried out in order to derive the impact of non optimum receiving frequency response characteristics on the perceived speech sound quality. Different experiments are described and the results are discussed. Based on the test results and respective frequency response characteristics, a tolerance mask is proposed which guarantees a maximum speech sound quality in receiving direction, assuming impairments solely stemming from different frequency response characteristics. In an additional set of experiments, the listening quality in sending direction was assessed under different types of background noise, and using different types of wideband terminals. The aim of this investigation was to find desirable sending frequency response characteristics with and without background noise at the near end, and to possibly give general recommendations in case of speech with near end background noise. The subjective experiments are introduced and the results will be discussed.
	Gaetan Lorho, David Isherwood (Nokia Corporation): Acoustic impedance characteristics of artificial ears for telephonometric use "Artificial ears are an integral part of the audio design process for telephony devices such as mobile phones. The mechanical and electro-acoustical characteristics of these artificial ears should primarily provide an overall acoustic impedance similar to that of the average human ear over a given frequency range. This paper presents work conducted within the ITU-T Study Group 12 to quantify the degree of similarity between human ears and a subset of ITU-T Rec. P.57 Type 3 artificial ears with respect to their acoustic impedance when measured using a mobile phone-like device."
15:30 – 16:00	Coffee break
16:00 – 17:30	SESSION 4: Terminals characteristics and teleconferencing Coordinator: Hans Gierlich (HEAD acoustics GmbH, Germany)
	Pascal Huart (Cisco, France): User perception and end-point characteristics Phone acoustic characteristics can only be adjusted to a limited extent using embedded realtime signal processing therefore the endpoint audio performances should really be considered from the early step of the design. The presentation intent is to cover some endpoint characteristic limitations and end user perception of band extensions. The limitations considered are transducers as well as the mechanical and industrial design for a typical enterprise phone.
	Christian Hoene (University of Tübingen, Germany): An Open-Source Softphone for Network Musical Performances Playing musical instruments over the telephone is very demanding, because the quality requirements are far higher than the of a normal conversation. First, the acoustic latency or „mouth-to-ear“ delay must be about 20 ms because acoustic waves travel in 20 ms a distance of 7 meters. Any larger delay and distance makes it difficult for musicians to keep synchronized ‎[1]. Second, the transmission shall provide high-quality reproduction of sound that is very faithful to the original. Network music performance solutions have been present ‎[2]‎[3] using – for example – the Ultra-Low-Delay codec of Fraunhofer IIS ‎[4]. However, in today’s world of internet telephony, softphones can be downloaded and used for free, VoIP-to-VoIP can be made without paying any fees, and many VoIP and SIP applications are available as open source. Thus, it will be difficult to make network music performances a success story, if one has to pay for the license of a patented codec. In this work, we present our open-source softphone, in which we combine the open-source phone „Ekiga“ ‎[5] with the Bluetooth „SBC“ audio codec ‎[6]‎[7] and a packet loss concealment algorithm based on ITU G.711 Appendix I ‎[8]. Latter we extend to support full bandwidth. Our softphone solution allows high-quality stereo audio at very low algorithmic delays and modest compression rates. All algorithms are free of royalties and their source code is available. In addition, we show the results of subjective MUSHRA tests ‎[9] and objective assessment using ITU P.862.2 ‎[10] and BS.1387-1 ‎[11] on the audio quality of our solution, testing the performance of the packet loss concealment in cases of speech, singing and audio source material. We also measure the coding performance of Bluetooth SBC varying various encoding parameters such as number of subbands, quantization bits and compression modes. Finally, we conclude with an outlook on further tasks in research and standardization.
	Hans Gierlich (HEAD acoustics GmbH, Germany): Echo perception in wideband telecommunication scenarios So far, subjective tests leading to echo loss requirements have been conducted mostly in narrowband telecommunication systems. It can be assumed that the requirements in wideband telecommunication systems may be different, first, due to the higher quality expectation of the users and second due to the different perception of high frequency echo components. Furthermore one-dimensional instrumental parameters such as weighted terminal coupling loss (TCLw) cannot adequately describe echo impairments. In addition, frequency dependent and temporal echo impairments may have to be taken into account. In a subjective test different types of echo impairments introduced using an echo simulation were investigated. The subjective testes were conducted according to ITU-T Rec. P.831. The wideband terminal was simulated including the typical sidetone path. The test conditions and the test procedure will be described in detail. The results of the subjective test will be discussed and conclusions will be drawn on the required spectral echo attenuation.
« TOP »

Day 3, 12 September 2008
09:00 – 10:30	SESSION 5: Test methodologies: extensions, new parameters, test signals, calibration Coordinator: Slawek Zielinski (University of Surrey, UK)
	Alexander Raake (Deutsche Telekom Laboratories, Germany): Conversational speech quality of spatialized audio conferences In previous listening tests, the advantage of a spatialized over a non-spatialized sound rendering in multiparty audio conferencing has been proven, for example, in terms of a higher speech intelligibility, better speaker identification, higher focal assurance (retaining who said what in the conference) and user preference. However, only very few studies have addressed the potential advantages of spatial audio in an actual conversation situation. In this presentation, we describe a conversation test method for assessing the speech quality of audio conferences with remote interlocuters. The method is based on a set of realistic conversation test scenarios: The first set aims at audio conferences held in a business context, the second set at conferences held in a private or spare time setting; at this stage, the test scenarios are applicable to conferences with three interlocutors. The paper reports on the results of two conversation test series carried out with the business set of the conversation test scenarios ("3CTS", 3-party Conversation Test Scenarios). The test results show a limited quality differentiation of spatialized versus non-spatialized speech, and also of narrowband, wideband and fullband speech (diotic or dichotic presentation). In our presentation, we analyze the possible reasons for this observation based on different technical and non-technical criteria.
	Thierry Etamé (France Telecom, France): Characterization of the multidimensional perceptive space for current speech and sound codecs The purpose of our work is to produce a reference system that can simulate and calibrate degradations of speech and audio codecs which are currently used on telecommunications networks, for subjective assessment tests of voice quality. At first, 20 wideband codecs are evaluated through subjective tests with the general goal of producing the multidimensional perceptive space underlying the perception of current degradations. Then, from a verbalization task, it appears that the identified attributes are clear/muffle, high-frequency noise, noise on speech and hiss. Finally, these dimensions are characterized with correlates such as spectral centroid, spectral flatness measure, Mean Opinion Score and correlation coefficient.
	Yu Jiao (University of Surrey, UK): Towards consistent assesment of audio quality of systems with different available bandwidth Historically, different methods were used for the assessment of quality of narrow-band speech, wide-band speech and broad-band audio signals. Consequently, various assessment techniques were developed and compartmentalised according to the bandwidth of associated applications. In the near future the distinction between audio systems based on their bandwidth may be blurred and the boundaries between them may be even completely removed since the new telecommunication systems will allow users not only to transmit and reproduce speech but also music and sound effects. In addition, systems will be capable of reproduction of binaural and multichannel audio signals, making it possible to render accurate 3D audio scenes. These developments pose new challenges for both objective and subjective assessment of audio quality in a consistent manner and there is a need for the development of new, more universal standards for audio assessment. In this presentation it will be shown that the traditional methods of subjective speech quality assessment, such as the ones described in the ITU-T P.800 Recommendation, could be combined with the methods that are commonly used in the audio quality assessment, e.g. the one standardised in the ITU-R BS. 1534 Recommendation. However, an important problem of defining a fixed frame-of-reference has to be addressed in this new development, which could be achieved by means of a direct anchoring technique. A live demonstration of the computer interface based on the new method will be made during the presentation.
10:30 – 11:00	Coffee break
11:00 – 12:30	Wrap-up session and conclusions Coordinator: Jean-Yves Monfort (France Telecom, France)
« TOP »