This page demonstrates the asynchronous audio source separation algorithm used in our group's entry to the 2018 Signal Separation Evaluation Campaign (SiSEC 2018) and in the forthcoming IWAENC 2018 paper "Speech Separation Using Partially Asynchronous Microphone Arrays Without Resampling". It aggregates audio data from separate microphone arrays, each of which has a slightly different sample rate. An advantage of our proposed algorithm is that it works with microphone arrays that move relative to each other. It is therefore suitable for wearable microphone array devices.
Reference: Ryan M. Corey and Andrew C. Singer, "Speech separation using partially asynchronous microphone arrays without resampling," International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, September 2018.
The speech data used in this experiment was taken from the VCTK corpus and played through six loudspeakers in the University of Illinois Augmented Listening Laboratory. Three human listeners wore two in-ear microphones and six hat-mounted microphones. The listeners each moved continuously during the recording. The room layout and the hat-mounted array are shown below.
The spatial statistics of each source were inferred from separate training recordings. These statistics were used to design four different source separation filters, shown in the table below. The first two rows are the unprocessed mixture and clean* source signals. The middle two rows are conventional static beamformers. The last two rows are the proposed time-varying source separation technique.
The samples below represent the separated sources as heard by Listener 2 (bottom left). The clips are in stereo and best experienced using headphones. Because the listener continuously moved his head up and down and from side to side during the recording, the sound sources should appear to move.
Binaural pair, single listener
8-microphone hat, single listener
Asynchronous binaural pairs, three listeners
Asynchronous 8-channel hats, three listeners
* "Clean" samples were recorded separately and have different head movement patterns than the recorded mixture.