Toshihiko Shiraishi / Yokohama National University
The objective of speech separation is to extract target speech from mixed sound. Traditional speech separation algorithms used time-frequency (T-F) masks. Recently, deep neural networks, which are generally called deep learning, have been used to estimate T-F masks, and the separation performance has been improved owing to it. However, a conventional deep learning approach that achieved the high separation performance requires a large number of parameters because its network is composed of recurrent neural networks (RNNs) which have recurrent and full connections. This is a disadvantage because the low memory consumption is desired in the case of using speech separation algorithms for actual applications. Because of this shortcoming, we proposed a novel network architecture that balances both the high separation performance and the low memory cost. The proposed network is composed of convolutional neural networks (CNNs) which have a sparse connection instead of RNNs to reduce the number of parameters. Additionally, aiming to achieve the high separation performance, we designed the proposed network for speech separation to make it possible to learn the feature of speech. We evaluated the separation performance for each network on the task of separating two-speaker mixture. In the simulation results, the separation performance of the proposed network was competitive with one of conventional ones with the highest separation performance regardless of realizing the more than 80 % reduction of the number of parameters. The results indicated that the proposed network is superior to a conventional one considering the cost-performances of both the speech separation performance and the memory cost.