WhispEar: A Bi-directional Framework for Scaling Whispered Speech Conversion via Pseudo-Parallel Whisper Generation

Abstract

Whispered speech lacks fundamental frequency due to the missing vibration of the vocal fold and periodic excitation, leading to degraded acoustic cues. This makes it challenging to recover timbre-preserved natural speech from whispered speech. Existing Whisper-to-Normal (W2N) methods rely heavily on parallel data, while such data are extremely scarce. Semantic representations perform strongly in the Text-to-Speech area. We observe that despite large acoustic discrepancies between whispered and normal speech, their high semantic similarity allows mutual conversion between them. Motivated by this, we propose WhispEar, a bidirectional framework supporting W2N and N2W tasks. The N2W module allows generating high-quality pseudo-whisper data using large-scale speech datasets. Experiments show that WhispEar significantly outperforms baselines. In addition, we release wEar, which is the largest bilingual dataset(ZH-EN) in this field to date. Its effectiveness is further validated through ablation experiments.

We will showcase WhispEar's capabilities through the following examples.
Zero-shot Whisper-to-Normal Conversion (English)

In this section, we present zero-shot whisper-to-normal (W2N) conversion results on unseen English speakers. Given whispered input without pitch information, WhispEar reconstructs natural-sounding normal speech with stable prosody and improved intelligibility. Even under limited paired training data, the model maintains consistent performance, demonstrating strong robustness and generalization.

Zero-shot Whisper-to-Normal Conversion (Chinese)

We further evaluate zero-shot W2N conversion on Mandarin Chinese. Whispered Mandarin poses additional challenges due to tonal information loss. Despite missing pitch cues, WhispEar successfully restores natural tonal contours and preserves linguistic content. The results show that our unified semantic modeling effectively generalizes across languages and speaking modes.

Zero-shot Normal-to-Whisper Conversion (English)

WhispEar also supports zero-shot normal-to-whisper (N2W) synthesis. Given normal English speech, the model generates realistic whispered speech while preserving speaker characteristics and linguistic content. This direction is crucial for scalable pseudo-parallel data generation, enabling the creation of large whispered datasets without manual recording.

Zero-shot Normal-to-Whisper Conversion (Chinese)

We additionally demonstrate zero-shot N2W conversion in Mandarin Chinese. The generated whispered speech exhibits stable articulation and natural spectral characteristics, closely matching real whispered recordings. These results confirm that the learned semantic representation remains consistent across languages and supports high-quality bidirectional conversion.

wEar Dataset and Scalability Analysis

We release wEar, the largest and most diverse bilingual whispered-normal parallel dataset to date, covering both English and Chinese. Beyond dataset release, we conduct systematic scalability experiments by leveraging large-scale pseudo-parallel data generated via our N2W pipeline. The results show clear performance gains as training data scales, validating WhispEar’s ability to effectively utilize synthetic whispered data and highlighting a practical pathway for scaling whispered speech conversion.

Comparison of whispered speech conversion datasets in terms of language coverage, total duration, number of parallel pairs, and number of speakers.
Dataset Language(s) Total Duration (h) Pairs Speakers
Others
wTIMIT EN 26 19k 48
CHAINs EN 3 1k 36
iWhisper CN 15 -- 80
wSPIRE EN 18 -- 88
AISHELL6 CN 30 20k 167
Whisper40 CN 6 1k 36
Ours
wEar (Real) CN 18 4k 146
wEar (Pseudo) EN,CN 3026 600k 230
wEar (Total) EN,CN 3044 604k 230

 

Scaling experiment with pseudo-parallel data. Blue curves denote models trained with pseudo-data pretraining only, while red curves denote models further fine-tuned on aligned real data after pretraining.

1. toWhisper: https://github.com/zeta-chicken/toWhisper