Secure Speech Communication



In the digital age, where information is readily available, the threat of fake or manipulated content is becoming increasingly concerning. Experts predict that in the next six years, more than 90% of all digital content will have been manipulated to some degree. This alarming statistic highlights the need for effective solutions to combat the rising threat of fake information. Manipulated content has the potential to cause severe damage, whether it is used for propaganda, disinformation, or cybercrime. The implications of this are far-reaching, with the potential to impact public opinion, disrupt democracy, and damage reputations.

Despite advancements in detection techniques, current technology is only able to detect about 65% of fake information, with the remaining 35% slipping through the internet undetected. This represents a significant challenge, as the technology required to create and distribute fake information is becoming increasingly accessible. The ability to create high-quality, convincing fake information with ease is a growing concern, and as technology continues to evolve, the challenge of detecting fake information will become even greater. This is particularly true when it comes to speech signals, which can be manipulated to create fake audio recordings or deepfakes that are difficult to distinguish from genuine recordings.

Given the potential harm that can be caused by fake information, it is crucial to develop effective solutions to combat this threat. This research aims to propose solutions specifically for the protection and detection of manipulated speech signals, which are increasingly being used to spread fake information. Detecting fake speech signals presents unique challenges, as these signals can be highly complex, making them difficult to detect using traditional methods. Furthermore, as deepfake technology improves, the challenge of detecting fake speech signals is only likely to increase. This research aims to address these challenges by proposing innovative solutions that can detect and prevent the spread of fake speech signals, ultimately contributing to the development of a safer digital landscape.


Related Publications


  1. F0 Modification via PV-TSM Algorithm for Speaker Anonymization Across Gender.
    Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

    2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7--10 November 2022.

    Speaker anonymization has been developed to protect personally identifiable information while retaining other encapsulated information in speech. Datasets, metrics, and protocols for evaluating speaker anonymization have been defined in the Voice Privacy Challenge (VPC). However, existing privacy metrics focus on evaluating general speaker individuality anonymization, which is represented by an x-vector. This study aims to investigate the effect of anonymization on the perception of gender. Understanding how anonymization caused gender transformation is essential for various applications of speaker anonymization. We proposed speaker anonymization methods across genders based on phase-vocoder time-scale modification (PV-TSM). Subsequently, in addition to the VPC evaluation, we developed a gender classifier to evaluate a speaker's gender anonymization. The objective evaluation results showed that our proposed method can successfully anonymize gender. In addition, our proposed methods outperformed the signal processing-based baseline methods in anonymizing speaker individuality represented by the x-vector in ASVeval while maintaining speech intelligibility.

  2. Speaker Anonymization by Pitch Shifting Based on Time-Scale Modification.
    Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

    2nd Symposium on Security and Privacy in Speech Communication joined with 2nd VoicePrivacy Challenge Workshop September 23 & 24 2022, as a satellite to Interspeech 2022, Incheon, Korea.

    The increasing usage of speech in digital technology raises a privacy issue because speech contains biometric information. Several methods of dealing with this issue have been proposed, including speaker anonymization or de-identification. Speaker anonymization aims to suppress personally identifiable information (PII) while keeping the other speech properties, including linguistic information. In this study, we utilize time-scale modification (TSM) speech signal processing for speaker anonymization. Speech signal processing approaches are significantly less complex than the state-of-the-art x-vector-based speaker anonymization method because it does not require a training process. We propose anonymization methods using two major categories of TSM, synchronous overlap-add (SOLA)-based algorithm and phase vocoder-based TSM (PV-TSM). For evaluating our proposed methods, we utilize the standard objective evaluation introduced in the VoicePrivacy challenge. The results show that our method based on the PV-TSM balances privacy and utility metrics better than baseline systems, especially when evaluating with an automatic speaker verification (ASV) system in anonymized enrollment and anonymized trials (a-a). Further, our method outperformed the x-vector-based speaker method, which has limitations in its complex training process, low privacy in an a-a scenario, and low voice distinctiveness.

  3. Speaker Anonymization by Modifying Fundamental Frequency and X-Vectors Singular Value.
    Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, Shunsuke Kidani, and Masashi Unoki.

    Computer Speech & Language, Elsevier, vol. 73, 101326, 2022.

    Speaker anonymization is a method of protecting voice privacy by concealing individual speaker characteristics while preserving linguistic information. The VoicePrivacy Challenge 2020 was initiated to generalize the task of speaker anonymization. In the challenge, two frameworks for speaker anonymization were introduced; in this study, we propose a method of improving the primary framework by modifying the state-of-the-art speaker individuality feature (namely, x-vector) in a neural waveform speech synthesis model. Our proposed method is constructed based on x-vector singular value modification with a clustering model. We also propose a technique of modifying the fundamental frequency and speech duration to enhance the anonymization performance. To evaluate our method, we carried out objective and subjective tests. The overall objective test results show that our proposed method improves the anonymization performance in terms of the speaker verifiability, whereas the subjective evaluation results show improvement in terms of the speaker dissimilarity. The intelligibility and naturalness of the anonymized speech with speech prosody modification were slightly reduced (less than 5% of word error rate) compared to the results obtained by the baseline system.

  4. Speech Watermarking by McAdams Coefficient Scheme Based on Random Forest Learning.
    Candy Olivia Mawalim, and Masashi Unoki.

    Entropy, MDPI, vol. 23, no. 10, 2021.

    Speech watermarking has become a promising solution for protecting the security of speech communication systems. We propose a speech watermarking method that uses the McAdams coefficient, which is commonly used for frequency harmonics adjustment. The embedding process was conducted, using bit-inverse shifting. We also developed a random forest classifier, using features related to frequency harmonics for blind detection. An objective evaluation was conducted to analyze the performance of our method in terms of the inaudibility and robustness requirements. The results indicate that our method satisfies the speech watermarking requirements with a 16 bps payload under normal conditions and numerous non-malicious signal processing operations, e.g., conversion to Ogg or MP4 format.

  5. Improving Security in McAdams Coefficient-Based Speaker Anonymization by Watermarking Method.
    Candy Olivia Mawalim, and Masashi Unoki.

    2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, December 2021.

    Speaker anonymization aims to suppress speaker individuality to protect privacy in speech while preserving the other aspects, such as speech content. One effective solution for anonymization is to modify the McAdams coefficient. In this work, we propose a method to improve the security for speaker anonymization based on the McAdams coefficient by using a speech watermarking approach. The proposed method consists of two main processes: one for embedding and one for detection. In embedding process, two different McAdams coefficients represent binary bits "0" and "1". The watermarked speech is then obtained by frame-by-frame bit inverse switching. Subsequently, the detection process is carried out by a power spectrum comparison. We conducted objective evaluations with reference to the VoicePrivacy 2020 Challenge (VP2020) and of the speech watermarking with reference to the Information Hiding Challenge (IHC) and found that our method could satisfy the blind detection, inaudibility, and robustness requirements in watermarking. It also significantly improved the anonymization performance in comparison to the secondary baseline system in VP2020.

  6. X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System.
    Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, and Masashi Unoki.

    Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, pp. 1703–1707, October 2020.

    Anonymizing speaker individuality is crucial for ensuring voice privacy protection. In this paper, we propose a speaker individuality anonymization system that uses singular value modification and statistical-based decomposition on an x-vector with ensemble regression modeling. An anonymization system requires speaker-to-speaker correspondence (each speaker corresponds to a pseudo-speaker), which may be possible by modifying significant x-vector elements. The significant elements were determined by singular value decomposition and variant analysis. Subsequently, the anonymization process was performed by an ensemble regression model trained using x-vector pools with clustering-based pseudo-targets. The results demonstrated that our proposed anonymization system effectively improves objective verifiability, especially in anonymized trials and anonymized enrollments setting, by preserving similar intelligibility scores with the baseline system introduced in the VoicePrivacy 2020 Challenge.