Secure Speech Communication

In the digital age, where information is readily available, the threat of fake or manipulated content is becoming increasingly concerning. Experts predict that in the next six years, more than 90% of all digital content will have been manipulated to some degree. This alarming statistic highlights the need for effective solutions to combat the rising threat of fake information. Manipulated content has the potential to cause severe damage, whether it is used for propaganda, disinformation, or cybercrime. The implications of this are far-reaching, with the potential to impact public opinion, disrupt democracy, and damage reputations.

Despite advancements in detection techniques, current technology is only able to detect about 65% of fake information, with the remaining 35% slipping through the internet undetected. This represents a significant challenge, as the technology required to create and distribute fake information is becoming increasingly accessible. The ability to create high-quality, convincing fake information with ease is a growing concern, and as technology continues to evolve, the challenge of detecting fake information will become even greater. This is particularly true when it comes to speech signals, which can be manipulated to create fake audio recordings or deepfakes that are difficult to distinguish from genuine recordings.

Given the potential harm that can be caused by fake information, it is crucial to develop effective solutions to combat this threat. This research aims to propose solutions specifically for the protection and detection of manipulated speech signals, which are increasingly being used to spread fake information. Detecting fake speech signals presents unique challenges, as these signals can be highly complex, making them difficult to detect using traditional methods. Furthermore, as deepfake technology improves, the challenge of detecting fake speech signals is only likely to increase. This research aims to address these challenges by proposing innovative solutions that can detect and prevent the spread of fake speech signals, ultimately contributing to the development of a safer digital landscape.

Related Publications

IH&MMSec

Robust Multilingual Audio Deepfake Detection Through Hybrid Modeling.

Candy Olivia Mawalim, Yutong Wang, Aulia Adila, Shogo Okada, Masashi Unoki

The 13th ACM Workshop on Information Hiding and Multimedia Security, San Jose, CA, June 18-20.

Abstract PDF URL

The increasing sophistication of AI-generated human voice poses a significant threat, demanding robust detection systems that can generalize effectively across diverse linguistic environments and synthesis techniques. In response to the SAFE Challenge, this paper introduces a novel approach to multilingual audio deepfake detection. Our primary contribution lies in the comprehensive study of deepfake detection using a multilingual speech corpus encompassing 17 languages and a broad spectrum of synthesis methods and acoustic conditions, designed to enable more realistic and challenging evaluations. To optimally utilize this diverse data, we propose a hybrid detection model that synergistically combines the strengths of end-to-end RawNet and AASIST architectures with language-agnostic representations learned from a multilingual self-supervised learning model. Additionally, we explore the efficacy of RawBoost data augmentation in enhancing robustness against real-world noise. Our experimental evaluation demonstrates promising generalization in generated audio detection, achieving approximately 73% balanced accuracy across multilingual data and unseen synthesis algorithms.
ATSIP

InaSAS: Benchmarking Indonesian Speech Antispoofing Systems

Candy Olivia Mawalim, Sarah Azka Arief, and Dessi Puji Lestari

APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 3, e203.

Abstract DOI PDF

Voice-based biometric systems are vulnerable to spoofing attacks, where attackers can deceive the systems with synthetic or replayed voice samples. To address this vulnerability, we introduce the InaSpoof-v1 dataset, which is a comprehensive benchmark for Indonesian language spoofing detection. We evaluate the state-ofthe-art countermeasure models on this dataset, highlighting the challenges posed by the diversity of the Indonesian language and the impacts of demographic factors. Our experimental results demonstrate the effectiveness of the end-to-end AASIST model for synthesized speech attack countermeasures and residual networks (ResNet) for replay attack detection. To improve future systems, we emphasize the importance of considering demographic actors and addressing the challenges posed by real-world scenarios.
ICASSP

Fine-tuning TitaNet-Large Model for Speaker Anonymization Attacker Systems.

Candy Olivia Mawalim, Aulia Adila, and Masashi Unoki.

The 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025), Hyderabad, India (Accepted).

Abstract PDF DOI

Speaker anonymization techniques are crucial for safeguarding user privacy in voice-based applications. However, these methods are susceptible to adversarial attacks that can compromise their effectiveness. This paper proposes attacker systems that leverage the power of fine-tuned TitaNet-Large and ECAPA-TDNN models to identify the original speaker from anonymized speech generated by various anonymization methods. Both pre-trained models are renowned for their stateof-the-art ability to extract robust speaker embeddings. Finetuning these models with anonymized speech enables them to identify underlying patterns in anonymized speech. We evaluated the proposed attacker systems against multiple anonymization techniques that performed effectively in a series of voice privacy challenges. Our experimental results underscore the effectiveness of the fine-tuned TitaNet-Large model in breaking through these anonymization methods, as indicated by the reduced equal error rate (EER). This highlights the importance of robust and adaptive anonymization strategies to counter such emerging semiinformed threats.
SEALP

Indonesian Speech Content De-Identification in Low Resource Transcripts.

Rifqi Naufal Abdjul, Dessi Puji Lestari, Ayu Purwarianti, Candy Olivia Mawalim, Sakriani Sakti, and Masashi Unoki.

The Second Workshop in South East Asian Language Processing, Co-located with COLING 2025, Abu Dhabi (Online), January 19-24, 2025.

Abstract PDF URL

Advancements in technology and the increased use of digital data threaten individual privacy, especially in speech containing Personally Identifiable Information (PII). Therefore, systems that can remove or process privacy-sensitive data in speech are needed, particularly for low-resource transcripts. These transcripts are minimally annotated or labeled automatically, which is less precise than human annotation. However, using them can simplify the development of de-identification systems in any language. In this study, we develop and evaluate an efficient speech de-identification system. We create an Indonesian speech dataset containing sensitive private information and design a system with three main components: speech recognition, information extraction, and masking. To enhance performance in low-resource settings, we incorporate transcription data in training, use data augmentation, and apply weakly supervised learning. Our results show that our techniques significantly improve privacy detection performance, with approximately 29% increase in F1 score, 20% in precision, and 30% in recall with minimally labeled data.
JAES

Forensic speech enhancement: Toward Reliable Handling of Poor-Quality Speech Recordings Used as Evidence in Criminal Trials

Helen Fraser, Vincent Aubanel, Robert C. Maher, Candy Olivia Mawalim, Xin Wang, Peter Počta, Emma Keith, Gérard Chollet, and Karla Pizzi.

Journal of the Audio Engineering Society, 2024 (Accepted).

Abstract DOI PDF

This paper proposes an innovative interdisciplinary approach to evaluating the effectiveness of forensic speech enhancement (FSE). FSE faces unique challenges arising from a range of factors, from poor recording quality, highly variable conditions from case to case, and content uncertainty. Despite these difficulties, FSE is commonly admitted in court, and can significantly influence the outcome of criminal trials. Current FSE practices are hindered by unrealistic expectations from courts, which often assume that enhanced audio inherently clarifies content. In fact, FSE can have the undesired opposite effect, potentially resulting in unfair prejudice, when, for example, it increases the credibility of a misleading transcript. The proposed interdisciplinary project advocates for a better consideration of speech perception factors, particularly those related to transcription. It aims to bridge the gap between FSE and forensic transcription by promoting a combined approach to enhancing and accurately transcribing forensic audio. By developing a position statement on FSE capabilities, the project seeks to establish realistic standards and foster collaboration among researchers and practitioners. This effort aims to ensure reliable, accountable forensic audio evidence, aligning with forensic science standards and improving the effectiveness of the justice system.
APSIPA

Detecting Spoof Voices in Asian Non-Native Speech: An Indonesian and Thai Case Study.

Aulia Adila, Candy Olivia Mawalim, and Masashi Unoki

The 16th annual conference organized by Asia-Pacific Signal and Information Processing Association (APSIPA 2024), Galaxy International Convention Center, Macau, China, 3-6 Dec 2024. (Accepted)

Abstract

TBD
O-COCOSDA

Analysis of Pathological Features for Spoof Detection.

Myat Aye Aye Aung, Hay Mar Soe Naing, Aye Mya Hlaing, Win Pa Pa, Kasorn Galajit, and Candy Olivia Mawalim

The 27th International Conference of the Oriental COCOSDA (O-COCOSDA 2024), National Yang Ming Chiao Tung University, Hsinchu, Taiwan, 17-19 Oct 2024. (Accepted)

Abstract

TBD
O-COCOSDA

UCSYSpoof: A Myanmar Language Dataset for Voice Spoofing Detection.

Hay Mar Soe Naing, Win Pa Pa, Aye Mya Hlaing, Myat Aye Aye Aung, Kasorn Galajit, and Candy Olivia Mawalim

The 27th International Conference of the Oriental COCOSDA (O-COCOSDA 2024), National Yang Ming Chiao Tung University, Hsinchu, Taiwan, 17-19 Oct 2024. (Accepted)

Abstract

TBD
ICAICTA

Indonesian Speech Anti-Spoofing System: Data Creation and CNN Models.

Sarah Azka Arief, Candy Olivia Mawalim, and Dessi Puji Lestari

The 11th International Conference on Advanced Informatics: Concepts, Theory, and Applications (ICAICTA 2024), Singapore, 28-30 Sept 2024.

Abstract

Biometric systems are prone to spoofing attacks. While research in speech anti-spoofing has been progressing, there is a limited availability of diverse language datasets. This study aims to bridge this gap by developing an Indonesian spoofed speech dataset, which includes replay attacks, text-to- speech, and voice conversion. This dataset forms the foundation for creating an Indonesian speech anti-spoofing system. Subsequently, light convolutional neural network (LCNN) and residual network (ResNet) models, based on convolutional neural networks (CNN), were developed to evaluate the dataset. The input features used are linear frequency cepstral coefficients (LFCC). Both models demonstrate remarkably low minDCF and EER scores approaching zero. The results also exhibit exceptional scores with 4-fold cross validation, showing strong initial performance with no signs of overfitting. However, models trained solely on Common Voice or Prosa.ai datasets performed poorly in cross-source tests, suggesting generalization issues due to a lack of diversity in the dataset. This highlights the need for further improvement and continued research in Indonesian speech spoof detection.
RISP

Study on Inaudible Speech Watermarking Method Based on Spread-Spectrum Using Linear Prediction Residue.

Aulia Adila, Candy Olivia Mawalim, Takuto Isoyama, and Masashi Unoki.

Journal of Signal Processing, Research Institute of Signal Processing, 2024, Volume 28 Issue 6 Pages 309-313.

Abstract DOI PDF

A reliable speech watermarking technique must balance satisfying four requirements: inaudibility, robustness, blind detectability, and confidentiality. A previous study proposed a speech watermarking technique based on direct spread spectrum (DSS) using a linear prediction (LP) scheme, i.e., LP-DSS, that could simultaneously satisfy these four requirements. However, an inaudibility issue was found due to the incorporation of a blind detection scheme with frame synchronization. In this paper, we investigate the feasibility of utilizing a psychoacoustical model, which simulates auditory masking, to control the suitable embedding level of the watermark signal to resolve the inaudibility issue in the LP-DSS scheme. Evaluation results confirmed that controlling the embedding level with the psychoacoustical model, with a constant scaling factor setting, could balance the trade-off between inaudibility and detection ability with a payload up to 64 bps.
iSAI-NLP

ThaiSpoof: A Database for Spoof Detection in Thai Language.

Kasorn Galajit, Thunpisit Kosolsriwiwat, Candy Olivia Mawalim, Pakinee Aimmanee, Waree Kongprawechnon, Win Pa Pa, Anuwat Chaiwongyen, Teeradaj Racharak, Hayati Yassin, Jessada Karnjana, Surasak Boonkla, and Masashi Unoki.

The 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing and The International Conference on Artificial Intelligence and Internet of Things (iSAI-NLP 2023).

Abstract DOI IEEE PDF

Many applications and security systems have widely applied automatic speaker verification (ASV). However, these systems are vulnerable to various direct and indirect access attacks, which weakens their authentication capability. The research in spoofed speech detection contributes to enhancing these systems. However, the study in spoofing detection is limited to only some languages due to the need for various datasets. This paper focuses on a Thai language dataset for spoof detection. The dataset consists of genuine speech signals and various types of spoofed speech signals. The spoofed speech dataset is generated using text-to-speech tools for the Thai language, synthesis tools, and tools for speech modification. To showcase the utilization of this dataset, we implement a simple model based on a convolutional neural network (CNN) taking linear frequency cepstral coefficients (LFCC) as its input. We trained, validated, and tested the model on our dataset referred to as ThaiSpoof. The experimental result shows that the accuracy of model is 95%, and equal error rate (EER) is 4.67%. The result shows that our ThaiSpoof dataset has the potential to develop for helping in spoof detection studies.
iSAI-NLP

Voice Contribution on LFCC feature and ResNet-34 for Spoof Detection.

Khaing Zar Mon, Kasorn Galajit, Candy Olivia Mawalim, Jessada Karnjana, Tsuyoshi Isshiki, and Pakinee Aimmanee.

The 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing and The International Conference on Artificial Intelligence and Internet of Things (iSAI-NLP 2023).

Abstract DOI IEEE PDF

Recently, biometric authentication has been significant advancement, particularly in speaker verification. While there have been significant advancements in this technology, compelling evidence highlights the continued vulnerability of this technology to malicious spoofing attacks. This vulnerability calls for developing specialized countermeasures to identify various attack types. This paper specifically focuses on detecting replay, speech synthesis, and voice conversion attacks. As our spoof detection strategy’s front-end feature extraction method, we used linear frequency cepstral coefficients (LFCC). We utilized ResNet-34 to classify between genuine and fake speech. By integrating LFCC with ResNet-34, We evaluated the proposed method using the ASVspoof 2019 dataset, PA (Physical Access), and LA (Logical Access). In our approach, we compare the use of the entire utterance for feature extraction in both PA and LA datasets with an alternative method that involves extracting features from a specific percentage of the voice segment within the utterance for classification. In addition, we conducted a comprehensive evaluation by comparing our proposed method with the established baseline techniques, LFCC-GMM and CQCC-GMM. The proposed method demonstrates promising performance, achieving equal error rate (EER) of 3.11% and 3.49% for the development and evaluation datasets, respectively, in PA attacks. In LA attacks evaluation, the proposed method performs EER of 0.16% and 6.89% for the development and evaluation datasets, respectively. The proposed method shows promising results in identifying spoof attacks for both PA and LA attacks.
APSIPA

Analysis of Spectro-Temporal Modulation Representation for Deep-Fake Speech Detection.

Haowei Cheng, Candy Olivia Mawalim, Kai Li, Lijun Wang, and Masashi Unoki.

The 15th Asia-Pasific Signal and Information Processing Association (APSIPA ASC 2023), Taipei, Taiwan, 31 October - 3 November 2023.

Abstract DOI IEEE PDF

Deep-fake speech detection aims to develop effective techniques for identifying fake speech generated using advanced deep-learning methods. It can reduce the negative impact of malicious production or dissemination of fake speech in real-life scenarios. Although humans can relatively easy to distinguish between genuine and fake speech due to human auditory mechanisms, it is difficult for machines to distinguish them correctly. One major reason for this challenge is that machines struggle to effectively separate speech content from human vocal system information. Common features used in speech processing face difficulties in handling this issue, hindering the neural network from learning the discriminative differences between genuine and fake speech. To address this issue, we investigated spectro-temporal modulation representations in genuine and fake speech, which simulate the human auditory perception process. Next, the spectro-temporal modulation was fit to a light convolutional neural network bidirectional long short-term memory for classification. We conducted experiments on the benchmark datasets of the Automatic Speaker Verification and Spoofing Countermeasures Challenge 2019 (ASVspoof2019) and the Audio Deep synthesis Detection Challenge 2023 (ADD2023), achieving an equal-error rate of 8.33\% and 42.10\%, respectively. The results showed that spectro-temporal modulation representations could distinguish the genuine and deep-fake speech and have adequate performance in both datasets.
APSIPA

F0 Modification via PV-TSM Algorithm for Speaker Anonymization Across Gender.

Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7--10 November 2022.

Abstract PDF

Speaker anonymization has been developed to protect personally identifiable information while retaining other encapsulated information in speech. Datasets, metrics, and protocols for evaluating speaker anonymization have been defined in the Voice Privacy Challenge (VPC). However, existing privacy metrics focus on evaluating general speaker individuality anonymization, which is represented by an x-vector. This study aims to investigate the effect of anonymization on the perception of gender. Understanding how anonymization caused gender transformation is essential for various applications of speaker anonymization. We proposed speaker anonymization methods across genders based on phase-vocoder time-scale modification (PV-TSM). Subsequently, in addition to the VPC evaluation, we developed a gender classifier to evaluate a speaker's gender anonymization. The objective evaluation results showed that our proposed method can successfully anonymize gender. In addition, our proposed methods outperformed the signal processing-based baseline methods in anonymizing speaker individuality represented by the x-vector in ASVeval while maintaining speech intelligibility.

SPSC-SIG

Speaker Anonymization by Pitch Shifting Based on Time-Scale Modification.

Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

2nd Symposium on Security and Privacy in Speech Communication joined with 2nd VoicePrivacy Challenge Workshop September 23 & 24 2022, as a satellite to Interspeech 2022, Incheon, Korea.

Abstract DOI PDF

The increasing usage of speech in digital technology raises a privacy issue because speech contains biometric information. Several methods of dealing with this issue have been proposed, including speaker anonymization or de-identification. Speaker anonymization aims to suppress personally identifiable information (PII) while keeping the other speech properties, including linguistic information. In this study, we utilize time-scale modification (TSM) speech signal processing for speaker anonymization. Speech signal processing approaches are significantly less complex than the state-of-the-art x-vector-based speaker anonymization method because it does not require a training process. We propose anonymization methods using two major categories of TSM, synchronous overlap-add (SOLA)-based algorithm and phase vocoder-based TSM (PV-TSM). For evaluating our proposed methods, we utilize the standard objective evaluation introduced in the VoicePrivacy challenge. The results show that our method based on the PV-TSM balances privacy and utility metrics better than baseline systems, especially when evaluating with an automatic speaker verification (ASV) system in anonymized enrollment and anonymized trials (a-a). Further, our method outperformed the x-vector-based speaker method, which has limitations in its complex training process, low privacy in an a-a scenario, and low voice distinctiveness.

Elsevier

Speaker Anonymization by Modifying Fundamental Frequency and X-Vectors Singular Value.

Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, Shunsuke Kidani, and Masashi Unoki.

Computer Speech & Language, Elsevier, vol. 73, 101326, 2022.

Abstract DOI PDF Video

Speaker anonymization is a method of protecting voice privacy by concealing individual speaker characteristics while preserving linguistic information. The VoicePrivacy Challenge 2020 was initiated to generalize the task of speaker anonymization. In the challenge, two frameworks for speaker anonymization were introduced; in this study, we propose a method of improving the primary framework by modifying the state-of-the-art speaker individuality feature (namely, x-vector) in a neural waveform speech synthesis model. Our proposed method is constructed based on x-vector singular value modification with a clustering model. We also propose a technique of modifying the fundamental frequency and speech duration to enhance the anonymization performance. To evaluate our method, we carried out objective and subjective tests. The overall objective test results show that our proposed method improves the anonymization performance in terms of the speaker verifiability, whereas the subjective evaluation results show improvement in terms of the speaker dissimilarity. The intelligibility and naturalness of the anonymized speech with speech prosody modification were slightly reduced (less than 5% of word error rate) compared to the results obtained by the baseline system.

MDPI

Speech Watermarking by McAdams Coefficient Scheme Based on Random Forest Learning.

Candy Olivia Mawalim, and Masashi Unoki.

Entropy, MDPI, vol. 23, no. 10, 2021.

Abstract DOI PDF

Speech watermarking has become a promising solution for protecting the security of speech communication systems. We propose a speech watermarking method that uses the McAdams coefficient, which is commonly used for frequency harmonics adjustment. The embedding process was conducted, using bit-inverse shifting. We also developed a random forest classifier, using features related to frequency harmonics for blind detection. An objective evaluation was conducted to analyze the performance of our method in terms of the inaudibility and robustness requirements. The results indicate that our method satisfies the speech watermarking requirements with a 16 bps payload under normal conditions and numerous non-malicious signal processing operations, e.g., conversion to Ogg or MP4 format.

APSIPA

Improving Security in McAdams Coefficient-Based Speaker Anonymization by Watermarking Method.

Candy Olivia Mawalim, and Masashi Unoki.

2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, December 2021.

Abstract IEEE PDF

Speaker anonymization aims to suppress speaker individuality to protect privacy in speech while preserving the other aspects, such as speech content. One effective solution for anonymization is to modify the McAdams coefficient. In this work, we propose a method to improve the security for speaker anonymization based on the McAdams coefficient by using a speech watermarking approach. The proposed method consists of two main processes: one for embedding and one for detection. In embedding process, two different McAdams coefficients represent binary bits "0" and "1". The watermarked speech is then obtained by frame-by-frame bit inverse switching. Subsequently, the detection process is carried out by a power spectrum comparison. We conducted objective evaluations with reference to the VoicePrivacy 2020 Challenge (VP2020) and of the speech watermarking with reference to the Information Hiding Challenge (IHC) and found that our method could satisfy the blind detection, inaudibility, and robustness requirements in watermarking. It also significantly improved the anonymization performance in comparison to the secondary baseline system in VP2020.

Interspeech

X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System.

Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, and Masashi Unoki.

Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, pp. 1703–1707, October 2020.

Abstract DOI PDF

Anonymizing speaker individuality is crucial for ensuring voice privacy protection. In this paper, we propose a speaker individuality anonymization system that uses singular value modification and statistical-based decomposition on an x-vector with ensemble regression modeling. An anonymization system requires speaker-to-speaker correspondence (each speaker corresponds to a pseudo-speaker), which may be possible by modifying significant x-vector elements. The significant elements were determined by singular value decomposition and variant analysis. Subsequently, the anonymization process was performed by an ensemble regression model trained using x-vector pools with clustering-based pseudo-targets. The results demonstrated that our proposed anonymization system effectively improves objective verifiability, especially in anonymized trials and anonymized enrollments setting, by preserving similar intelligibility scores with the baseline system introduced in the VoicePrivacy 2020 Challenge.

Secure Speech Communication

Related Publications

Robust Multilingual Audio Deepfake Detection Through Hybrid Modeling.

InaSAS: Benchmarking Indonesian Speech Antispoofing Systems

Fine-tuning TitaNet-Large Model for Speaker Anonymization Attacker Systems.

Indonesian Speech Content De-Identification in Low Resource Transcripts.

Forensic speech enhancement: Toward Reliable Handling of Poor-Quality Speech Recordings Used as Evidence in Criminal Trials

Detecting Spoof Voices in Asian Non-Native Speech: An Indonesian and Thai Case Study.

Analysis of Pathological Features for Spoof Detection.

UCSYSpoof: A Myanmar Language Dataset for Voice Spoofing Detection.

Indonesian Speech Anti-Spoofing System: Data Creation and CNN Models.

Study on Inaudible Speech Watermarking Method Based on Spread-Spectrum Using Linear Prediction Residue.

ThaiSpoof: A Database for Spoof Detection in Thai Language.

Voice Contribution on LFCC feature and ResNet-34 for Spoof Detection.

Analysis of Spectro-Temporal Modulation Representation for Deep-Fake Speech Detection.

F0 Modification via PV-TSM Algorithm for Speaker Anonymization Across Gender.

Speaker Anonymization by Pitch Shifting Based on Time-Scale Modification.

Speaker Anonymization by Modifying Fundamental Frequency and X-Vectors Singular Value.

Speech Watermarking by McAdams Coefficient Scheme Based on Random Forest Learning.

Improving Security in McAdams Coefficient-Based Speaker Anonymization by Watermarking Method.

X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System.