Publications

* denotes equal contribution and joint lead authorship.

Journal Book Chapter Conference

2025

APSIPA

Phoneme-Specific Challenges to Intelligibility in Hearing Impairment Under Noisy Condition

Denawati Junia, Candy Olivia Mawalim

The 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA 2025), Shangri-la, Singapore, October 22-24, 2025.

Abstract PDF URL

Hearing impairment significantly reduces speech intelligibility, particularly in noisy acoustic environments, due to impaired auditory sensitivity and phoneme recognition. This study investigates whether speech intelligibility can be accurately predicted by integrating phoneme-level error patterns in noisy conditions. Using the Clarity prediction challenge dataset, phoneme-level errors were computed based on the international phonetic alphabet transcriptions and quantified by word error rate (WER). Our key findings reveal that high-frequency fricatives (/f/,/Z/) and affricates (/dZ/), along with voiced phonemes (/g/,/O/), showed the highest average WER. This indicates their particular vulnerability to masking and the effects of highfrequency hearing loss. While higher SNR generally improves intelligibility, we observed a weak correlation (ρ ≈ 0.20), underscoring the critical role of individual hearing loss profiles. To further our analysis, we used the five most challenging phonemes and SNR as features to predict speech intelligibility with Random Forest and XGBoost models. This approach yielded slightly better prediction performance compared to the Hearing Aid Speech Perception Index.
APSIPA

Study on Signal Processing Techniques in Protecting Voice Personae Against Speech Synthesis Systems

Nopparut Li, Candy Olivia Mawalim, Masashi Unoki

The 17th Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA 2025), Shangri-la, Singapore, October 22-24, 2025.

Abstract PDF URL

Recent advancements in speech synthesis have enabled the generation of natural-sounding speech signals that can closely mimic specific speakers, raising serious concerns about the misuse of voice recordings for impersonation and fraud. Although researchers have extensively studied spoof speech detection, they can only implement these approaches once the spoofed content has been generated. We propose a method based on F0 component elimination and compare it with conventional filtering and artificial reverberation for the impact on the quality of synthesized speech signals generated using the TorToiSe TTS model, as well as the perceptual quality of the modified speech signals. Results show that the proposed method is able to reduce the identifiability of synthesized speech signals with minimal impact on speech quality, offering a promising direction for voice personae protection against speech synthesis systems.
HAI

Multimodal Classification of Co-speech Gesture Pragmatical function in Storytelling

Jinqian Zhang, Sixia Li, Candy Olivia Mawalim, Shogo Okada

The 13th International Conference on Human-Agent Interaction (HAI 2025), Yokohama, Japan, November 10-13, 2025.

Abstract

TBD
WASPAA

Modeling Multi-Level Hearing Loss for Speech Intelligibility Prediction

Xiajie Zhou, Candy Olivia Mawalim, Masashi Unoki

The IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA 2025), Granlibakken Tahoe, Tahoe City, CA, 12-15 October 2025.

Abstract PDF

The diverse perceptual consequences of hearing loss severely impede speech communication, but standard clinical audiometry, which is focused on threshold-based frequency sensitivity, does not adequately capture deficits in frequency and temporal resolution. To address this limitation, we propose a speech intelligibility prediction method that explicitly simulates auditory degradations according to hearing loss severity by broadening cochlear filters and applying low-pass modulation filtering to temporal envelopes. Speech signals are subsequently analyzed using the spectro-temporal modulation (STM) representations, which reflect how auditory resolution loss alters the underlying modulation structure. In addition, normalized cross-correlation (NCC) matrices quantify the similarity between the STM representations of clean speech and speech in noise. These auditory-informed features are utilized to train a Vision Transformer–based regression model that integrates the STM maps and NCC embeddings to estimate speech intelligibility scores. Evaluations on the Clarity Prediction Challenge corpus show that the proposed method outperforms the hearing-aid speech perception index (HASPI) in both mild and moderate-to-severe hearing loss groups, with a relative root mean squared error reduction of 16.5% for the mild group and a 6.1% reduction for the moderate-to-severe group. These results highlight the importance of explicitly modeling listener-specific frequency and temporal resolution degradations to improve speech intelligibility prediction and provide interpretability in auditory distortions.
Clarity

Integrating Linguistic and Acoustic Cues for Machine Learning-Based Speech Intelligibility Prediction in Hearing Impairment

Candy Olivia Mawalim, Xiajie Zhou, Huy Quoc Nguyen, Masashi Unoki

The 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), Delft, The Netherlands, 22nd August 2025.

Abstract PDF Poster

Speech intelligibility prediction for individuals with hearing loss is paramount for advancing hearing aid technology. Leveraging recent breakthroughs in ASR foundation models, particularly Whisper, we fine-tuned a Whisper model for speech intelligibility prediction. Our approach incorporates data augmentation using impulse responses from diverse everyday environments. This study investigates the effective integration of linguistic and acoustic cues to enhance the prediction of finetune ASR models, aiming to compensate for both hearing loss and information loss during signal downsampling. Our goal is to improve speech intelligibility prediction, especially in noisy conditions. Experiments demonstrate that integrating these cues is beneficial. Furthermore, employing a weighted average ensemble model, which balances predictions from left and right audio channels and considers both stable and unstable linguistic and acoustic cues, significantly improved prediction performance, reducing the RMSE by approximately 2 and enhancing the Pearson correlation coefficient (ρ) by around 0.05.
Clarity

Lightweight Speech Intelligibility Prediction with Spectro-Temporal Modulation for Hearing-Impaired Listeners

Xiajie Zhou, Candy Olivia Mawalim, Huy Quoc Nguyen, Masashi Unoki

The 6th Clarity Workshop on Improving Speech-in-Noise for Hearing Devices (Clarity-2025), Delft, The Netherlands, 22nd August 2025.

Abstract PDF

Hearing loss leads to reduced frequency resolution and impaired temporal resolution, making it difficult for listeners to distinguish similar sounds and perceive speech dynamics in noise. To capture these perceptual degradations, we employ spectrotemporal modulation (STM) analysis as the core feature representation. This study proposes a speech intelligibility prediction framework that uses STM representations as input to lightweight convolutional neural network (CNN) models. We design two models: STM-CNN-SE (E020a), which incorporates squeeze-and-excitation (SE) block, and STM-CNN-ECA (E020b), which uses efficient channel attention (ECA) block and richer input features. Compared to the HASPI, experiments on the CPC3 development dataset show that E020a and E020b reduce root-mean-square error (RMSE) by 11.2% and 12.6%, respectively. These results demonstrate the effectiveness of STM-based CNN architectures for speech intelligibility prediction under hearing loss conditions.
IH&MMSec

Robust Multilingual Audio Deepfake Detection Through Hybrid Modeling.

Candy Olivia Mawalim, Yutong Wang, Aulia Adila, Shogo Okada, Masashi Unoki

The 13th ACM Workshop on Information Hiding and Multimedia Security, San Jose, CA, June 18-20.

Abstract PDF URL

The increasing sophistication of AI-generated human voice poses a significant threat, demanding robust detection systems that can generalize effectively across diverse linguistic environments and synthesis techniques. In response to the SAFE Challenge, this paper introduces a novel approach to multilingual audio deepfake detection. Our primary contribution lies in the comprehensive study of deepfake detection using a multilingual speech corpus encompassing 17 languages and a broad spectrum of synthesis methods and acoustic conditions, designed to enable more realistic and challenging evaluations. To optimally utilize this diverse data, we propose a hybrid detection model that synergistically combines the strengths of end-to-end RawNet and AASIST architectures with language-agnostic representations learned from a multilingual self-supervised learning model. Additionally, we explore the efficacy of RawBoost data augmentation in enhancing robustness against real-world noise. Our experimental evaluation demonstrates promising generalization in generated audio detection, achieving approximately 73% balanced accuracy across multilingual data and unseen synthesis algorithms.
ATSIP

InaSAS: Benchmarking Indonesian Speech Antispoofing Systems

Candy Olivia Mawalim, Sarah Azka Arief, and Dessi Puji Lestari

APSIPA Transactions on Signal and Information Processing: Vol. 14: No. 3, e203.

Abstract DOI PDF

Voice-based biometric systems are vulnerable to spoofing attacks, where attackers can deceive the systems with synthetic or replayed voice samples. To address this vulnerability, we introduce the InaSpoof-v1 dataset, which is a comprehensive benchmark for Indonesian language spoofing detection. We evaluate the state-ofthe-art countermeasure models on this dataset, highlighting the challenges posed by the diversity of the Indonesian language and the impacts of demographic factors. Our experimental results demonstrate the effectiveness of the end-to-end AASIST model for synthesized speech attack countermeasures and residual networks (ResNet) for replay attack detection. To improve future systems, we emphasize the importance of considering demographic actors and addressing the challenges posed by real-world scenarios.
CEAI

Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents

Candy Olivia Mawalim, Chee Wee Leong, Guy Sivan, Hung-Hsuan Huang, and Shogo Okada.

Computers and Education: Artificial Intelligence, Vol. 8, June 2025, 100386.

Abstract DOI PDF

This study introduces a novel method for explainable speaking skill assessment that utilizes a unique dataset featuring video recordings of conversational interviews for high-stakes outcomes (i.e., admission to high schools and universities). Unlike traditional automated speaking assessments that prioritize accuracy at the expense of interpretability, our approach employs a new multimodal dataset that integrates acoustic and linguistic features, visual cues, turn-taking patterns, and expert-derived scores quantifying various speaking skill aspects observed during interviews with young adolescents. This dataset is distinguished by its open-ended question format, which allows for varied responses from interviewees, providing a rich basis for analysis. The experimental results demonstrate that fusing interpretable features, including prosody, action units, and turn-taking, significantly enhances the accuracy of spoken English skill prediction, achieving an overall accuracy of 83% when a machine learning model based on the light gradient boosting algorithm is used. Furthermore, this research underscores the significant influence of external factors, such as interviewer behavior and the interview setting, particularly on the coherence aspect of spoken English proficiency. This focus on an innovative dataset and interpretable assessment tools offers a more nuanced understanding of speaking skills in high-stakes contexts than that offered by previous studies.
MTI

Influence of Personality Traits and Demographics on Rapport Recognition Using Adversarial Learning

Wenqing Wei, Sixia Li, Candy Olivia Mawalim, Xiguang Li, Kazunori Komatani, and Shogo Okada.

Multimodal Technol. Interact. 2025, 9(3), 18.

Abstract DOI PDF

The automatic recognition of user rapport at the dialogue level for multimodal dialogue systems (MDSs) is a critical component of effective dialogue system management. Both the dialogue systems and their evaluations need to be based on user expressions. Numerous studies have demonstrated that user personalities and demographic data such as age and gender significantly affect user expression. Neglecting users’ personalities and demographic data will result in less accurate user expression and rapport recognition. To the best of our knowledge, no existing studies have considered the effects of users’ personalities and demographic data on the automatic recognition of user rapport in MDSs. To analyze the influence of users’ personalities and demographic data on dialogue level user rapport recognition, we first used a Hazummi dataset which is an online dataset containing users’ personal information (personality, age, and gender information). Based on this dataset, we analyzed the relationship between user rapport in dialogue systems and users’ traits, finding that gender and age significantly influence the recognition of user rapport. These factors could potentially introduce biases into the model. To mitigate the impact of users’ traits, we introduced an adversarial-based model. Experimental results showed a significant improvement in user rapport recognition compared to models that do not account for users’ traits. To validate our multimodal modeling approach, we compared it to human perception and instruction-based Large Language Models (LLMs). The results showed that our model outperforms that of human and instruction-based LLM models.
IEEE Access

Speech Intelligibility Prediction Using Binaural Processing for Hearing Loss

Xiajie Zhou, Candy Olivia Mawalim, and Masashi Unoki.

IEEE Access, Jan 2025.

Abstract DOI PDF

As the global issue of hearing loss becomes increasingly severe, developing effective speech intelligibility prediction methods is crucial for improving the performance of hearing aids. However, current methods struggle in noisy environments and overlook individual differences in hearing loss between ears, which impacts prediction accuracy. Therefore, this study proposes a non-intrusive speech intelligibility prediction method that incorporates the binaural processing for hearing loss. The proposed method simulates the multi-stage binaural processing of the outer, middle, and inner ear and integrates binaural cues through an equalization-cancellation model to mitigate masking effects in noisy environments. Key features extracted from speech signals serve as inputs for a hybrid speech intelligibility model combining long short-term memory (LSTM) and light gradient boosting machine (LightGBM) models. The proposed method captures the critical features of speech signals, especially in challenging environments and for different types of hearing loss. Experimental results show that, compared to the baseline system of the second Clarity Prediction Challenge (CPC2) dataset, the proposed method achieves an 8.3% reduction in root mean squared error (RMSE). Notably, the proposed method reduces RMSE by 12.8% when predicting inconsistent hearing loss compared to listeners with consistent hearing levels, confirming the potential of combining hearing loss modeling with binaural processing.
NAACL

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines.

G. Winata, F. Hudi, P. A. Irawan, D. Anugraha, R. A. Putri, Y. Wang, A. Nohejl, U. Prathama, N. Ousidhoum, A. Amriani, A. Rzayev, A. Das, A. Pramodya, A. Adila, B. Willie, C. O. Mawalim, C. C. Lam, D. Abolade, E. Chersoni, E. Santus, et al.

The 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics (NAACL 2025), Albuquerque, New Mexico, pp. 3242-3264.

Abstract ArXiv PDF URL

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over \emph{1 million data points}, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release a knowledge base with annotated food entries and images along with the VQA data.
ICASSP

Fine-tuning TitaNet-Large Model for Speaker Anonymization Attacker Systems.

Candy Olivia Mawalim, Aulia Adila, and Masashi Unoki.

The 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2025), Hyderabad, India.

Abstract PDF DOI

Speaker anonymization techniques are crucial for safeguarding user privacy in voice-based applications. However, these methods are susceptible to adversarial attacks that can compromise their effectiveness. This paper proposes attacker systems that leverage the power of fine-tuned TitaNet-Large and ECAPA-TDNN models to identify the original speaker from anonymized speech generated by various anonymization methods. Both pre-trained models are renowned for their stateof-the-art ability to extract robust speaker embeddings. Finetuning these models with anonymized speech enables them to identify underlying patterns in anonymized speech. We evaluated the proposed attacker systems against multiple anonymization techniques that performed effectively in a series of voice privacy challenges. Our experimental results underscore the effectiveness of the fine-tuned TitaNet-Large model in breaking through these anonymization methods, as indicated by the reduced equal error rate (EER). This highlights the importance of robust and adaptive anonymization strategies to counter such emerging semiinformed threats.
SEALP

Indonesian Speech Content De-Identification in Low Resource Transcripts.

Rifqi Naufal Abdjul, Dessi Puji Lestari, Ayu Purwarianti, Candy Olivia Mawalim, Sakriani Sakti, and Masashi Unoki.

The Second Workshop in South East Asian Language Processing, Co-located with COLING 2025, Abu Dhabi (Online), January 19-24, 2025.

Abstract PDF URL

Advancements in technology and the increased use of digital data threaten individual privacy, especially in speech containing Personally Identifiable Information (PII). Therefore, systems that can remove or process privacy-sensitive data in speech are needed, particularly for low-resource transcripts. These transcripts are minimally annotated or labeled automatically, which is less precise than human annotation. However, using them can simplify the development of de-identification systems in any language. In this study, we develop and evaluate an efficient speech de-identification system. We create an Indonesian speech dataset containing sensitive private information and design a system with three main components: speech recognition, information extraction, and masking. To enhance performance in low-resource settings, we incorporate transcription data in training, use data augmentation, and apply weakly supervised learning. Our results show that our techniques significantly improve privacy detection performance, with approximately 29% increase in F1 score, 20% in precision, and 30% in recall with minimally labeled data.

2024

JAES

Forensic speech enhancement: Toward Reliable Handling of Poor-Quality Speech Recordings Used as Evidence in Criminal Trials

Helen Fraser, Vincent Aubanel, Robert C. Maher, Candy Olivia Mawalim, Xin Wang, Peter Počta, Emma Keith, Gérard Chollet, and Karla Pizzi.

Journal of the Audio Engineering Society, c., vol. 72, no. 11, pp. 748–753, Nov. 2024.

Abstract DOI PDF

This paper proposes an innovative interdisciplinary approach to evaluating the effectiveness of forensic speech enhancement (FSE). FSE faces unique challenges arising from a range of factors, from poor recording quality, highly variable conditions from case to case, and content uncertainty. Despite these difficulties, FSE is commonly admitted in court, and can significantly influence the outcome of criminal trials. Current FSE practices are hindered by unrealistic expectations from courts, which often assume that enhanced audio inherently clarifies content. In fact, FSE can have the undesired opposite effect, potentially resulting in unfair prejudice, when, for example, it increases the credibility of a misleading transcript. The proposed interdisciplinary project advocates for a better consideration of speech perception factors, particularly those related to transcription. It aims to bridge the gap between FSE and forensic transcription by promoting a combined approach to enhancing and accurately transcribing forensic audio. By developing a position statement on FSE capabilities, the project seeks to establish realistic standards and foster collaboration among researchers and practitioners. This effort aims to ensure reliable, accountable forensic audio evidence, aligning with forensic science standards and improving the effectiveness of the justice system.
IEEE Access

MAG-BERT-ARL for Fair Automated Video Interview Assessment.

Bimasena Putra, Kurniawati Azizah, Candy Olivia Mawalim, Ikhlasul Akmal Hanif, Sakriani Sakti, Chee Wee Leong, and Shogo Okada.

IEEE Access, 2024.

Abstract DOI PDF

Potential biases within automated video interview assessment algorithms may disadvantage specific demographics due to the collection of sensitive attributes, which are regulated by the General Data Protection Regulation (GDPR). To mitigate these fairness concerns, this research introduces MAG-BERT-ARL, an automated video interview assessment system that eliminates reliance on sensitive attributes. MAG-BERT-ARL integrates Multimodal Adaptation Gate and Bidirectional Encoder Representations from Transformers (MAG-BERT) model with the Adversarially Reweighted Learning (ARL). This integration aims to improve the performance of underrepresented groups by promoting Rawlsian Max-Min Fairness. Through experiments on the Educational Testing Service (ETS) and First Impressions (FI) datasets, the proposed method demonstrates its effectiveness in optimizing model performance (increasing Pearson correlation coefficient up to 0.17 in the FI dataset and precision up to 0.39 in the ETS dataset) and fairness (reducing equal accuracy up to 0.11 in the ETS dataset). The findings underscore the significance of integrating fairness-enhancing techniques like ARL and highlight the impact of incorporating nonverbal cues on hiring decisions.
APSIPA

Unsupervised Anomalous Sound Detection Using Timbral and Human Voice Disorder-Related Acoustic Features.

Malik Akbar Hashemi Rafsanjani, Candy Olivia Mawalim, Dessi Puji Lestari, Sakriani Sakti, and Masashi Unoki

The 16th annual conference organized by Asia-Pacific Signal and Information Processing Association (APSIPA 2024), Galaxy International Convention Center, Macau, China, 3-6 Dec 2024.

Abstract PDF DOI

Anomalous sound detection (ASD) crucially prevents industrial accidents by distinguishing normal and abnormal machine sounds. Previous research utilizing timbral and short-term features attained a notable F1 score of 0.920. However, relying solely on supervised learning models is impractical due to the difficulty of acquiring anomaly data. This study focuses on developing an unsupervised learning model for ASD, emphasizing prominent timbral features. We also investigate human voice disorder (HVD)-related features, which are potentially linked to human perception of anomalous sounds in machines. We conducted a comparative analysis using 5-fold cross-validation to evaluate our proposed method, with the area under receiver operating characteristic (ROC AUC) as the metric. The proposed ASD method using timbral and HVD-related features significantly improved AUC by 10.87% compared to the baseline system in the DCASE Challenge 2020.
APSIPA

Anomalous Sound Detection Based on Time Domain Gammatone Filterbank and IDNN Model.

Primanda Adyatma Hafiz, Candy Olivia Mawalim, Dessi Puji Lestari, Sakriani Sakti, and Masashi Unoki

The 16th annual conference organized by Asia-Pacific Signal and Information Processing Association (APSIPA 2024), Galaxy International Convention Center, Macau, China, 3-6 Dec 2024.

Abstract PDF DOI

Anomalous sound detection (ASD) systems distinguish normal and abnormal machinery conditions on the basis of sound. While most ASD systems rely on the log Mel spectrogram, it lacks sufficient frequency resolution and performs suboptimally for rapidly changing sounds. Alternatively, the Gammatone spectrogram, extracted using a time domain Gammatone filterbank, offers enhanced spectrogram resolution. This study investigates the effectiveness of the time domain Gammatone spectrogram for ASD. To optimally learn the time domain Gammatone spectrogram features, an Interpolation Deep Neural Network (IDNN) model was proposed as the detection model. This model detects nonstationary sound frames highly reliably. An evaluation was conducted using MIMII dataset with area under receiver operating characteristic curve (ROC AUC) as the metric. Experimental results showed that our proposed method achieved ROC AUC of 92.5%, outperforming the log Mel spectrogram feature by 5.9 percentage points.
APSIPA

Detecting Spoof Voices in Asian Non-Native Speech: An Indonesian and Thai Case Study.

Aulia Adila, Candy Olivia Mawalim, and Masashi Unoki

The 16th annual conference organized by Asia-Pacific Signal and Information Processing Association (APSIPA 2024), Galaxy International Convention Center, Macau, China, 3-6 Dec 2024.

Abstract PDF DOI

This study focuses on building effective spoofing countermeasures (CMs) for non-native speech, specifically targeting Indonesian and Thai speakers. We constructed a dataset comprising both native and non-native speech to facilitate our research. Three key features—MFCC, LFCC, and CQCC—were extracted from the speech data, and three classic machine learning-based classifiers—CatBoost, XGBoost, and GMM—were employed to develop robust spoofing detection systems using the native and combined (native and non-native) speech data. This resulted in two types of CMs: Native and Combined. The performance of these CMs was evaluated on both native and nonnative speech datasets. Our findings reveal significant challenges faced by Native CM in handling non-native speech, highlighting the necessity for domain-specific solutions. The proposed method shows improved detection capabilities, demonstrating the importance of incorporating non-native speech data into the training process. This work lays the foundation for more effective spoofing detection systems in diverse linguistic contexts.
O-COCOSDA

Analysis of Pathological Features for Spoof Detection.

Myat Aye Aye Aung, Hay Mar Soe Naing, Aye Mya Hlaing, Win Pa Pa, Kasorn Galajit, and Candy Olivia Mawalim

The 27th International Conference of the Oriental COCOSDA (O-COCOSDA 2024), National Yang Ming Chiao Tung University, Hsinchu, Taiwan, 17-19 Oct 2024.

Abstract PDF DOI

Deepfake speech, an advanced use of speech synthesis technology, presents a considerable challenge due to its highly realistic sound and the complexities involved in detecting it. The selection and analysis of effective features are essential for enhancing spoof detection capabilities. This study focuses on analyzing pathological features within the Myanmar Spoof Dataset. Spoofed speech in the dataset is created using five distinct techniques: vocoder methods with HiFiGAN and Parallel WaveGAN, pre-trained voice conversion via FreeVC, and GMM-based and Differential GMM-based voice conversion (GMMVC_DIFF) methods. In this paper, we perform a comparative analysis of pathological features, including Harmonics-to-Noise Ratio (HNR), six types of jitter, and seven shimmer features. Additionally, Cepstral Peak Prominence (CPP) features were assessed under both voiced and unvoiced speech conditions. The analysis demonstrates that these features show substantial variations across different spoofing techniques. Specifically, the voice conversion methods GMMVC_DIFF show notable differences in these features. These results highlight the pivotal role of these pathological features in enhancing the precision of future spoof detection systems.
O-COCOSDA

UCSYSpoof: A Myanmar Language Dataset for Voice Spoofing Detection.

Hay Mar Soe Naing, Win Pa Pa, Aye Mya Hlaing, Myat Aye Aye Aung, Kasorn Galajit, and Candy Olivia Mawalim

The 27th International Conference of the Oriental COCOSDA (O-COCOSDA 2024), National Yang Ming Chiao Tung University, Hsinchu, Taiwan, 17-19 Oct 2024.

Abstract PDF DOI

Automatic Speaker Verification (ASV) is widely used in voice-based security mechanisms. It involves accepting or rejecting a person identification based on the individual's voice, a unique biometric feature. However, it faces many challenges and is vulnerable to direct or indirect attacks. Spoof voice detection is also an important component in secure voice authentication systems. Unfortunately, there is no spoof detection system using Myanmar language dataset. Spoof detection systems are important for many languages, including Myanmar, as they prevent fraud and misinformation, maintain trust, cultural and linguistic relevance, etc. Therefore, this paper proposes a Myanmar spoof voice dataset called UCSYSpoof, which contains both real and spoofed speech signals. End - to-end speech synthesis, vocoder-based speech reconstruction, and voice conversion techniques were used to generate the spoof speech based on 12,000 genuine speech signals. To demonstrate the impact of proposed dataset, a simple spoof detection model is implemented using long short-term memory (LSTM) and convolutional neural network (CNN) classifiers with linear frequency cepstral coefficients (LFCC) and Mel frequency cepstral coefficients (MFCC) features. Based on the empirical results, using CNN with LFCC and MFCC features achieves the comparable results on proposed dataset. The results show that the detection model has Fl-score of 0.99 and an equal error rate (EER) of 0.004, respectively.
ICAICTA

Indonesian Speech Anti-Spoofing System: Data Creation and CNN Models.

Sarah Azka Arief, Candy Olivia Mawalim, and Dessi Puji Lestari

The 11th International Conference on Advanced Informatics: Concepts, Theory, and Applications (ICAICTA 2024), Singapore, 28-30 Sept 2024.

Abstract PDF DOI

Biometric systems are prone to spoofing attacks. While research in speech anti-spoofing has been progressing, there is a limited availability of diverse language datasets. This study aims to bridge this gap by developing an Indonesian spoofed speech dataset, which includes replay attacks, text-to- speech, and voice conversion. This dataset forms the foundation for creating an Indonesian speech anti-spoofing system. Subsequently, light convolutional neural network (LCNN) and residual network (ResNet) models, based on convolutional neural networks (CNN), were developed to evaluate the dataset. The input features used are linear frequency cepstral coefficients (LFCC). Both models demonstrate remarkably low minDCF and EER scores approaching zero. The results also exhibit exceptional scores with 4-fold cross validation, showing strong initial performance with no signs of overfitting. However, models trained solely on Common Voice or Prosa.ai datasets performed poorly in cross-source tests, suggesting generalization issues due to a lack of diversity in the dataset. This highlights the need for further improvement and continued research in Indonesian speech spoof detection.
ICMI

Do We Need to Watch It All? Efficient Job Interview Video Processing with Differentiable Masking.

Hung Le, Sixia Li, Candy Olivia Mawalim, Hung-Hsuan Huang, Chee Wee Leong, and Shogo Okada

The 26th International Conference on Multimodal Interaction (ICMI 2024), San José, Costa Rica, 4-8 Nov 2024.

Abstract PDF DOI

With technological advancements in transmitting and storing large video files, more and more organizations are incorporating asynchronous video interviews as part of their personnel selection process. Automatic evaluation of these videos is a challenging machine learning setting because the samples are composed of time series input data but only one overall label is available. It is unclear which segments of the time series input (i.e., videos) are the most important ones for prediction. Not all nonverbal features, spoken words, and utterances contribute equally to the prediction; some segments of the videos might even introduce noise to the model. Processing all multimodal information is therefore inefficient. To address this challenge, we propose a framework to model the content of the answer via the full transcription and the speaking patterns of the interviewee via short clips. Our model learns to automatically select the most informative segment by previewing the acoustic modality using a technique called differentiable masking. The results show that our method outperforms existing approaches while being more efficient since only partial multimodal data are processed, and the interpretability of the model is enhanced.
ACII

Incremental Multimodal Sentiment Analysis on HAI Based on Multitask Active Learning with Inter-Annotator Agreement.

Thus Karnjanapatchara, Sixia Li, Candy Olivia Mawalim, Kazunori Komatani, and Shogo Okada

The 12th International Conference on Affective Computing and Intelligent Interaction (ACII 2024), Glasglow, Scotland, UK, 15-18 September 2024.

Abstract PDF DOI

Multimodal sentiment analysis (MSA) is critical in developing empathetic and adaptive multimodal dialogue systems or conversational agents that can naturally interact with users by recognizing sentiment and engagement. Addressing the challenges to collect labeled data for MSA in human agent interaction (HAI), this study introduces an innovative approach combining active learning and multitask learning. Our efficient sentiment recognition model leverages active learning to select informative data for learning models, significantly reducing the labor-intensive data labeling process. Furthermore, we employ multitask learning to improve annotation (label) quality by evaluating alignment with true labels and inter-annotator agreement, thus enhancing the reliability of sentiment annotations. We evaluate the proposed multitask and active learning method using a human-agent multimodal dialogue dataset including various kinds of sentiment annotations, which are publicly available. The experimental results demonstrate that by learning to predict the agreement score, multitask learning becomes better than single-task learning at capturing the nuances and uncertainties in the data. This study lays the groundwork for incremental learning strategies in MSA, aiming to adaptively understand user sentiments in human-agent interactions.
RISP

Study on Inaudible Speech Watermarking Method Based on Spread-Spectrum Using Linear Prediction Residue.

Aulia Adila, Candy Olivia Mawalim, Takuto Isoyama, and Masashi Unoki.

Journal of Signal Processing, Research Institute of Signal Processing, 2024, Volume 28 Issue 6 Pages 309-313.

Abstract DOI PDF

A reliable speech watermarking technique must balance satisfying four requirements: inaudibility, robustness, blind detectability, and confidentiality. A previous study proposed a speech watermarking technique based on direct spread spectrum (DSS) using a linear prediction (LP) scheme, i.e., LP-DSS, that could simultaneously satisfy these four requirements. However, an inaudibility issue was found due to the incorporation of a blind detection scheme with frame synchronization. In this paper, we investigate the feasibility of utilizing a psychoacoustical model, which simulates auditory masking, to control the suitable embedding level of the watermark signal to resolve the inaudibility issue in the LP-DSS scheme. Evaluation results confirmed that controlling the embedding level with the psychoacoustical model, with a constant scaling factor setting, could balance the trade-off between inaudibility and detection ability with a payload up to 64 bps.
Interspeech

Are Recent Deep Learning-Based Speech Enhancement Methods Ready to Confront Real-World Noisy Environments?

Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

The 25th Interspeech Conference, Kos Island, Greece, 1-5 September 2024.

Abstract PDF DOI Code Demo

Recent advancements in speech enhancement techniques have ignited interest in improving speech quality and intelligibility. However, the effectiveness of recently proposed methods is unclear. In this paper, a comprehensive analysis of modern deep learning-based speech enhancement approaches is presented. Through evaluations using the Deep Suppression Noise and Clarity Enhancement Challenge datasets, we assess the performances of three methods: Denoiser, DeepFilterNet3, and FullSubNet+. Our findings reveal nuanced performance differences among these methods, with varying efficacy across datasets. While objective metrics offer valuable insights, they struggle to represent complex scenarios with multiple noise sources. Leveraging ASR-based methods for these scenarios shows promise but may induce critical hallucination effects. Our study emphasizes the need for ongoing research to refine techniques for diverse real-world environments.
IEAI

Exploring a Cutting-Edge Convolutional Neural Network for Speech Emotion Recognition

Navod Neranjan Thilakarathne, Kasorn Galajit, Candy Olivia Mawalim, and Hayati Yassin.

The 5th International conference on Industrial Engineering and Artificial Intelligence (IEAI 2024), Chulalongkorn University, Bangkok, Thailand, 24-26 April 2024.

Abstract PDF

In light of the ongoing expansion of humancomputer interaction, advancements in the comprehension and interpretation of human emotions are of the utmost importance. SER, representing speech emotion recognition, is a critical element in this context as it enables computational systems to comprehend the emotions of humans. Throughout the years, SER has employed a variety of techniques, including well-established speech analysis and classification methods. However, in recent years, techniques powered by deep learning have been suggested as a viable substitute for conventional SER methods, owing to their more encouraging outcomes in comparison to the aforementioned methods. In this regard, by utilizing a novel Convolutional Neural Network (CNN) model designed to assess and categorize seven emotional states based on speech signals retrieved from three distinct databases, this research presents a novel approach to SER that yields 88.76% accuracy. The purpose of this research is to enhance the accuracy and efficiency of emotion identification, with the end goal of boosting applications in fields such as interactive voice response systems, mental health monitoring, and personalized digital assistants.
NCSP

Study on Inaudible Speech Watermarking Method Based on Spread-Spectrum Using Linear Prediction Residue.

Aulia Adila, Candy Olivia Mawalim, Isoyama Takuto, and Masashi Unoki.

The 2024 RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing, Hawaii, 27 February - 1 March 2024.

Abstract PDF

A reliable speech watermarking technique must balance satisfying the four requirements: inaudibility, robustness, blind-detectability, and confidentiality. The previous study proposed an LP-DSS scheme that could simultaneously satisfy these four requirements. However, the inaudibility issue happened due to the blind detection scheme with frame synchronization. In this paper, we investigate the feasibility of utilizing a psychoacoustical model to control the suitable embedding level of the watermark signal to resolve the inaudibility issue that arises in the LP-DSS scheme. A psychoacoustical model simulates the auditory masking phenomenon that “mask” signals below the masking curve to be imperceptible to human ears. Results of the evaluation confirmed that the controlled embedding level from the psychoacoustical model balanced the trade-off between inaudibility and detection ability with payload up to 64 bps.

2023

iSAI-NLP

ThaiSpoof: A Database for Spoof Detection in Thai Language.

Kasorn Galajit, Thunpisit Kosolsriwiwat, Candy Olivia Mawalim, Pakinee Aimmanee, Waree Kongprawechnon, Win Pa Pa, Anuwat Chaiwongyen, Teeradaj Racharak, Hayati Yassin, Jessada Karnjana, Surasak Boonkla, and Masashi Unoki.

The 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing and The International Conference on Artificial Intelligence and Internet of Things (iSAI-NLP 2023).

Abstract DOI IEEE PDF

Many applications and security systems have widely applied automatic speaker verification (ASV). However, these systems are vulnerable to various direct and indirect access attacks, which weakens their authentication capability. The research in spoofed speech detection contributes to enhancing these systems. However, the study in spoofing detection is limited to only some languages due to the need for various datasets. This paper focuses on a Thai language dataset for spoof detection. The dataset consists of genuine speech signals and various types of spoofed speech signals. The spoofed speech dataset is generated using text-to-speech tools for the Thai language, synthesis tools, and tools for speech modification. To showcase the utilization of this dataset, we implement a simple model based on a convolutional neural network (CNN) taking linear frequency cepstral coefficients (LFCC) as its input. We trained, validated, and tested the model on our dataset referred to as ThaiSpoof. The experimental result shows that the accuracy of model is 95%, and equal error rate (EER) is 4.67%. The result shows that our ThaiSpoof dataset has the potential to develop for helping in spoof detection studies.
iSAI-NLP

Voice Contribution on LFCC feature and ResNet-34 for Spoof Detection.

Khaing Zar Mon, Kasorn Galajit, Candy Olivia Mawalim, Jessada Karnjana, Tsuyoshi Isshiki, and Pakinee Aimmanee.

The 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing and The International Conference on Artificial Intelligence and Internet of Things (iSAI-NLP 2023).

Abstract DOI IEEE PDF

Recently, biometric authentication has been significant advancement, particularly in speaker verification. While there have been significant advancements in this technology, compelling evidence highlights the continued vulnerability of this technology to malicious spoofing attacks. This vulnerability calls for developing specialized countermeasures to identify various attack types. This paper specifically focuses on detecting replay, speech synthesis, and voice conversion attacks. As our spoof detection strategy’s front-end feature extraction method, we used linear frequency cepstral coefficients (LFCC). We utilized ResNet-34 to classify between genuine and fake speech. By integrating LFCC with ResNet-34, We evaluated the proposed method using the ASVspoof 2019 dataset, PA (Physical Access), and LA (Logical Access). In our approach, we compare the use of the entire utterance for feature extraction in both PA and LA datasets with an alternative method that involves extracting features from a specific percentage of the voice segment within the utterance for classification. In addition, we conducted a comprehensive evaluation by comparing our proposed method with the established baseline techniques, LFCC-GMM and CQCC-GMM. The proposed method demonstrates promising performance, achieving equal error rate (EER) of 3.11% and 3.49% for the development and evaluation datasets, respectively, in PA attacks. In LA attacks evaluation, the proposed method performs EER of 0.16% and 6.89% for the development and evaluation datasets, respectively. The proposed method shows promising results in identifying spoof attacks for both PA and LA attacks.
APSIPA

Analysis of Spectro-Temporal Modulation Representation for Deep-Fake Speech Detection.

Haowei Cheng, Candy Olivia Mawalim, Kai Li, Lijun Wang, and Masashi Unoki.

The 15th Asia-Pasific Signal and Information Processing Association (APSIPA ASC 2023), Taipei, Taiwan, 31 October - 3 November 2023.

Abstract DOI IEEE PDF

Deep-fake speech detection aims to develop effective techniques for identifying fake speech generated using advanced deep-learning methods. It can reduce the negative impact of malicious production or dissemination of fake speech in real-life scenarios. Although humans can relatively easy to distinguish between genuine and fake speech due to human auditory mechanisms, it is difficult for machines to distinguish them correctly. One major reason for this challenge is that machines struggle to effectively separate speech content from human vocal system information. Common features used in speech processing face difficulties in handling this issue, hindering the neural network from learning the discriminative differences between genuine and fake speech. To address this issue, we investigated spectro-temporal modulation representations in genuine and fake speech, which simulate the human auditory perception process. Next, the spectro-temporal modulation was fit to a light convolutional neural network bidirectional long short-term memory for classification. We conducted experiments on the benchmark datasets of the Automatic Speaker Verification and Spoofing Countermeasures Challenge 2019 (ASVspoof2019) and the Audio Deep synthesis Detection Challenge 2023 (ADD2023), achieving an equal-error rate of 8.33\% and 42.10\%, respectively. The results showed that spectro-temporal modulation representations could distinguish the genuine and deep-fake speech and have adequate performance in both datasets.
APSIPA

Incorporating the Digit Triplet Test in A Lightweight Speech Intelligibility Prediction for Hearing Aids.

Xiajie Zhou, Candy Olivia Mawalim, and Masashi Unoki.

The 15th Asia-Pasific Signal and Information Processing Association (APSIPA ASC 2023), Taipei, Taiwan, 31 October - 3 November 2023.

Abstract DOI IEEE PDF

Recent studies in speech processing often utilize sophisticated methods for solving a task to obtain high-accuracy results. Although high performance could be achieved, the methods are too complex and require high-performance computational power that might not be available for a wide range of researchers. In this study, we propose a method to incorporate the low dimensional and the recent state-of-the-art acoustic features for speech processing to predict the speech intelligibility in noise for hearing aids. The proposed method was developed based on the stack regressor on various traditional machine learning regressors. Unlike other existing works, we utilized the results of the digit triplet test, which is usually used to measure the hearing ability in the existence of noise, to improve the prediction. The evaluation of our proposed method was carried out by using the first Clarity Prediction Challenge dataset. This dataset is utilized for speech intelligibility prediction that consists of speech signals output of hearing aids that were arranged in various simulated scenes with interferers. Our experimental results show that the proposed method could improve speech intelligibility prediction. The results also show that the digit triplet test results are beneficial for speech intelligibility prediction in noise.
APAC

Non-Intrusive Speech Intelligibility Prediction Using an Auditory Periphery Model with Hearing Loss.

Candy Olivia Mawalim, Benita Angela Titalim, Shogo Okada, and Masashi Unoki.

Applied Acoustics, 2023.

Abstract DOI PDF

Speech intelligibility prediction methods are necessary for hearing aid development. However, many such prediction methods are categorized as intrusive metrics because they require reference speech as input, which is often unavailable in real-world situations. Additionally, the processing techniques in hearing aids may cause temporal or frequency shifts, which degrade the accuracy of intrusive speech intelligibility metrics. This paper proposes a non-intrusive auditory model for predicting speech intelligibility under hearing loss conditions. The proposed method requires binaural signals from hearing aids and audiograms representing the hearing conditions of hearing-impaired listeners. It also includes additional acoustic features to improve the method’s robustness in noisy and reverberant environments. A two-dimensional convolutional neural network with neural decision forests is used to construct a speech intelligibility prediction model. An evaluation conducted with the first Clarity Prediction Challenge dataset shows that the proposed method performs better than the baseline system.
IEEE Access

A Ranking Model for Evaluation of Conversation Partners Based on Rapport Levels.

Takato Hayashi, Candy Olivia Mawalim, Ryo Ishii, Akira Morikawa, Takao Nakamura, and Shogo Okada.

IEEE Access, 2023.

Abstract DOI PDF

Our proposed ranking model ranks conversation partners based on self-reported rapport levels for each participant. The model is important for tasks that recommend interaction partners based on user rapport built in past interactions, such as matchmaking between a student and a teacher in one-to-one online language classes. To rank conversation partners, we can use a regression model that predicts rapport ratings. It is, however, challenging to learn the mapping from the participants' behavior to their associated rapport ratings because a subjective scale for rapport ratings may vary across different participants. Hence, we propose a ranking model trained via preference learning (PL). The model avoids the subjective scale bias because the model is trained to predict ordinal relations between two conversation partners based on rapport ratings reported by the same participant. The input of the model is multimodal (acoustic and linguistic) features extracted from two participants' behaviors in an interaction. Since there is no publicly available dataset for validating the ranking model, we created a new dataset composed of online dyadic (person-to-person) interactions between a participant and several different conversation partners. We compare the ranking model trained via preference learning with the regression model by using evaluation metrics for the ranking. The experimental results show that preference learning is a more suitable approach for ranking conversation partners. Furthermore, we investigate the effect of each modality and the different stages of rapport development on the ranking performance.
EUSIPCO

Auditory Model Optimization with Wavegram-CNN and Acoustic Parameter Models for Nonintrusive Speech Intelligibility Prediction in Hearing Aids.

Candy Olivia Mawalim, Benita Angela Titalim, Shogo Okada, and Masashi Unoki.

The 31st European Signal Processing Conference (EUSIPCO 2023), Helsinki, Finland.

Abstract DOI PDF

Nonintrusive speech intelligibility (SI) prediction is essential for evaluating many speech technology applications, including hearing aid development. In this study, several factors related to hearing perception are investigated to predict SI. In the proposed method, we integrated a physiological auditory model from two ears (binaural EarModel), wavegram-CNN model and acoustic parameter model. The refined EarModel does not require clean speech as input (blind method). In EarModel, the perception caused by hearing loss is simulated based on audiograms. Meanwhile, the wavegram-CNN and acoustic parameter models represent the factors related to the speech spectrum and acoustics, respectively. The proposed method is evaluated based on the scenario from the 1st Clarity Prediction Challenge (CPC1). The results show that the proposed method outperforms the intrusive baseline MBSTOI and HASPI methods in terms of the Pearson coefficient (ρ), RMSE, and R2 score in both closed-set and open-set tracks. Based on the results from listener-wise evaluation results, the average $\rho$ could be improved by more than 0.3 using the proposed method.
LNCS

Inter-person Intra-modality Attention Based Model for Dyadic Interaction Engagement Prediction.

Xiguang Li, Shogo Okada, and Candy Olivia Mawalim.

The 25th HCI International Conference, HCII 2023, Copenhagen, Denmark, July 23–28, 2023.

Abstract DOI PDF

With the rapid development of artificial agents, more researchers have explored the importance of user engagement level prediction. Real-time user engagement level prediction assists the agent in properly adjusting its policy for the interaction. However, the existing engagement modeling lacks the element of interpersonal synchrony, a temporal behavior alignment closely related to the engagement level. Part of this is because the synchrony phenomenon is complex and hard to delimit. With this background, we aim to develop a model suitable for temporal interpersonal features with the help of the modern data-driven machine learning method. Based on previous studies, we select multiple non-verbal modalities of dyadic interactions as predictive features and design a multi-stream attention model to capture the interpersonal temporal relationship of each modality. Furthermore, we experiment with two additional embedding schemas according to the synchrony definitions in psychology. Finally, we compare our model with a conventional structure that emphasizes the multimodal features within an individual. Our experiments showed the effectiveness of the intra-modal inter-person design in engagement prediction. However, the attempt to manipulate the embeddings failed to improve the performance. In the end, we discuss the experiment result and elaborate on the limitations of our work.
LNCS

Investigating the Effect of Linguistic Features on Personality and Job Performance Predictions.

Hung Le, Sixia Li, Candy Olivia Mawalim, Hung-Hsuan Huang, Chee Wee Leong, and Shogo Okada.

The 25th HCI International Conference, HCII 2023, Copenhagen, Denmark, July 23–28, 2023.

Abstract DOI PDF

Personality traits are known to have a high correlation with job performance. On the other hand, there is a strong relationship between language and personality. In this paper, we presented a neural network model for inferring personality and hirability. Our model was trained only from linguistic features but achieved good results by incorporating transfer learning and multi-task learning techniques. The model improved the F1 score 5.6% point on the Hiring Recommendation label compared to previous work. The effect of different Automatic Speech Recognition systems on the performance of the models was also shown and discussed. Lastly, our analysis suggested that the model makes better judgments about hirability scores when the personality traits information is not absent.
JMUI

Personality Trait Estimation in Group Discussions using Multimodal Analysis and Speaker Embedding.

Candy Olivia Mawalim, Shogo Okada, Yukiko I. Nakano, and Masashi Unoki.

Journal on Multimodal User Interfaces, 2023.

Abstract DOI PDF

The automatic estimation of personality traits is essential for many human-computer interface (HCI) applications. This paper focused on improving Big Five personality trait estimation in group discussions via multimodal analysis and transfer learning with the state-of-the-art speaker individuality feature, namely, the identity vector (i-vector) speaker embedding. The experiments were carried out by investigating the effective and robust multimodal features for estimation with two group discussion datasets, i.e., the Multimodal Task-Oriented Group Discussion (MATRICS) (in Japanese) and Emergent Leadership (ELEA) (in European languages) corpora. Subsequently, the evaluation was conducted by using leave-one-person-out cross-validation (LOPCV) and ablation tests to compare the effectiveness of each modality. The overall results showed that the speaker-dependent features, e.g., the i-vector, effectively improved the prediction accuracy of Big Five personality trait estimation. In addition, the experimental results showed that audio-related features were the most prominent features in both corpora.

2022

ACM-MUM

Multimodal Analysis for Communication Skill and Self-Efficacy Level Estimation in Job Interview Scenario.

Tomoya Ohba*, Candy Olivia Mawalim*, Shun Katada, Haruki Kuroki, and Shogo Okada.

The 21st International Conference on Mobile and Ubiquitous Multimedia (ACM MUM 2022), Lisbon, Portugal, 27--30 November 2022.

Abstract DOI PDF

An interview for a job recruiting process requires applicants to demonstrate their communication skills. Interviewees sometimes become nervous about the interview because interviewees themselves do not know their assessed score. This study investigates the relationship between the communication skill (CS) and the self-efficacy level (SE) of interviewees through multimodal modeling. We also clarify the difference between effective features in the prediction of CS and SE labels. For this purpose, we collect a novel multimodal job interview data corpus by using a job interview agent system where users experience the interview using a virtual reality head-mounted display (VR-HMD). The data corpus includes annotations of CS by third-party experts and SE annotations by the interviewees. The data corpus also includes various kinds of multimodal data, including audio, biological (i.e., physiological), gaze, and language data. We present two types of regression models, linear regression and sequential-based regression models, to predict CS, SE, and the gap (GA) between skill and self-efficacy. Finally, we report that the model with acoustic, gaze, and linguistic features has the best regression accuracy in CS prediction (correlation coefficient r = 0.637). Furthermore, the regression model with biological features achieves the best accuracy in SE prediction (r = 0.330).
SST

OBISHI: Objective Binaural Intelligibility Score for the Hearing Impaired.

Candy Olivia Mawalim, Benita Angela Titalim, Masashi Unoki, and Shogo Okada.

SST2022, The 18th Australasian International Conference on Speech Science and Technology, Canberra, Australia, 13--16 December 2022.

Abstract PDF

Speech intelligibility prediction for both normal hearing and hearing impairment is very important for hearing aid development. The Clarity Prediction Challenge 2022 (CPC1) was initiated to evaluate the speech intelligibility of speech signals produced by hearing aid systems. Modified binaural short-time objective intelligibility (MBSTOI) and hearing aid speech prediction index (HASPI) were introduced in the CPC1 to understand the basis of speech intelligibility prediction. This paper proposes a method to predict speech intelligibility scores, namely OBISHI. OBISHI is an intrusive (non-blind) objective measurement that receives binaural speech input and considers the hearing-impaired characteristics. In addition, a pre-trained automatic speech recognition (ASR) system was also utilized to infer the difficulty of utterances regardless of the hearing loss condition. We also integrated the hearing loss model by the Cambridge auditory group and the Gammatone Filterbank-based prediction model. The total evaluation was conducted by comparing the predicted intelligibility score of the baseline MBSTOI and HASPI with the actual correctness of listening tests. In general, the results showed that the proposed method, OBISHI, outperformed the baseline MBSTOI and HASPI (improved approximately 10% classification accuracy in terms of F1 score).
APSIPA

Speech Intelligibility Prediction for Hearing Aids Using an Auditory Model and Acoustic Parameters.

Benita Angela Titalim*, Candy Olivia Mawalim*, Shogo Okada, and Masashi Unoki.

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7--10 November 2022.

Abstract PDF

Objective speech intelligibility (SI) metrics for hearing-impaired people play an important role in hearing aid development. The work on improving SI prediction also became the basis of the first Clarity Prediction Challenge (CPC1). This study investigates a physiological auditory model called EarModel and acoustic parameters for SI prediction. EarModel is utilized because it provides advantages in estimating human hearing, both normal and impaired. The hearing-impaired condition is simulated in EarModel based on audiograms; thus, the SI perceived by hearing-impaired people is more accurately predicted. Moreover, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and WavLM, as additional acoustic parameters for estimating the difficulty levels of given utterances, are included to achieve improved prediction accuracy. The proposed method is evaluated on the CPC1 database. The results show that the proposed method improves the SI prediction effects of the baseline and hearing aid speech prediction index (HASPI). Additionally, an ablation test shows that incorporating the eGeMAPS and WavLM can significantly contribute to the prediction model by increasing the Pearson correlation coefficient by more than 15% and decreasing the root-mean-square error (RMSE) by more than 10.00 in both closed-set and open-set tracks.
APSIPA

F0 Modification via PV-TSM Algorithm for Speaker Anonymization Across Gender.

Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7-10 November 2022.

Abstract PDF

Speaker anonymization has been developed to protect personally identifiable information while retaining other encapsulated information in speech. Datasets, metrics, and protocols for evaluating speaker anonymization have been defined in the Voice Privacy Challenge (VPC). However, existing privacy metrics focus on evaluating general speaker individuality anonymization, which is represented by an x-vector. This study aims to investigate the effect of anonymization on the perception of gender. Understanding how anonymization caused gender transformation is essential for various applications of speaker anonymization. We proposed speaker anonymization methods across genders based on phase-vocoder time-scale modification (PV-TSM). Subsequently, in addition to the VPC evaluation, we developed a gender classifier to evaluate a speaker's gender anonymization. The objective evaluation results showed that our proposed method can successfully anonymize gender. In addition, our proposed methods outperformed the signal processing-based baseline methods in anonymizing speaker individuality represented by the x-vector in ASVeval while maintaining speech intelligibility.
SPSC-SIG

Speaker Anonymization by Pitch Shifting Based on Time-Scale Modification.

Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

2nd Symposium on Security and Privacy in Speech Communication joined with 2nd VoicePrivacy Challenge Workshop September 23 & 24 2022, as a satellite to Interspeech 2022, Incheon, Korea.

Abstract DOI PDF

The increasing usage of speech in digital technology raises a privacy issue because speech contains biometric information. Several methods of dealing with this issue have been proposed, including speaker anonymization or de-identification. Speaker anonymization aims to suppress personally identifiable information (PII) while keeping the other speech properties, including linguistic information. In this study, we utilize time-scale modification (TSM) speech signal processing for speaker anonymization. Speech signal processing approaches are significantly less complex than the state-of-the-art x-vector-based speaker anonymization method because it does not require a training process. We propose anonymization methods using two major categories of TSM, synchronous overlap-add (SOLA)-based algorithm and phase vocoder-based TSM (PV-TSM). For evaluating our proposed methods, we utilize the standard objective evaluation introduced in the VoicePrivacy challenge. The results show that our method based on the PV-TSM balances privacy and utility metrics better than baseline systems, especially when evaluating with an automatic speaker verification (ASV) system in anonymized enrollment and anonymized trials (a-a). Further, our method outperformed the x-vector-based speaker method, which has limitations in its complex training process, low privacy in an a-a scenario, and low voice distinctiveness.
Elsevier

Speaker Anonymization by Modifying Fundamental Frequency and X-Vectors Singular Value.

Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, Shunsuke Kidani, and Masashi Unoki.

Computer Speech & Language, Elsevier, vol. 73, 101326, 2022.

Abstract DOI PDF Video

Speaker anonymization is a method of protecting voice privacy by concealing individual speaker characteristics while preserving linguistic information. The VoicePrivacy Challenge 2020 was initiated to generalize the task of speaker anonymization. In the challenge, two frameworks for speaker anonymization were introduced; in this study, we propose a method of improving the primary framework by modifying the state-of-the-art speaker individuality feature (namely, x-vector) in a neural waveform speech synthesis model. Our proposed method is constructed based on x-vector singular value modification with a clustering model. We also propose a technique of modifying the fundamental frequency and speech duration to enhance the anonymization performance. To evaluate our method, we carried out objective and subjective tests. The overall objective test results show that our proposed method improves the anonymization performance in terms of the speaker verifiability, whereas the subjective evaluation results show improvement in terms of the speaker dissimilarity. The intelligibility and naturalness of the anonymized speech with speech prosody modification were slightly reduced (less than 5% of word error rate) compared to the results obtained by the baseline system.

2021

MDPI

Speech Watermarking by McAdams Coefficient Scheme Based on Random Forest Learning.

Candy Olivia Mawalim, and Masashi Unoki.

Entropy, MDPI, vol. 23, no. 10, 2021.

Abstract DOI PDF

Speech watermarking has become a promising solution for protecting the security of speech communication systems. We propose a speech watermarking method that uses the McAdams coefficient, which is commonly used for frequency harmonics adjustment. The embedding process was conducted, using bit-inverse shifting. We also developed a random forest classifier, using features related to frequency harmonics for blind detection. An objective evaluation was conducted to analyze the performance of our method in terms of the inaudibility and robustness requirements. The results indicate that our method satisfies the speech watermarking requirements with a 16 bps payload under normal conditions and numerous non-malicious signal processing operations, e.g., conversion to Ogg or MP4 format.
TOMM

Task-independent Recognition of Communication Skills in Group Interaction Using Time-series Modeling.

Candy Olivia Mawalim, Shogo Okada, and Yukiko I. Nakano.

ACM Transactions on Multimedia Computing Communications and Applications, vol. 17, no. 4, pp. 122:1-122:27, 2021.

Abstract DOI PDF

Case studies of group discussions are considered an effective way to assess communication skills (CS). This method can help researchers evaluate participants’ engagement with each other in a specific realistic context. In this article, multimodal analysis was performed to estimate CS indices using a three-task-type group discussion dataset, the MATRICS corpus. The current research investigated the effectiveness of engaging both static and time-series modeling, especially in task-independent settings. This investigation aimed to understand three main points: first, the effectiveness of time-series modeling compared to nonsequential modeling; second, multimodal analysis in a task-independent setting; and third, important differences to consider when dealing with task-dependent and task-independent settings, specifically in terms of modalities and prediction models. Several modalities were extracted (e.g., acoustics, speaking turns, linguistic-related movement, dialog tags, head motions, and face feature sets) for inferring the CS indices as a regression task. Three predictive models, including support vector regression (SVR), long short-term memory (LSTM), and an enhanced time-series model (an LSTM model with a combination of static and time-series features), were taken into account in this study. Our evaluation was conducted by using the R2 score in a cross-validation scheme. The experimental results suggested that time-series modeling can improve the performance of multimodal analysis significantly in the task-dependent setting (with the best R2 = 0.797 for the total CS index), with word2vec being the most prominent feature. Unfortunately, highly context-related features did not fit well with the task-independent setting. Thus, we propose an enhanced LSTM model for dealing with task-independent settings, and we successfully obtained better performance with the enhanced model than with the conventional SVR and LSTM models (the best R2 = 0.602 for the total CS index). In other words, our study shows that a particular time-series modeling can outperform traditional nonsequential modeling for automatically estimating the CS indices of a participant in a group discussion with regard to task dependency.
APSIPA

Improving Security in McAdams Coefficient-Based Speaker Anonymization by Watermarking Method.

Candy Olivia Mawalim, and Masashi Unoki.

2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, December 2021.

Abstract IEEE PDF

Speaker anonymization aims to suppress speaker individuality to protect privacy in speech while preserving the other aspects, such as speech content. One effective solution for anonymization is to modify the McAdams coefficient. In this work, we propose a method to improve the security for speaker anonymization based on the McAdams coefficient by using a speech watermarking approach. The proposed method consists of two main processes: one for embedding and one for detection. In embedding process, two different McAdams coefficients represent binary bits "0" and "1". The watermarked speech is then obtained by frame-by-frame bit inverse switching. Subsequently, the detection process is carried out by a power spectrum comparison. We conducted objective evaluations with reference to the VoicePrivacy 2020 Challenge (VP2020) and of the speech watermarking with reference to the Information Hiding Challenge (IHC) and found that our method could satisfy the blind detection, inaudibility, and robustness requirements in watermarking. It also significantly improved the anonymization performance in comparison to the secondary baseline system in VP2020.

2020

APSIPA

Speech Information Hiding by Modification of LSF Quantization Index in CELP Codec.

Candy Olivia Mawalim, Shengbei Wang, and Masashi Unoki.

2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, December 2020.

Abstract IEEE PDF

A prospective method for securing digital speech communication is by hiding the information within the speech. Most of the speech information hiding methods proposed in prior research are lacking in robustness when dealing with the encoding process (e.g. the code-excited linear prediction (CELP) codec). The CELP codecs provide a codebook that represents the encoded signal at a lower bit rate. As essential features in speech coding, line spectral frequencies (LSFs) are generally included in the codebook. Consequently, LSFs are considered as a prospective medium for information hiding that is robust against CELP codecs. In this paper, we propose a speech information hiding method that modifies the least significant bit of the LSF quantization obtained by a CELP codec. We investigated the feasibility of our proposed method by objective evaluation in terms of detection accuracy and inaudibility. The evaluation results confirmed the reliability of our proposed method with some further potential improvement (multiple embedding and varying segmentation lengths). The results also showed that our proposed method is robust against several signal processing operations, such as resampling, adding Gaussian noise, and several CELP codecs (i.e., the Federation Standard-1016 CELP, G.711, and G.726).
Interspeech

X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System.

Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, and Masashi Unoki.

Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, pp. 1703–1707, October 2020.

Abstract DOI PDF

Anonymizing speaker individuality is crucial for ensuring voice privacy protection. In this paper, we propose a speaker individuality anonymization system that uses singular value modification and statistical-based decomposition on an x-vector with ensemble regression modeling. An anonymization system requires speaker-to-speaker correspondence (each speaker corresponds to a pseudo-speaker), which may be possible by modifying significant x-vector elements. The significant elements were determined by singular value decomposition and variant analysis. Subsequently, the anonymization process was performed by an ensemble regression model trained using x-vector pools with clustering-based pseudo-targets. The results demonstrated that our proposed anonymization system effectively improves objective verifiability, especially in anonymized trials and anonymized enrollments setting, by preserving similar intelligibility scores with the baseline system introduced in the VoicePrivacy 2020 Challenge.
Springer

Audio Information Hiding Based on Cochlear Delay Characteristics with Optimized Segment Selection.

Candy Olivia Mawalim and Masashi Unoki.

Advances in Intelligent Systems and Computing, Springer, vol. 1145, 2020.

Abstract DOI

Audio information hiding (AIH) based on cochlear delay (CD) characteristics is a promising technique to deal with the trade-off between inaudibility and robustness requirements effectively. However, the use of phase-shift keying (PSK) for blindly detectable AIH based on CD characteristics caused abrupt phase changing (phase spread spectrum), which leads to bad inaudibility. This paper proposed the technique to reduce the spread spectrum from PSK by segment selection process with spline interpolation optimization. Objective evaluation to measure the detection accuracy (BDR) and inaudibility (PEAQ and LSD) was carried out with 102 various genre music clips dataset. Based on the evaluation result, our proposed method could successfully reduce the spread spectrum caused by PSK by having improvement on inaudibility test with adequate detection accuracy up to 1024 bps.

2019

RISP

Feasibility of Audio Information Hiding Using Linear Time Variant IIR Filters Based on Cochlear Delay.

Candy Olivia Mawalim and Masashi Unoki.

Journal of Signal Processing, Research Institute of Signal Processing, vol. 23, no. 4, 2019.

Abstract DOI PDF

A reported technique for cochlear delay (CD) based audio information hiding achieved imperceptibility in non-blind approaches. However, the phase shift keying (PSK) technique was utilized with a blind method, causing drastically phase changes that imply perceptible information was inserted. This paper presents an investigation on the feasibility of hiding information in the linear time variant (LTV) system. We adapted a CD filter design for the LTV system in the embedding scheme. The detection scheme was conducted using instantaneous chirp-z transformation (CZT). Objective tests for checking the imperceptibility (PEAQ and LSD) and for checking data payload (bit detection rate (BDR)) were conducted to evaluate our method. Our experimental results supported the feasibility of utilizing the CD-based audio information hiding in the LTV system. ln addition, detection regarding both imperceptibility and data payload was better in our method than in the previous blind method.
LNCS

Multimodal BigFive Personality Trait Analysis using Communication Skill Indices and Multiple Discussion Types Dataset.

Candy Olivia Mawalim, Shogo Okada, Yukiko I. Nakano, and Masashi Unoki.

Springer LNCS Social Computing and Social Media: Design, Human Behavior, and Analytics, Springer, vol. 11578, 2019.

Abstract DOI PDF

This paper focuses on multimodal analysis in multiple discussion types dataset for estimating BigFive personality traits. The analysis was conducted to achieve two goals: First, clarifying the effectiveness of multimodal features and communication skill indices to predict the BigFive personality traits. Second, identifying the relationship among multimodal features, discussion type, and the BigFive personality traits. The MATRICS corpus, which contains of three discussion task types dataset, was utilized in this experiment. From this corpus, three sets of multimodal features (acoustic, head motion, and linguistic) and communication skill indices were extracted as the input for our binary classification system. The evaluation was conducted by using F1-score in 10-fold cross validation. The experimental results showed that the communication skill indices are important in estimating agreeableness trait. In addition, the scope and freedom of conversation affected the performance of personality traits estimator. The freer a discussion is, the better personality traits estimator can be obtained.

2017

ICEEI

POS-based reordering rules for Indonesian-Korean statistical machine translation.

Candy Olivia Mawalim, Dessi Puji Lestari, and Ayu Purwarianti.

6th International Conference on Electrical Engineering and Informatics (ICEEI), pp. 1-6, 2017.

Abstract IEEE

In SMT system, reordering problem is one of the most important and difficult problems to solve. The problem becomes definitely serious due to the different grammatical pattern between source and target language. The previous research about reordering model in SMT use the distortion-based reordering approach. However, this approach is not suitable for Indonesian-Korean translation. The main reason is because the word order between Indonesian and Korean are mostly reversed. Therefore, in this study, we develop a source-side reordering rules by using POS tag and word alignment information. This technique is promising to solve the reordering problem based on the experimental result. By applying 130 reordering rules in ID-KR and 50 reordering rules for KR-ID translation, the quality of translation in term of BLEU score increases 1.25% for ID-KR translation and 0.83% for KR-ID translation. Besides, combining this reordering rules with Korean verb formation rules for ID-KR translation can increase the BLEU score from 38.07 to 49.46 (in 50 simple sentences evaluation).
PACLIC

Rule-based Reordering and Post-Processing for Indonesian-Korean Statistical Machine Translation.

Candy Olivia Mawalim, Dessi Puji Lestari, and Ayu Purwarianti.

In Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation, pages 287–295, Cebu, Phillippines, 2017.

Abstract ACL PDF

In SMT system, reordering problem is one of the most important and difficult problems to solve. The problem becomes definitely serious due to the different grammatical pattern between source and target language. The previous research about reordering model in SMT use the distortion-based reordering approach. However, this approach is not suitable for Indonesian-Korean translation. The main reason is because the word order between Indonesian and Korean are mostly reversed. Therefore, in this study, we develop a source-side reordering rules by using POS tag and word alignment information. This technique is promising to solve the reordering problem based on the experimental result. By applying 130 reordering rules in ID-KR and 50 reordering rules for KR-ID translation, the quality of translation in term of BLEU score increases 1.25% for ID-KR translation and 0.83% for KR-ID translation. Besides, combining this reordering rules with Korean verb formation rules for ID-KR translation can increase the BLEU score from 38.07 to 49.46 (in 50 simple sentences evaluation).

Publications

* denotes equal contribution and joint lead authorship.

2025

Phoneme-Specific Challenges to Intelligibility in Hearing Impairment Under Noisy Condition

Study on Signal Processing Techniques in Protecting Voice Personae Against Speech Synthesis Systems

Multimodal Classification of Co-speech Gesture Pragmatical function in Storytelling

Modeling Multi-Level Hearing Loss for Speech Intelligibility Prediction

Integrating Linguistic and Acoustic Cues for Machine Learning-Based Speech Intelligibility Prediction in Hearing Impairment

Lightweight Speech Intelligibility Prediction with Spectro-Temporal Modulation for Hearing-Impaired Listeners

Robust Multilingual Audio Deepfake Detection Through Hybrid Modeling.

InaSAS: Benchmarking Indonesian Speech Antispoofing Systems

Beyond accuracy: Multimodal modeling of structured speaking skill indices in young adolescents

Influence of Personality Traits and Demographics on Rapport Recognition Using Adversarial Learning

Speech Intelligibility Prediction Using Binaural Processing for Hearing Loss

WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines.

Fine-tuning TitaNet-Large Model for Speaker Anonymization Attacker Systems.

Indonesian Speech Content De-Identification in Low Resource Transcripts.

2024

Forensic speech enhancement: Toward Reliable Handling of Poor-Quality Speech Recordings Used as Evidence in Criminal Trials

MAG-BERT-ARL for Fair Automated Video Interview Assessment.

Unsupervised Anomalous Sound Detection Using Timbral and Human Voice Disorder-Related Acoustic Features.

Anomalous Sound Detection Based on Time Domain Gammatone Filterbank and IDNN Model.

Detecting Spoof Voices in Asian Non-Native Speech: An Indonesian and Thai Case Study.

Analysis of Pathological Features for Spoof Detection.

UCSYSpoof: A Myanmar Language Dataset for Voice Spoofing Detection.

Indonesian Speech Anti-Spoofing System: Data Creation and CNN Models.

Do We Need to Watch It All? Efficient Job Interview Video Processing with Differentiable Masking.

Incremental Multimodal Sentiment Analysis on HAI Based on Multitask Active Learning with Inter-Annotator Agreement.

Study on Inaudible Speech Watermarking Method Based on Spread-Spectrum Using Linear Prediction Residue.

Are Recent Deep Learning-Based Speech Enhancement Methods Ready to Confront Real-World Noisy Environments?

Exploring a Cutting-Edge Convolutional Neural Network for Speech Emotion Recognition

Study on Inaudible Speech Watermarking Method Based on Spread-Spectrum Using Linear Prediction Residue.

2023

ThaiSpoof: A Database for Spoof Detection in Thai Language.

Voice Contribution on LFCC feature and ResNet-34 for Spoof Detection.

Analysis of Spectro-Temporal Modulation Representation for Deep-Fake Speech Detection.

Incorporating the Digit Triplet Test in A Lightweight Speech Intelligibility Prediction for Hearing Aids.

Non-Intrusive Speech Intelligibility Prediction Using an Auditory Periphery Model with Hearing Loss.

A Ranking Model for Evaluation of Conversation Partners Based on Rapport Levels.

Auditory Model Optimization with Wavegram-CNN and Acoustic Parameter Models for Nonintrusive Speech Intelligibility Prediction in Hearing Aids.

Inter-person Intra-modality Attention Based Model for Dyadic Interaction Engagement Prediction.

Investigating the Effect of Linguistic Features on Personality and Job Performance Predictions.

Personality Trait Estimation in Group Discussions using Multimodal Analysis and Speaker Embedding.

2022

Multimodal Analysis for Communication Skill and Self-Efficacy Level Estimation in Job Interview Scenario.

OBISHI: Objective Binaural Intelligibility Score for the Hearing Impaired.

Speech Intelligibility Prediction for Hearing Aids Using an Auditory Model and Acoustic Parameters.

F0 Modification via PV-TSM Algorithm for Speaker Anonymization Across Gender.

Speaker Anonymization by Pitch Shifting Based on Time-Scale Modification.

Speaker Anonymization by Modifying Fundamental Frequency and X-Vectors Singular Value.

2021

Speech Watermarking by McAdams Coefficient Scheme Based on Random Forest Learning.

Task-independent Recognition of Communication Skills in Group Interaction Using Time-series Modeling.

Improving Security in McAdams Coefficient-Based Speaker Anonymization by Watermarking Method.

2020

Speech Information Hiding by Modification of LSF Quantization Index in CELP Codec.

X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System.

Audio Information Hiding Based on Cochlear Delay Characteristics with Optimized Segment Selection.

2019

Feasibility of Audio Information Hiding Using Linear Time Variant IIR Filters Based on Cochlear Delay.

Multimodal BigFive Personality Trait Analysis using Communication Skill Indices and Multiple Discussion Types Dataset.

2017

POS-based reordering rules for Indonesian-Korean statistical machine translation.

Rule-based Reordering and Post-Processing for Indonesian-Korean Statistical Machine Translation.