Publications

* denotes equal contribution and joint lead authorship.
Journal Book Chapter Conference


2024

  1. Forensic speech enhancement: Toward Reliable Handling of Poor-Quality Speech Recordings Used as Evidence in Criminal Trials
    Helen Fraser, Vincent Aubanel, Robert C. Maher, Candy Olivia Mawalim, Xin Wang, Peter Počta, Emma Keith, Gérard Chollet, and Karla Pizzi.

    Journal of the Audio Engineering Society, 2024 (Accepted).

    This paper proposes an innovative interdisciplinary approach to evaluating the effectiveness of forensic speech enhancement (FSE). FSE faces unique challenges arising from a range of factors, from poor recording quality, highly variable conditions from case to case, and content uncertainty. Despite these difficulties, FSE is commonly admitted in court, and can significantly influence the outcome of criminal trials. Current FSE practices are hindered by unrealistic expectations from courts, which often assume that enhanced audio inherently clarifies content. In fact, FSE can have the undesired opposite effect, potentially resulting in unfair prejudice, when, for example, it increases the credibility of a misleading transcript. The proposed interdisciplinary project advocates for a better consideration of speech perception factors, particularly those related to transcription. It aims to bridge the gap between FSE and forensic transcription by promoting a combined approach to enhancing and accurately transcribing forensic audio. By developing a position statement on FSE capabilities, the project seeks to establish realistic standards and foster collaboration among researchers and practitioners. This effort aims to ensure reliable, accountable forensic audio evidence, aligning with forensic science standards and improving the effectiveness of the justice system.
  2. MAG-BERT-ARL for Fair Automated Video Interview Assessment.
    Bimasena Putra, Kurniawati Azizah, Candy Olivia Mawalim, Ikhlasul Akmal Hanif, Sakriani Sakti, Chee Wee Leong, and Shogo Okada.

    IEEE Access, 2024.

    Potential biases within automated video interview assessment algorithms may disadvantage specific demographics due to the collection of sensitive attributes, which are regulated by the General Data Protection Regulation (GDPR). To mitigate these fairness concerns, this research introduces MAG-BERT-ARL, an automated video interview assessment system that eliminates reliance on sensitive attributes. MAG-BERT-ARL integrates Multimodal Adaptation Gate and Bidirectional Encoder Representations from Transformers (MAG-BERT) model with the Adversarially Reweighted Learning (ARL). This integration aims to improve the performance of underrepresented groups by promoting Rawlsian Max-Min Fairness. Through experiments on the Educational Testing Service (ETS) and First Impressions (FI) datasets, the proposed method demonstrates its effectiveness in optimizing model performance (increasing Pearson correlation coefficient up to 0.17 in the FI dataset and precision up to 0.39 in the ETS dataset) and fairness (reducing equal accuracy up to 0.11 in the ETS dataset). The findings underscore the significance of integrating fairness-enhancing techniques like ARL and highlight the impact of incorporating nonverbal cues on hiring decisions.
  3. Unsupervised Anomalous Sound Detection Using Timbral and Human Voice Disorder-Related Acoustic Features.
    Malik Akbar Hashemi Rafsanjani, Candy Olivia Mawalim, Dessi Puji Lestari, Sakriani Sakti, and Masashi Unoki

    The 16th annual conference organized by Asia-Pacific Signal and Information Processing Association (APSIPA 2024), Galaxy International Convention Center, Macau, China, 3-6 Dec 2024. (Accepted)

    TBD
  4. Anomalous Sound Detection Based on Time Domain Gammatone Filterbank and IDNN Model.
    Primanda Adyatma Hafiz, Candy Olivia Mawalim, Dessi Puji Lestari, Sakriani Sakti, and Masashi Unoki

    The 16th annual conference organized by Asia-Pacific Signal and Information Processing Association (APSIPA 2024), Galaxy International Convention Center, Macau, China, 3-6 Dec 2024. (Accepted)

    TBD
  5. Detecting Spoof Voices in Asian Non-Native Speech: An Indonesian and Thai Case Study.
    Aulia Adila, Candy Olivia Mawalim, and Masashi Unoki

    The 16th annual conference organized by Asia-Pacific Signal and Information Processing Association (APSIPA 2024), Galaxy International Convention Center, Macau, China, 3-6 Dec 2024. (Accepted)

    TBD
  6. Analysis of Pathological Features for Spoof Detection.
    Myat Aye Aye Aung, Hay Mar Soe Naing, Aye Mya Hlaing, Win Pa Pa, Kasorn Galajit, and Candy Olivia Mawalim

    The 27th International Conference of the Oriental COCOSDA (O-COCOSDA 2024), National Yang Ming Chiao Tung University, Hsinchu, Taiwan, 17-19 Oct 2024. (Accepted)

    TBD
  7. UCSYSpoof: A Myanmar Language Dataset for Voice Spoofing Detection.
    Hay Mar Soe Naing, Win Pa Pa, Aye Mya Hlaing, Myat Aye Aye Aung, Kasorn Galajit, and Candy Olivia Mawalim

    The 27th International Conference of the Oriental COCOSDA (O-COCOSDA 2024), National Yang Ming Chiao Tung University, Hsinchu, Taiwan, 17-19 Oct 2024. (Accepted)

    TBD
  8. Indonesian Speech Anti-Spoofing System: Data Creation and CNN Models.
    Sarah Azka Arief, Candy Olivia Mawalim, and Dessi Puji Lestari

    The 11th International Conference on Advanced Informatics: Concepts, Theory, and Applications (ICAICTA 2024), Singapore, 28-30 Sept 2024.

    Biometric systems are prone to spoofing attacks. While research in speech anti-spoofing has been progressing, there is a limited availability of diverse language datasets. This study aims to bridge this gap by developing an Indonesian spoofed speech dataset, which includes replay attacks, text-to- speech, and voice conversion. This dataset forms the foundation for creating an Indonesian speech anti-spoofing system. Subsequently, light convolutional neural network (LCNN) and residual network (ResNet) models, based on convolutional neural networks (CNN), were developed to evaluate the dataset. The input features used are linear frequency cepstral coefficients (LFCC). Both models demonstrate remarkably low minDCF and EER scores approaching zero. The results also exhibit exceptional scores with 4-fold cross validation, showing strong initial performance with no signs of overfitting. However, models trained solely on Common Voice or Prosa.ai datasets performed poorly in cross-source tests, suggesting generalization issues due to a lack of diversity in the dataset. This highlights the need for further improvement and continued research in Indonesian speech spoof detection.
  9. Do We Need to Watch It All? Efficient Job Interview Video Processing with Differentiable Masking.
    Hung Le, Sixia Li, Candy Olivia Mawalim, Hung-Hsuan Huang, Chee Wee Leong, and Shogo Okada

    The 26th International Conference on Multimodal Interaction (ICMI 2024), San José, Costa Rica, 4-8 Nov 2024. (Accepted)

    TBD
  10. Incremental Multimodal Sentiment Analysis on HAI Based on Multitask Active Learning with Inter-Annotator Agreement.
    Thus Karnjanapatchara, Sixia Li, Candy Olivia Mawalim, Kazunori Komatani, and Shogo Okada

    The 12th International Conference on Affective Computing and Intelligent Interaction (ACII 2024), Glasglow, Scotland, UK, 15-18 September 2024. (Accepted)

    TBD
  11. Study on Inaudible Speech Watermarking Method Based on Spread-Spectrum Using Linear Prediction Residue.
    Aulia Adila, Candy Olivia Mawalim, Takuto Isoyama, and Masashi Unoki.

    Journal of Signal Processing, Research Institute of Signal Processing, 2024, Volume 28 Issue 6 Pages 309-313.

    A reliable speech watermarking technique must balance satisfying four requirements: inaudibility, robustness, blind detectability, and confidentiality. A previous study proposed a speech watermarking technique based on direct spread spectrum (DSS) using a linear prediction (LP) scheme, i.e., LP-DSS, that could simultaneously satisfy these four requirements. However, an inaudibility issue was found due to the incorporation of a blind detection scheme with frame synchronization. In this paper, we investigate the feasibility of utilizing a psychoacoustical model, which simulates auditory masking, to control the suitable embedding level of the watermark signal to resolve the inaudibility issue in the LP-DSS scheme. Evaluation results confirmed that controlling the embedding level with the psychoacoustical model, with a constant scaling factor setting, could balance the trade-off between inaudibility and detection ability with a payload up to 64 bps.
  12. Are Recent Deep Learning-Based Speech Enhancement Methods Ready to Confront Real-World Noisy Environments?
    Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

    The 25th Interspeech Conference, Kos Island, Greece, 1-5 September 2024.

    Recent advancements in speech enhancement techniques have ignited interest in improving speech quality and intelligibility. However, the effectiveness of recently proposed methods is unclear. In this paper, a comprehensive analysis of modern deep learning-based speech enhancement approaches is presented. Through evaluations using the Deep Suppression Noise and Clarity Enhancement Challenge datasets, we assess the performances of three methods: Denoiser, DeepFilterNet3, and FullSubNet+. Our findings reveal nuanced performance differences among these methods, with varying efficacy across datasets. While objective metrics offer valuable insights, they struggle to represent complex scenarios with multiple noise sources. Leveraging ASR-based methods for these scenarios shows promise but may induce critical hallucination effects. Our study emphasizes the need for ongoing research to refine techniques for diverse real-world environments.
  13. Exploring a Cutting-Edge Convolutional Neural Network for Speech Emotion Recognition
    Navod Neranjan Thilakarathne, Kasorn Galajit, Candy Olivia Mawalim, and Hayati Yassin.

    The 5th International conference on Industrial Engineering and Artificial Intelligence (IEAI 2024), Chulalongkorn University, Bangkok, Thailand, 24-26 April 2024.

    In light of the ongoing expansion of humancomputer interaction, advancements in the comprehension and interpretation of human emotions are of the utmost importance. SER, representing speech emotion recognition, is a critical element in this context as it enables computational systems to comprehend the emotions of humans. Throughout the years, SER has employed a variety of techniques, including well-established speech analysis and classification methods. However, in recent years, techniques powered by deep learning have been suggested as a viable substitute for conventional SER methods, owing to their more encouraging outcomes in comparison to the aforementioned methods. In this regard, by utilizing a novel Convolutional Neural Network (CNN) model designed to assess and categorize seven emotional states based on speech signals retrieved from three distinct databases, this research presents a novel approach to SER that yields 88.76% accuracy. The purpose of this research is to enhance the accuracy and efficiency of emotion identification, with the end goal of boosting applications in fields such as interactive voice response systems, mental health monitoring, and personalized digital assistants.
  14. Study on Inaudible Speech Watermarking Method Based on Spread-Spectrum Using Linear Prediction Residue.
    Aulia Adila, Candy Olivia Mawalim, Isoyama Takuto, and Masashi Unoki.

    The 2024 RISP International Workshop on Nonlinear Circuits, Communications and Signal Processing, Hawaii, 27 February - 1 March 2024.

    A reliable speech watermarking technique must balance satisfying the four requirements: inaudibility, robustness, blind-detectability, and confidentiality. The previous study proposed an LP-DSS scheme that could simultaneously satisfy these four requirements. However, the inaudibility issue happened due to the blind detection scheme with frame synchronization. In this paper, we investigate the feasibility of utilizing a psychoacoustical model to control the suitable embedding level of the watermark signal to resolve the inaudibility issue that arises in the LP-DSS scheme. A psychoacoustical model simulates the auditory masking phenomenon that “mask” signals below the masking curve to be imperceptible to human ears. Results of the evaluation confirmed that the controlled embedding level from the psychoacoustical model balanced the trade-off between inaudibility and detection ability with payload up to 64 bps.

2023

  1. ThaiSpoof: A Database for Spoof Detection in Thai Language.
    Kasorn Galajit, Thunpisit Kosolsriwiwat, Candy Olivia Mawalim, Pakinee Aimmanee, Waree Kongprawechnon, Win Pa Pa, Anuwat Chaiwongyen, Teeradaj Racharak, Hayati Yassin, Jessada Karnjana, Surasak Boonkla, and Masashi Unoki.

    The 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing and The International Conference on Artificial Intelligence and Internet of Things (iSAI-NLP 2023).

    Many applications and security systems have widely applied automatic speaker verification (ASV). However, these systems are vulnerable to various direct and indirect access attacks, which weakens their authentication capability. The research in spoofed speech detection contributes to enhancing these systems. However, the study in spoofing detection is limited to only some languages due to the need for various datasets. This paper focuses on a Thai language dataset for spoof detection. The dataset consists of genuine speech signals and various types of spoofed speech signals. The spoofed speech dataset is generated using text-to-speech tools for the Thai language, synthesis tools, and tools for speech modification. To showcase the utilization of this dataset, we implement a simple model based on a convolutional neural network (CNN) taking linear frequency cepstral coefficients (LFCC) as its input. We trained, validated, and tested the model on our dataset referred to as ThaiSpoof. The experimental result shows that the accuracy of model is 95%, and equal error rate (EER) is 4.67%. The result shows that our ThaiSpoof dataset has the potential to develop for helping in spoof detection studies.
  2. Voice Contribution on LFCC feature and ResNet-34 for Spoof Detection.
    Khaing Zar Mon, Kasorn Galajit, Candy Olivia Mawalim, Jessada Karnjana, Tsuyoshi Isshiki, and Pakinee Aimmanee.

    The 18th International Joint Symposium on Artificial Intelligence and Natural Language Processing and The International Conference on Artificial Intelligence and Internet of Things (iSAI-NLP 2023).

    Recently, biometric authentication has been significant advancement, particularly in speaker verification. While there have been significant advancements in this technology, compelling evidence highlights the continued vulnerability of this technology to malicious spoofing attacks. This vulnerability calls for developing specialized countermeasures to identify various attack types. This paper specifically focuses on detecting replay, speech synthesis, and voice conversion attacks. As our spoof detection strategy’s front-end feature extraction method, we used linear frequency cepstral coefficients (LFCC). We utilized ResNet-34 to classify between genuine and fake speech. By integrating LFCC with ResNet-34, We evaluated the proposed method using the ASVspoof 2019 dataset, PA (Physical Access), and LA (Logical Access). In our approach, we compare the use of the entire utterance for feature extraction in both PA and LA datasets with an alternative method that involves extracting features from a specific percentage of the voice segment within the utterance for classification. In addition, we conducted a comprehensive evaluation by comparing our proposed method with the established baseline techniques, LFCC-GMM and CQCC-GMM. The proposed method demonstrates promising performance, achieving equal error rate (EER) of 3.11% and 3.49% for the development and evaluation datasets, respectively, in PA attacks. In LA attacks evaluation, the proposed method performs EER of 0.16% and 6.89% for the development and evaluation datasets, respectively. The proposed method shows promising results in identifying spoof attacks for both PA and LA attacks.
  3. Analysis of Spectro-Temporal Modulation Representation for Deep-Fake Speech Detection.
    Haowei Cheng, Candy Olivia Mawalim, Kai Li, Lijun Wang, and Masashi Unoki.

    The 15th Asia-Pasific Signal and Information Processing Association (APSIPA ASC 2023), Taipei, Taiwan, 31 October - 3 November 2023.

    Deep-fake speech detection aims to develop effective techniques for identifying fake speech generated using advanced deep-learning methods. It can reduce the negative impact of malicious production or dissemination of fake speech in real-life scenarios. Although humans can relatively easy to distinguish between genuine and fake speech due to human auditory mechanisms, it is difficult for machines to distinguish them correctly. One major reason for this challenge is that machines struggle to effectively separate speech content from human vocal system information. Common features used in speech processing face difficulties in handling this issue, hindering the neural network from learning the discriminative differences between genuine and fake speech. To address this issue, we investigated spectro-temporal modulation representations in genuine and fake speech, which simulate the human auditory perception process. Next, the spectro-temporal modulation was fit to a light convolutional neural network bidirectional long short-term memory for classification. We conducted experiments on the benchmark datasets of the Automatic Speaker Verification and Spoofing Countermeasures Challenge 2019 (ASVspoof2019) and the Audio Deep synthesis Detection Challenge 2023 (ADD2023), achieving an equal-error rate of 8.33\% and 42.10\%, respectively. The results showed that spectro-temporal modulation representations could distinguish the genuine and deep-fake speech and have adequate performance in both datasets.
  4. Incorporating the Digit Triplet Test in A Lightweight Speech Intelligibility Prediction for Hearing Aids.
    Xiajie Zhou, Candy Olivia Mawalim, and Masashi Unoki.

    The 15th Asia-Pasific Signal and Information Processing Association (APSIPA ASC 2023), Taipei, Taiwan, 31 October - 3 November 2023.

    Recent studies in speech processing often utilize sophisticated methods for solving a task to obtain high-accuracy results. Although high performance could be achieved, the methods are too complex and require high-performance computational power that might not be available for a wide range of researchers. In this study, we propose a method to incorporate the low dimensional and the recent state-of-the-art acoustic features for speech processing to predict the speech intelligibility in noise for hearing aids. The proposed method was developed based on the stack regressor on various traditional machine learning regressors. Unlike other existing works, we utilized the results of the digit triplet test, which is usually used to measure the hearing ability in the existence of noise, to improve the prediction. The evaluation of our proposed method was carried out by using the first Clarity Prediction Challenge dataset. This dataset is utilized for speech intelligibility prediction that consists of speech signals output of hearing aids that were arranged in various simulated scenes with interferers. Our experimental results show that the proposed method could improve speech intelligibility prediction. The results also show that the digit triplet test results are beneficial for speech intelligibility prediction in noise.
  5. Non-Intrusive Speech Intelligibility Prediction Using an Auditory Periphery Model with Hearing Loss.
    Candy Olivia Mawalim, Benita Angela Titalim, Shogo Okada, and Masashi Unoki.

    Applied Acoustics, 2023.

    Speech intelligibility prediction methods are necessary for hearing aid development. However, many such prediction methods are categorized as intrusive metrics because they require reference speech as input, which is often unavailable in real-world situations. Additionally, the processing techniques in hearing aids may cause temporal or frequency shifts, which degrade the accuracy of intrusive speech intelligibility metrics. This paper proposes a non-intrusive auditory model for predicting speech intelligibility under hearing loss conditions. The proposed method requires binaural signals from hearing aids and audiograms representing the hearing conditions of hearing-impaired listeners. It also includes additional acoustic features to improve the method’s robustness in noisy and reverberant environments. A two-dimensional convolutional neural network with neural decision forests is used to construct a speech intelligibility prediction model. An evaluation conducted with the first Clarity Prediction Challenge dataset shows that the proposed method performs better than the baseline system.
  6. A Ranking Model for Evaluation of Conversation Partners Based on Rapport Levels.
    Takato Hayashi, Candy Olivia Mawalim, Ryo Ishii, Akira Morikawa, Takao Nakamura, and Shogo Okada.

    IEEE Access, 2023.

    Our proposed ranking model ranks conversation partners based on self-reported rapport levels for each participant. The model is important for tasks that recommend interaction partners based on user rapport built in past interactions, such as matchmaking between a student and a teacher in one-to-one online language classes. To rank conversation partners, we can use a regression model that predicts rapport ratings. It is, however, challenging to learn the mapping from the participants' behavior to their associated rapport ratings because a subjective scale for rapport ratings may vary across different participants. Hence, we propose a ranking model trained via preference learning (PL). The model avoids the subjective scale bias because the model is trained to predict ordinal relations between two conversation partners based on rapport ratings reported by the same participant. The input of the model is multimodal (acoustic and linguistic) features extracted from two participants' behaviors in an interaction. Since there is no publicly available dataset for validating the ranking model, we created a new dataset composed of online dyadic (person-to-person) interactions between a participant and several different conversation partners. We compare the ranking model trained via preference learning with the regression model by using evaluation metrics for the ranking. The experimental results show that preference learning is a more suitable approach for ranking conversation partners. Furthermore, we investigate the effect of each modality and the different stages of rapport development on the ranking performance.
  7. Auditory Model Optimization with Wavegram-CNN and Acoustic Parameter Models for Nonintrusive Speech Intelligibility Prediction in Hearing Aids.
    Candy Olivia Mawalim, Benita Angela Titalim, Shogo Okada, and Masashi Unoki.

    The 31st European Signal Processing Conference (EUSIPCO 2023), Helsinki, Finland.

    Nonintrusive speech intelligibility (SI) prediction is essential for evaluating many speech technology applications, including hearing aid development. In this study, several factors related to hearing perception are investigated to predict SI. In the proposed method, we integrated a physiological auditory model from two ears (binaural EarModel), wavegram-CNN model and acoustic parameter model. The refined EarModel does not require clean speech as input (blind method). In EarModel, the perception caused by hearing loss is simulated based on audiograms. Meanwhile, the wavegram-CNN and acoustic parameter models represent the factors related to the speech spectrum and acoustics, respectively. The proposed method is evaluated based on the scenario from the 1st Clarity Prediction Challenge (CPC1). The results show that the proposed method outperforms the intrusive baseline MBSTOI and HASPI methods in terms of the Pearson coefficient (ρ), RMSE, and R2 score in both closed-set and open-set tracks. Based on the results from listener-wise evaluation results, the average $\rho$ could be improved by more than 0.3 using the proposed method.
  8. Inter-person Intra-modality Attention Based Model for Dyadic Interaction Engagement Prediction.
    Xiguang Li, Shogo Okada, and Candy Olivia Mawalim.

    The 25th HCI International Conference, HCII 2023, Copenhagen, Denmark, July 23–28, 2023.

    With the rapid development of artificial agents, more researchers have explored the importance of user engagement level prediction. Real-time user engagement level prediction assists the agent in properly adjusting its policy for the interaction. However, the existing engagement modeling lacks the element of interpersonal synchrony, a temporal behavior alignment closely related to the engagement level. Part of this is because the synchrony phenomenon is complex and hard to delimit. With this background, we aim to develop a model suitable for temporal interpersonal features with the help of the modern data-driven machine learning method. Based on previous studies, we select multiple non-verbal modalities of dyadic interactions as predictive features and design a multi-stream attention model to capture the interpersonal temporal relationship of each modality. Furthermore, we experiment with two additional embedding schemas according to the synchrony definitions in psychology. Finally, we compare our model with a conventional structure that emphasizes the multimodal features within an individual. Our experiments showed the effectiveness of the intra-modal inter-person design in engagement prediction. However, the attempt to manipulate the embeddings failed to improve the performance. In the end, we discuss the experiment result and elaborate on the limitations of our work.
  9. Investigating the Effect of Linguistic Features on Personality and Job Performance Predictions.
    Hung Le, Sixia Li, Candy Olivia Mawalim, Hung-Hsuan Huang, Chee Wee Leong, and Shogo Okada.

    The 25th HCI International Conference, HCII 2023, Copenhagen, Denmark, July 23–28, 2023.

    Personality traits are known to have a high correlation with job performance. On the other hand, there is a strong relationship between language and personality. In this paper, we presented a neural network model for inferring personality and hirability. Our model was trained only from linguistic features but achieved good results by incorporating transfer learning and multi-task learning techniques. The model improved the F1 score 5.6% point on the Hiring Recommendation label compared to previous work. The effect of different Automatic Speech Recognition systems on the performance of the models was also shown and discussed. Lastly, our analysis suggested that the model makes better judgments about hirability scores when the personality traits information is not absent.
  10. Personality Trait Estimation in Group Discussions using Multimodal Analysis and Speaker Embedding.
    Candy Olivia Mawalim, Shogo Okada, Yukiko I. Nakano, and Masashi Unoki.

    Journal on Multimodal User Interfaces, 2023.

    The automatic estimation of personality traits is essential for many human-computer interface (HCI) applications. This paper focused on improving Big Five personality trait estimation in group discussions via multimodal analysis and transfer learning with the state-of-the-art speaker individuality feature, namely, the identity vector (i-vector) speaker embedding. The experiments were carried out by investigating the effective and robust multimodal features for estimation with two group discussion datasets, i.e., the Multimodal Task-Oriented Group Discussion (MATRICS) (in Japanese) and Emergent Leadership (ELEA) (in European languages) corpora. Subsequently, the evaluation was conducted by using leave-one-person-out cross-validation (LOPCV) and ablation tests to compare the effectiveness of each modality. The overall results showed that the speaker-dependent features, e.g., the i-vector, effectively improved the prediction accuracy of Big Five personality trait estimation. In addition, the experimental results showed that audio-related features were the most prominent features in both corpora.

2022

  1. Multimodal Analysis for Communication Skill and Self-Efficacy Level Estimation in Job Interview Scenario.
    Tomoya Ohba*, Candy Olivia Mawalim*, Shun Katada, Haruki Kuroki, and Shogo Okada.

    The 21st International Conference on Mobile and Ubiquitous Multimedia (ACM MUM 2022), Lisbon, Portugal, 27--30 November 2022.

    An interview for a job recruiting process requires applicants to demonstrate their communication skills. Interviewees sometimes become nervous about the interview because interviewees themselves do not know their assessed score. This study investigates the relationship between the communication skill (CS) and the self-efficacy level (SE) of interviewees through multimodal modeling. We also clarify the difference between effective features in the prediction of CS and SE labels. For this purpose, we collect a novel multimodal job interview data corpus by using a job interview agent system where users experience the interview using a virtual reality head-mounted display (VR-HMD). The data corpus includes annotations of CS by third-party experts and SE annotations by the interviewees. The data corpus also includes various kinds of multimodal data, including audio, biological (i.e., physiological), gaze, and language data. We present two types of regression models, linear regression and sequential-based regression models, to predict CS, SE, and the gap (GA) between skill and self-efficacy. Finally, we report that the model with acoustic, gaze, and linguistic features has the best regression accuracy in CS prediction (correlation coefficient r = 0.637). Furthermore, the regression model with biological features achieves the best accuracy in SE prediction (r = 0.330).
  2. OBISHI: Objective Binaural Intelligibility Score for the Hearing Impaired.
    Candy Olivia Mawalim, Benita Angela Titalim, Masashi Unoki, and Shogo Okada.

    SST2022, The 18th Australasian International Conference on Speech Science and Technology, Canberra, Australia, 13--16 December 2022.

    Speech intelligibility prediction for both normal hearing and hearing impairment is very important for hearing aid development. The Clarity Prediction Challenge 2022 (CPC1) was initiated to evaluate the speech intelligibility of speech signals produced by hearing aid systems. Modified binaural short-time objective intelligibility (MBSTOI) and hearing aid speech prediction index (HASPI) were introduced in the CPC1 to understand the basis of speech intelligibility prediction. This paper proposes a method to predict speech intelligibility scores, namely OBISHI. OBISHI is an intrusive (non-blind) objective measurement that receives binaural speech input and considers the hearing-impaired characteristics. In addition, a pre-trained automatic speech recognition (ASR) system was also utilized to infer the difficulty of utterances regardless of the hearing loss condition. We also integrated the hearing loss model by the Cambridge auditory group and the Gammatone Filterbank-based prediction model. The total evaluation was conducted by comparing the predicted intelligibility score of the baseline MBSTOI and HASPI with the actual correctness of listening tests. In general, the results showed that the proposed method, OBISHI, outperformed the baseline MBSTOI and HASPI (improved approximately 10% classification accuracy in terms of F1 score).
  3. Speech Intelligibility Prediction for Hearing Aids Using an Auditory Model and Acoustic Parameters.
    Benita Angela Titalim*, Candy Olivia Mawalim*, Shogo Okada, and Masashi Unoki.

    2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7--10 November 2022.

    Objective speech intelligibility (SI) metrics for hearing-impaired people play an important role in hearing aid development. The work on improving SI prediction also became the basis of the first Clarity Prediction Challenge (CPC1). This study investigates a physiological auditory model called EarModel and acoustic parameters for SI prediction. EarModel is utilized because it provides advantages in estimating human hearing, both normal and impaired. The hearing-impaired condition is simulated in EarModel based on audiograms; thus, the SI perceived by hearing-impaired people is more accurately predicted. Moreover, the extended Geneva Minimalistic Acoustic Parameter Set (eGeMAPS) and WavLM, as additional acoustic parameters for estimating the difficulty levels of given utterances, are included to achieve improved prediction accuracy. The proposed method is evaluated on the CPC1 database. The results show that the proposed method improves the SI prediction effects of the baseline and hearing aid speech prediction index (HASPI). Additionally, an ablation test shows that incorporating the eGeMAPS and WavLM can significantly contribute to the prediction model by increasing the Pearson correlation coefficient by more than 15% and decreasing the root-mean-square error (RMSE) by more than 10.00 in both closed-set and open-set tracks.
  4. F0 Modification via PV-TSM Algorithm for Speaker Anonymization Across Gender.
    Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

    2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7-10 November 2022.

    Speaker anonymization has been developed to protect personally identifiable information while retaining other encapsulated information in speech. Datasets, metrics, and protocols for evaluating speaker anonymization have been defined in the Voice Privacy Challenge (VPC). However, existing privacy metrics focus on evaluating general speaker individuality anonymization, which is represented by an x-vector. This study aims to investigate the effect of anonymization on the perception of gender. Understanding how anonymization caused gender transformation is essential for various applications of speaker anonymization. We proposed speaker anonymization methods across genders based on phase-vocoder time-scale modification (PV-TSM). Subsequently, in addition to the VPC evaluation, we developed a gender classifier to evaluate a speaker's gender anonymization. The objective evaluation results showed that our proposed method can successfully anonymize gender. In addition, our proposed methods outperformed the signal processing-based baseline methods in anonymizing speaker individuality represented by the x-vector in ASVeval while maintaining speech intelligibility.
  5. Speaker Anonymization by Pitch Shifting Based on Time-Scale Modification.
    Candy Olivia Mawalim, Shogo Okada, and Masashi Unoki.

    2nd Symposium on Security and Privacy in Speech Communication joined with 2nd VoicePrivacy Challenge Workshop September 23 & 24 2022, as a satellite to Interspeech 2022, Incheon, Korea.

    The increasing usage of speech in digital technology raises a privacy issue because speech contains biometric information. Several methods of dealing with this issue have been proposed, including speaker anonymization or de-identification. Speaker anonymization aims to suppress personally identifiable information (PII) while keeping the other speech properties, including linguistic information. In this study, we utilize time-scale modification (TSM) speech signal processing for speaker anonymization. Speech signal processing approaches are significantly less complex than the state-of-the-art x-vector-based speaker anonymization method because it does not require a training process. We propose anonymization methods using two major categories of TSM, synchronous overlap-add (SOLA)-based algorithm and phase vocoder-based TSM (PV-TSM). For evaluating our proposed methods, we utilize the standard objective evaluation introduced in the VoicePrivacy challenge. The results show that our method based on the PV-TSM balances privacy and utility metrics better than baseline systems, especially when evaluating with an automatic speaker verification (ASV) system in anonymized enrollment and anonymized trials (a-a). Further, our method outperformed the x-vector-based speaker method, which has limitations in its complex training process, low privacy in an a-a scenario, and low voice distinctiveness.
  6. Speaker Anonymization by Modifying Fundamental Frequency and X-Vectors Singular Value.
    Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, Shunsuke Kidani, and Masashi Unoki.

    Computer Speech & Language, Elsevier, vol. 73, 101326, 2022.

    Speaker anonymization is a method of protecting voice privacy by concealing individual speaker characteristics while preserving linguistic information. The VoicePrivacy Challenge 2020 was initiated to generalize the task of speaker anonymization. In the challenge, two frameworks for speaker anonymization were introduced; in this study, we propose a method of improving the primary framework by modifying the state-of-the-art speaker individuality feature (namely, x-vector) in a neural waveform speech synthesis model. Our proposed method is constructed based on x-vector singular value modification with a clustering model. We also propose a technique of modifying the fundamental frequency and speech duration to enhance the anonymization performance. To evaluate our method, we carried out objective and subjective tests. The overall objective test results show that our proposed method improves the anonymization performance in terms of the speaker verifiability, whereas the subjective evaluation results show improvement in terms of the speaker dissimilarity. The intelligibility and naturalness of the anonymized speech with speech prosody modification were slightly reduced (less than 5% of word error rate) compared to the results obtained by the baseline system.

2021

  1. Speech Watermarking by McAdams Coefficient Scheme Based on Random Forest Learning.
    Candy Olivia Mawalim, and Masashi Unoki.

    Entropy, MDPI, vol. 23, no. 10, 2021.

    Speech watermarking has become a promising solution for protecting the security of speech communication systems. We propose a speech watermarking method that uses the McAdams coefficient, which is commonly used for frequency harmonics adjustment. The embedding process was conducted, using bit-inverse shifting. We also developed a random forest classifier, using features related to frequency harmonics for blind detection. An objective evaluation was conducted to analyze the performance of our method in terms of the inaudibility and robustness requirements. The results indicate that our method satisfies the speech watermarking requirements with a 16 bps payload under normal conditions and numerous non-malicious signal processing operations, e.g., conversion to Ogg or MP4 format.
  2. Task-independent Recognition of Communication Skills in Group Interaction Using Time-series Modeling.
    Candy Olivia Mawalim, Shogo Okada, and Yukiko I. Nakano.

    ACM Transactions on Multimedia Computing Communications and Applications, vol. 17, no. 4, pp. 122:1-122:27, 2021.

    Case studies of group discussions are considered an effective way to assess communication skills (CS). This method can help researchers evaluate participants’ engagement with each other in a specific realistic context. In this article, multimodal analysis was performed to estimate CS indices using a three-task-type group discussion dataset, the MATRICS corpus. The current research investigated the effectiveness of engaging both static and time-series modeling, especially in task-independent settings. This investigation aimed to understand three main points: first, the effectiveness of time-series modeling compared to nonsequential modeling; second, multimodal analysis in a task-independent setting; and third, important differences to consider when dealing with task-dependent and task-independent settings, specifically in terms of modalities and prediction models. Several modalities were extracted (e.g., acoustics, speaking turns, linguistic-related movement, dialog tags, head motions, and face feature sets) for inferring the CS indices as a regression task. Three predictive models, including support vector regression (SVR), long short-term memory (LSTM), and an enhanced time-series model (an LSTM model with a combination of static and time-series features), were taken into account in this study. Our evaluation was conducted by using the R2 score in a cross-validation scheme. The experimental results suggested that time-series modeling can improve the performance of multimodal analysis significantly in the task-dependent setting (with the best R2 = 0.797 for the total CS index), with word2vec being the most prominent feature. Unfortunately, highly context-related features did not fit well with the task-independent setting. Thus, we propose an enhanced LSTM model for dealing with task-independent settings, and we successfully obtained better performance with the enhanced model than with the conventional SVR and LSTM models (the best R2 = 0.602 for the total CS index). In other words, our study shows that a particular time-series modeling can outperform traditional nonsequential modeling for automatically estimating the CS indices of a participant in a group discussion with regard to task dependency.
  3. Improving Security in McAdams Coefficient-Based Speaker Anonymization by Watermarking Method.
    Candy Olivia Mawalim, and Masashi Unoki.

    2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Tokyo, Japan, December 2021.

    Speaker anonymization aims to suppress speaker individuality to protect privacy in speech while preserving the other aspects, such as speech content. One effective solution for anonymization is to modify the McAdams coefficient. In this work, we propose a method to improve the security for speaker anonymization based on the McAdams coefficient by using a speech watermarking approach. The proposed method consists of two main processes: one for embedding and one for detection. In embedding process, two different McAdams coefficients represent binary bits "0" and "1". The watermarked speech is then obtained by frame-by-frame bit inverse switching. Subsequently, the detection process is carried out by a power spectrum comparison. We conducted objective evaluations with reference to the VoicePrivacy 2020 Challenge (VP2020) and of the speech watermarking with reference to the Information Hiding Challenge (IHC) and found that our method could satisfy the blind detection, inaudibility, and robustness requirements in watermarking. It also significantly improved the anonymization performance in comparison to the secondary baseline system in VP2020.

2020

  1. Speech Information Hiding by Modification of LSF Quantization Index in CELP Codec.
    Candy Olivia Mawalim, Shengbei Wang, and Masashi Unoki.

    2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, December 2020.

    A prospective method for securing digital speech communication is by hiding the information within the speech. Most of the speech information hiding methods proposed in prior research are lacking in robustness when dealing with the encoding process (e.g. the code-excited linear prediction (CELP) codec). The CELP codecs provide a codebook that represents the encoded signal at a lower bit rate. As essential features in speech coding, line spectral frequencies (LSFs) are generally included in the codebook. Consequently, LSFs are considered as a prospective medium for information hiding that is robust against CELP codecs. In this paper, we propose a speech information hiding method that modifies the least significant bit of the LSF quantization obtained by a CELP codec. We investigated the feasibility of our proposed method by objective evaluation in terms of detection accuracy and inaudibility. The evaluation results confirmed the reliability of our proposed method with some further potential improvement (multiple embedding and varying segmentation lengths). The results also showed that our proposed method is robust against several signal processing operations, such as resampling, adding Gaussian noise, and several CELP codecs (i.e., the Federation Standard-1016 CELP, G.711, and G.726).
  2. X-Vector Singular Value Modification and Statistical-Based Decomposition with Ensemble Regression Modeling for Speaker Anonymization System.
    Candy Olivia Mawalim, Kasorn Galajit, Jessada Karnjana, and Masashi Unoki.

    Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, pp. 1703–1707, October 2020.

    Anonymizing speaker individuality is crucial for ensuring voice privacy protection. In this paper, we propose a speaker individuality anonymization system that uses singular value modification and statistical-based decomposition on an x-vector with ensemble regression modeling. An anonymization system requires speaker-to-speaker correspondence (each speaker corresponds to a pseudo-speaker), which may be possible by modifying significant x-vector elements. The significant elements were determined by singular value decomposition and variant analysis. Subsequently, the anonymization process was performed by an ensemble regression model trained using x-vector pools with clustering-based pseudo-targets. The results demonstrated that our proposed anonymization system effectively improves objective verifiability, especially in anonymized trials and anonymized enrollments setting, by preserving similar intelligibility scores with the baseline system introduced in the VoicePrivacy 2020 Challenge.
  3. Audio Information Hiding Based on Cochlear Delay Characteristics with Optimized Segment Selection.
    Candy Olivia Mawalim and Masashi Unoki.

    Advances in Intelligent Systems and Computing, Springer, vol. 1145, 2020.

    Audio information hiding (AIH) based on cochlear delay (CD) characteristics is a promising technique to deal with the trade-off between inaudibility and robustness requirements effectively. However, the use of phase-shift keying (PSK) for blindly detectable AIH based on CD characteristics caused abrupt phase changing (phase spread spectrum), which leads to bad inaudibility. This paper proposed the technique to reduce the spread spectrum from PSK by segment selection process with spline interpolation optimization. Objective evaluation to measure the detection accuracy (BDR) and inaudibility (PEAQ and LSD) was carried out with 102 various genre music clips dataset. Based on the evaluation result, our proposed method could successfully reduce the spread spectrum caused by PSK by having improvement on inaudibility test with adequate detection accuracy up to 1024 bps.

2019

  1. Feasibility of Audio Information Hiding Using Linear Time Variant IIR Filters Based on Cochlear Delay.
    Candy Olivia Mawalim and Masashi Unoki.

    Journal of Signal Processing, Research Institute of Signal Processing, vol. 23, no. 4, 2019.

    A reported technique for cochlear delay (CD) based audio information hiding achieved imperceptibility in non-blind approaches. However, the phase shift keying (PSK) technique was utilized with a blind method, causing drastically phase changes that imply perceptible information was inserted. This paper presents an investigation on the feasibility of hiding information in the linear time variant (LTV) system. We adapted a CD filter design for the LTV system in the embedding scheme. The detection scheme was conducted using instantaneous chirp-z transformation (CZT). Objective tests for checking the imperceptibility (PEAQ and LSD) and for checking data payload (bit detection rate (BDR)) were conducted to evaluate our method. Our experimental results supported the feasibility of utilizing the CD-based audio information hiding in the LTV system. ln addition, detection regarding both imperceptibility and data payload was better in our method than in the previous blind method.
  2. Multimodal BigFive Personality Trait Analysis using Communication Skill Indices and Multiple Discussion Types Dataset.
    Candy Olivia Mawalim, Shogo Okada, Yukiko I. Nakano, and Masashi Unoki.

    Springer LNCS Social Computing and Social Media: Design, Human Behavior, and Analytics, Springer, vol. 11578, 2019.

    This paper focuses on multimodal analysis in multiple discussion types dataset for estimating BigFive personality traits. The analysis was conducted to achieve two goals: First, clarifying the effectiveness of multimodal features and communication skill indices to predict the BigFive personality traits. Second, identifying the relationship among multimodal features, discussion type, and the BigFive personality traits. The MATRICS corpus, which contains of three discussion task types dataset, was utilized in this experiment. From this corpus, three sets of multimodal features (acoustic, head motion, and linguistic) and communication skill indices were extracted as the input for our binary classification system. The evaluation was conducted by using F1-score in 10-fold cross validation. The experimental results showed that the communication skill indices are important in estimating agreeableness trait. In addition, the scope and freedom of conversation affected the performance of personality traits estimator. The freer a discussion is, the better personality traits estimator can be obtained.

2017

  1. POS-based reordering rules for Indonesian-Korean statistical machine translation.
    Candy Olivia Mawalim, Dessi Puji Lestari, and Ayu Purwarianti.

    6th International Conference on Electrical Engineering and Informatics (ICEEI), pp. 1-6, 2017.

    In SMT system, reordering problem is one of the most important and difficult problems to solve. The problem becomes definitely serious due to the different grammatical pattern between source and target language. The previous research about reordering model in SMT use the distortion-based reordering approach. However, this approach is not suitable for Indonesian-Korean translation. The main reason is because the word order between Indonesian and Korean are mostly reversed. Therefore, in this study, we develop a source-side reordering rules by using POS tag and word alignment information. This technique is promising to solve the reordering problem based on the experimental result. By applying 130 reordering rules in ID-KR and 50 reordering rules for KR-ID translation, the quality of translation in term of BLEU score increases 1.25% for ID-KR translation and 0.83% for KR-ID translation. Besides, combining this reordering rules with Korean verb formation rules for ID-KR translation can increase the BLEU score from 38.07 to 49.46 (in 50 simple sentences evaluation).
  2. Rule-based Reordering and Post-Processing for Indonesian-Korean Statistical Machine Translation.
    Candy Olivia Mawalim, Dessi Puji Lestari, and Ayu Purwarianti.

    In Proceedings of the 31st Pacific Asia Conference on Language, Information and Computation, pages 287–295, Cebu, Phillippines, 2017.

    In SMT system, reordering problem is one of the most important and difficult problems to solve. The problem becomes definitely serious due to the different grammatical pattern between source and target language. The previous research about reordering model in SMT use the distortion-based reordering approach. However, this approach is not suitable for Indonesian-Korean translation. The main reason is because the word order between Indonesian and Korean are mostly reversed. Therefore, in this study, we develop a source-side reordering rules by using POS tag and word alignment information. This technique is promising to solve the reordering problem based on the experimental result. By applying 130 reordering rules in ID-KR and 50 reordering rules for KR-ID translation, the quality of translation in term of BLEU score increases 1.25% for ID-KR translation and 0.83% for KR-ID translation. Besides, combining this reordering rules with Korean verb formation rules for ID-KR translation can increase the BLEU score from 38.07 to 49.46 (in 50 simple sentences evaluation).