CCS '18 Paper #870 Reviews and Comments
===========================================================================
Paper #870 Void: Voice liveness detection through a spectrogram analysis of
voice commands


Review #870A
===========================================================================
* Updated: Jul 13, 2018, 4:12:08 PM GMT

Overall merit
-------------
2. Weak reject

Brief paper summary (2-3 sentences)
-----------------------------------
The paper presents a liveness detection system which analyzes spectral power patterns of voice signals to detect voice spoofing attacks. Two attack vectors have been evaluated in this paper. The first attack is known as replay attack (replaying the previously recorded samples) and the second attack is hidden voice command and inaudible commands. The result of the experiment conducted on several voice datasets shows that the system can detect the attacks with higher accuracy compared to other liveness detection schemes.

Strengths
---------
•    With the increase in the use of voice authentication apps detecting spoofing attack is of a significant value and the idea of using spectral power patterns of voice signals is interesting and novel.  
•    This study aims to identify voice spoofing attacks without using any additional hardware. It runs lightweight classification algorithm with a small number of features and shows a promising result. 
•    The evaluation set up for playing back the samples seems to be reasonable and realistic. An attacker could have similar tools and access to the victim’s recordings and launch the attack in the proximity of the phone.

Weaknesses
----------
•    The feature set does not show a significant difference between human voice and replayed voices through loudspeakers in lower frequencies above 1kHz. 
•    The most crucial part of this paper is the classification stage, which is responsible for detecting the replayed voice samples from the original human voices. In this study, only a few classifiers have been tested from which SVM has been finally used as the classifier for two classes (live-human and replayed). I am not convinced that this system can detect the voice spoofing attacks. It is better to employ more models and classifiers to test the accuracy of the system and the extracted feature set. 
•    I did not understand the value of the work in evaluating the hidden voice commands and inaudible commands. These techniques are supposed to fail in the presence of voice authentication, even if no liveness detection is to be deployed. Preferably, testing the methods against other types of voice synthesis attack could be explored.

Detailed comments for the authors
---------------------------------
•    The performance degrades from one dataset to another specifically in detecting live-human voices. Could you elaborate your insights on this difference? Have you considered testing other professionally recorded voice datasets as the baseline of the study? They should perhaps show the highest accuracy. 

•    What was the setup for the hidden voice command set up? Did you try to hide the commands in audio samples from the victim user? Without your liveness detection approach did you notice that the voice authentication systems would fail when the inaudible commands and hidden voice commands are played? My understanding is that such attacks target speech recognition engine and therefore they should be detected by the voice authentication systems. So how is the liveness detection helpful in this setting?

•    Apart from the replay attack several other type of voice synthesis attacks have been introduced and they seem to be able to replicate the victim’s voice (e.g., Festvox voice conversion, Lyrebird voice synthesis). What is the motivation of testing only the replay attack? How would your system work against other voice synthesis attacks?

•    This approach could perhaps be extended to other devices such as virtual assistant device. It would be interesting to discuss this possibility in this paper.

Post rebuttal comments:

Thanks for providing the responses to reviewers' concerns.

The motivation to center on replay attacks is still unconvincing to me (the motivating scenario is not a replay attack as far as I understood).

The system has not at all been tested against voice synthesis attacks so I do not really believe the claim that it could work against such attacks (theoretically, it may seem to, but no tests have been done to prove or disprove this).

Focus on hidden voices still seems out of place. Most voice assistants already deploy voice biometrics (especially for sensitive commands, which is what we care about), which should already be able to defeat hidden voice attacks since they don't match up the features of a given user's voice (hidden voices are designed to defeat speech recognition, not speaker recognition).

The founding premise of the defense (differentiating between mechanical speakers and humans) is weak. The paper does not show why a motivated attacker can not build speakers that matches the features of human auditory system. This is an extremely weak argument for a CCS paper: "As for speakers that go beyond such specs, we argue that such speakers are not commonly available, and would be very difficult and expensive for attackers to carry such speakers around."


Review #870B
===========================================================================

Overall merit
-------------
3. Weak accept

Brief paper summary (2-3 sentences)
-----------------------------------
This paper proposes a voice liveness detection system through a spectral power analysis of voice commands. The system is evaluated on two datasets, and achieves an accuracy of over 99% and 98% in detecting voice replay attacks with less than 1% and 5.1% equal error rate.

Strengths
---------
Voice liveness detection is an important research topic as voice assistants prevail. This paper introduces a liveness detection system named Void. Comparing with existing methods, Void shows the following strengths:
1. It does not require any additional hardware and only needs to analyze the signals recorded by the microphone.
2. It does not require the user to put the microphone physically near his/her mouth.
3. It is more efficient and lightweight than existing machine learning based approaches (e.g.,  in the 2017 ASVspoof challenge).
This paper is well written, and the system is evaluated with abundant samples from human and loudspeakers.

Weaknesses
----------
The methods rely on the imperfection of the speakers, and a motivated attacker can invest on "perfect" speakers to bypass the attacks.

Detailed comments for the authors
---------------------------------
The key insight of this paper is to utilize the 'imperfections' of audio recording and playback systems to detect replay attacks. Thus, some major questions would naturally be, what are these imperfections, how do they deteriorate audio quality, and to what extent will a replay attack be associated with these imperfections. It seems that this paper gives answer to only the second question. The authors did not quantify the fundamental imperfections of speakers (especially the high quality ones) that enable the system, which reduces the credibility of the proposed method and its performance against motivated attackers.

The imperfections of speakers is determined by the device quality. Ideally, a perfect speaker with flat frequency response will not possess such imperfections. Although an ideal speaker is not practically possible, a motivated attacker could utilize high-quality speakers to minimize the imperfections. Thus, the lowest bar of the proposed liveness detection system is its performance against high quality speakers. However, the content on high quality speakers in this paper is very confusing and unconvincing. For example:

1. In section 4.3 the authors first admit that high quality speakers do not show power patterns similar to low quality speakers and the patterns are non-deterministic compared with live human, but later they suddenly claim with fig. 4 that high quality speakers are different from human in the normalized signal power without giving any explanation. In fig. 4, some major frequency components are missing on speakers. Such a result is extremely abnormal because it is not supposed to happen on high quality speakers. 

2. It is unclear how fig. 4 is made. It is supposed to be a normalized version of fig. 3, but they look very different. It is also unclear why the normalized signal power has to be used at the first place. What is the key difference of low and high quality speakers that result in the difference of fig. 3 and 4? Same problem with fig. 6.

3. The experiment setup for fig. 3 and 4 is insufficient to prove the key insights. For example, in fig. 3, the mixed frequency response of the entire replay system is used to prove the non-flat response of speakers. In fig. 4, it is unknown whether the same microphone is used to record human and speakers. 

Some other concerns include:

1. Lack of security analysis. It is practically possible for an attacker to find a recording/playback equipment set that can bypass Void? If yes, what would the expense be? If not, why?

2. Lack of definition and explanation of the exponential and linear decay. It is not evident that the linear power decay pattern exists.

3. Only the distance of replay attack is evaluated. It is unclear how the system performs in detecting a human user from different distances.

4. In table 3, the feature extraction time of 274.16 seconds is too long for a voice sample. Isn't it supposed to be for all the samples in the dataset? If yes, how long does it take for the realtime detection of one command (feature extraction plus testing), assuming the system has been trained beforehand?

6. According to [16][18], inaudible voice commands do not shift the peak mean location to 3.9 kHz. 

7. Some relevant studies on replay attack detection through channel noise and frequency response are not mentioned in the related work. 

I have no doubt that the proposed system will work for replay attacks with low quality speakers. However, the biggest challenge and shortage of this paper is in detecting the high quality speakers. I suggest the authors provide concrete evidence on the fundamental characteristics and signal analysis of high quality speakers, or it might be a little oversold.


Review #870C
===========================================================================

Overall merit
-------------
2. Weak reject

Brief paper summary (2-3 sentences)
-----------------------------------
Home assistant devices take any voice command they hear, regardless of
who gives it. This paper attempts to differentiate between humans and
audio speakers (e.g., a stereo).

Strengths
---------
-This is an area where attacks are increasingly being demonstrated as
practical. 

-Some interesting experiments!

Weaknesses
----------
-Limited novelty given previously published work. 

-The authors don't look at attacks that use different audio codecs.

Detailed comments for the authors
---------------------------------
This is very interesting work, and it tries to tackle a very interesting
problem. The increasing frequency with which devices are being sold with
voice-only interfaces means that papers like this are very important.
That said, I have a few issues that I think should be addressed.

Novelty: The authors correctly point out their relation to [7], and I
certainly see that this is an improvement over that. However,
methodologically, the real difference appears to be simply that you
picked better machine learning algorithms. Your extraction and training
times are certainly lower, but at the end of the day detection time is
virtually identical. The authors also miss the following paper which
identifies similar characteristics but does not rely on machine
learning:

Blue, et al., Hello, Is It Me You’re Looking For?  Differentiating
Between Human and Electronic Speakers for Voice Interface Security, In
Proceedings of the ACM Conference on Security and Privacy in Wireless
and Mobile Networks (WiSec), 2018.

The experiments here are very good, but I would recommend that the
authors also consider two other scenarios. First, from the list of
speakers given in the appendix, there does not appear to be any
experiment that tests against a television (even though that attack in
the intro specifically references one as the source). Secondly, the
authors do not consider that an attacker may use alternative encodings
(e.g., anything other than lossless .wav) to avoid the power
profiling they do. This would make the word much stronger, especially
given related work. 

Lastly, the authors perform their tests from a very limited distance
(all within one meter). It wasn't clear that this was realistic, given
that devices across the room have been demonstrated as capable of
executing these attacks. Given the degradation of frequencies through
air, are your observations and training consistent in a more realistic
environment?


Review #870D
===========================================================================
* Updated: Jul 5, 2018, 6:20:12 PM GMT

Overall merit
-------------
3. Weak accept

Brief paper summary (2-3 sentences)
-----------------------------------
With the growing popularity of voice assistants, authentication through recognizing speakers is gaining increasing importance. Speaker recognition can be defeated through voice replay attacks -- that is, making a recording of an authorized person giving the desired command and then replaying it through a speaker. To foil such attacks, this work builds a voice liveness detector that leverages differences between humans speaking in real life and audio played through speakers in terms of the spectral power observable at different frequencies. Based on the non-linearity of human voices in real life (an exponential power decay at arond 1khz) and the higher number of peak frequencies, the authors use an SVM based on these features to classify voices as live / recording. They collect voice samples from 120 people they recruited, and they evaluate their SVM on these samples, as well as voice samples from an additional 42 people collected as part of the previously compiled ASVspoof dataset. For the former set, the SVM performs exceedingly well, while the SVM also performs respectably on the latter.

Strengths
---------
+Sensible approach to an important and timely topic

+SVM has many performance advantages over neural networks used in prior work

+Collect a large test set that is fairly comprehensive in varying background noise, distance, and types of speakers used for replay attacks

+Also evaluate against a test set collected by others

+Comprehensive experiments include training and testing on different data (7.3.3)

+Also detect dolphin attacks (commands issued on inaudible frequencies)

+Clearly written

Weaknesses
----------
-Does the microphone have an impact?

-Published frequency response charts of devices ignored

-Can this scheme be gamed by adjusting recordings to mirror the characteristic exponential decay of live voices?

-Loudspeakers tested do not include audiophile versions with flat frequency response.... combination of speakers

-High (14.2%) FAR on ASVspoof

-Compared to ASVspoof, the authors' dataset focuses far more on bult-in speaker playback

Detailed comments for the authors
---------------------------------
Overall, I found this paper's reliance on the differences in relative power of particular audio frequencies to detect replay attacks and dolphin attacks (commands issued on inaudible frequencies) against speaker-recognition systems to be a sensible approach to solving an important and timely problem as authentication increasingly must be carried out over audio channels with the spread of voice assistants.

In my opinion, the greatest strength of this work is the large number of variables tested in the experiments. The comprehensive experiments varied the background noise, recording distance, human speaker's gender, the human speakers themselves, the loudspeakers used (both built-in speakers and standalone speakers), and other factors. Seeing that the factors other than the loudspeaker generally had minimal impact in the detection accuracy was a very helpful contribution. Crucially, the authors' decision to test their approach both on data they collected themselves in the large number of varying circumstances mentioned above, as well as on the ASVspoof dataset previously collected, was a major strength of the paper. I especially liked Section 7.3.3's experiment training and testing on different data, and seeing that the detection rate dropped quite a bit was both realistic and a good choice to include in the paper.

I also like that the paper used the same techniques to detect dolphin attacks, which received quite a bit of attention over the last year. In short, this paper's techniques, as evaluated, are able to detect both replay attacks and dolphin attacks, which are the best-known attacks on speaker-recognition systems.

This work relies on SVMs, whereas prior approaches have utilized deep learning. Notably, deep learning requires greater training time, greater classification time, and a much larger model, so I applaud the successful use of an SVM for this task. The paper's limited performance benchmarks rightly emphasize this advantage.

Finally, the paper was clearly written and generally easy to follow. Despite the paper's flaws, detailed below, I think it makes a nice contribution in moving the literature toward better being able to detect audio replay attacks on speaker-recognition systems.

While I think the work makes a valuable contribution overall, it does have some shortcomings. A clear oversight in this work is that the microphones used for recording voices in the first place are ignored. While it is valuable to know what a commodity smartphone microphone picks up and whether liveness can be detected with them, I think it would also be nice to consider what microphones with a flatter frequency response are able to do. I feel that more detail and greater emphasis should be given to the fact that "for recording human voices, we used three different laptops with a sound card manufactured from different vendors." Built-in sound cards on laptops, as well as the microphones built into the Samsung Galaxy S8 and iPhone 8, are far from the ideal recording method for a determined adversary. I feel that this should be mentioned, and the paper should also be more precise and detailed about how exactly the recordings were made (I was confused about whether laptops or phones were used for the different recordings), as well as whether the recordings were compressed (e.g., as mp3 files). 

In short, i am left wondering: Is it the loudspeaker or is it the microphone that leads to the relative frequency differences that the SVM leverages in detecting liveness?

Relatedly, manufacturers of all but the most low-end loudspeakers and microphones publish the frequency response of their devices. Since the main audio features this work leverages relate to the relative power in different frequency ranges, I was surprised that the authors did not explicitly use this information in their attacks. That is, one could compensate for the frequency response of a speaker in different ranges by adjusting and equalizing the recording for the frequencies that have low/high frequency responses.

Building on this point, it seems like the proposed void detection mechanism could be gamed by adjusting recordings to mirror the characteristic exponential decay of live voices. The work does not currently address this at all, seemingly to imply that the differences are fundamental features of replay attacks. It seems to me, though, that these are not inherent qualities. While I think it is ultimately very useful to build a liveness detector that foils simple replay attacks of the type likely to be deployed by "adversaries" sharing a home or office with a user, I think it is also important to discuss (or, ideally, evaluate in follow-up work) whether the features void utilizes can be bypassed by a clever adversary with EQ software.

I think it's also worth noting that the 14.2% FAR on ASVspoof is quite high. Other than more closely matching the training data to the test data, do the authors have any other ideas for reducing the FAR in future work?

In addition, I feel like the authors somewhat mischaracterize that they used "high-quality standalone speakers." The types of speakers tested -- Creative A60, Logitech S120, Bose mini bluetooth speaker 2 -- are low-end, commodity speakers for gaming or informally listening to music. A large number of audiophile speakers or speakers designed for music mixing/mastering are designed to have a very flat frequency response and cost many hundreds of dollars/euros, and it seems like these should have been tested. At the very least, the paper should acknowledge accurately the class of speakers tested.

Additional comments are the following:

--Could the plots on the right of Figure 1 and Figure 2 have consistent y axes? If they did, the difference between them would look less stark.

--In Figure 6, can the model number or at least a citation to the model be given in place of "Bose speaker" and "Yamaha speaker"? Is the Bose speaker the "Bose mini bluetooth speaker 2" listed elsewhere? This is very crucially different than the high-end Bose speakers one might imagine from the manufacturer's name alone.

--Is it possible to come up with a better name for the approach than "void," a term already overloaded in computer science?

--I wonder if artifacts of audio compression might also help differentiate pre-recorded voices that are played back. This seems like an obvious extension.

--Table 4 has a column named "detection" whose name is somewhat meaningless, though it seems to mean "classification correct" based on the other numbers in the table.

--Why was a second dataset of 20 university affiliates collected after the dataset of 100 IT profressionals? That is, why the differences and the after-the-fact additional data collection?

--Could Table 3 also contain the practical disk space used for both prior approaches and void?

--Can these be reworded since they are very confusing?: "lower frequencies above 1k" "higher frequencies below 1k" (e.g., give the frequency ranges being discussed)

--Typos include "An SVM classifies dataset" "clealry"

======================
Post-rebuttal comments
======================

I thank the authors for their rebuttal. Unfortunately, after reading the rebuttal, I am no longer enthusiastic about this paper for the reasons listed below:

1) The rebuttal and paper refer to "high-quality 5.1 and 2.1 channel speakers," which are distinct from the types of monitor speakers (search for, e.g., "studio monitors" with a "flat frequency response") that are indeed widely available (and used in recording studios around the world) and that might not have the same frequency decay patterns that void leverages. Related to this point, the rebuttal does not fully address whether an attacker who is aware of the scheme could attempt to game the scheme by compensating through EQ. It is not obvious to me why equalization intending to compensate for the characteristics of hardware would necessarily require hardware updates. Without a meaningful evaluation of an attacker trying to evade the system, I cannot advocate for this work.

2) While 15 different recording devices might be indicated in Table 1 (but not specified which is which), were they all effectively the same microphone (i.e., a commodity laptop microphone)? There are a wide array of professional microphones used by recording studios, and these likely could have performed better.

3) The claim that voice synthesis attacks could be detected because they must be played through loudspeakers is, particularly in light of the two points above, an unsupported claim.


Response by Author [Muhammad Ejaz Ahmed <ejaz629@gmail.com>] (596 words)
---------------------------------------------------------------------------
WRT R3’s comments on novelty “... better machine learning ...” and R1’s comment “... the classification stage ...,” our contributions extend beyond optimal selection of classification algorithms. We carefully analyzed spectral power patterns of human voices and voices replayed through loudspeakers (Section 4), and identified key features to facilitate a lightweight and fast voice liveness detection algorithm. The models presented in [7] merely focused on maximizing classification accuracy and  and do not consider real-world latency and resource constraints (Section 3.1), resulting in solutions that may not be acceptable in real-world voice assistant implementations. In contrast, we focused on identifying minimal number of lightweight yet highly effective features to minimize both training and testing time, while still achieving competitively high accuracy. Void is much faster in training and testing times compared to [7], and uses significantly smaller number of features (Table 3). To demonstrate the practicality, we also evaluated Void using our own dataset that is about 13 times larger than the ASVspoof dataset (Section 7.3).

WRT R3’s comment “... limited distance (all within one meter) …,” we clarify that Void was  tested against three different speaker distances (Sections 7.3.1): 15, 130, and 260 cm.

WRT R3’s comment “... different audio codecs ...,” we surmise that codecs would have negligible impact on Void’s performance because even under lossy compression the pattern of transformed signals (Section 5.1) will not be significantly affected. We did test Void under two scenarios implicitly related to compression (but not reported in the paper) -- testing with shorter 0.5 second commands, and varying sampling frequencies (8, 16, and 44.1 kHz). For 0.5 second voice commands, the average EER was 1.7%, and for full length commands, the average EER was 0.57%. We did not observe noticeable degradation in Void performance. 

WRT R3’s comment on the missing paper (Blue et al.), it is not available online yet. We will review the paper once it becomes available.

WRT R1’s comment “… value of the work in evaluating the hidden …,” voice biometric authentication solutions are known to achieve about 80-90% accuracy (with background noise), indicating that they could still misclassify hidden or inaudible commands. Void could be used to as a complementary solution to detect such misclassified attacks.

WRT R1’s comment “… better to employ more models …,” we also tested other representative models such as GMM and KNN (Section 5.3) but SVM performed the best. We did not experiment with CNN/DNN/RNN because they are computationally expensive. 

WRT R1’s comment “… voice synthesis attacks …,” Void would also be effective against voice synthesis attacks because the synthesized samples would have to be played through loudspeakers. 

WRT R1’s comment “… feature set does not show a significant difference …,” due to this reason, we additionally identified key features for higher power frequencies (Algo. 3).  

WRT R4’s comment “… to mirror the characteristic exponential decay …”, yes, an adversarial machine learning attack could try to mimic the characteristics of live-human voices. However, this attack would require an expensive hardware update because Void uses hardware-level features from existing loudspeakers.

WRT R4’s comment “... microphone have an impact ...,” we tested with numerous mic configurations (14 loudspeakers and 15 microphones) as shown in Table 1.

WRT R2’s comment “... perfect speakers …” and R4’s comment “... do not include audiophile …,” we did experiment with high-quality 5.1 and 2.1 channel speakers (Appendix C), showing that Void performs well against them. As for speakers that go beyond such specs, we argue that such speakers are not commonly available, and would be very difficult and expensive for attackers to carry such speakers around.