Adversarial Machine Learning in Image Classification: A Survey Toward the Defender’s Perspective

March 30, 2023

The following is the summary and discussion of the paper Adversarial Machine Learning in Image Classification: A Survey Toward the Defender’s Perspective, by Machado et. al.

Summary

It is quite easy to fool machine learning models built for image classification by perturbing an image. Models will often suffer from the Clever Hans affect.

Taxonomy of Adversarial Images (Perturbations)

Scope:
- Individual: Crafting a perturbation for each specific image
- General/Universe Scope: Perturbations are generated without considering a specific input - this is more realistic and done more frequently
Visibility:
- Optimal: Not noticeable by the human eye, but classifier is confident about the incorrect prediction
- Indistinguishable: Not noticeable by the human eye and unable to evade the model
- Visible: Able to be spotted by humans and can fool the model
- Physical: Added to real world object (for example stop signs, traffic lights in the case of self driving applications)
- Fooling Images: Humans are unable to tell what the image represents, but a model will have high confidence regarding the classification of the image
- Noise: non-malicious perturbations applied to the image, image is misclassified
Measurement: Can apply the L0, L1, L2, L(infinity) norm to measure the difference between the original and perturbed image

Taxonomy of Attacks

A threat model explains how a defense (AV) is designed to protect against specific attacks
Attackers Knowledge: White, Black, Grey Box Attacks
Attacker Influence:
- Poisoning Attacks: attacks attempts influence the training process of a model by corrupting the training data with mislabels to deviate the model behavior
- Evasive Attacks: Attacker will create adversarial samples that can evade the ML model. Typically attackers will create a surrogate model (greybox attacks)
Attack Specificity:
- Targeted Attacks: Attackers will create adversarial samples specific to an AV/machine learning model such that the sample is classified into a specific class
- Untargeted attacks: Attack wants to evade the model; they want the AV to mislabel their adverse sample
Attack computation:
- Iterative: Have a high computation cost (such as the Genetic algorithm approach we saw previously), but it probably more query efficient (less likely to be detected by the AVs)
- Single Step: Apply gradient ascent approach, developing perturbations based on gradient. Typically need to apply a large change in the original malware sample. This approach is typically less effective, but has a much lower computational cost
Security Violations
- Availability violation: Any attack that affects the model's usability, leading to a denial of service.
- Integrity: Occurs when an a sample is able to evade a model; the model is still able to detect other samples, but unable to specifically detect adverse samples
- Privacy: Attackers attempt to gain knowledge about the model (the architecture, the training/validation data, etc.)
Different approaches to Attacking:
- Gradient Attacks: Typically done in a white box setting, using the gradient of the model to perform gradient ascent - not very realistic to do
- Score Based Attacks: Develop a surrogate model to develop perturbations - black/greybox approach - this is more realistic
- Decision Based: Develop small perturbations at a time using rejections sampling
- Approximation Attacks: Attempt to mimic the model output by using differentiable functions and apply a gradient based attack using this approach

Attacking Algorithms (there are many, here are a few covered in the presentation)

Fast Gradient Sign Method (FGSM) - low computational cost, creates a large perturbation, BUT most of the time it does not evade the model
Basic Iterative Method: Iterative approach of FGSM - perturbations are generated in much smaller steps - upper bound is also set on the perturbations
Deep Fool - Find the closest decision boundary and perturb the image so that it can cross that boundary
Carlini and Wagner Attack (CW Attack): Optimization algorithm to minimize the L2 difference between the original and perturbed image BUT also generates an image that can fool the classifier

Taxonomy of Defense

Proactive or Reactive:
- Proactive: Developing robust classifiers against adversarial attacks
- Reactive: Develop specific models to detect malicious images, etc.
Gradient Masking:
- Develop a defensive model that has a smooth gradient - will result in attackers having to perform a greater number of queries against the model
- There are many approaches a defender can use:
  - Vanishing Gradient - Use a deep NN to have an extremely small gradient
  - Shattered Gradient - Introduce steps that are not differentiable
  - Adversarial Training
  - Defensive Distillation: Train a model to output softlabels, then train a second model to predict these soft labels - this model will have a smooth gradient
Some other Defensive Approaches:
- Develop a model to distinguish between adversarial and real samples
- Compute the distributions of legitimate and adversarial images
- Preprocess images
- Ensemble of Models
- Proximity Measurements: Check that the output label of a model matches similar labels (follows same "path" in a Neural Network)

Why are Adversarial Samples able to bypass Models?

Model lacks generalization
Models are linear due to activation functions in Deep Neural Networks, thus they are able to be exploited by attackers easily.
Boundary Tilting: Adversarial samples are generated by perturbing legitimate samples until they cross the boundary
High dimensional data is very complex, thus models are vulnerable regardless of training techniques.
Training Dataset is not representative
Extracted Features are not robust, can be easily exploited by attackers

Discussion

Why are we looking at malware detection from the image domain perspective?
- A lot of knowledge from other domains can be transferred to domains, that at initial glance do not seem to apply. Obviously there is not a "direct" mapping from one domain to the other, BUT there is loads of knowledge transfer
The Brazil army is the one who released this paper? Why is this information in public view now?
- The work that they are doing now is not classified - likely they have developed methods to deter from these attacks (defense), and likely have come up with attacks the evade systems.
What is missing from this paper: Explainable AI - The presenter talks about some of their research regarding the usage of explainable AI in network security.
- Able to check if the predicted outcome is aligned with model behavior/response
- Can utilize the Shapley outcome which provides scores of which features are positively/negatively contributing to an outcome. If the Shapley score is low AKA low credibility from the main model, then we can rely on a secondary model for the result. This is a reactive approach.
Is having a smooth gradient good or bad?
- A smooth gradient allows you to hide information about the gradient. Smoother gradients mean that an attacker will need to perform a greater number of queries against a model to learn about it.
- Moving target defense is also good, since it forces an attacker
Bagging model can also be applied as defensive technique
Unlike in malware domain, we can perform physical attacks in image domain. For example for self driving cars, we can cover up stop sizes, modify traffic lights, etc.
Can apply SQL injection attack (license plate)
Can perform a physical attack in the facial recognition domain by wearing a mask, makeup, etc - which is related to security domain as well.