Pop Quiz! Can a Large Language Model Help With Reverse Engineering?

April 01, 2023

The following is summary of the paper Pop Quiz! Can a Large Language Model Help With Reverse Engineering? by Pearce et. al.


  • The question that this paper tried to answer is: Can a Large Language Model (LLM) help with reverse engineering software?

  • What is the goal of reverse engineering:

    • Trying to understand how an existing piece of software works, what is the goal of the software, how it achieves that goal!
    • You want to be able to understand what are the inputs to program, what are the outputs, etc.
  • Analysts will often reverse engineer malware to understand how it works, how it is able to evade detection (if it is adversarial).

  • Control systems may also be reverse engineered as well.

    • Can be vital to maintain legacy code, and reimplementing the logic on newer devices.
  • The way the researchers approached this problem is by developing a "quizzing framework":

    • They leverage OpenAI’s code-davinci-001 - focusing on C code (or decompiled C code; binaries were decompiled using the tool ghidra)
    • They will provide the model with the code, and ask open ended questions regarding the code's functionality, variables, etc.
    • The authors then judged the answers of the model, and judged them as either correct or incorrect.
    • The authors set the temperature parameter to 0 (at first), and then set it 1 and compared the correctness of the model based on these parameters.
  • Given a piece of malware the LLM was typically able to correctly tell what a piece of malware was doing when temperature was set to 0. When the temperature was 1, the LLM typically responded incorrectly.

    • Even after randomizing the code (putting random variable names), the LLM was able to identify what was happening!
  • Given software describing a PID controller, the LLM can identify the goal of the software, describe its inputs, etc.

  • When compiling the code with different optimization tags, the greater the optimization (after decompiling the code I am assuming), the less useful the LLM responses are!

  • The above approach was considered to a "pop quiz" of the LLM by the researchers; afterwards the researchers developed a more systematic approach to evaluate the LLM.

    • They tested the LLM with malware code, industrial control systems, and real world malware code from vx-underground
    • True and False questions were structured in the following manner: The above code (does not) have
    • Some open ended questions as well
    • Parameters were tuned for the model: temperature (affecting the probability distribution of each token), top_p; they set temperature to 0.4 and top_p to 1
  • Using a more systematic approach, the researchers had the following conclusions:

    • The LLM is much better suited to respond to "in domain" questions about the code provided. For example the LLM provides poor responses to T/F questions regarding control systems when provided malware code and/or vice versa
    • Correctly answer just under half of the questions
    • When you add randomization to the code, the correctness of responses start to go down significantly
  • This was true zero shot testing, since the code-davinci-001 does not have knowledge of Cybersecurity and control systems. Truly shows how powerful these models are, as it was able to respond to half of the questions correctly


  • What is code-davinci?

    • It is one of the foundational models developed by OpenAI (ironically not so open anymore, but thats a side note). This model is what tools such as GitHub Copilot is built on.
  • Google used recaptcha for two purposes:

      1. To actually verify that you are a human
      1. Second, once Google knew you were human, they used humans for Supervised learning from human feedback. For example, identifying letters that a computer was unable to do at the time - this was required for digitize many books. Another example is identifying vehicle related objects such as busses, traffic lights, stop signs, etc. This was specifically used for Waymo - self driving car project.
  • This type of model can be treated as an Oracle:

    • It will lower the barrier to entry to software engineering/programming, allows users with nontechnical knowledge to learn about what a program does
    • Able to ask high level questions, so that anyone can query the model.
    • Before this model can be considered an Oracle, there must be a lot of improvement
  • How can such a model be improved? Pipeline!

    • Having a pipeline will allow us to better process the code! Can decompile the software to static analysis and also execute the code to perform dynamic analysis
  • What is the biggest advancement of GPT-4?

    • The ability to add plugins specific to your use case! This allows you to taylor your experience!
  • There is a huge split in the community whether language models like GPT3/4 are a HUGE overfit or actually truly intelligent - leading to general artificial intelligence!

  • What else can language models be used for?

    • Log analysis?
    • Finding bugs in code such as using insecure functions, buffer overflow
    • As long as something can be defined as language, we can utilize an LLM to deal with it!
  • What are some security specific use cases?

    • LLMs can create taylor made pishing attacks for specific users
    • Replace problem words easily caught in spam detection systems (develop adversarial attacks)
    • Get the sentiment of emails (is this email trying to convince you to perform an action?); try to detect intention of the email
  • Making security for human:

    • One great example of this is replacing a user's password with a pattern they need to draw. Internally, the pattern will be stored as a numeric passcode, but the user will enter is by drawing the pattern. Pattern is easily to recall than a numeric sequence!
      • This brings up the question then: wouldn't it be easier to hack! At least simplying observing a person draw the pattern, wouldn't it be easier to remember and break into that person's phone!
  • Playground is a more "free" version of ChatGPT. Can ask playground mode to develop a malware software program, BUT unable to ask ChatGPT this same question without hearing about the ethics

Profile picture

Written by Sidharth Baveja
Master of Computer Science Student at Texas A&M
Send me an email if you would like to get in touch: sidharthbav at gmail dot com