Using CodeQL variant analysis to find format string vulnerabilities - Part 1

Code Review & Static Code Analysis is something that I really enjoy doing for fun and sometimes for bread and butter. CodeQL is used for variant analysis which is something like searching the codebase with a modelled code pattern. In this blog post I am going to use the following example and try to play around with CodeQL to find the exact matches against the vulnerable format string expression.  
Now using the input "%08x.%08x" , I can see there are 2 scenarios where I would successfully exploit a format string vulnerability.

Enter your name:%08x.%08x
Executing function 1
%08x.%08x
Executing function 2
1a8571fe
Executing function 3
64ed32a0.00000000  ====> Exploited 
Executing function 4
1a8571fe.1a857180  ====> Exploited
Executing function 1
I am hardcoded
Executing function 3
I am hardcoded but safe

So my objective will be to play around with CodeQL and try to write few queries that will detect the code pattern where the input is directly used in format. Hence my query should be able to detect the lines in func3(Line 16) and func4(Line 23,24) as potentially exploitable on the fact that the user controlled string is directly passed to printf() and sprintf() without any format.

  • First let us create the database: codeql database create ./database/vulnc --language="cpp" --command="gcc format_string.c -o format.o" --source-root="./app/"
  • Next I have set up the workspace by installing vscode & vscode-codeql extension
  • Next I have imported the generate database in the workspace
  • Now I am ready to begin.

Trial 1 - Find all printf & sprintf calls

import cpp

from FunctionCall f
where f.getTarget().getName().regexpMatch("(printf|sprintf)")
select f

All the occurrence of printf and sprintf is found 



Now I really don't want the results of Line 17 and Line 25  as they are directly hardcoded newlines and pose no threat.

Trial 2 - Remove all the new lines

import cpp

from FunctionCall fc
where fc.getTarget().getName().regexpMatch("(printf|sprintf)")
and not fc.getArgument(0).getValue() = "\n"
select fc

Well I got rid of the newlines and the results are down to 6



Trial 3 - Find all printf & sprintf calls where the parameter is a variable. It means no printf with hardcoded strings should be detected

import cpp

from FunctionCall fc,VariableAccess var
where fc.getTarget().getName().regexpMatch("(printf|sprintf)")
and fc.getArgument(0) = var
select fc



Hence I could detect all the location that the potential format string vulnerability that could exist.


Testing the model - Lets try another variant of format string and check if our model can detect it.

I added a new code to the existing code and also built the database for the following to verify if the model works for other versions of format string of printf.

void func5(char *str){
printf(str,"AAA");
}



While this CodeQL query looks flawless for my code, this does not fit in many cases. I read some CodeQL articles on some typical scenarios of sprintf , printf where they need to perform a flow analysis. The code also does not support all the family of printf functions. So in the next post we will talk about flow analysis.Let me know if you think any mistakes are there / improvement that can be done.


References:
https://help.semmle.com/QL/ql-support/ql-training/
https://codeql.github.com/docs/ql-language-reference/formulas/
https://codeql.github.com/docs/codeql-language-guides/analyzing-data-flow-in-cpp/