At the risk of sounding really stupid. Can someone ELI5 what's going on here and...

0x0 · on Nov 7, 2014

It starts with an invalid .jpg (literally a text file containing "hello"), and by trying over and over, changing random bytes and tracing the execution of the decoder program as it is fed the corrupted input, it will drill deeper and deeper into the program until it has gotten far enough that the input is actually a valid .jpg, without any human input.

Fuzzing like this is a very effective technique for finding (security) bugs in programs that parse input, because you will quickly end up with "impossible" input nobody thought to check for (but is close enough that it won't be rejected outright), and whoops there's your buffer overflow.

In this particular case, the fuzzer is going beyond just throwing random input, as it considers which changes to the input trigger new code paths in the target binary, and therefore should have a higher success rate in triggering bugs compared to just trying random stuff. And don't forget, this will work with any type of program and file type, not just .jpgs and the djpeg binary.

ulber · on Nov 7, 2014

>In this particular case, the fuzzer is going beyond just throwing random input, as it considers which changes to the input trigger new code paths in the target binary, and therefore should have a higher success rate in triggering bugs compared to just trying random stuff.

To expand on this, techniques like this are called whitebox fuzzing (or maybe graybox in afl's case). In their extreme whitebox fuzzers even incorporate constraint solvers to directly solve inputs that take the program to previously unexplored paths. One very impressive project is the SAGE whitebox fuzzer [1,2,3] that's in production use at Microsoft (an internal project sadly). I work in the related field of automated test generation, but all my tools are very much research-grade. However, in SAGE they've done all the work of figuring out how 24/7 whitebox fuzzing can be integrated into the development process. I am somewhat envious of the researchers getting to work in an environment where that is possible. If you're interested I very much recommend reading the papers on SAGE.

[1] Poster about SAGE: http://research.microsoft.com/en-us/um/people/pg/public_psfi...

[2] An approachable article on SAGE: http://research.microsoft.com/en-us/um/people/pg/public_psfi...

[3] The paper with all the details: http://research.microsoft.com/en-us/projects/atg/ndss2008.pd...

f- · on Nov 8, 2014

The main problem with SAGE is that at least outside Microsoft, it exists just as a series of (very enthusiastic) papers :-)

So, while I suspect it's very cool, it's also a bit of a no-op for everybody else. It's also impossible to independently evaluate the benefits: for example, its performance cost, the amount of fine-tuning and configuration required for each target, the relative gains compared to less sophisticated instrumented fuzzing strategies, etc.

contingencies · on Nov 7, 2014

Great links, thanks.

3rd3 · on Nov 7, 2014

Why does it use fuzzed input in the first place? Couldn’t one just use random input from the beginning instead? It would be effectively equivalent but fuzzing of a "hello" string seems to be roundabout.

0x0 · on Nov 7, 2014

Well, "hello" is pretty random. :) It was probably just used for dramatic effect in the demo, and you have to start with something - of course even a 0 byte file would be enough.

You could also have a started with a valid .jpg with lots of complicated embedded exif metadata sections etc, and have a good chance of triggering bugs in those code paths without having to "discover exif" first.

sp332 · on Nov 7, 2014

From the article: it works without any special preparation: there is nothing special about the "hello" string.

Houshalter · on Nov 8, 2014

He said it took a day to find good jpg images. If you started the program with a valid input, then it would take much less time to explore the other code paths.

Zikes · on Nov 7, 2014

In this case "hello" was just a pseudorandom starter to seed the fuzzer.

malanj · on Nov 7, 2014

This is how I understand it:

1) JPEG file structure is complex (much more so than a BMP file for example). 2) Imagine you didn't know what it looked, but had a tool that could "read" them. djpeg in this case 3) What the fuzzer does is "peek" at how the djpeg tool reacts to random "fuzzed" data. It's smart about it, understanding when a bit of data makes the tool do something new (code path) 4) After millions of iterations it "learns" how JPEG file structure works and can generate valid JPEGs.

This might not be very relevant for JPEGs, given that you can just Google their structure. This is cool because it shows how the tool could figure out (or find a weakness in) something where you don't know how it works.

Someone · on Nov 7, 2014

It doesn't do #4; it just finds one input that makes the program exit with exit code zero (for success)

It is better to look at it as a maze solver that tries to generate instructions for a robot that will lead that robot through the maze, while that robot has its own weird way of interpreting the instructions. It it generates) makes

gear54rus · on Nov 7, 2014

This describes a program (afl) which is given another program (djpeg, some kind of JPEG parser - takes arbitrary data and decides if its JPEG or not) to play with.

afl starts feeding it random inputs gradually gaining limited understanding about what it accepts and what it does not.

Eventually, it gathers enough info (based on strace'ing, google it, and its output) to start pulling JPEG images out of thin air like it's nothing.

The catch is, that this afl program learns this all by itself, and it can do so with any other parser program (like ELF parsers, GIF parsers, whatever). It is a general purpose tool.

Also, similar concept can be applied to things like web servers, or any other servers (since they parse what clients send them).

bane · on Nov 7, 2014

It works kind of like http://reverseocr.tumblr.com/ does, but it monitors code paths in the "recognizer" until the software doing the recognition goes to where you want it. It's like a genetic algorithm meats code-path testing way of exercising your code.

phkahler · on Nov 7, 2014

>> Can someone ELI5 what's going on here...

Evolution.