Getting started with osquery development on Linux using CLion and QtCreator

Here’s a quick how-to to answer a question that was posted in the osquery Slack.

Setting up the base development environment

What follows is just a TLDR containing the bare minimum to get to the second part of the post. The osquery project has a lot of good documentation that you should really check out.

Prerequisites

osquery can be run pretty much on any distribution, but for development I’d suggest to go with Ubuntu. The bootstrap procedure will take care of everything, but you will at least need the following packages installed: sudo, make, python, ruby, bash (your login shell doesn’t matter) and of course git.

Getting the source and initializing the environment

  1. Clone the repository: git clone https://github.com/facebook/osquery.git.
  2. Boostrap the environment: make sysprep

The bootstrap procedure will actually perform the following tasks:

  • Updating the system
  • Installing required packages from the repository
  • Initialize the third-party git submodule referenced by the CMake project
  • Install the osquery toolchain and the required libraries inside /usr/local/osquery using Linuxbrew

Using CMake without the Makefile wrapper

When running the make command, the provided Makefile will set a couple of environment variables that CMake will in turn automatically pick up when it is launched to configure the project.

export PATH="/usr/local/osquery/bin:${PATH}"

export DEPS_DIR="/usr/local/osquery"
export LDFLAGS="-L${DEPS_DIR}/legacy/lib -L${DEPS_DIR}/lib -B${DEPS_DIR}/legacy/lib -rtlib=compiler-rt -fuse-ld=lld"

export CC="${DEPS_DIR}/bin/clang"
export CXX="${DEPS_DIR}/bin/clang++"

If everything is working as intended, you should now be able to compile the project without calling make:

$ mkdir osquery-build
$ cd osquery-build
$ cmake ../osquery
$ cmake --build . --config Release -- -j `nproc`

Setting up the IDE

QtCreator

Open the Options dialog from the Tools menu, then adjust each page. Make sure to always select Apply, or the new settings will not show up in the next page.

CMake page

Add the CMake binary from /usr/local/osquery/bin/cmake and name it osquery CMake. Make sure to also enable Autorun CMake and Auto-create build directories.

Compilers page

Add the C++ and C Clang binaries located inside /usr/local/osquery/bin:

Name: osquery Clang (C), osquery Clang (C++)
Compiler path: /usr/local/osquery/bin/clang, /usr/local/osquery/bin/clang++
Platform linker flags: -L/usr/local/osquery/legacy/lib -L/usr/local/osquery/lib -B/usr/local/osquery/legacy/lib -rtlib=compiler-rt -fuse-ld=lld

Kits page

Name: osquery
Compiler: select the compilers we just added.
CMake tool: select the CMake binary we just added.

Environment: Use the following snippet

CC=/usr/local/osquery/bin/clang
CXX=/usr/local/osquery/bin/clang++
LDFLAGS="-L/usr/local/osquery/legacy/lib -L/usr/local/osquery/lib -B/usr/local/osquery/legacy/lib -rtlib=compiler-rt -fuse-ld=lld"
PATH=/usr/local/osquery/bin:${PATH}

Opening the osquery project

You should now be able to directly open the root CMakeLists.txt file.

Here’s a couple of additional tweaks you can apply once you have opened the project:

Open the Build page inside the Projects settings and then: 1. Add -j <cpu count> under Build Steps, Tool arguments 2. Add SKIP_BENCHMARKS=1 and/or SKIP_TESTS=1 under Build Environment

Additional notes

By default QtCreator uses CMake to display the project files; you can change this using the combo box at the top left corner, above the explorer. I like to use the File System, OSQUERY view. You can also disable the bread crumbs if you click on the filter icon.

Using the clang model

The Clang model plugin can be enabled from the Help, About Plugins dialog. Keep in mind that while it’s more advanced than the standard one, it is also hard to make it work reliably.

The osquery toolchain is unusual, as it doesn’t really work on its own; we need to be able to specify some additional settings, and to do so we have to define the QTC_CLANG_NO_DIAGNOSTIC_CHECK=1 environment variable and restart QtCreator.

To configure the code model, to open the Projects, Clang Code Model page and select Custom in the selector located at the top of the screen. Copy the default profile and name it osquery.

The following are my settings:

-Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-unused-macros -Wno-newline-eof -Wno-exit-time-destructors -Wno-global-constructors -Wno-gnu-zero-variadic-macro-arguments -Wno-documentation -Wno-shadow -Wno-switch-enum -Wno-missing-prototypes -Wno-used-but-marked-unused -march=x86-64 -mno-avx -I/usr/local/osquery/legacy/include -I/usr/local/osquery/include -I/usr/local/osquery/include/c++/v1 -I/usr/local/osquery/legacy/include -I/usr/local/osquery/include -I/usr/local/osquery/lib/clang/6.0.0/include

CLion

First of all, configure the variables:

  1. Open the CMakeLists.txt file and select Open as Project.
  2. Open the Settings dialog from the File menu, and then click on CMake under Build, Execution, Deployment.
  3. Click on the small folder icon on the right of the Environment input field.
Variable Value
LDFLAGS -L/usr/local/osquery/legacy/lib -L/usr/local/osquery/lib -B/usr/local/osquery/legacy/lib -rtlib=compiler-rt -fuse-ld=lld
CXX /usr/local/osquery/bin/clang++
CC /usr/local/osquery/bin/clang
PATH /usr/local/osquery/bin:<YOUR_ORIGINAL_PATH>

You should now reset your CMake cache and select the all target:

  1. Click on Reset Cache and Reload Project from the Tools, CMake menu.
  2. Select Build All from the top-right.

You should now be able to build osquery!

DisARMing a Raspberry Pi - BSides San Francisco CTF 2017

Before we start, you should grab a copy of the challenge file from the CTF write-ups 2017 repository page.

The executable we are going to analyze is an ELF for ARM architecture; I took this chance to go and fetch my Raspberry Pi from my dusty drawers to put it to good use. I have installed the Ubuntu MATE for ARM image, but you can probably use whichever distribution you like.

In order to debug the program, I have used the ARM debug server that comes with IDA Pro (dbgsrv/armlinux_server) to connect to the Raspberry Pi; in case you don’t have it, take a look at either Voltron or radare2.

Let’s get started! Or so I thought; the executable crashes right after the first line of printed text - this is actually not a good start.

root@raspberrykoma:/# ./disarming.arm
To disarm, you need a secret code and some validators!
*** Error in `./disarming.arm': free(): invalid next size (fast): 0x555aa430 ***
Please enter secret code: Aborted
.text:099E  PUSH {R1-R3}
.text:09A0  PUSH {R7,LR}
.text:09A2  SUB SP, SP, #0x2C
.text:09A4  ADD R7, SP, #0
.text:09A6  STR R0, [R7,#0x2C+var_28]
.text:09A8  LDR R3, [R7,#0x2C+var_28]
.text:09AA  MOV R0, R3
.text:09AC  BL sub_E68
.text:09B0  MOV R3, R0
.text:09B2  AND.W R3, R3, #1
.text:09B6  STR R3, [R7,#0x2C+var_28]
.text:09B8  ADD.W R3, R7, #0x38
.text:09BC  STR R3, [R7,#0x2C+arg]
.text:09BE  LDR R0, [R7,#0x2C+varg_r1]
.text:09C0  BLX strdup                 ; first allocation
.text:09C4  MOV R3, R0
.text:09C6  STR R3, [R7,#0x2C+s]
.text:09C8  LDR R0, [R7,#0x2C+s]
.text:09CA  BLX strlen                 ; size passed to malloc()
.text:09CE  MOV R3, R0
.text:09D0  MOV R0, R3
.text:09D2  BLX malloc                 ; second allocation
.text:09D6  MOV R3, R0
.text:09D8  STR R3, [R7,#0x2C+ptr]
.text:09DA  LDR R3, [R7,#0x2C+ptr]
.text:09DC  STR R3, [R7,#0x2C+var_C]
.text:09DE  LDR R0, [R7,#0x2C+varg_r1]
.text:09E0  BLX strlen
.text:09E4  MOV R3, R0
.text:09E6  ADDS R3, #3
.text:09E8  LSRS R3, R3, #2
.text:09EA  STR R3, [R7,#0x2C+var_8]
.text:09EC  B loc_A18
.text:0A76  LDR R0, [R7,#0x2C+s]
.text:0A78  BLX free
.text:0A7C  LDR R0, [R7,#0x2C+ptr]     ; crash!
.text:0A7E  BLX free

This function allocates two buffers, the first one using strdup() and the second one using malloc(). The crash happens when attempting to release the latter, clearly indicating a fault in how the program handles its allocations. Specifically, the indexes being used here are outside the boundaries of the memory it has requested, causing vital heap information being overwritten by the instruction at address 0x09F2. The buffer in question is not really used for anything useful, so I decided to patch the STR instruction with a NOP opcode (0x00, 0xBF) to avoid the issue.

I’m not going to spend more time on this function, as the program is now be able to run just fine; should you wish to investigate why the printed strings are not visible with your hexeditor, this is where you have to start looking at.

We can finally start the program and see what it asks for:

root@raspberrykoma:/# ./disarming.arm
Please enter secret code: test
Please enter first validator: test

Both parameters are read inside the main() function, using two fgets() call at addresses 0x0AE4 and 0x0B50. One thing to keep in mind is that this function will stop at the newline character, including it in the returned data (take a look at the code at address 0x0B0E).

Nothing here suggests how long the input values should be, as the stack-allocated buffer used to collect the input is pretty big; this means we’ll have to find another way to determine the size of the parameters.

// 0x0B54 - 0x0B6C
validator = strtol(validator_string, strlen(secret_code_string));

// 0x0B74 - 0x0B76
counter = 0;

// 0x0BA8 - 0x0BBC
while (strlen(secretcode) > counter)
{
    // 0x0B7C - 0x0B8E
    r1 = secretcode[counter];

    // 0x0B90 - 0x0B94
    r3 = fibs[r1];

    // 0x0B98 - 0x0B9A
    validator -= fibs[r1];

    // 0x0B9E - 0x0BA4
    counter++;
}

The validator value is first interpreted using the strtol() function, and then transformed using an array named fibs which happens to contain the fibonacci sequence.

Notice how the length of the secret code is used as the conversion base; popular choices indicate that we will probably have to enter 8, 10 or 16 characters for the secret code. This is an important clue, as we now know that the validator must be a number.

// 0x0BBE - 0x0BCE
dword_0AC = validator;

// 0x0BD0 - 0x0BF4
dword_0AC = xlate[secret_code[0]] ^ 'F' | dword_0AC;

// 0x0BF6 - 0x0C1C
dword_0AC = xlate[secret_code[1]] ^ 'l' | dword_0AC;

// 0x0C1E - 0x0C44
dword_0AC = xlate[secret_code[2]] ^ 'g' | dword_0AC;

// 0x0C46 - 0x0C6C
dword_0AC = xlate[secret_code[3]] ^ 'G' | dword_0AC;

The first four characters of the secret code are used as indexes in yet another sequence, this time named xlate; you will need a debugger if you want to inspect those values, as they are initialized during startup before the call to the main() entry point.

We can already guess what we need to enter here; we just have to find which indexes into the xlate array contains the characters used in the xor operation, but we’ll look into this matter later, when we have a better overview of what is happening inside this function.

// 0xC6E - 0xC76
if (dword_0AC == 0)
{
    // 0xC78 - 0xC86
    r3 = strlen(secret_code);
}
else
{
    // 0x0C88
    r3 = 8;
}

// 0x0C8A - 0x0C94
counter = r3 - 4;

// 0x0C98 - 0x0CA4
read_buffer[counter] = 0;

// 0x0CDC - 0x0CE4
while (counter > 0)
{
    // 0x0CA8 - 0x0CB8
    r2 = secret_code[counter + 4];

    // 0x0CBA - 0x0CC2
    r1 = xlate[r2];

    // 0x0CC4 - 0x0CD0
    r3 = read_buffer[counter] = r1;

    // 0x0CD2 - 0x0CD8
    counter--;
}

It should be pretty clear by now that the dword_0AC variable should always evaluate to zero; the initial value for counter will then be based on the secret code length.

The read_buffer was initially used to get the parameters from the user, so it will still contains the last value read by the fgets() call which happens to be the validator.

// 0x0CE6 - 0x0CEE
if (dword_0AC == 0)
{
    // 0x0CF0 - 0x0CFE
    r2 = strlen(secret_code_string);
}
else
{
    // 0x0D02
    r2 = 8;
}

// 0x0D04 - 0x0D14
converted_read_buffer = strtoull(read_buffer, r2);

If everything works as expected, we will once again be using the secret code length as the conversion base for the strtoull() call. We don’t know what’s the correct secret code yet, but we at least know that it must contains what is needed for the previous transformation loop to build a number.

// 0x0D18 - 0x0D24
r3 = 0x76673250 ^ converted_read_buffer;

// 0x0D26 - 0x0D2E
r2 = r3 | dword_0AC;

// 0x0D30 - 0x0D34
dword_0AC = r2;

// 0x0D36 - 0x0D3E
if (dword_0AC == 0)
{
    // 0x0D5A - 0x0D64
    return 0;
}
else
{
    // 0x0D40 - 0x0D46
    printf("Sorry, no flag for you!");

    // 0x0D4A
    sub_994(); // this just calls exit(1);
}

The transformed read buffer must evaluate to 0x76673250, as it’s the only way to keep the dword_0AC variable to zero; this last clue finally allows us to reconstruct the secret code and the validator value needed to get the flag.

Since the conversion base can be controlled by changing the length of the secret code, we have to decide how to express the value needed by the xor operation. One thing to keep in mind is that the digits we are going to use must be present inside the xlate sequence, or we will not be able to emit them during the transformation loop.

The first attempt went with the decimal base (1986474576) which requires the following digits: 1456789.

digit   xlate[] index

1       N/A
4       N/A
5       60  '<'
6       116 't'
7       66  'B'
8       122 'z'
9       N/A

Not having much luck with that, I tried again with base 16:

digit   xlate[] index

0       50  '2'
2       80  'P'
2       103 'g'
3       78  'N'
5       60  '<'
6       116 't'
7       66  'B'

The hexadecimal encoding is not only capable of expressing the value we need, but also gives us two different indexes to choose from for the digit ‘2’! I have choosen to go with the letter ‘g’, and built the final string: BttBNg<2. Having decided to go with base 16 means that the secret code will need to be exactly 16 bytes long.

We now need to retrieve the first four characters; this is not going to be hard, as they are clearly visible on the right hand side of the xor operators at addresses 0x0BD0 - 0x0C6C.

char    xlate[] index
F       33  '!'
l       39  '\''
g       114 'r'
G       118 'v'

We have found another portion of the secret code, and we now have 12 bytes; the issue is that they are not enough if we want the strtol()/strtoull() functions to work correctly. I added some ‘X’ characters on the right, and here’s the final string: !‘rvBttBNg<2XXXX.

The validator is the last missing piece of the puzzle; the initial loop will set it to the value entered by the user, but it is then transformed using the fibonacci sequence along with the secret code characters; this means that you will need to enter another validator if you have used a different padding than mine. The same also applies if you have used the ‘P’ character in the xlate transformation.

The value we need in this case should be 0xE3AC8677 but the strtol() function used to convert the validator works on signed 32-bit integers; this means that we can’t use it as the function will return with LONG_MAX value. We can work around this issue by expressing the value we need as a negative integer: -1c537989.

We have found everything we need, and we should now able to get the flag:

Please enter secret code: !'rvBttBNg<2XXXX
Please enter first validator: -1c537989
Good enough: FlgG76673250

Solving SmokeStack, from the third Flare-On Challenge

Note: This article has been published right after the Flare-On Challenge 3 has ended.

Official writeups can be found here: 2016 Flare-On Challenge solutions from fireeye.com

SmokeStack is the fifth level of the third edition of the Flare On Challenge organized by FireEye. I’ve decided to write a post about it because this is one of the two levels I’ve enjoyed the most (the other being CHIMERA).

I will be using the assembly I’ve annotated from the start take make things easier to understand. A little warning: this post is (really) verbose, as I’ve included the assembly code in its entirety.

Let’s get started!

The application expects the user to pass 10 characters as the first parameter.

; function starts at virtual address 0x00402F30
_main proc
    ; ...

    ; 0x00402F76
    cmp     [ebp+argc], 1
    jle     __exit

    ; 0x00402F80
    mov     eax, [ebp+argv]
    mov     ecx, [eax+4]
    push    ecx
    call    _strlen
    jl      __exit

    ; ...
_main endp

Each character is then extended to a 2-bytes value and copied to a global buffer. The code is sometimes pretty verbose, which is a clear indication that it was not compiled with optimizations flags.

; function starts at virtual address 0x00402F30
_main proc
    ; ...

    ; 0x00402F9C
    mov     [ebp+i], 0
    jmp     short __vm_stack_initialization_loop

    ; 0x00402FAE
__vm_stack_initialization_loop:
    cmp     [ebp+i], 0Ah
    jge     short __start_vm_execution_loop

    ; 0x00402FB4
    mov     eax, [ebp+argv]
    mov     ecx, [eax+4]

    mov     edx, [ebp+i]
    movsx   ax, byte ptr [ecx+edx]

    mov     ecx, [ebp+i]
    mov     vm_stack[ecx*2], ax

    jmp     short __vm_stack_initialization_loop_condition

    ; 0x00402FA5
__vm_stack_initialization_loop_condition:
    mov     edx, [ebp+i]
    add     edx, 1
    mov     [ebp+i], edx

    ; ...
_main endp

Take a look at the cross references and notice how this is the only place where the program accesses this buffer directly, as the rest of the code will make use of the following two functions to access it:

; function starts at virtual address 0x00401000
VMStack_push proc value:word
    push    ebp
    mov     ebp, esp

    mov     ax, vm_stack_pointer
    add     ax, 1
    mov     vm_stack_pointer, ax

    movzx   ecx, vm_stack_pointer
    mov     dx, [ebp+value]
    mov     vm_stack[ecx*2], dx

    pop     ebp
    retn
VMStack_push endp

; function starts at virtual address 0x00401080
VMStack_pop proc
    push    ebp
    mov     ebp, esp
    push    ecx

    movzx   eax, vm_stack_pointer
    mov     cx, vm_stack[eax*2]
    mov     [ebp+word], cx

    mov     dx, vm_stack_pointer
    sub     dx, 1
    mov     vm_stack_pointer, dx

    mov     ax, [ebp+word]

    mov     esp, ebp
    pop     ebp
    retn
VMStack_pop endp

The counter grows when a value is saved, and decreases when a value is removed; it is pretty obvious that this is some kind of LIFO stack implementation.

Back to the entry point: the buffer has been populated with our string and the first initialization phase ends at virtual address 0x00402FCF. The next function we’re going to enter is located at virtual address 0x00401610 and will soak up most of the execution time - this is where we are going to focus our efforts.

The first part is not really interesting, as it’s just initialization of values and function pointers.

; function starts at address 0x00401610
VMMain proc
    ;
    ; initialization
    ;

    push    ebp
    mov     ebp, esp

    call    InitializeVMOpcodeHandlers

    xor     eax, eax
    mov     vm_register_A, ax

    xor     ecx, ecx
    mov     vm_register_B, cx

    mov     edx, 9
    mov     vm_stack_pointer, dx

    xor     eax, eax
    mov     vm_instruction_pointer, ax

    ;
    ; main loop
    ;

    ; 0x0040163D
__vm_execution_loop:
    movzx   ecx, vm_instruction_pointer
    movzx   edx, vm_code_size
    cmp     ecx, edx
    jge     short __last_vm_instruction_reached

    ; 0x0040164F
    call    VMFetchAndExecuteNextOpcode
    jmp     short __vm_execution_loop

    ; 0x00401656
__last_vm_instruction_reached:

    ; the virtual machine exit code is taken from the first register
    mov     ax, vm_register_A
    pop     ebp
    retn
VMMain endp

; function starts at virtual address 0x00401570
InitializeVMOpcodeHandlers proc
    push    ebp
    mov     ebp, esp

    mov     vm_opcode_handlers, VMOpcodeHandler_push
    mov     vm_opcode_handlers+4, VMOpcodeHandler_pop
    mov     vm_opcode_handlers+8, VMOpcodeHandler_add
    mov     vm_opcode_handlers+0Ch, VMOpcodeHandler_sub
    mov     vm_opcode_handlers+10h, VMOpcodeHandler_RotateRight
    mov     vm_opcode_handlers+14h, VMOpcodeHandler_RotateLeft
    mov     vm_opcode_handlers+18h, VMOpcodeHandler_xor
    mov     vm_opcode_handlers+1Ch, VMOpcodeHandler_not
    mov     vm_opcode_handlers+20h, VMOpcodeHandler_eq
    mov     vm_opcode_handlers+24h, VMOpcodeHandler_sel
    mov     vm_opcode_handlers+28h, VMOpcodeHandler_jmp
    mov     vm_opcode_handlers+2Ch, VMOpcodeHandler_pushRegister
    mov     vm_opcode_handlers+30h, VMOpcodeHandler_mov
    mov     vm_opcode_handlers+34h, VMOpcodeHandler_nop

    pop     ebp
    retn
InitializeVMOpcodeHandlers endp

The second part of the routine will loop until a counter reaches the end, each time calling the following function:

; function starts at virtual address 0x00401540
VMFetchAndExecuteNextOpcode proc
    push    ebp
    mov     ebp, esp
    push    ecx

    movzx   eax, vm_instruction_pointer
    mov     cx, ds:vm_instructions[eax*2]
    mov     [ebp+opcode], cx

    movzx   edx, [ebp+opcode]
    mov     eax, vm_opcode_handlers[edx*4]
    call    eax

    mov     esp, ebp
    pop     ebp
    retn
VMFetchAndExecuteNextOpcode endp

Let’s get an overview of what is happening: * The function pointer array is used in conjuction with indexes found in a statically allocated buffer that has been initialized at compile-time. * A counter is incremented each time one of those function pointers is used. * As we already noticed, all values are extended (or truncated) to 16-bit words.

We can now draw some conclusions: * Each one of the indexes used to access the function pointer array is in fact an opcode. * The counter that is incremented at each call is the instruction pointer. It is not incremented automatically because the opcode length is not always the same, as they may optionally require immediate values. * You have probably already guessed it by now, but the global memory buffer is the virtual machine stack. * The virtual machine is heavily stack-based, and operates on 2-bytes words.

And Now for Something Completely Different: a dump of the analyzed opcode handlers. I have put the sel opcode at the top of the list since it’s the most unusual one.

; this function pops three values, and uses one of them to decide
; which one of other two needs to be kept into the stack.
;
; function starts at virtual address 0x00401360
VMOpcodeHandler_sel proc
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch

    call    VMStack_pop
    mov     [ebp+first_word], ax

    call    VMStack_pop
    mov     [ebp+second_word], ax

    call    VMStack_pop
    mov     [ebp+third_word], ax

    movzx   eax, [ebp+third_word]
    cmp     eax, 1
    jnz     short loc_401399

    movzx   ecx, [ebp+first_word]
    push    ecx             ; value
    call    VMStack_push

    add     esp, 4
    jmp     short loc_4013A6

loc_401399:
    movzx   edx, [ebp+second_word]
    push    edx             ; value
    call    VMStack_push

    add     esp, 4

loc_4013A6:
    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_sel endp

; function starts at virtual address 0x00401180
VMOpcodeHandler_RotateRight proc
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch
    push    esi

    call    VMStack_pop

    mov     [ebp+first_word], ax
    call    VMStack_pop

    mov     [ebp+second_word], ax
    movzx   eax, [ebp+second_word]
    movzx   ecx, [ebp+first_word]
    sar     eax, cl

    movzx   edx, [ebp+second_word]
    movzx   ecx, [ebp+first_word]
    mov     esi, 10h
    sub     esi, ecx
    mov     ecx, esi
    shl     edx, cl
    or      eax, edx
    and     eax, 0FFFFh
    mov     [ebp+result], ax

    movzx   edx, [ebp+result]
    push    edx             ; value
    call    VMStack_push

    add     esp, 4

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    pop     esi
    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_RotateRight endp

; function starts at virtual address 0x004011F0
VMOpcodeHandler_RotateLeft proc
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch
    push    esi

    call    VMStack_pop
    mov     [ebp+first_word], ax

    call    VMStack_pop
    mov     [ebp+second_word], ax

    movzx   eax, [ebp+second_word]
    movzx   ecx, [ebp+first_word]
    shl     eax, cl

    movzx   edx, [ebp+second_word]
    movzx   ecx, [ebp+first_word]
    mov     esi, 10h
    sub     esi, ecx
    mov     ecx, esi
    sar     edx, cl
    or      eax, edx
    and     eax, 0FFFFh
    mov     [ebp+value], ax

    movzx   edx, [ebp+value]
    push    edx             ; value
    call    VMStack_push

    add     esp, 4

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    pop     esi
    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_RotateLeft endp

; function starts at virtual address 0x00401030
VMOpcodeHandler_push proc
    push    ebp
    mov     ebp, esp
    push    ecx

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    movzx   ecx, vm_instruction_pointer
    mov     dx, ds:vm_instructions[ecx*2]
    mov     [ebp+immediate], dx

    movzx   eax, [ebp+immediate]
    push    eax             ; value
    call    VMStack_push
    add     esp, 4

    mov     cx, vm_instruction_pointer
    add     cx, 1
    mov     vm_instruction_pointer, cx

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_push endp

; function starts at virtual address 0x004010C0
VMOpcodeHandler_pop proc
    push    ebp
    mov     ebp, esp

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    call    VMStack_pop

    pop     ebp
    retn
VMOpcodeHandler_pop endp

; function starts at virtual address 0x004010E0
VMOpcodeHandler_add proc
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch

    call    VMStack_pop
    mov     [ebp+first_word], ax

    call    VMStack_pop
    mov     [ebp+second_word], ax

    movzx   eax, [ebp+first_word]
    movzx   ecx, [ebp+second_word]
    add     eax, ecx
    mov     [ebp+result], ax

    movzx   edx, [ebp+result]
    push    edx             ; value
    call    VMStack_push

    add     esp, 4

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_add endp

; function starts at virtual address 0x00401130
VMOpcodeHandler_sub proc
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch

    call    VMStack_pop
    mov     [ebp+first_word], ax

    call    VMStack_pop
    mov     [ebp+second_word], ax

    movzx   eax, [ebp+second_word]
    movzx   ecx, [ebp+first_word]
    sub     eax, ecx
    mov     [ebp+result], ax

    movzx   edx, [ebp+result]
    push    edx             ; value
    call    VMStack_push

    add     esp, 4

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_sub endp

; function starts at virtual address 0x00401260
VMOpcodeHandler_xor proc
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch

    call    VMStack_pop
    mov     [ebp+first_word], ax

    call    VMStack_pop
    mov     [ebp+second_word], ax

    movzx   eax, [ebp+first_word]
    movzx   ecx, [ebp+second_word]
    xor     eax, ecx
    mov     [ebp+result], ax

    movzx   edx, [ebp+result]
    push    edx             ; value
    call    VMStack_push

    add     esp, 4

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_xor endp

; function starts at virtual address 0x004012B0
VMOpcodeHandler_not proc
    push    ebp
    mov     ebp, esp

    sub     esp, 8

    call    VMStack_pop
    mov     [ebp+word], ax

    movzx   eax, [ebp+word]
    not     eax
    and     eax, 0FFFFh
    mov     [ebp+result], ax

    movzx   ecx, [ebp+result]
    push    ecx             ; value
    call    VMStack_push

    add     esp, 4

    mov     dx, vm_instruction_pointer
    add     dx, 1
    mov     vm_instruction_pointer, dx

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_not endp

; function starts at virtual address 0x00401300
VMOpcodeHandler_eq proc 
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch

    call    VMStack_pop
    mov     [ebp+first_word], ax

    call    VMStack_pop
    mov     [ebp+second_word], ax

    movzx   eax, [ebp+first_word]
    movzx   ecx, [ebp+second_word]
    cmp     eax, ecx
    jnz     short loc_40132F

    mov     edx, 1
    mov     [ebp+result], dx
    jmp     short loc_401335

loc_40132F:
    xor     eax, eax
    mov     [ebp+result], ax

loc_401335:
    movzx   ecx, [ebp+result]
    push    ecx             ; value
    call    VMStack_push

    add     esp, 4

    mov     dx, vm_instruction_pointer
    add     dx, 1
    mov     vm_instruction_pointer, dx

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_eq endp

; function starts at virtual address 0x004013C0
VMOpcodeHandler_jmp proc
    push    ebp
    mov     ebp, esp

    call    VMStack_pop
    mov     vm_instruction_pointer, ax

    pop     ebp
    retn
VMOpcodeHandler_jmp endp

; function starts at virtual address 0x004013D0
VMOpcodeHandler_pushRegister proc
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    movzx   ecx, vm_instruction_pointer
    mov     dx, ds:vm_instructions[ecx*2]
    mov     [ebp+opcode_parameter], dx

    movzx   eax, [ebp+opcode_parameter]
    mov     [ebp+opcode_parameter_alias], eax
    cmp     [ebp+opcode_parameter_alias], 3
    ja      short __opcode_handler_end

    mov     ecx, [ebp+opcode_parameter_alias]
    jmp     ds:off_401464[ecx*4]

__read_accumulator:
    mov     dx, vm_register_A
    mov     [ebp+word], dx
    jmp     short __opcode_handler_end

__read_base_stack_pointer:
    mov     ax, vm_register_B
    mov     [ebp+word], ax
    jmp     short __opcode_handler_end

__read_stack_pointer:
    mov     cx, vm_stack_pointer
    mov     [ebp+word], cx
    jmp     short __opcode_handler_end

__read_instruction_pointer:
    mov     dx, vm_instruction_pointer
    mov     [ebp+word], dx

__opcode_handler_end:
    movzx   eax, [ebp+word]
    push    eax
    call    VMStack_push

    add     esp, 4

    mov     cx, vm_instruction_pointer
    add     cx, 1
    mov     vm_instruction_pointer, cx

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_pushRegister endp

; function starts at virtual address 0x00401480
VMOpcodeHandler_mov proc
    push    ebp
    mov     ebp, esp

    sub     esp, 0Ch

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    movzx   ecx, vm_instruction_pointer
    mov     dx, ds:vm_instructions[ecx*2]
    mov     [ebp+vm_register_id], dx

    call    VMStack_pop
    mov     [ebp+value], ax

    movzx   eax, [ebp+vm_register_id]
    mov     [ebp+vm_register_id_copy], eax
    cmp     [ebp+vm_register_id_copy], 3
    ja      short __increment_instruction_pointer

    mov     ecx, [ebp+vm_register_id_copy]
    jmp     ds:off_401510[ecx*4]

__set_accumulator:
    mov     dx, [ebp+value]
    mov     vm_register_A, dx
    jmp     short __increment_instruction_pointer

__set_base_stack_pointer:
    mov     ax, [ebp+value]
    mov     vm_register_B, ax
    jmp     short __increment_instruction_pointer

__set_stack_pointer:
    mov     cx, [ebp+value]
    mov     vm_stack_pointer, cx
    jmp     short __increment_instruction_pointer

__set_instruction_pointer:
    mov     dx, [ebp+value]
    mov     vm_instruction_pointer, dx

 __increment_instruction_pointer:
    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    mov     esp, ebp
    pop     ebp
    retn
VMOpcodeHandler_mov endp

; function starts at virtual address 0x00401520
VMOpcodeHandler_nop proc
    push    ebp
    mov     ebp, esp

    mov     ax, vm_instruction_pointer
    add     ax, 1
    mov     vm_instruction_pointer, ax

    pop     ebp
    retn
VMOpcodeHandler_nop endp

You should now have a good understanding of how each opcode works, but you can’t but agree with me when I say that debugging this is everything but comfortable. For this reason, I have chosen to dump the virtual machine code (see virtual address 0x0040A140) and write a simple disassembler.

#include <iostream>
#include <fstream>
#include <vector>
#include <stdexcept>
#include <string>
#include <iomanip>

const std::vector<std::string> mnemonics =
{
    "push", "pop", "add", "sub",
    "trm1", "trm2", "xor", "not",
    "eq", "sel", "jmp", "push",
    "mov", "nop"
};

int main(int argc, char *argv[]);
const char *GetRegisterName(std::uint16_t register_id);

int main(int argc, char *argv[])
{
    static_cast<void>(argc);
    static_cast<void>(argv);
    
    if (argc != 2)
    {
        std::cout << "Usage:\n";
        std::cout << "smokestack_disasm dump" << std::endl;

        return 1;
    }

    std::vector<std::uint8_t> buffer;

    try
    {
        std::fstream input_file;
        input_file.open(argv[1], std::ios_base::in |
            std::ios_base::binary);

        if (!input_file)
            throw std::runtime_error("Failed to open the input file");

        input_file.seekg(0, std::ios_base::end);
        if (!input_file)
            throw std::runtime_error("Seek failed");

        std::streamsize input_file_size = input_file.tellg();
        if (!input_file)
            throw std::runtime_error("Failed to get the file size");

        input_file.seekg(0);
        if (!input_file)
            throw std::runtime_error("Seek failed");

        buffer.resize(input_file_size);
        if (buffer.size() != input_file_size)
            throw std::runtime_error("Memory allocation failed");

        input_file.read(reinterpret_cast<char *>(buffer.data()),
            input_file_size);

        if (!input_file)
            throw std::runtime_error("Failed to read the file");

        input_file.close();
    }

    catch (const std::exception &exception)
    {
        std::cout << exception.what() << std::endl;
        return 1;
    }

    const std::uint8_t *ptr = buffer.data();

    while (ptr < buffer.data() + buffer.size())
    {
        std::uint32_t instruction_pointer = (ptr - buffer.data()) / 2;

        std::cout << std::hex << std::setfill('0') << std::setw(4)
            << instruction_pointer;

        std::cout << "\t\t";

        std::cout << std::hex << std::setfill('0') << std::setw(2) <<
            static_cast<int>(*ptr);

        std::cout << "\t" << mnemonics[*ptr] << " ";

        // opcodes that require immediate parameters needs to
        // increment the instruction pointer twice
        switch (*ptr)
        {
            // push <immediate>
            case 0:
            {
                ptr += 2;

                std::uint16_t value = 
                    *reinterpret_cast<const std::uint16_t *>(ptr);

                std::cout << "0x" << std::hex << std::setfill('0') <<
                    std::setw(4) << value;

                // also show ascii encoding
                if (value >= 0x20 && value <= 0x7D)
                {
                    std::cout << " ; '" << static_cast<char>(value)
                        << "'";
                }

                break;
            }

            // push <register_id>
            case 11:
            {
                ptr += 2;

                std::uint16_t value =
                    *reinterpret_cast<const std::uint16_t *>(ptr);

                std::cout << GetRegisterName(value);

                break;
            }

            // mov <register_id>, stack[sp]
            case 12:
            {
                ptr += 2;

                std::uint16_t value =
                    *reinterpret_cast<const std::uint16_t *>(ptr);

                std::cout << GetRegisterName(value);
                std::cout << ", ST(0)";

                break;
            }

            default:
                break;
        }

        std::cout << std::endl;

        // show an empty line after we have printed a jump instruction
        if (*ptr == 10)
        {
            std::cout << std::hex << std::setfill('0') << std::setw(4)
                << instruction_pointer << std::endl;
        }

        ptr += 2;
    }

    return 0;
}

const char *GetRegisterName(std::uint16_t register_id)
{
    switch (register_id)
    {
        case 0:
            return "ax";
        
        case 1:
            return "bp";
        
        case 2:
            return "sp";
        
        case 3:
            return "ip";
        
        default:
            throw std::runtime_error("Invalid register id");
    }
}

The following is the full output of the disassembler, including my own comments.

0000    00  push 0x0021
0002    02  add           ; \ adds 0x21 to the last character in the
0003    00  push 0x0091   ; / program argument
0005    08  eq 
0006    00  push 0x0016
0008    00  push 0x000c   ; \ this is what we should take. last char
000a    09  sel           ; / is: 0x91 - 0x21 = 'p'
000b    0a  jmp 
000b
000c    0b  push ax       ; \
000e    00  push 0x000c   ; | ax is set to 0 during startup
0010    02  add           ; | ax = ST(0) = 0 + 0x0c
0011    0c  mov ax, ST(0) ; /
0013    00  push 0x001d   ; \
0015    0a  jmp           ; / we're going to jump to address 0x001d
0015
0016    0b  push ax
0018    00  push 0x0063
001a    02  add 
001b    0c  mov ax, ST(0)
001d    00  push 0x0018   ; \
001f    06  xor           ; | next character: 0x54 ^ 0x18 = 'L'
0020    00  push 0x0054   ; |
0022    08  eq            ; /
0023    00  push 0x0033   ; \
0025    00  push 0x0029   ; | we're going to jump to 0x0029
0027    09  sel           ; |
0028    0a  jmp           ; /
0028
0029    0b  push ax       ; \
002b    00  push 0x002c   ; | ax is still 0x0C; result is 0x38
002d    02  add           ; | and is saved to ax again
002e    0c  mov ax, ST(0) ; /
0030    00  push 0x003d   ; \
0032    0a  jmp           ; / we're going to jump to 0x003d
0032
0033    00  push 0x000e
0035    01  pop 
0036    0b  push ax
0038    00  push 0x0059
003a    02  add 
003b    0c  mov ax, ST(0)
003d    0b  push ax       ; } 0x38 is pushed again on stack
003f    00  push 0x0000   ; \
0041    0c  mov bx, ST(0) ; / bx = 0x0000
0043    00  push 0x0009   ; \
0045    0c  mov ax, ST(0) ; / ax = 0x0009
0045
0047    0b  push bx       ; \
0049    00  push 0x0002   ; |
004b    02  add           ; | bx += 0x0002
004c    0c  mov bx, ST(0) ; /
004e    0b  push ax       ; \
0050    00  push 0x0001   ; |
0052    03  sub           ; | ax -= 0x0001
0053    0c  mov ax, ST(0) ; /
0055    0b  push ax       ; \
0057    00  push 0x0000   ; | condition: ax == 0x0000
0059    08  eq            ; /
005a    00  push 0x0047   ; \
005c    00  push 0x0060   ; | false:resume the loop (0x0047).
005e    09  sel           ; | true: leave the loop (0x0060). bx is set
005f    0a  jmp           ; /        to (2 * 9)
005f
0060    0c  mov ax, ST(0) ; } dx = 0x005d + 0x0012 = 0x006f (char 'o')
0062    0b  push bx       ; } push 0x0012
0064    03  sub           ; } 0x006f - 0x0012 = 0x005d
0065    00  push 0x005d   ; \
0067    08  eq            ; |
0068    00  push 0x007c   ; | the condition must be true and we need
006a    00  push 0x006e   ; | to jump to 0x006e
006c    09  sel           ; |
006d    0a  jmp           ; /
006d
006e    0b  push ax       ; } push 0x0038
0070    00  push 0x0007   ; \
0072    03  sub           ; | ax = 0x0038 - 0x0007 = 0x0031
0073    0c  mov ax, ST(0) ; /
0075    00  push 0x005b   ; \
0077    0c  mov bx, ST(0) ; / bx = 0x005b
0079    00  push 0x0087   ; \
007b    0a  jmp           ; / jmp 0x0087
007b
007c    00  push 0x0036   ; '6'
007e    0c  mov bx, ST(0)
0080    0b  push ax
0082    0b  push bx
0084    02  add 
0085    0c  mov bx, ST(0)
0087    0b  push bx       ; \ (bx = 0x005b)
0089    00  push 0x0058   ; | 0x0058 + 0x005b = 0x00b3
008b    02  add           ; /
008c    06  xor           ; } 0x00b3 ^ 0x004a = 0x00f9 -> char 'J'
008d    00  push 0x00f9   ; \
008f    08  eq            ; |
0090    00  push 0x00a0   ; | jmp (xor_result == 0x00f9 ? 0x96 : 0xa0)
0092    00  push 0x0096   ; |
0094    09  sel           ; |
0095    0a  jmp           ; /
0095
0096    0b  push ax       ; \ (ax = 0x0031)
0098    00  push 0x004d   ; |
009a    06  xor           ; | ax = 0x0031 ^ 0x004d = 0x007c
009b    0c  mov ax, ST(0) ; /
009d    00  push 0x00ae   ; \
009f    0a  jmp           ; / jmp 0x00ae
009f
00a0    00  push 0x0323
00a2    00  push 0x012b
00a4    03  sub 
00a5    0c  mov bx, ST(0)
00a7    0b  push ax
00a9    0b  push bx
00ab    02  add 
00ac    0c  mov bx, ST(0)
00ae    0c  mov bx, ST(0) ; } bx = character 'b'
00b0    0b  push bx       ; } push 0x0062
00b2    0b  push bx       ; \
00b4    00  push 0x0001   ; | bx -= 0x0001
00b6    03  sub           ; |
00b7    0c  mov bx, ST(0) ; /
00b9    00  push 0x0003   ; \
00bb    02  add           ; / ST(0) += 0x0003
00bc    0b  push bx       ; \
00be    00  push 0x0000   ; |
00c0    08  eq            ; |
00c1    00  push 0x00b2   ; | loop while bx != 0x0000
00c3    00  push 0x00c7   ; |
00c5    09  sel           ; |
00c6    0a  jmp           ; /
00c6
00c7    07  not           ; \ bx = 0x62 + (0x62 * 3)
00c7                      ; / not(0x188) = 0xfe77
00c8    00  push 0xfe77   ; \
00ca    08  eq            ; |
00cb    00  push 0x00d8   ; | condition must be true
00cd    00  push 0x00d1   ; | to jump to 0x00d1
00cf    09  sel           ; |
00d0    0a  jmp           ; /
00d0
00d1    0b  push ax       ; \
00d3    00  push 0x0058   ; |
00d5    02  add           ; | ax = 0x007c + 0x0058 = 0x00d4
00d6    0c  mov ax, ST(0) ; /
00d8    00  push 0x0003   ; \
00da    04  trm1          ; | x: character at position 4
00db    00  push 0x008c   ; |
00dd    02  add           ; | condition = ((x >> 0x0003) | 
00de    00  push 0x6094   ; |     (x << (0x0010 - 0x0003)) & 0xffff
00e0    08  eq            ; |     + 0x8C) == 0x6094
00e1    00  push 0x00ee   ; |
00e3    00  push 0x00e7   ; | we need to jump to 0x00e7
00e5    09  sel           ; |
00e6    0a  jmp           ; /
00e6
00e7    0b  push ax
00e9    00  push 0x00e7
00eb    02  add 
00ec    0c  mov ax, ST(0)
00ee    0b  push bx       ; \
00f0    02  add           ; | bx is 0x0000
00f1    00  push 0x000c   ; |
00f3    06  xor           ; | the next word we're going to use is
00f4    00  push 0x0074   ; | the third character
00f6    08  eq            ; |
00f7    00  push 0x0107   ; | if ((0x0000 + 'x') ^ 0x000c == 0x0074)
00f9    00  push 0x00fd   ; |     jmp 0x00fd <- take this jump
00fb    09  sel           ; | else
00fc    0a  jmp           ; /     jmp 0x0107
00fc
00fd    0b  push ax       ; \
00ff    00  push 0x0009   ; | ax = 0x00d4 - 0x0009 = 0x00cb
0101    03  sub           ; |
0102    0c  mov ax, ST(0) ; /
0104    00  push 0x011d   ; \
0106    0a  jmp           ; / jmp 0x011d
0106
0107    00  push 0x000a
0109    0c  mov bx, ST(0)
010b    0b  push bx
010d    00  push 0x0001
010f    03  sub 
0110    0c  mov bx, ST(0)
0112    0b  push bx
0114    00  push 0x0000
0116    08  eq 
0117    00  push 0x010b
0119    00  push 0x011d
011b    09  sel 
011c    0a  jmp 
011c
011d    00  push 0x0006   ; \ trm2(0x0006, character at position 2)
011f    05  trm2          ; |
0120    00  push 0x1dc0   ; | condition = (shl('w', 0x0006) |
0122    08  eq            ; |     sar('w', 0x10 - 0x0006)) &
0122                      ; |     0xffff == 0x1dc0;
0122                      ; |
0123    00  push 0x0133   ; | if (condition)
0125    00  push 0x0129   ; |     jmp 0x0129 <- take this jump
0127    09  sel           ; | else
0128    0a  jmp           ; /     jmp 0x0133
0128
0129    0b  push ax       ; \
012b    00  push 0x0071   ; |
012d    02  add           ; | ax = 0x00cb + 0x0071 = 0x013c
012e    0c  mov ax, ST(0) ; /
0130    00  push 0x013d   ; \
0132    0a  jmp           ; / jmp 0x013d
0132
0133    0b  push ax
0135    00  push 0x0077   ; 'w'
0137    02  add 
0138    0c  mov ax, ST(0)
013a    00  push 0x013d
013c    0a  jmp 
013c
013d    00  push 0x0016   ; \
013f    02  add           ; | this is the character at position 1
0140    00  push 0x000e   ; |
0142    03  sub           ; | condition = (0x0016 + 'Y' -
0143    00  push 0x0061   ; |      0x000E) == 0x0061;
0145    08  eq            ; |
0146    00  push 0x0153   ; | if (condition)
0148    00  push 0x014c   ; |     jmp 0x014c
014a    09  sel           ; | else
014b    0a  jmp           ; /     jmp 0x0153
014b
014c    0b  push ax       ; \ 
014e    00  push 0x002c   ; |
0150    03  sub           ; | ax = 0x013c - 0x002c = 0x0110
0151    0c  mov ax, ST(0) ; /
0153    0c  mov bx, ST(0) ; } bx = 'k' -> character at position 0
0155    0b  push bx
0157    00  push 0x212c
0159    0b  push bx       ; \
015b    00  push 0x0001   ; | this loop subtracts 0x07 from 0x212c
015d    03  sub           ; | for 'k' (0x6b) times
015e    0c  mov bx, ST(0) ; |
0160    00  push 0x0007   ; | bx--;
0162    03  sub           ; /
0163    0b  push bx       ; \
0165    00  push 0x0000   ; | if (bx == 0)
0167    08  eq            ; |     jmp 0x016e
0168    00  push 0x0159   ; | else
016a    00  push 0x016e   ; |     jmp 0x0159
016c    09  sel           ; |
016d    0a  jmp           ; /
016d
016e    00  push 0x01ca   ; \
0170    06  xor           ; | if ((loop_result ^ 0x01ca) == 0x1ff5)
0171    00  push 0x1ff5   ; |     jmp 0x017a
0173    08  eq            ; | else
0174    00  push 0x0181   ; |     jmp 0x0181
0176    00  push 0x017a   ; |
0178    09  sel           ; |
0179    0a  jmp           ; /
0179
017a    0b  push ax       ; \
017c    00  push 0x0012   ; | ax = 0x0110 + 0x0012 = 0x0122
017e    02  add           ; |
017f    0c  mov ax, ST(0) ; /
0181    0d  nop

The correct string is then the one that allows the virtual machine to execute the program to the end: kYwxCbJoLp. There’s no need to analyze the rest of the executable; pass those characters back to the program and it will print our flag: A_p0ppu$H&_a_Jmp@flare-on.com.

Writeup for the Transformer challenge from VolgaCTF 2016 Quals

This is another challenge from the Volga CTF Quals 2016, involving an x64 ELF executable that encodes files. Our objective is to recover the clear text data from the encrypted file.

Here’s the description for this challenge:

This binary does something with the data. The transformation must be reversible, but the details are unknown. It shouldn’t be too difficult to reverse that transformation and obtain the flag, should it?

You can download the executable from the CTF Writeup repository on GitHub.

What I’m going to do in this writeup is simple:

  1. Reverse the algorithm. I will use IDA Pro, but you can also use radare2.
  2. Write a compatible encoder in C++.
  3. Invert the transformation and implement decoding functionality.

For starters, you may have noticed that this executable does not like to be run inside a debugger; you can easily pinpoint where the anti-debugger check is performed by setting a breakpoint to the exit() function.

00401070 Protection proc near 
00401070     sub     rsp, 8
00401070
00401070     ; terminate the process if the LD_PRELOAD variable
00401070     ; is set
00401070
00401074     mov     edi, offset name ; "LD_PRELOAD"
00401079     call    _getenv
00401079
0040107e     test    rax, rax
00401081     jnz     short _terminate
00401081
00401081     ; terminate if the PTRACE_TRACEME ptrace request
00401081     ; returns -1 (meaning that a debugger is already
00401081     ; attached)
00401081
00401083     xor     ecx, ecx
00401085     xor     edx, edx
00401087     xor     esi, esi
00401089     xor     edi, edi        ; request (PTRACE_TRACEME)
0040108b     xor     eax, eax
0040108d     call    _ptrace
0040108d
00401092     test    rax, rax
00401095     js      short _terminate
00401095
00401097     add     rsp, 8
0040109b     retn
0040109c
0040109c _terminate:
0040109c     xor     edi, edi        ; status (0)
0040109e     call    _exit
0040109e Protection endp

We could have choosen to hook the ptrace() function and disable this check by using the LD_PRELOAD environment variable, but the code at address 401074 would have prevented us from being able to do it. What we really want to remove instead is the debugger protection and to be honest it’s easier to just patch the program. It’s not required if you don’t plan on stepping through the code or if you don’t mind skipping it manually with a breakpoint each time you start the program.

Now that we got rid of the protection, let’s take a look at the caller.

004019C0 init proc near
004019C0     push    r15
004019C2     mov     r15d, edi
004019C5     push    r14
004019C7     mov     r14, rsi
004019CA     push    r13
004019CC     mov     r13, rdx
004019CF     push    r12
004019CF
004019CF     ; array contents:
004019CF     ; Initialization01 (004013D0)
004019CF     ; Protection (00401070)
004019CF     ; CppIoStreamConstructor (004012F0)
004019CF     
004019D1     lea     r12, InitializationFunctionsList1
004019D8     push    rbp
004019D8
004019D8     ; array contents: Initialization02 (004013B0)
004019D8
004019D9     lea     rbp, InitializationFunctionsList2
004019E0     push    rbx
004019E1     sub     rbp, r12
004019E4     xor     ebx, ebx
004019E6     sar     rbp, 3
004019EA     sub     rsp, 8
004019EE     call    _init_proc
004019F3     test    rbp, rbp
004019F6     jz      short loc_401A16
004019F6
004019F6     ; it's pretty obvious that someone has messed up
004019F6     ; this opcode. replace it with a function call
004019F6     ; to address 004013B0
004019F6
004019F8     nop     dword ptr [rax+rax+00000000h]
00401A00
00401A00 loc_401A00:
00401A00
00401A00     mov     rdx, r13
00401A03     mov     rsi, r14
00401A06     mov     edi, r15d
00401A06
00401A06     ; functions called:
00401A06     ; 1. Initialization01: program initialization
00401A06     ; 2. Protection: LD_PRELOAD/debugger check
00401A06     ; 3. CppIoStreamConstructor: constructor and destructor
00401A06     ;    (using atexit) for the std::ios_base c++ object
00401A06
00401A09     call    qword ptr [r12+rbx*8]
00401A0D     add     rbx, 1
00401A11     cmp     rbx, rbp
00401A14     jnz     short loc_401A00
00401A16
00401A16 loc_401A16:
00401A16
00401A16     add     rsp, 8
00401A1A     pop     rbx
00401A1B     pop     rbp
00401A1C     pop     r12
00401A1E     pop     r13
00401A20     pop     r14
00401A22     pop     r15
00401A24     retn
00401A24 init endp

If you played the previous level (named Broken) you will remember that the init() function had been patched to remove a function call that was required for the algorithm to work as intended; this challenge is no exception and you will have to fix the instruction located at address 004019F8. Again, the function we need to call can be found inside the second array.

There’s not much else to say about the rest of the functions referenced here; they’re just used for initialization and we don’t really care to analyze them, provided that we take a memory snapshot once it’s done. Let’s see how the rest of the program works; it’s pretty long, so I will use annotated pseudo-code to illustrate its internals.

// 004010B0
int main(int argc, char *argv[])
{
    // 004010d0
    if (argc != 3)
        return 0;

    const char *input_file_path = argv[1];
    const char *output_file_path = argv[2];

    // 0040111a
    std::fstream input_file(input_file_path);

    // 00401135
    std::fstream output_file(output_file_path);

    // 004011c5
    input_file.seekg(0, std::ios_base::end);

    // 004011d2
    std::streamsize file_size = input_file.tellg();

    // 004011ee
    input_file.seekg(0, std::ios_base::beg);

    // 004011fa
    std::uint8_t *buffer = new std::uint8_t[file_size];

    // 0040121D
    input_file.read(buffer, file_size);

    //
    // we don't care much about the following function, as it doesn't touch
    // our buffer in any way. Take a memory snapshot after this call
    //

    // 0040122C
    call sub_401932;

    //
    // this is what we need to analyze
    //

    // 0040123A
    call sub_401836;

    // 00401248
    output_file.write(buffer, file_size);

    return 0;
}

// 00401836
void sub_401836(uint8_t *input_buffer, uint8_t *output_buffer, uint32_t buffer_size)
{
    // rdi = input_buffer
    // rsi = output_buffer
    // rdx = buffer_size

    // an xmmword is 16 bytes long; this function encodes one block
    // at a time by calling sub_40179e

    // 00401860
    rdx = shr rdx, 0x04;

    // 00401870
    if (rdx == 0)
        return;

    do
    {
        // 00401877
        xmm0 = *rdi;

        // 0040187b
        call sub_40179e;

        // 00401880
        *rsi = xmm0;

        // 00401884
        rdi += 0x10;

        // 00401888
        rsi += 0x10;

        // 0040188c
        rdx--;
    } while (rdx != 0); // 00401864

    // 0040189f
    return;
}

// 0040179e
void sub_40179e()
{
    // this is the real encoding function; it's nothing too fancy
    // and you can easily invert each transformation by performing
    // the same operations in reverse order

    // 004017ac
    xmm7 = xmm0;

    // 004017c8
    xmm0 = xmmword_602128;

    // 004017d1
    pxor xmm7, xmm0;

    // 004017d6
    r15 = 0x10;

    do
    {
        // 004017e1
        call sub_4015d0;

        // 004017e6
        call sub_40167d;

        // 004017eb
        call sub_40169f;

        // 004017f0
        xmm0 = xmmword_602128[r15];

        // 004017f9
        pxor xmm7, xmm0;

        // 004017fe
        r15 += 0x10;
    } while (r15 < 0x0A); // 00401802

    // 0040180b
    call sub_4015d0;

    // 00401810
    call sub_40167d;

    // 00401815
    xmm0 = xmmword_602128[r15];

    // 0040181e
    pxor xmm7, xmm0;

    // 00401823
    xmm0 = xmm7

    // 00401835
    return;
}

// 004015d0
void sub_4015d0()
{
    // 004015d9
    [rsp] = xmm7;

    // 004015de
    for (offset = 0; offset < 0x0c; offset += 0x04)
    {
        ebx = [rsp + offset];
        call sub_40161f;
        [rsp + offset] = eax;
    }

    // 00401615
    xmm7 = [rsp];

    // 0040161e
    return;
}

// 0040167d
void sub_40167d()
{
    // 00401694
    xmm0 = xmmword_401683;

    // 00401698
    pshufb xmm7, xmm0;

    // 0040169e
    return;
}

// 0040169f
void __usercall sub_40169f()
{
    // 004016a8
    [rsp] = xmm7;

    // 004016ad
    for (offset = 0; offset < 0x0c; offset += 0x04)
    {
        edi = [rsp + offset];
        call sub_4016e9;
        [esp] = eax;
    }

    // 004016df
    xmm7 = [rsp];

    // 004016e8
    return;
}

// 0040161f
void __usercall sub_40161f()
{
    // input: ebx
    // output: eax

    movzx   ecx, bl
    mov     al, byte ptr qword_602268[ecx]
    movzx   ecx, bh
    mov     ah, byte ptr qword_602268[ecx]
    shl     eax, 10h
    shr     ebx, 10h
    movzx   ecx, bl
    mov     al, byte ptr qword_602268[ecx]
    movzx   ecx, bh
    mov     ah, byte ptr qword_602268[ecx]
    rol     eax, 10h
    add     rsp, 8
    nop
    retn
}

There’re two more functions we need to analyze: sub_40161f and sub_4016e9; I’ve added some spacing to make the listings easier to follow and understand.

0040161F sub_40161f proc near
0040161F
0040161F     ; input: ebx
0040161F     ; output: eax
0040161F
0040161F     push    rcx
00401620     jmp     short loc_40162B
00401620
00401622     db 0x00
00401623     db 0x00
00401624     db 0x00
00401625     db 0xE9
00401626     db 0xA6
00401627     db 0x01
00401628     db 0x40
00401629     db 0xE9
0040162A     db 0x04
0040162B
0040162B loc_40162B:
0040162B
0040162B     xor     rax, rax
0040162E     jmp     short loc_401646

........

00401646 loc_401646:
00401646
00401646     movzx   ecx, bl
00401649     mov     al, byte ptr qword_602268[ecx]
00401649
00401650     movzx   ecx, bh
00401653     mov     ah, byte ptr qword_602268[ecx]
0040165A
0040165A     shl     eax, 10h
0040165D     shr     ebx, 10h
00401660
00401660     movzx   ecx, bl
00401663     mov     al, byte ptr qword_602268[ecx]
0040166A
0040166A     movzx   ecx, bh
0040166D     mov     ah, byte ptr qword_602268[ecx]
00401674
00401674     rol     eax, 10h
00401677
00401677     add     rsp, 8
0040167B     nop
0040167C     retn
0040167C sub_40161f endp

This function is pretty easy to invert; just execute the same opcodes in reverse order and you will obtain the input value. Let’s take a look at the second procedure:

004016E9 sub_4016e9 proc near
004016E9
004016E9     ; input: edi
004016E9     ; output: eax
004016E9
004016E9     xor     eax, eax
004016EB     jz      short loc_4016EE
004016EB
004016ED     db 0xE8
004016EE
004016EE loc_4016EE:
004016EE     movzx   r8, dil
004016F2     shr     edi, 8
004016F2
004016F5     movzx   r9, dil
004016F9     shr     edi, 8
004016F9
004016FC     movzx   r10, dil
00401700     shr     edi, 8
00401700
00401703     movzx   r11, dil
00401707     xor     eax, eax
00401707
00401709     mov     r12b, byte_602394[r11]
00401710     mov     dil, byte_602494[r8]
00401717     xor     edi, r9d
0040171A     xor     edi, r10d
0040171D     xor     edi, r12d
00401720     and     edi, 0FFh
00401726     or      eax, edi
00401728     shl     eax, 8
00401728
0040172B     mov     r12b, byte_602494[r11]
00401732     mov     dil, byte_602394[r10]
00401739     xor     edi, r8d
0040173C     xor     edi, r9d
0040173F     xor     edi, r12d
00401742     and     edi, 0FFh
00401748     or      eax, edi
0040174A     shl     eax, 8
0040174D     jb      short loc_401752
0040174F     jnb     short loc_401752
0040174F
00401751     db 0E9h
00401752
00401752 loc_401752:
00401752
00401752     mov     r12b, byte_602494[r10]
00401759     mov     dil, byte_602394[r9]
00401760     xor     edi, r8d
00401763     xor     edi, r12d
00401766     xor     edi, r11d
00401769     and     edi, 0FFh
0040176F     or      eax, edi
0040176F
00401771     shl     eax, 8
00401774     mov     r12b, byte_602494[r9]
0040177B     mov     dil, byte_602394[r8]
00401782     xor     edi, r12d
00401785     xor     edi, r10d
00401788     xor     edi, r11d
0040178B     and     edi, 0FFh
00401791     or      eax, edi
00401791
00401793     retn
0040109e sub_4016e9 endp

This is slightly harder, and will require some work; first of all, write down how the sub_4016e9 function encodes the input value:

output[3] = byte_602494[input[0]] ^ input[1] ^ input[2] ^ byte_602394[input[3]];
output[2] = byte_602394[input[2]] ^ input[0] ^ input[1] ^ byte_602494[input[3]];
output[1] = byte_602394[input[1]] ^ input[0] ^ input[3] ^ byte_602494[input[2]];
output[0] = byte_602394[input[0]] ^ input[2] ^ input[3] ^ byte_602494[input[1]];

Then, invert the operands so that the input value is on the left:

input[1] = output[3] ^ byte_602494[input[0]] ^ input[2] ^ byte_602394[input[3]]
input[0] = output[2] ^ byte_602394[input[2]] ^ input[1] ^ byte_602494[input[3]]
input[3] = output[1] ^ byte_602394[input[1]] ^ input[0] ^ byte_602494[input[2]]
input[2] = output[0] ^ byte_602394[input[0]] ^ input[3] ^ byte_602494[input[1]]

We can now brute force the system and obtain the input value:

// ...

for (std::uint16_t input1 = 0x00; !found && input1 <= 0xFF; input1++)
{
    input[1] = static_cast<std::uint8_t>(input1);

    for (std::uint16_t input2 = 0x00; !found && input2 <= 0xFF; input2++)
    {
        input[2] = static_cast<std::uint8_t>(input2);

        for (std::uint16_t input3 = 0x00; !found && input3 <= 0xFF; input3++)
        {
            input[3] = static_cast<std::uint8_t>(input3);

            input[0] = output[2] ^ byte_602394[input[2]] ^ input[1] ^ byte_602494[input[3]];

            if (input[1] != (output[3] ^ byte_602494[input[0]] ^ input[2] ^ byte_602394[input[3]]))
                continue;

            if (input[2] != (output[0] ^ byte_602394[input[0]] ^ input[3] ^ byte_602494[input[1]]))
                continue;

            if (input[3] != (output[1] ^ byte_602394[input[1]] ^ input[0] ^ byte_602494[input[2]]))
                continue;

            found = true;
        }
    }
}

// ...

We have pretty much analyzed everything we needed to invert the transformation algorithm, and we can now decode the encrypted file:

alessandro at tachikoma in ~/Projects/untransformer (master)
$ untransformer decode flag.transformed flag.decoded && cat flag.decoded
VolgaCTF{transf0rming_dat@_with_hardc0ded_key_is_not_saf33}

I have published the whole source code on my GitHub page in case you want to take a look at the decoder.

If you managed to get this far, thanks for reading! I hope you found this writeup interesting; I know I at least had fun writing it.