I wrote this little hook-engine for a much bigger article. Sometimes it seems
such a vaste to write valuable code for large articles whose topic isn't
directly related to the code. This often leads to the problem that the code
won't be found by the people who are looking for it.
Personally, I would've used Microsoft's Detour hook engine, but the free license
only applies to x86 applications, and that seemed a little bit too restrictive
to me. So, I decided to write my own engine in order to support x64 as well.
I've never downloaded Detour nor have I ever seen its APIs, but from the general
overview given by Microsoft it's easy to guess how it works.
As I said, this is only a part of something bigger. It's not perfect, but it can
easily become such. Since this is not a beginner's guide about hooking, I assume
that the reader already possesses the necessary knowledge to understand the
material. If you never heard about this subject, you'd better start with another
article. There's plenty of guides out there, no need to repeat the same things
here.
As everybody knows there's only one easy and secure way to hook a Win32 API: to
put an inconditional jump at the beginning of the code to redirect it to the
hooked function. And by secure I just mean that our hook can't be bypassed. Of
course, there are some other ways, but they're either complicated or insane or
both. A proxy dll, for instance, might work in some cases, but it's rather
insane for system dlls. Overwriting the IAT is unsecure for two reasons:
a) The program might use GetProcAddress to retrieve the address of an API (and
in that case we should handle this API as well).
b) It's not always possible, there are many cases as for packed programs where
the IAT gets built by the protection code and not by the Windows loader.
Ok, I guess you're convinced. Let's just say that there's a reason why Microsoft
uses the method presented in this article.
A common technique used in combination with the unconditional jump is:
This approach may seem unsafe in a multi-threading environment and it is. It might work, but our technique is much more powerful. Well, nothing new, we just put our unconditional jump at the beginning of the code we want to hook and we put the original instructions of the API elsewhere in memory. When the hooked function jumps to our code we can call the bridge we created, which, after the first instructions, will jump to the API code which follows our unconditional jump:
Let's make a real world example. If the first instructions of the function/API we want to hook are:
mov edi, edi
push ebp
mov ebp, esp
xor ecx, ecx
They will be replaced by our:00400000 jmp our_code
00400005 xor ecx, ecx
Our bridge will look like this:mov edi, edi
push ebp
mov ebp, esp
jmp 00400005
Of course, to know the size of the instructions we're going to
replace we need a disassembler both for x86 and x64. I searched on google for an
x64 disassembler and found the
diStorm64 disassembler. I quote from its homepage:
diStorm64 is a professional quality open source disassembler
library for AMD64, licensed under the BSD license. diStorm is a binary stream disassembler. It's capable of disassembling 80x86 instructions in 64 bits (AMD64, X86-64) and both in 16 and 32 bits. In addition, it disassembles FPU, MMX, SSE, SSE2, SSE3, SSSE3, SSE4, 3DNow! (w/ extensions), new x86-64 instruction sets, VMX, and AMD's SVM! diStorm was written to decode quickly every instruction as accurately as possible. Robust decoding, while taking special care for valid or unused prefixes, is what makes this disassembler powerful, especially for research. Another benefit that might come in handy is that the module was written as multi-threaded, which means you could disassemble several streams or more simultaneously. For rapidly use, diStorm is compiled for Python and is easily used in C as well. diStorm was originally written under Windows and ported later to Linux and Mac. The source code is portable and platform independent (supports both little and big endianity). It also can be used as a ring0 disassembler (tested as a kernel driver using the DDK under Windows)! |
This sounded pretty good to me. Now that we have our disassembler we can start!
The first thing I wanted to know was if it was possible to create bridges without having to relocate jumps. As the reader knows jumps, most of the time, have a relative address as operand and not an absolute one. This leads to the problem that I can't relocate a jump without having to recalculate its relative address. Also, I wanted to test if this disassembler really worked fine. So, I wrote a little program which creates a log file of all the instructions of all exported functions in a dll which are going to be overwritten by an unconditional jump. Here's the code:
#include
"stdafx.h"
#include
"distorm.h"
#include
<stdlib.h>
#include
<stdlib.h>
#include
<Windows.h>
DWORD
RvaToOffset(IMAGE_NT_HEADERS *NT, DWORD
Rva);
VOID
AddFunctionToLog(FILE *Log, BYTE *FileBuf, DWORD
FuncRVA);
VOID GetInstructionString(char *Str,
_DecodedInst *Instr);
int
_tmain(int
argc, _TCHAR*
argv[])
{
if (argc < 2) return 0;
//
// Open log file
//
FILE *Log =
NULL;
if (_tfopen_s(&Log,
argv[2],
_T("w")) != 0)
return 0;
//
// Open PE file
//
HANDLE
hFile = CreateFile(argv[1],
GENERIC_READ, FILE_SHARE_READ,
NULL,
OPEN_EXISTING, 0,
NULL);
if (hFile ==
INVALID_HANDLE_VALUE)
{
fclose(Log);
return 0;
}
DWORD
FileSize = GetFileSize(hFile,
NULL);
BYTE *FileBuf = new BYTE [FileSize];
DWORD
BRW;
if (FileBuf)
ReadFile(hFile,
FileBuf, FileSize, &BRW,
NULL);
CloseHandle(hFile);
IMAGE_DOS_HEADER *pDosHeader = (IMAGE_DOS_HEADER *)
FileBuf;
IMAGE_NT_HEADERS *pNtHeaders = (IMAGE_NT_HEADERS *) ((FileBuf !=
NULL ?
pDosHeader->e_lfanew : 0) + (ULONG_PTR)
FileBuf);
if (!FileBuf ||
pDosHeader->e_magic !=
IMAGE_DOS_SIGNATURE ||
pNtHeaders->Signature !=
IMAGE_NT_SIGNATURE ||
pNtHeaders->OptionalHeader.DataDirectory
[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress == 0)
{
fclose(Log);
if (FileBuf)
delete
FileBuf;
return 0;
}
//
// Walk through export dir's functions
//
DWORD
ET_RVA = pNtHeaders->OptionalHeader.DataDirectory
[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress;
IMAGE_EXPORT_DIRECTORY *pExportDir = (IMAGE_EXPORT_DIRECTORY *)
(RvaToOffset(pNtHeaders,
ET_RVA) + (ULONG_PTR)
FileBuf);
DWORD *pFunctions = (DWORD *) (RvaToOffset(pNtHeaders,
pExportDir->AddressOfFunctions) + (ULONG_PTR)
FileBuf);
for (DWORD
x = 0; x <
pExportDir->NumberOfFunctions;
x++)
{
if (pFunctions[x] == 0) continue;
AddFunctionToLog(Log,
FileBuf, pFunctions[x]);
}
fclose(Log);
delete
FileBuf;
return 0;
}
//
// This function adds to the log the instructions
// at the beginning of each function which are going
// to be overwritten by the hook jump
//
VOID AddFunctionToLog(FILE *Log, BYTE *FileBuf, DWORD
FuncRVA)
{
#define
MAX_INSTRUCTIONS 100
IMAGE_NT_HEADERS *pNtHeaders = (IMAGE_NT_HEADERS *)
((*(IMAGE_DOS_HEADER *)
FileBuf).e_lfanew + (ULONG_PTR)
FileBuf);
_DecodeResult
res;
_DecodedInst
decodedInstructions[MAX_INSTRUCTIONS];
unsigned
int decodedInstructionsCount = 0;
#ifdef
_M_IX86
_DecodeType
dt = Decode32Bits;
#define
JUMP_SIZE 10 // worst case scenario
#else
ifdef _M_AMD64
_DecodeType
dt = Decode64Bits;
#define
JUMP_SIZE 14 // worst case scenario
#endif
_OffsetType
offset = 0;
res =
distorm_decode(offset, // offset for buffer, e.g. 0x00400000
(const BYTE *) &FileBuf[RvaToOffset(pNtHeaders,
FuncRVA)],
50, // function size (code size to disasm)
dt, // x86 or x64?
decodedInstructions, // decoded instr
MAX_INSTRUCTIONS, // array size
&decodedInstructionsCount // how many instr were disassembled?
);
if (res ==
DECRES_INPUTERR)
return;
DWORD
InstrSize = 0;
for (UINT
x = 0; x <
decodedInstructionsCount; x++)
{
if (InstrSize >=
JUMP_SIZE)
break;
InstrSize +=
decodedInstructions[x].size;
char
Instr[100];
GetInstructionString(Instr, &decodedInstructions[x]);
fprintf(Log,
"%s\n", Instr);
}
fprintf(Log,
"\n\n\n");
}
VOID GetInstructionString(char *Str,
_DecodedInst *Instr)
{
wsprintfA(Str,
"%s %s", Instr->mnemonic.p,
Instr->operands.p);
_strlwr_s(Str, 100);
}
DWORD
RvaToOffset(IMAGE_NT_HEADERS *NT, DWORD
Rva)
{
DWORD
Offset = Rva,
Limit;
IMAGE_SECTION_HEADER *Img;
WORD
i;
Img =
IMAGE_FIRST_SECTION(NT);
if (Rva <
Img->PointerToRawData)
return
Rva;
for (i = 0;
i < NT->FileHeader.NumberOfSections;
i++)
{
if (Img[i].SizeOfRawData)
Limit =
Img[i].SizeOfRawData;
else
Limit =
Img[i].Misc.VirtualSize;
if (Rva >=
Img[i].VirtualAddress &&
Rva < (Img[i].VirtualAddress +
Limit))
{
if (Img[i].PointerToRawData != 0)
{
Offset -=
Img[i].VirtualAddress;
Offset +=
Img[i].PointerToRawData;
}
return
Offset;
}
}
return
NULL;
}
jmp [xxxxx]
xxxxx: absolute address (DWORD on x86 and QWORD on x64)
This means we'd
have a worst case scenario of 10 bytes on x86 and of 14 bytes on x64. In this
hook engine I'm using only worst case scenarios (no 5 byte relative addresses),
simply because if the space between the original function and the hooked one is
> 2GB or the space between the original function and the bridge is > 2GB, then I
would have to recreate the bridge from scratch every time I hook/unhook the
function. A professional engine should do this (and it's not much work), but
I'll keep it simple (for me) and use only absolute jumps. As for the results of the little program above, I
created logs for the ntdll.dll and advapi32.dll both for x86 and x64. Here, for
instance, is a small part of the ntdll.dll x86 log:mov eax, 0x44
mov edx, 0x7ffe0300
mov eax, 0x45
mov edx, 0x7ffe0300
mov eax, 0x46
mov edx, 0x7ffe0300
mov eax, 0x47
mov edx, 0x7ffe0300
mov eax, 0x48
mov edx, 0x7ffe0300
mov eax, 0x49
mov edx, 0x7ffe0300
mov eax, 0x4a
mov edx, 0x7ffe0300
mov eax, 0x4b
mov edx, 0x7ffe0300
mov eax, 0x4c
mov edx, 0x7ffe0300
This is of course pretty encouraging, but let's see the results for the x64
platform.sub rsp, 0x48
mov rax, [rsp+0x78]
mov byte [rsp+0x30], 0x0
mov [rsp+0x10], rbx
mov [rsp+0x18], rbp
mov [rsp+0x20], rsi
push rsi
push r14
push r15
sub rsp, 0x480
mov rax, rsp
mov [rax+0x8], rbx
mov [rax+0x10], rsi
mov [rax+0x18], r12
sub rsp, 0x38
mov [rsp+0x20], r8
mov r9d, edx
mov r8, rcx
mov rax, rsp
mov [rax+0x8], rsi
mov [rax+0x10], rdi
mov [rax+0x18], r12
mov [rsp+0x10], rbx
mov [rsp+0x18], rsi
push rdi
push r12
sub rsp, 0x68
mov rax, r9
mov r9d, [rsp+0xb0]
But what about the functions which just call a syscall after moving a number
into a register like NtCreateProcess, NtOpenKey etc.? These functions have very
few instructions and our 14 bytes jump will overwrite more code than the one of
the function itself. But that doesn't seem to be a problem, since as we can see
from the disassembler these functions have a 16-bytes alignment. So, we won't
overwrite other functions code anyway.
Here's the main code of the hook engine (all the code is about 300 lines of
code):
//
// This function creates a bridge of the original function
//
VOID *CreateBridge(ULONG_PTR
Function, const UINT
JumpSize =
JUMP_SIZE)
{
if (pBridgeBuffer ==
NULL) return
NULL;
#define
MAX_INSTRUCTIONS 100
_DecodeResult
res;
_DecodedInst
decodedInstructions[MAX_INSTRUCTIONS];
unsigned
int decodedInstructionsCount = 0;
#ifdef
_M_IX86
_DecodeType
dt =
Decode32Bits;
#else
ifdef _M_AMD64
_DecodeType
dt =
Decode64Bits;
#endif
_OffsetType
offset = 0;
res =
distorm_decode(offset, // offset for buffer
(const BYTE *)
Function, // buffer to disassemble
50, // function size (code size to disasm)
// 50 instr should be _quite_ enough
dt, // x86 or x64?
decodedInstructions, // decoded instr
MAX_INSTRUCTIONS, // array size
&decodedInstructionsCount // how many instr were disassembled?
);
if (res ==
DECRES_INPUTERR)
return
NULL;
DWORD
InstrSize = 0;
VOID *pBridge = (VOID *) &pBridgeBuffer[CurrentBridgeBufferSize];
for (UINT
x = 0; x <
decodedInstructionsCount; x++)
{
if (InstrSize >=
JumpSize)
break;
BYTE *pCurInstr = (BYTE *) (InstrSize + (ULONG_PTR)
Function);
//
// This is an sample attempt of handling a jump
// It works, but it converts the jz to jmp
// since I didn't write the code for writing
// conditional jumps
//
/* if (*pCurInstr == 0x74) // jz near
{
ULONG_PTR Dest = (InstrSize + (ULONG_PTR) Function)
+ (char) pCurInstr[1];
WriteJump(&pBridgeBuffer[CurrentBridgeBufferSize], Dest);
CurrentBridgeBufferSize += JumpSize;
}
else
{
*/
memcpy(&pBridgeBuffer[CurrentBridgeBufferSize],
(VOID *)
pCurInstr, decodedInstructions[x].size);
CurrentBridgeBufferSize +=
decodedInstructions[x].size;
//}
InstrSize +=
decodedInstructions[x].size;
}
WriteJump(&pBridgeBuffer[CurrentBridgeBufferSize],
Function + InstrSize);
CurrentBridgeBufferSize +=
GetJumpSize((ULONG_PTR)
&pBridgeBuffer[CurrentBridgeBufferSize],
Function + InstrSize);
return
pBridge;
}
//
// Hooks a function
//
extern
"C" __declspec(dllexport)
BOOL __cdecl
HookFunction(ULONG_PTR
OriginalFunction, ULONG_PTR
NewFunction)
{
//
// Check if the function has already been hooked
// If so, no disassembling is necessary since we already
// have our bridge
//
HOOK_INFO *hinfo =
GetHookInfoFromFunction(OriginalFunction);
if (hinfo)
{
WriteJump((VOID *)
OriginalFunction, NewFunction);
}
else
{
if (NumberOfHooks == (MAX_HOOKS - 1))
return
FALSE;
VOID *pBridge =
CreateBridge(OriginalFunction,
GetJumpSize(OriginalFunction,
NewFunction));
if (pBridge ==
NULL)
return
FALSE;
HookInfo[NumberOfHooks].Function =
OriginalFunction;
HookInfo[NumberOfHooks].Bridge = (ULONG_PTR)
pBridge;
HookInfo[NumberOfHooks].Hook =
NewFunction;
NumberOfHooks++;
WriteJump((VOID *)
OriginalFunction, NewFunction);
}
return
TRUE;
}
//
// Unhooks a function
//
extern
"C" __declspec(dllexport)
VOID __cdecl
UnhookFunction(ULONG_PTR
Function)
{
//
// Check if the function has already been hooked
// If not, I can't unhook it
//
HOOK_INFO *hinfo =
GetHookInfoFromFunction(Function);
if (hinfo)
{
//
// Replaces the hook jump with a jump to the bridge
// I'm not completely unhooking since I'm not
// restoring the original bytes
//
WriteJump((VOID *)
hinfo->Function,
hinfo->Bridge);
}
}
//
// Get the bridge to call instead of the original function from hook
//
extern
"C" __declspec(dllexport)
ULONG_PTR __cdecl
GetOriginalFunction(ULONG_PTR
Hook)
{
if (NumberOfHooks == 0)
return
NULL;
for (UINT
x = 0; x <
NumberOfHooks; x++)
{
if (HookInfo[x].Hook ==
Hook)
return
HookInfo[x].Bridge;
}
return
NULL;
}
Using the code is very simple. Basically, the dll only exports 3 functions: one to hook, another to unhook and the last to get the address of the bridge of the hooked function. Of course, we need to retrieve the address of the bridge, otherwise we can't call the original code of the hooked function.
Let's see a little code sample which works both on x86 and x64:
#include
"stdafx.h"
#include
"NtHookEngine_Test.h"
BOOL (__cdecl *HookFunction)(ULONG_PTR
OriginalFunction, ULONG_PTR
NewFunction);
VOID (__cdecl *UnhookFunction)(ULONG_PTR
Function);
ULONG_PTR (__cdecl *GetOriginalFunction)(ULONG_PTR
Hook);
int
WINAPI MyMessageBoxW(HWND
hWnd, LPCWSTR
lpText, LPCWSTR
lpCaption,
UINT
uType, WORD
wLanguageId, DWORD
dwMilliseconds);
int
APIENTRY _tWinMain(HINSTANCE
hInstance, HINSTANCE
hPrevInstance,
LPTSTR lpCmdLine,
int nCmdShow)
{
//
// Retrive hook functions
//
HMODULE
hHookEngineDll = LoadLibrary(_T("NtHookEngine.dll"));
HookFunction = (BOOL (__cdecl *)(ULONG_PTR, ULONG_PTR))
GetProcAddress(hHookEngineDll,
"HookFunction");
UnhookFunction = (VOID (__cdecl *)(ULONG_PTR))
GetProcAddress(hHookEngineDll,
"UnhookFunction");
GetOriginalFunction = (ULONG_PTR (__cdecl *)(ULONG_PTR))
GetProcAddress(hHookEngineDll,
"GetOriginalFunction");
if (HookFunction ==
NULL || UnhookFunction ==
NULL ||
GetOriginalFunction ==
NULL)
return 0;
//
// Hook MessageBoxTimeoutW
//
HookFunction((ULONG_PTR)
GetProcAddress(LoadLibrary(_T("User32.dll")),
"MessageBoxTimeoutW"),
(ULONG_PTR) &MyMessageBoxW);
MessageBox(0,
_T("Hi, this is a message box!"),
_T("This is the title."),
MB_ICONINFORMATION);
//
// Unhook MessageBoxTimeoutW
//
UnhookFunction((ULONG_PTR)
GetProcAddress(LoadLibrary(_T("User32.dll")),
"MessageBoxTimeoutW"));
MessageBox(0,
_T("Hi, this is a message box!"), _T("This is the title."),
MB_ICONINFORMATION);
return 0;
}
int
WINAPI MyMessageBoxW(HWND
hWnd, LPCWSTR
lpText, LPCWSTR
lpCaption, UINT
uType,
WORD
wLanguageId, DWORD
dwMilliseconds)
{
int (WINAPI *pMessageBoxW)(HWND
hWnd, LPCWSTR
lpText,
LPCWSTR
lpCaption, UINT
uType, WORD
wLanguageId,
DWORD
dwMilliseconds);
pMessageBoxW = (int (WINAPI *)(HWND,
LPCWSTR, LPCWSTR, UINT, WORD, DWORD))
GetOriginalFunction((ULONG_PTR)
MyMessageBoxW);
return
pMessageBoxW(hWnd,
lpText, L"Hooked MessageBox",
uType,
wLanguageId, dwMilliseconds);
}
In this sample I'm hooking the API "MessageBoxTimeoutW". I tried to hook MessageBoxW and that worked fine on x86, then I tried on x64 and the code generated an excpetion. So, I disassembled the MessageBoxW function on x64:
Unfortunately, as you can notice, the first instructions of this API include a jz which is going to be overwritten by our unconditional jump. And since we don't relocate jumps in our bridge, we can't hook this function. So, I had to hook the function MessageBoxTimeoutW, which is called inside MessageBoxW and has no jumps at the beginning.
In the code example I first hook the function and call it, then I unhook it and call it again. So, the output will be:
That's all. Of course, this code works only if MessageBoxTimeoutW is available. I'm not completely sure about when it was first introduced, since it's an undocumented API. I guess it has been introduced with XP, so chances are that this particular hook won't work on Windows 2000.
As it's possible to see from the previous example, the hook engine isn't perfect, but it can easily be improved. I don't develop it further because I don't need a more powerful one (right now, I mean). I just needed an x86/x64 hook engine with no license restrictions. I wrote this engine and the article in just one day, it really wasn't much work. Most of the work in such a hook engine is writing the disassembler, which I didn't do. So, in my opinion, it doesn't make much sense paying for a hook engine. The only thing which I really can't provide in this engine is support for Itanium. That's because I don't have a disassembler for this platform. But I would rather write one myself than buying a hook engine. I might actually add an Itanium disassembler in the future, who knows...
I hope you can find this code useful.
Daniel Pistelli