____________________________________________________________________
ClamAV Bytecode Compiler
User Manual
Contents
ClamAV Bytecode Compiler - Internals Manual,
© 2009 Sourcefire, Inc.
Authors: Török Edvin
This document is distributed under the terms of the GNU General Public License v2.
Clam AntiVirus is free software; you can redistribute it and/or modify it under the terms of
the GNU General Public License as published by the Free Software Foundation; version 2 of
the License.
This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR
A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program;
if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
02110-1301, USA.
ClamAV and Clam AntiVirus are trademarks of Sourcefire, Inc.
The ClamAV Bytecode Compiler uses the LLVM compiler framework, thus requires an Operating System where building LLVM is supported:
The following packages are required to compile the ClamAV Bytecode Compiler:
The following packages are optional, but highly recommended:
You can obtain the source code in one of the following ways 2
git clone git://git.clamav.net/git/clamav-bytecode-compiler
git clone http://git.clamav.net/git/clamav-bytecode-compiler.git
You can keep the source code updated using:
A minimalistic release build requires 100M of disk space.
Testing the compiler requires a full build, 320M of disk space. A debug build requires significantly more disk space (1.4G for a minimalistic debug build).
Note that this only needed during the build process, once installed only 12M is needed.
Building requires a separate object directory, building in the source directory is not supported. Create a build directory:
$ cd clamav-bytecode-compiler && mkdir obj
Run configure (you can use any prefix you want, this example uses /usr/local/clamav):
Run the build under ulimit 1 :
If make check reports errors, check that your compiler is NOT on this list: http://llvm.org/docs/GettingStarted.html#brokengcc.
If it is, then your compiler is buggy, and you need to do one of the following: upgrade your compiler to a non-buggy version, upgrade the OS to one that has a non-buggy compiler, compile with export OPTMIZE_OPTION=-O2, or export OPTIMIZE_OPTION=-O1, or export OPTIMIZE_OPTION=\-O1.
If not you probably found a bug, report it at http://bugs.clamav.net
Install it:
1 Logical signatures can be used as triggers for executing bytecode. However, instead of describing a logical signature as a .ldb pattern, you use (simple) C code which is later translated to a .ldb-style logical signature by the ClamAV Bytecode Compiler.
A bytecode triggered by a logical signature is much more powerful than a logical signature itself: you can write complex algorithmic detections, and use the logical signature as a filter (to speed up matching). Thus another name for “logical signature bytecodes” is “algorithmic detection bytecodes”. The detection you write in bytecode has read-only access to the file being scanned and its metadata (PE sections, EP, etc.).
Algorithmic detection bytecodes are triggered when a logical signature matches. They can execute an algorithm that determines whether the file is infected and with which virus.
A bytecode can be either algorithmic or an unpacker (or other hook), but not both.
It consists of:
The syntax for defining logical signatures, and an example is described in Section 2.2.4.
The function entrypoint must report the detected virus by calling foundVirus and returning 0. It is recommended that you always return 0, otherwise a warning is shown and the file is considered clean. If foundVirus is not called, then ClamAV also assumes the file is clean.
Each logical signature bytecode must have a virusname prefix, and one or more virusnames. The virusname prefix is used by the SI to ensure unique virusnames (a unique number is appended for duplicate prefixes).
In Program 1 3 predefied macros are used:
In this example, the bytecode could generate one of these virusnames: Trojan.Foo.A, or Trojan.Foo.B, by calling foundVirus("A") or foundVirus("B") respectively (notice that the prefix is not part of these calls).
Logical signatures use .ndb style patterns, an example on how to define these is shown in Program 2.
Each pattern has a name (like a variable), and a string that is the hex pattern itself. The declarations are delimited by the macros SIGNATURES_DECL_BEGIN, and SIGNATURES_DECL_END. The definitions are delimited by the macros SIGNATURES_DEF_BEGIN, and SIGNATURES_END. Declarations must always come before definitions, and you can have only one declaration and declaration section! (think of declaration like variable declarations, and definitions as variable assignments, since that what they are under the hood). The order in which you declare the signatures is the order in which they appear in the generated logical signature.
You can use any name for the patterns that is a valid record field name in C, and doesn’t conflict with anything else declared.
After using the above macros, the global variable Signatures will have two new fields: magic, and zero. These can be used as arguments to the functions count_match(), and matches() anywhere in the program as shown in Program 3:
The condition in the if can be interpreted as: if the match signature has matched at least once, and the number of times the zero signature matched is higher than the number of times the check signature matched, then we have found a virus A, otherwise the file is clean.
The simplest logical signature is like a .ndb signature: a virus name, signature target, 0 as logical expression 1, and a ndb-style pattern.
The code for this is shown in Program 4
The logical signature (created by the compiler) looks like this: Trojan.Foo.{A};Target:2;0;aabb
Of course you should use a .ldb signature in this case when all the processing in entrypoint is only setting a virusname and returning. However, you can do more complex checks in entrypoint, once the bytecode was triggered by the logical_trigger
In the example in Program 4 the pattern was used without an anchor; such a pattern matches at any offset. You can use offsets though, the same way as in .ndb signatures, see Program 5 for an example.
An example for this is shown in Program 5. Here you see the following new features used: 1 1In case of a duplicate virusname the prefix is appended a unique number by the SI
The logical signature looks like this:
Trojan.Foo.{A,B};Target:2;(((0|1|2)=42,2)|(3=10));EP+0:aabb;ffff;aaccee;f00d;dead
Notice how the subsignature that is not used in the logical expression (number 4, dead) is used in entrypoint to decide the virus name. This works because ClamAV does collect the match counts for all subsignatures (regardless if they are used or not in a signature). The count_match(Signatures.check2) call is thus a simple memory read of the count already determined by ClamAV.
Also notice that comments can be used freely: they are ignored by the compiler. You can use either C-style multiline comments (start comment with /*, end with */), or C++-style single-line comments (start comment with //, automatically ended by newline).
ClamAV only supports a limited set of regular expressions in .ndb format : wildcards. The bytecode compiler allows you to compile fully generic regular expressions to bytecode directly. When libclamav loads the bytecode, it will compile to native code (if using the JIT), so it should offer quite good performance.
The compiler currently uses re2c to compile regular expressions to C code, and then compile that to bytecode. The internal workings are all transparent to the user: the compiler automatically uses re2c when needed, and re2c is embedded in the compiler, so you don’t need to install it.
The syntax of regular expressions are similar to the one used by POSIX regular expressions, except you have to quote literals, since unquoted they are interpreted as regular expression names.
Lets start with a simple example, to match this POSIX regular expression: eval([a-zA-Z_][a-zA-Z0-9_]*\.unescape.
See Program 6 1 1This omits the virusname, and logical signature declarations .
There are several new features introduced here, here is a step by step breakdown:
You may have multiple regular expressions, or declare multiple regular expressions with a name, and use those names to build more complex regular expressions.
When writing an unpacker, the bytecode should consist of:
Compiling is similar to gcc 1 :
This will compile the file foo.c into a file called foo.cbc, that can be loaded by ClamAV, and packed inside a .cvd file.
The compiler by default has all warnings turned on.
Supported optimization levels: -O0, -O1, -O2, -O3. 1 It is recommended that you always compile with at least -O1.
Warning options: -Werror (transforms all warnings into errors).
Preprocessor flags:
The compiler also supports some other commandline options (see clambc-compiler --help for a full list), however some of them have no effect when using the ClamAV bytecode backend (such as the X86 backend options). You shouldn’t need to use any flags not documented above.
Filenames with a .cpp extension are compiled as C++ files, however clang++ is not yet ready for production use, so this is EXPERIMENTAL currently. For now write bytecodes in C.
After compiling a C source file to bytecode, you can load it in ClamAV:
ClamBC is a tool you can use to test whether the bytecode loads, compiles, and can execute its entrypoint successfully. Usage:
For example loading a simple bytecode with 2 functions is done like this:
You can tell clamscan to load the bytecode as a database directly:
Or you can instruct it to load all databases from a directory, then clamscan will load all supported formats, including files with bytecode, which have the .cbc extension.
You can also put the bytecode files into the default database directory of ClamAV (usually /usr/local/share/clamav) to have it loaded automatically from there. Of course, the bytecode can be stored inside CVD files, too.
Printf, and printf-like format specifiers are not supported in the bytecode. You can use these functions instead of printf to print strings and integer to clamscan’s –debug output:
debug_print_str, debug_print_uint, debug_print_str_start, debug_print_str_nonl.
You can also use the debug convenience wrapper that automatically prints as string or integer depending on parameter type: debug, debug, debug.
See Program 7 for an example.
If you have GDB 7.0 (or newer) you can single-step 1 1not yet implemented in libclamav 2 2assuming you have JIT support during the execution of the bytecode.
You can single-step through the execution of the bytecode, however you can’t (yet) print values of individual variables, you’ll need to add debug statements in the bytecode to print interesting values.
However currently the only supported language from which such bytecode can be generated is a simplified form of C 1
The language supported by the ClamAV bytecode compiler is a restricted set of C99 with some GNU extensions.
These restrictions are enforced at compile time:
They are meant to ensure the following:
These restrictions are checked at runtime (checks are inserted at compile time):
The ClamAV API header has further restriction, see the Internals manual.
Although the bytecode undergoes a series of automated tests (see Publishing chapter in Internals manual), the above restrictions don’t guarantee that the resulting bytecode will execute correctly! You must still test the code yourself, these restrictions only avoid the most common errors. Although the compiler and verifier aims to accept only code that won’t crash ClamAV, no code is 100% perfect, and a bug in the verifier could allow unsafe code be executed by ClamAV.
The bytecode format has the following limitations:
Logical signatures can be used as triggers for executing a bytecode. Instead of describing a logical signatures as a .ldb pattern, you use C code which is then translated to a .ldb-style logical signature.
Logical signatures in ClamAV support the following operations:
Out of the above operations the ClamAV Bytecode Compiler doesn’t support computing sums of nested subexpressions, (it does support nesting though).
The C code that can be converted into a logical signature must obey these restrictions:
The compiler does the following transformations (not necessarily in this order):
If after this transformation the program meets the requirements outlined above, then it is converted to a logical signature. The resulting logical signature is simplified using basic properties of boolean operations, such as associativity, distributivity, De Morgan’s law.
The final logical signature is not unique (there might be another logical signature with identical behavior), however the boolean part is in a canonical form: it is in disjunctive normal form, with operands sorted in ascending order.
For best results the C code should consist of:
You can use || in the if condition too, but be careful that after expanding to disjunctive normal form, the number of subexpressions doesn’t exceed 64.
Note that you do not have to use all the subsignatures you declared in logical_trigger, you can do more complicated checks (that wouldn’t obey the above restrictions) in the bytecode itself at runtime. The logical_trigger function is fully compiled into a logical signature, it won’t be a runtime executed function (hence the restrictions).
When compiling a bytecode program, bytecode.h is automatically included, so you don’t need to explicitly include it. These headers (and the compiler itself) predefine certain macros, see Appendix ?? for a full list. In addition the following types are defined:
As described in Section 4.1 the width of integer types are fixed, the above typedefs show that.
A bytecode’s entrypoint is the function entrypoint and it’s required by ClamAV to load the bytecode.
Bytecode that is triggered by a logical signature must have a list of virusnames and patterns defined. Bytecodes triggered via hooks can optionally have them, but for example a PE unpacker doesn’t need virus names as it only processes the data.
Global COPYRIGHT(c) This will also prevent the sourcecode from being embedded into the bytecode
Global DECLARE_SIGNATURE(name)
Global DEFINE_SIGNATURE(name, hex)
Global FUNCTIONALITY_LEVEL_MAX(m)
Global FUNCTIONALITY_LEVEL_MIN(m)
Global ICONGROUP1(group)
Global ICONGROUP2(group)
Global PDF_HOOK_DECLARE This hook is called several times, use pdf_get_phase() to find out in which phase you got called.
Global PE_HOOK_DECLARE
Global PE_UNPACKER_DECLARE
Global SIGNATURES_DECL_BEGIN
Global SIGNATURES_DECL_END
Global SIGNATURES_DEF_BEGIN
Global SIGNATURES_END
Global TARGET(tgt)
Global VIRUSNAME_PREFIX(name)
Global VIRUSNAMES(...)
Global buffer_pipe_done(int32_t id) After this all attempts to use this buffer will result in error. All buffer_pipes are automatically deallocated when bytecode finishes execution.
Global buffer_pipe_new(uint32_t size)
Global buffer_pipe_new_fromfile(uint32_t pos)
Global buffer_pipe_read_avail(int32_t id)
Global buffer_pipe_read_get(int32_t id, uint32_t amount) The ’amount’ parameter should be obtained by a call to buffer_pipe_read_avail().
Global buffer_pipe_read_stopped(int32_t id, uint32_t amount) Updates read cursor in buffer_pipe.
Global buffer_pipe_write_avail(int32_t id)
Global buffer_pipe_write_get(int32_t id, uint32_t size) Returns pointer to writable buffer. The ’amount’ parameter should be obtained by a call to buffer_pipe_write_avail().
Global buffer_pipe_write_stopped(int32_t id, uint32_t amount)
Global cli_readint16(const void ∗buff)
Global cli_readint32(const void ∗buff)
Global cli_writeint32(void ∗offset, uint32_t v)
Global hashset_add(int32_t hs, uint32_t key)
Global hashset_contains(int32_t hs, uint32_t key)
Global hashset_done(int32_t id) Trying to use the hashset after this will result in an error. The hashset may not be used after this. All hashsets are automatically deallocated when bytecode finishes execution.
Global hashset_empty(int32_t id)
Global hashset_new(void)
Global hashset_remove(int32_t hs, uint32_t key)
Global inflate_done(int32_t id)
Global inflate_init(int32_t from_buffer, int32_t to_buffer, int32_t windowBits) ’from_buffer’ and writing uncompressed uncompressed data ’to_buffer’.
Global inflate_process(int32_t id)
Global le16_to_host(uint16_t v)
Global le32_to_host(uint32_t v)
Global le64_to_host(uint64_t v)
Global malloc(uint32_t size)
Global map_addkey(const uint8_t ∗key, int32_t ksize, int32_t id)
Global map_done(int32_t id)
Global map_find(const uint8_t ∗key, int32_t ksize, int32_t id)
Global map_getvalue(int32_t id, int32_t size)
Global map_getvaluesize(int32_t id)
Global map_new(int32_t keysize, int32_t valuesize)
Global map_remove(const uint8_t ∗key, int32_t ksize, int32_t id)
Global map_setvalue(const uint8_t ∗value, int32_t vsize, int32_t id)
Class DIS_arg
Class DIS_fixed
Class DIS_mem_arg
Global disasm_x86(struct DISASM_RESULT ∗result, uint32_t len)
Global DisassembleAt(struct DIS_fixed ∗result, uint32_t offset, uint32_t len)
Global count_match(__Signature sig)
Global engine_db_options(void)
Global engine_dconf_level(void)
Global engine_functionality_level(void)
Global engine_scan_options(void)
Global match_location(__Signature sig, uint32_t goback)
Global match_location_check(__Signature sig, uint32_t goback, const char ∗static_start, uint32_t static_len) It is recommended to use this for safety and compatibility with 0.96.1
Global matches(__Signature sig)
Global __is_bigendian(void) __attribute__((const )) __attribute__((nothrow))
Global check_platform(uint32_t a, uint32_t b, uint32_t c)
Global disable_bytecode_if(const int8_t ∗reason, uint32_t len, uint32_t cond)
Global disable_jit_if(const int8_t ∗reason, uint32_t len, uint32_t cond)
Global get_environment(struct cli_environment ∗env, uint32_t len)
Global version_compare(const uint8_t ∗lhs, uint32_t lhs_len, const uint8_t ∗rhs, uint32_t rhs_len)
Global buffer_pipe_new_fromfile(uint32_t pos) to the current file, at the specified position.
Global file_byteat(uint32_t offset)
Global file_find(const uint8_t ∗data, uint32_t len)
Global file_find_limit(const uint8_t ∗data, uint32_t len, int32_t maxpos)
Global fill_buffer(uint8_t ∗buffer, uint32_t len, uint32_t filled, uint32_t cursor, uint32_t fill)
Global getFilesize(void)
Global read(uint8_t ∗data, int32_t size)
Global read_number(uint32_t radix) Non-numeric characters are ignored.
Global seek(int32_t pos, uint32_t whence)
Global write(uint8_t ∗data, int32_t size)
Global __clambc_filesize[1]
Global __clambc_kind
Global __clambc_match_counts[64]
Global __clambc_match_offsets[64]
Global __clambc_pedata
Global matchicon(const uint8_t ∗group1, int32_t group1_len, const uint8_t ∗group2, int32_t group2_len)
Global jsnorm_done(int32_t id)
Global jsnorm_init(int32_t from_buffer)
Global jsnorm_process(int32_t id)
Global icos(int32_t a, int32_t b, int32_t c)
Global iexp(int32_t a, int32_t b, int32_t c)
Global ilog2(uint32_t a, uint32_t b)
Global ipow(int32_t a, int32_t b, int32_t c)
Global isin(int32_t a, int32_t b, int32_t c)
Global pdf_get_dumpedobjid(void) Valid only in PDF_PHASE_POSTDUMP.
Global pdf_get_flags(void)
Global pdf_get_obj_num(void)
Global pdf_get_phase(void) Identifies at which phase this bytecode was called.
Global pdf_getobj(int32_t objidx, uint32_t amount) Meant only for reading, write modifies the fmap buffer, so avoid!
Global pdf_getobjsize(int32_t objidx)
Global pdf_lookupobj(uint32_t id)
Global pdf_set_flags(int32_t flags)
Class cli_exe_info
Class cli_exe_section
Class cli_pe_hook_data
Global get_pe_section(struct cli_exe_section ∗section, uint32_t num)
Global getEntryPoint(void)
Global getExeOffset(void)
Global getImageBase(void)
Global getNumberOfSections(void)
Global getPEBaseOfCode(void)
Global getPEBaseOfData(void)
Global getPECharacteristics()
Global getPECheckSum(void)
Global getPEDataDirRVA(unsigned n)
Global getPEDataDirSize(unsigned n)
Global getPEDllCharacteristics(void)
Global getPEFileAlignment(void)
Global getPEImageBase(void)
Global getPEisDLL()
Global getPELFANew(void)
Global getPELoaderFlags(void)
Global getPEMachine()
Global getPEMajorImageVersion(void)
Global getPEMajorLinkerVersion(void)
Global getPEMajorOperatingSystemVersion(void)
Global getPEMajorSubsystemVersion(void)
Global getPEMinorImageVersion(void)
Global getPEMinorLinkerVersion(void)
Global getPEMinorOperatingSystemVersion(void)
Global getPEMinorSubsystemVersion(void)
Global getPENumberOfSymbols()
Global getPEPointerToSymbolTable()
Global getPESectionAlignment(void)
Global getPESizeOfCode(void)
Global getPESizeOfHeaders(void)
Global getPESizeOfHeapCommit(void)
Global getPESizeOfHeapReserve(void)
Global getPESizeOfImage(void)
Global getPESizeOfInitializedData(void)
Global getPESizeOfOptionalHeader()
Global getPESizeOfStackCommit(void)
Global getPESizeOfStackReserve(void)
Global getPESizeOfUninitializedData(void)
Global getPESubsystem(void)
Global getPETimeDateStamp()
Global getPEWin32VersionValue(void)
Global getSectionRVA(unsigned i) .
Global getSectionVirtualSize(unsigned i) .
Global getVirtualEntryPoint(void)
Global hasExeInfo(void)
Global hasPEInfo(void)
Global isPE64(void)
Class pe_image_data_dir
Class pe_image_file_hdr
Class pe_image_optional_hdr32
Class pe_image_optional_hdr64
Class pe_image_section_hdr
Global pe_rawaddr(uint32_t rva)
Global readPESectionName(unsigned char name[8], unsigned n)
Global readRVA(uint32_t rva, void ∗buf, size_t bufsize)
Global bytecode_rt_error(int32_t locationid)
Global extract_new(int32_t id)
Global extract_set_container(uint32_t container)
Global foundVirus(const char ∗virusname)
Global input_switch(int32_t extracted_file)
Global setvirusname(const uint8_t ∗name, uint32_t len)
Global atoi(const uint8_t ∗str, int32_t size)
Global debug_print_str(const uint8_t ∗str, uint32_t len)
Global debug_print_str_nonl(const uint8_t ∗str, uint32_t len)
Global debug_print_str_start(const uint8_t ∗str, uint32_t len)
Global debug_print_uint(uint32_t a)
Global entropy_buffer(uint8_t ∗buffer, int32_t size)
Global hex2ui(uint32_t hex1, uint32_t hex2)
Global memchr(const void ∗s, int c, size_t n)
Global memcmp(const void ∗s1, const void ∗s2, uint32_t n) __attribute__((__nothrow__)) __attribute__((__pure__)) __attribute__((__nonnull__(1
Global memcpy(void ∗restrict dst, const void ∗restrict src, uintptr_t n) __attribute__((__nothrow__)) __attribute__((__nonnull__(1
Global memmove(void ∗dst, const void ∗src, uintptr_t n) __attribute__((__nothrow__)) __attribute__((__nonnull__(1
Global memset(void ∗src, int c, uintptr_t n) __attribute__((nothrow)) __attribute__((__nonnull__((1))))
Global memstr(const uint8_t ∗haystack, int32_t haysize, const uint8_t ∗needle, int32_t needlesize)
Executable file information
Entrypoint of executable Address size - PE ONLY Number of sections Offset where this executable start in file (nonzero if embedded) Resrources RVA - PE ONLY Information about all the sections of this file. This array has nsection elements
Section of executable file.
Section characteristics Raw offset (in file) Raw size (in file) Relative VirtualAddress PE - unaligned PointerToRawData PE - unaligned SizeOfRawData PE - unaligned VirtualAddress PE - unaligned VirtualSize VirtualSize
Data for the bytecode PE hook
PE data directory header address of new exe header EntryPoint as file offset Header for this PE file internally needed by rawaddr Number of sections 32-bit PE optional header 64-bit PE optional header number of overlays size of overlays
disassembled operand
size of access type of access memory operand other operand register operand
disassembled instruction.
size of address size of operation segment opcode of X86 instruction
disassembled memory operand: scale_reg∗scale + add_reg + displacement
size of access register used as displacemenet displacement as immediate number scale as immediate number register used as scale
disassembly result, 64-byte, matched by type-8 signatures
PE data directory header
Header for this PE file
CPU this executable runs on, see libclamav/pe.c for possible values PE magic header: PE\0\0 Number of sections in this executable debug debug == 224 Unreliable
32-bit PE optional header
NT drivers only usually 32 or 512 multiple of 64 KB unreliable unreliable not used unreliable unreliable not used unreliable usually 32 or 4096 unreliable unreliable unreliable
PE 64-bit optional header
NT drivers only usually 32 or 512 multiple of 64 KB unreliable unreliable not used unreliable unreliable not used unreliable usually 32 or 4096 unreliable unreliable unreliable
PE section header
may not end with NULL object files only object files only object files only offset to the section’s data object files only multiple of FileAlignment
Reads specified amount of bytes from the current file into a buffer. Also moves current position in the file.
Writes the specified amount of bytes from a buffer to the current temporary file.
Changes the current file position to the specified one.
Logical signature match counts.
Logical signature match offsets This is a low-level variable, use the Macros in bytecode_local.h instead to access it.