PIC

____________________________________________________________________
ClamAV Bytecode Compiler
User Manual

Contents

ClamAV Bytecode Compiler - Internals Manual,
© 2009 Sourcefire, Inc.
Authors: Török Edvin
This document is distributed under the terms of the GNU General Public License v2.
Clam AntiVirus is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 2 of the License.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.

ClamAV and Clam AntiVirus are trademarks of Sourcefire, Inc.

Chapter 1
Installation

1.1. Requirements

The ClamAV Bytecode Compiler uses the LLVM compiler framework, thus requires an Operating System where building LLVM is supported:

The following packages are required to compile the ClamAV Bytecode Compiler:

The following packages are optional, but highly recommended:

1.2. Obtaining the ClamAV Bytecode Compiler

You can obtain the source code in one of the following ways 2

You can keep the source code updated using:

git pull

1.3. Building

1.3.1. Disk space

A minimalistic release build requires  100M of disk space.

Testing the compiler requires a full build,  320M of disk space. A debug build requires significantly more disk space (1.4G for a minimalistic debug build).

Note that this only needed during the build process, once installed only  12M is needed.

1.3.2. Create build directory

Building requires a separate object directory, building in the source directory is not supported. Create a build directory:

$ cd clamav-bytecode-compiler && mkdir obj

Run configure (you can use any prefix you want, this example uses /usr/local/clamav):

$ cd obj && ../llvm/configure --enable-optimized \  
 --enable-targets=host-only --disable-bindings \  
--prefix=/usr/local/clamav

Run the build under ulimit 1 :

$ (ulimit -t 3600 -v 512000 && make clambc-only -j4)

1.4. Testing

$ (ulimit -t 3600 v 512000 && make -j4)  
$ make check-all

If make check reports errors, check that your compiler is NOT on this list: http://llvm.org/docs/GettingStarted.html#brokengcc.

If it is, then your compiler is buggy, and you need to do one of the following: upgrade your compiler to a non-buggy version, upgrade the OS to one that has a non-buggy compiler, compile with export OPTMIZE_OPTION=-O2, or export OPTIMIZE_OPTION=-O1, or export OPTIMIZE_OPTION=\-O1.

If not you probably found a bug, report it at http://bugs.clamav.net

1.5. Installing

Install it:

$ make install-clambc -j8

1.5.1. Structure of installed files

  1. The ClamAV Bytecode compiler driver: $PREFIX/bin/clambc-compiler
  2. ClamAV bytecode header files:
    $PREFIX/lib/clang/1.1/include:  
    bcfeatures.h  
    bytecode_{api_decl.c,api,disasm,execs,features}.h  
    bytecode.h  
    bytecode_{local,pe,types}.h

  3. clang compiler (with ClamAV bytecode backend) compiler include files:
    $PREFIX/lib/clang/1.1/include:  
    emmintrin.h  
    float.h  
    iso646.h  
    limits.h  
    {,p,t,x}mmintrin.h  
    mm_malloc.h  
    std{arg,bool,def,int}.h  
    tgmath.h

  4. User manual
    $PREFIX/docs/clamav/clambc-user.pdf

Chapter 2
Tutorial

2.1. Short introduction to the bytecode language

2.1.1. Types, variables and constants

2.1.2. Arrays and pointers

2.1.3. Arithmetics

2.1.4. Functions

2.1.5. Control flow

2.1.6. Common functions

2.2. Writing logical signature bytecodes

1 Logical signatures can be used as triggers for executing bytecode. However, instead of describing a logical signature as a .ldb pattern, you use (simple) C code which is later translated to a .ldb-style logical signature by the ClamAV Bytecode Compiler.

A bytecode triggered by a logical signature is much more powerful than a logical signature itself: you can write complex algorithmic detections, and use the logical signature as a filter (to speed up matching). Thus another name for “logical signature bytecodes” is “algorithmic detection bytecodes”. The detection you write in bytecode has read-only access to the file being scanned and its metadata (PE sections, EP, etc.).

2.2.1. Structure of a bytecode for algorithmic detection

Algorithmic detection bytecodes are triggered when a logical signature matches. They can execute an algorithm that determines whether the file is infected and with which virus.

A bytecode can be either algorithmic or an unpacker (or other hook), but not both.

It consists of:

The syntax for defining logical signatures, and an example is described in Section 2.2.4.

The function entrypoint must report the detected virus by calling foundVirus and returning 0. It is recommended that you always return 0, otherwise a warning is shown and the file is considered clean. If foundVirus is not called, then ClamAV also assumes the file is clean.

2.2.2. Virusnames

Each logical signature bytecode must have a virusname prefix, and one or more virusnames. The virusname prefix is used by the SI to ensure unique virusnames (a unique number is appended for duplicate prefixes).


1/ Prefix, used for duplicate detection and fixing / 
VIRUSNAME_PREFIX("Trojan.Foo") 
3/ You are only allowed to set these virusnames as found / 
VIRUSNAMES("A", "B") 
5/ File type / 
TARGET(2)

Program 1: Declaring virusnames

In Program 1 3 predefied macros are used:

In this example, the bytecode could generate one of these virusnames: Trojan.Foo.A, or Trojan.Foo.B, by calling foundVirus("A") or foundVirus("B") respectively (notice that the prefix is not part of these calls).

2.2.3. Patterns

Logical signatures use .ndb style patterns, an example on how to define these is shown in Program 2.


SIGNATURES_DECL_BEGIN 
2DECLARE_SIGNATURE(magic) 
DECLARE_SIGNATURE(check) 
4DECLARE_SIGNATURE(zero) 
SIGNATURES_DECL_END 
6 
SIGNATURES_DEF_BEGIN 
8DEFINE_SIGNATURE(magic, "EP+0:aabb") 
DEFINE_SIGNATURE(check, "f00d") 
10DEFINE_SIGNATURE(zero, "ffff") 
SIGNATURES_END

Program 2: Declaring patterns

Each pattern has a name (like a variable), and a string that is the hex pattern itself. The declarations are delimited by the macros SIGNATURES_DECL_BEGIN, and SIGNATURES_DECL_END. The definitions are delimited by the macros SIGNATURES_DEF_BEGIN, and SIGNATURES_END. Declarations must always come before definitions, and you can have only one declaration and declaration section! (think of declaration like variable declarations, and definitions as variable assignments, since that what they are under the hood). The order in which you declare the signatures is the order in which they appear in the generated logical signature.

You can use any name for the patterns that is a valid record field name in C, and doesn’t conflict with anything else declared.

After using the above macros, the global variable Signatures will have two new fields: magic, and zero. These can be used as arguments to the functions count_match(), and matches() anywhere in the program as shown in Program 3:

The condition in the if can be interpreted as: if the match signature has matched at least once, and the number of times the zero signature matched is higher than the number of times the check signature matched, then we have found a virus A, otherwise the file is clean.


1int entrypoint(void) 
{ 
3  if (matches(Signatures.match) && count_match(Signatures.zero) > count_match(Signatures.check)) 
    foundVirus("A"); 
5  return 0; 
}

Program 3: Using patterns

2.2.4. Single subsignature

The simplest logical signature is like a .ndb signature: a virus name, signature target, 0 as logical expression 1, and a ndb-style pattern.

The code for this is shown in Program 4


 
/ Declare the prefix of the virusname / 
2VIRUSNAME_PREFIX("Trojan.Foo") 
/ Declare the suffix of the virusname / 
4VIRUSNAMES("A") 
/ Declare the signature target type (1 = PE) / 
6TARGET(1) 
 
8/ Declare the name of all subsignatures used / 
SIGNATURES_DECL_BEGIN 
10DECLARE_SIGNATURE(magic) 
SIGNATURES_DECL_END 
12 
/ Define the pattern for each subsignature / 
14SIGNATURES_DEF_BEGIN 
DEFINE_SIGNATURE(magic, "aabb") 
16SIGNATURES_END 
 
18/ All bytecode triggered by logical signatures must have this 
   function / 
20bool logical_trigger(void) 
{ 
22  / return true if the magic subsignature matched, 
    its pattern is defined above to "aabb" / 
24  return count_match(Signatures.magic) != 2; 
} 
26 
/ This is the bytecode function that is actually executed when the logical 
28  signature matched / 
int entrypoint(void) 
30{ 
  / call this function to set the suffix of the virus found / 
32  foundVirus("A"); 
  / success, return 0 / 
34  return 0; 
}

Program 4: Single subsignature example

The logical signature (created by the compiler) looks like this: Trojan.Foo.{A};Target:2;0;aabb

Of course you should use a .ldb signature in this case when all the processing in entrypoint is only setting a virusname and returning. However, you can do more complex checks in entrypoint, once the bytecode was triggered by the logical_trigger

In the example in Program 4 the pattern was used without an anchor; such a pattern matches at any offset. You can use offsets though, the same way as in .ndb signatures, see Program 5 for an example.

2.2.5. Multiple subsignatures

An example for this is shown in Program 5. Here you see the following new features used: 1 1In case of a duplicate virusname the prefix is appended a unique number by the SI

The logical signature looks like this:

Trojan.Foo.{A,B};Target:2;(((0|1|2)=42,2)|(3=10));EP+0:aabb;ffff;aaccee;f00d;dead

Notice how the subsignature that is not used in the logical expression (number 4, dead) is used in entrypoint to decide the virus name. This works because ClamAV does collect the match counts for all subsignatures (regardless if they are used or not in a signature). The count_match(Signatures.check2) call is thus a simple memory read of the count already determined by ClamAV.

Also notice that comments can be used freely: they are ignored by the compiler. You can use either C-style multiline comments (start comment with /*, end with */), or C++-style single-line comments (start comment with //, automatically ended by newline).


 
1/ You are only allowed to set these virusnames as found / 
VIRUSNAME_PREFIX("Test") 
3VIRUSNAMES("A", "B") 
TARGET(1) 
5 
SIGNATURES_DECL_BEGIN 
7DECLARE_SIGNATURE(magic) 
DECLARE_SIGNATURE(zero) 
9DECLARE_SIGNATURE(check) 
DECLARE_SIGNATURE(fivetoten) 
11DECLARE_SIGNATURE(check2) 
SIGNATURES_DECL_END 
13 
SIGNATURES_DEF_BEGIN 
15DEFINE_SIGNATURE(magic, "EP+0:aabb") 
DEFINE_SIGNATURE(zero, "ffff") 
17DEFINE_SIGNATURE(fivetoten, "aaccee") 
DEFINE_SIGNATURE(check, "f00d") 
19DEFINE_SIGNATURE(check2, "dead") 
SIGNATURES_END 
21 
bool logical_trigger(void) 
23{ 
    unsigned sum_matches = count_match(Signatures.magic)+ 
25        count_match(Signatures.zero) + count_match(Signatures.fivetoten); 
    unsigned unique_matches = matches(Signatures.magic)+ 
27            matches(Signatures.zero)+ matches(Signatures.fivetoten); 
    if (sum_matches == 42 && unique_matches == 2) { 
29        // The above 3 signatures have matched a total of 42 times, and at least 
        // 2 of them have matched 
31        return true; 
    } 
33    // If the check signature matches 10 times we still have a match 
    if (count_match(Signatures.check) == 10) 
35        return true; 
    // No match 
37    return false; 
} 
39 
int entrypoint(void) 
41{ 
    unsigned count = count_match(Signatures.check2); 
43    if (count >= 2) 
//    foundVirus(count == 2 ? "A" : "B"); 
45      if (count == 2) 
        foundVirus("A"); 
47      else 
        foundVirus("B"); 
49    return 0; 
}

Program 5: Multiple subsignatures

2.2.6. W32.Polipos.A detector rewritten as bytecode

2.2.7. Virut detector in bytecode

2.3. Writing regular expressions in bytecode

ClamAV only supports a limited set of regular expressions in .ndb format : wildcards. The bytecode compiler allows you to compile fully generic regular expressions to bytecode directly. When libclamav loads the bytecode, it will compile to native code (if using the JIT), so it should offer quite good performance.

The compiler currently uses re2c to compile regular expressions to C code, and then compile that to bytecode. The internal workings are all transparent to the user: the compiler automatically uses re2c when needed, and re2c is embedded in the compiler, so you don’t need to install it.

The syntax of regular expressions are similar to the one used by POSIX regular expressions, except you have to quote literals, since unquoted they are interpreted as regular expression names.

2.3.1. A very simple regular expression

Lets start with a simple example, to match this POSIX regular expression: eval([a-zA-Z_][a-zA-Z0-9_]*\.unescape.

See Program 6 1 1This omits the virusname, and logical signature declarations .


 
int entrypoint(void) 
2{ 
    REGEX_SCANNER; 
4    seek(0, SEEK_SET); 
    for (;;) { 
6        REGEX_LOOP_BEGIN 
 
8        / !re2c 
           ANY = [^]; 
10 
           "eval("[azAZ_][azAZ_09]".unescape" { 
12              long pos = REGEX_POS; 
              if (pos < 0) 
14                continue; 
              debug("unescape found at:"); 
16              debug(pos); 
           } 
18           ANY  { continue; } 
        / 
20    } 
    return 0; 
22}

Program 6: Simple regular expression example

There are several new features introduced here, here is a step by step breakdown:

REGEX_SCANNER this declares the data structures needed by the regular expression matcher
seek(0, SEEK_SET) this sets the current file offset to position 0, matching will start at this position. For offset 0 it is not strictly necessary to do this, but it serves as a reminder that you might want to start matching somewhere, that is not necessarily 0.
 for(;;) { REGEX_LOOP_BEGIN this creates the regular expression matcher main loop. It takes the current file byte-by-byte 1 1it is not really reading byte-by-byte, it is using a buffer to speed things up and tries to match one of the regular expressions.
/*!re2c This mark the beginning of the regular expression description. The entire regular expression block is a C comment, starting with !re2c
ANY = [^]; This declares a regular expression named ANY that matches any byte.
"eval("[a-zA-Z_][a-zA-Z_0-9]*".unescape" { This is the actual regular expression.
"eval(" This matches the literal string eval(. Literals have to be placed in double quotes " here, unlike in POSIX regular expressions or PCRE. If you want case-insensitive matching, you can use .
[a-zA-Z_] This is a character class, it matches any lowercase, uppercase or _ characters.
[a-zA-Z_0-9]*" Same as before, but with repetition. * means match zero or more times, + means match one or more times, just like in POSIX regular expressions.
".unescape" A literal string again
{ start of the action block for this regular expression. Whenever the regular expression matches, the attached C code is executed.
long pos = REGEX_POS; this determines the absolute file offset where the regular expression has matched. Note that because the regular expression matcher uses a buffer, using just seek(0, SEEK_CUR) would give the current position of the end of that buffer, and not the current position during regular expression matching. You have to use the REGEX_POS macro to get the correct position.
debug(...) Shows a debug message about what was found and where. This is extremely helpful when you start writing regular expressions, and nothing works: you can determine whether your regular expression matched at all, and if it matched where you thought it would. There is also a DEBUG_PRINT_MATCH that prints the entire matched string to the debug output. Of course before publishing the bytecode you might want to turn off these debug messages.
} closes the action block for this regular expression
ANY { continue; } If none of the regular expressions matched so far, just keep running the matcher, at the next byte
*/ closes the regular expression description block
} closes the for() loop

You may have multiple regular expressions, or declare multiple regular expressions with a name, and use those names to build more complex regular expressions.

2.3.2. Named regular expressions

2.4. Writing unpackers

2.4.1. Structure of a bytecode for unpacking (and other hooks)

When writing an unpacker, the bytecode should consist of:

2.4.2. Detecting clam.exe via bytecode

Example provided by aCaB:

2.4.3. Detecting clam.exe via bytecode (disasm)

Example provided by aCaB:

2.4.4. A simple unpacker

2.4.5. Matching PDF javascript

2.4.6. YC unpacker rewritten as bytecode

Chapter 3
Usage

3.1. Invoking the compiler

Compiling is similar to gcc 1 :

$ /usr/local/clamav/bin/clambc-compiler foo.c -o foo.cbc -O2

This will compile the file foo.c into a file called foo.cbc, that can be loaded by ClamAV, and packed inside a .cvd file.

The compiler by default has all warnings turned on.

Supported optimization levels: -O0, -O1, -O2, -O3. 1 It is recommended that you always compile with at least -O1.

Warning options: -Werror (transforms all warnings into errors).

Preprocessor flags:

-I <directory>
Searches in the given directory when it encounters a #include "headerfile" directive in the source code, in addition to the system defined header search directories.
-D <MACRONAME>=<VALUE>
Predefine given <MACRONAME> to be equal to <VALUE>.
-U <MACRONAME>
Undefine a predefined macro

The compiler also supports some other commandline options (see clambc-compiler --help for a full list), however some of them have no effect when using the ClamAV bytecode backend (such as the X86 backend options). You shouldn’t need to use any flags not documented above.

3.1.1. Compiling C++ files

Filenames with a .cpp extension are compiled as C++ files, however clang++ is not yet ready for production use, so this is EXPERIMENTAL currently. For now write bytecodes in C.

3.2. Running compiled bytecode

After compiling a C source file to bytecode, you can load it in ClamAV:

3.2.1. ClamBC

ClamBC is a tool you can use to test whether the bytecode loads, compiles, and can execute its entrypoint successfully. Usage:

 clambc <file> [function] [param1 ...]

For example loading a simple bytecode with 2 functions is done like this:

$ clambc foo.cbc  
LibClamAV debug: searching for unrar, user-searchpath: /usr/local/lib  
LibClamAV debug: unrar support loaded from libclamunrar_iface.so.6.0.4 libclamunrar_iface_so_6_0  
LibClamAV debug: bytecode: Parsed 0 APIcalls, maxapi 0  
LibClamAV debug: Parsed 1 BBs, 2 instructions  
LibClamAV debug: Parsed 1 BBs, 2 instructions  
LibClamAV debug: Parsed 2 functions  
Bytecode loaded  
Running bytecode function :0  
Bytecode run finished  
Bytecode returned: 0x8  
Exiting

3.2.2. clamscan, clamd

You can tell clamscan to load the bytecode as a database directly:

$ clamscan -dfoo.cbc

Or you can instruct it to load all databases from a directory, then clamscan will load all supported formats, including files with bytecode, which have the .cbc extension.

$ clamscan -ddirectory

You can also put the bytecode files into the default database directory of ClamAV (usually /usr/local/share/clamav) to have it loaded automatically from there. Of course, the bytecode can be stored inside CVD files, too.

3.3. Debugging bytecode

3.3.1. “printf” style debugging

Printf, and printf-like format specifiers are not supported in the bytecode. You can use these functions instead of printf to print strings and integer to clamscan’s –debug output:

debug_print_str, debug_print_uint, debug_print_str_start, debug_print_str_nonl.

You can also use the debug convenience wrapper that automatically prints as string or integer depending on parameter type: debug, debug, debug.

See Program 7 for an example.


 
1/ test debug APIs / 
int entrypoint(void) 
3{ 
  / print a debug message, followed by newline / 
5  debug_print_str("bytecode started", 16); 
 
7  / start a new debug message, dont end with newline yet / 
  debug_print_str_start("Engine functionality level: ", 28); 
9  / print an integer, no newline / 
  debug_print_uint(engine_functionality_level()); 
11  / print a string without starting a new debug message, and without 
    terminating with newline / 
13  debug_print_str_nonl(", dconf functionality level: ", 28); 
  debug_print_uint(engine_dconf_level()); 
15  debug_print_str_nonl("\n", 1); 
  debug_print_str_start("Engine scan options: ", 21); 
17  debug_print_uint(engine_scan_options()); 
  debug_print_str_nonl(", db options: ", 13); 
19  debug_print_uint(engine_db_options()); 
  debug_print_str_nonl("\n", 1); 
21 
  / convenience wrapper to just print a string / 
23  debug("just print a string"); 
  / convenience wrapper to just print an integer / 
25  debug(4); 
  return 0xf00d; 
27}

Program 7: Example of using debug APIs

3.3.2. Single-stepping

If you have GDB 7.0 (or newer) you can single-step 1 1not yet implemented in libclamav 2 2assuming you have JIT support during the execution of the bytecode.

Chapter 4
ClamAV bytecode language

The bytecode that ClamAV loads is a simplified form of the LLVM Intermediate Representation, and as such it is language-independent.

However currently the only supported language from which such bytecode can be generated is a simplified form of C 1

The language supported by the ClamAV bytecode compiler is a restricted set of C99 with some GNU extensions.

4.1. Differences from C99 and GNU C

These restrictions are enforced at compile time:

They are meant to ensure the following:

These restrictions are checked at runtime (checks are inserted at compile time):

The ClamAV API header has further restriction, see the Internals manual.

Although the bytecode undergoes a series of automated tests (see Publishing chapter in Internals manual), the above restrictions don’t guarantee that the resulting bytecode will execute correctly! You must still test the code yourself, these restrictions only avoid the most common errors. Although the compiler and verifier aims to accept only code that won’t crash ClamAV, no code is 100% perfect, and a bug in the verifier could allow unsafe code be executed by ClamAV.

4.2. Limitations

The bytecode format has the following limitations:

4.3. Logical signatures

Logical signatures can be used as triggers for executing a bytecode. Instead of describing a logical signatures as a .ldb pattern, you use C code which is then translated to a .ldb-style logical signature.

Logical signatures in ClamAV support the following operations:

Out of the above operations the ClamAV Bytecode Compiler doesn’t support computing sums of nested subexpressions, (it does support nesting though).

The C code that can be converted into a logical signature must obey these restrictions:

The compiler does the following transformations (not necessarily in this order):

If after this transformation the program meets the requirements outlined above, then it is converted to a logical signature. The resulting logical signature is simplified using basic properties of boolean operations, such as associativity, distributivity, De Morgan’s law.

The final logical signature is not unique (there might be another logical signature with identical behavior), however the boolean part is in a canonical form: it is in disjunctive normal form, with operands sorted in ascending order.

For best results the C code should consist of:

You can use || in the if condition too, but be careful that after expanding to disjunctive normal form, the number of subexpressions doesn’t exceed 64.

Note that you do not have to use all the subsignatures you declared in logical_trigger, you can do more complicated checks (that wouldn’t obey the above restrictions) in the bytecode itself at runtime. The logical_trigger function is fully compiled into a logical signature, it won’t be a runtime executed function (hence the restrictions).

4.4. Headers and runtime environment

When compiling a bytecode program, bytecode.h is automatically included, so you don’t need to explicitly include it. These headers (and the compiler itself) predefine certain macros, see Appendix ?? for a full list. In addition the following types are defined:

typedef unsigned char uint8_t; 
2typedef char int8_t; 
typedef unsigned short uint16_t; 
4typedef short int16_t; 
typedef unsigned int uint32_t; 
6typedef int int32_t; 
typedef unsigned long uint64_t; 
8typedef long int64_t; 
typedef unsigned int size_t; 
10typedef int off_t; 
typedef struct signature { unsigned id } __Signature;

As described in Section 4.1 the width of integer types are fixed, the above typedefs show that.

A bytecode’s entrypoint is the function entrypoint and it’s required by ClamAV to load the bytecode.

Bytecode that is triggered by a logical signature must have a list of virusnames and patterns defined. Bytecodes triggered via hooks can optionally have them, but for example a PE unpacker doesn’t need virus names as it only processes the data.

Chapter 5
Bytecode security & portability

Chapter 6
Reporting bugs

Chapter 7
Bytecode API

7.1. API groups

7.1.1. Bytecode configuration

Global COPYRIGHT(c) This will also prevent the sourcecode from being embedded into the bytecode

Global DECLARE_SIGNATURE(name)

Global DEFINE_SIGNATURE(name, hex)

Global FUNCTIONALITY_LEVEL_MAX(m)

Global FUNCTIONALITY_LEVEL_MIN(m)

Global ICONGROUP1(group)

Global ICONGROUP2(group)

Global PDF_HOOK_DECLARE This hook is called several times, use pdf_get_phase() to find out in which phase you got called.

Global PE_HOOK_DECLARE

Global PE_UNPACKER_DECLARE

Global SIGNATURES_DECL_BEGIN

Global SIGNATURES_DECL_END

Global SIGNATURES_DEF_BEGIN

Global SIGNATURES_END

Global TARGET(tgt)

Global VIRUSNAME_PREFIX(name)

Global VIRUSNAMES(...)

7.1.2. Data structure handling functions

Global buffer_pipe_done(int32_t id) After this all attempts to use this buffer will result in error. All buffer_pipes are automatically deallocated when bytecode finishes execution.

Global buffer_pipe_new(uint32_t size)

Global buffer_pipe_new_fromfile(uint32_t pos)

Global buffer_pipe_read_avail(int32_t id)

Global buffer_pipe_read_get(int32_t id, uint32_t amount) The ’amount’ parameter should be obtained by a call to buffer_pipe_read_avail().

Global buffer_pipe_read_stopped(int32_t id, uint32_t amount) Updates read cursor in buffer_pipe.

Global buffer_pipe_write_avail(int32_t id)

Global buffer_pipe_write_get(int32_t id, uint32_t size) Returns pointer to writable buffer. The ’amount’ parameter should be obtained by a call to buffer_pipe_write_avail().

Global buffer_pipe_write_stopped(int32_t id, uint32_t amount)

Global cli_readint16(const void buff)

Global cli_readint32(const void buff)

Global cli_writeint32(void offset, uint32_t v)

Global hashset_add(int32_t hs, uint32_t key)

Global hashset_contains(int32_t hs, uint32_t key)

Global hashset_done(int32_t id) Trying to use the hashset after this will result in an error. The hashset may not be used after this. All hashsets are automatically deallocated when bytecode finishes execution.

Global hashset_empty(int32_t id)

Global hashset_new(void)

Global hashset_remove(int32_t hs, uint32_t key)

Global inflate_done(int32_t id)

Global inflate_init(int32_t from_buffer, int32_t to_buffer, int32_t windowBits) ’from_buffer’ and writing uncompressed uncompressed data ’to_buffer’.

Global inflate_process(int32_t id)

Global le16_to_host(uint16_t v)

Global le32_to_host(uint32_t v)

Global le64_to_host(uint64_t v)

Global malloc(uint32_t size)

Global map_addkey(const uint8_t key, int32_t ksize, int32_t id)

Global map_done(int32_t id)

Global map_find(const uint8_t key, int32_t ksize, int32_t id)

Global map_getvalue(int32_t id, int32_t size)

Global map_getvaluesize(int32_t id)

Global map_new(int32_t keysize, int32_t valuesize)

Global map_remove(const uint8_t key, int32_t ksize, int32_t id)

Global map_setvalue(const uint8_t value, int32_t vsize, int32_t id)

7.1.3. Disassemble APIs

Class DIS_arg

Class DIS_fixed

Class DIS_mem_arg

Global disasm_x86(struct DISASM_RESULT result, uint32_t len)

Global DisassembleAt(struct DIS_fixed result, uint32_t offset, uint32_t len)

7.1.4. Engine queries

Global count_match(__Signature sig)

Global engine_db_options(void)

Global engine_dconf_level(void)

Global engine_functionality_level(void)

Global engine_scan_options(void)

Global match_location(__Signature sig, uint32_t goback)

Global match_location_check(__Signature sig, uint32_t goback, const char static_start, uint32_t static_len) It is recommended to use this for safety and compatibility with 0.96.1

Global matches(__Signature sig)

7.1.5. Environment detection functions

Global __is_bigendian(void) __attribute__((const )) __attribute__((nothrow))

Global check_platform(uint32_t a, uint32_t b, uint32_t c)

Global disable_bytecode_if(const int8_t reason, uint32_t len, uint32_t cond)

Global disable_jit_if(const int8_t reason, uint32_t len, uint32_t cond)

Global get_environment(struct cli_environment env, uint32_t len)

Global version_compare(const uint8_t lhs, uint32_t lhs_len, const uint8_t rhs, uint32_t rhs_len)

7.1.6. File operations

Global buffer_pipe_new_fromfile(uint32_t pos) to the current file, at the specified position.

Global file_byteat(uint32_t offset)

Global file_find(const uint8_t data, uint32_t len)

Global file_find_limit(const uint8_t data, uint32_t len, int32_t maxpos)

Global fill_buffer(uint8_t buffer, uint32_t len, uint32_t filled, uint32_t cursor, uint32_t fill)

Global getFilesize(void)

Global read(uint8_t data, int32_t size)

Global read_number(uint32_t radix) Non-numeric characters are ignored.

Global seek(int32_t pos, uint32_t whence)

Global write(uint8_t data, int32_t size)

7.1.7. Global variables

Global __clambc_filesize[1]

Global __clambc_kind

Global __clambc_match_counts[64]

Global __clambc_match_offsets[64]

Global __clambc_pedata

7.1.8. Icon matcher APIs

Global matchicon(const uint8_t group1, int32_t group1_len, const uint8_t group2, int32_t group2_len)

7.1.9. JS normalize API

Global jsnorm_done(int32_t id)

Global jsnorm_init(int32_t from_buffer)

Global jsnorm_process(int32_t id)

7.1.10. Math functions

Global icos(int32_t a, int32_t b, int32_t c)

Global iexp(int32_t a, int32_t b, int32_t c)

Global ilog2(uint32_t a, uint32_t b)

Global ipow(int32_t a, int32_t b, int32_t c)

Global isin(int32_t a, int32_t b, int32_t c)

7.1.11. PDF handling functions

Global pdf_get_dumpedobjid(void) Valid only in PDF_PHASE_POSTDUMP.

Global pdf_get_flags(void)

Global pdf_get_obj_num(void)

Global pdf_get_phase(void) Identifies at which phase this bytecode was called.

Global pdf_getobj(int32_t objidx, uint32_t amount) Meant only for reading, write modifies the fmap buffer, so avoid!

Global pdf_getobjsize(int32_t objidx)

Global pdf_lookupobj(uint32_t id)

Global pdf_set_flags(int32_t flags)

7.1.12. PE functions

Class cli_exe_info

Class cli_exe_section

Class cli_pe_hook_data

Global get_pe_section(struct cli_exe_section section, uint32_t num)

Global getEntryPoint(void)

Global getExeOffset(void)

Global getImageBase(void)

Global getNumberOfSections(void)

Global getPEBaseOfCode(void)

Global getPEBaseOfData(void)

Global getPECharacteristics()

Global getPECheckSum(void)

Global getPEDataDirRVA(unsigned n)

Global getPEDataDirSize(unsigned n)

Global getPEDllCharacteristics(void)

Global getPEFileAlignment(void)

Global getPEImageBase(void)

Global getPEisDLL()

Global getPELFANew(void)

Global getPELoaderFlags(void)

Global getPEMachine()

Global getPEMajorImageVersion(void)

Global getPEMajorLinkerVersion(void)

Global getPEMajorOperatingSystemVersion(void)

Global getPEMajorSubsystemVersion(void)

Global getPEMinorImageVersion(void)

Global getPEMinorLinkerVersion(void)

Global getPEMinorOperatingSystemVersion(void)

Global getPEMinorSubsystemVersion(void)

Global getPENumberOfSymbols()

Global getPEPointerToSymbolTable()

Global getPESectionAlignment(void)

Global getPESizeOfCode(void)

Global getPESizeOfHeaders(void)

Global getPESizeOfHeapCommit(void)

Global getPESizeOfHeapReserve(void)

Global getPESizeOfImage(void)

Global getPESizeOfInitializedData(void)

Global getPESizeOfOptionalHeader()

Global getPESizeOfStackCommit(void)

Global getPESizeOfStackReserve(void)

Global getPESizeOfUninitializedData(void)

Global getPESubsystem(void)

Global getPETimeDateStamp()

Global getPEWin32VersionValue(void)

Global getSectionRVA(unsigned i) .

Global getSectionVirtualSize(unsigned i) .

Global getVirtualEntryPoint(void)

Global hasExeInfo(void)

Global hasPEInfo(void)

Global isPE64(void)

Class pe_image_data_dir

Class pe_image_file_hdr

Class pe_image_optional_hdr32

Class pe_image_optional_hdr64

Class pe_image_section_hdr

Global pe_rawaddr(uint32_t rva)

Global readPESectionName(unsigned char name[8], unsigned n)

Global readRVA(uint32_t rva, void buf, size_t bufsize)

7.1.13. Scan control functions

Global bytecode_rt_error(int32_t locationid)

Global extract_new(int32_t id)

Global extract_set_container(uint32_t container)

Global foundVirus(const char virusname)

Global input_switch(int32_t extracted_file)

Global setvirusname(const uint8_t name, uint32_t len)

7.1.14. String operations

Global atoi(const uint8_t str, int32_t size)

Global debug_print_str(const uint8_t str, uint32_t len)

Global debug_print_str_nonl(const uint8_t str, uint32_t len)

Global debug_print_str_start(const uint8_t str, uint32_t len)

Global debug_print_uint(uint32_t a)

Global entropy_buffer(uint8_t buffer, int32_t size)

Global hex2ui(uint32_t hex1, uint32_t hex2)

Global memchr(const void s, int c, size_t n)

Global memcmp(const void s1, const void s2, uint32_t n) __attribute__((__nothrow__)) __attribute__((__pure__)) __attribute__((__nonnull__(1

Global memcpy(void restrict dst, const void restrict src, uintptr_t n) __attribute__((__nothrow__)) __attribute__((__nonnull__(1

Global memmove(void dst, const void src, uintptr_t n) __attribute__((__nothrow__)) __attribute__((__nonnull__(1

Global memset(void src, int c, uintptr_t n) __attribute__((nothrow)) __attribute__((__nonnull__((1))))

Global memstr(const uint8_t haystack, int32_t haysize, const uint8_t needle, int32_t needlesize)

7.2. Structure types

7.2.1. cli_exe_info Struct Reference

Data Fields
7.2.1.1. Detailed Description

Executable file information

PE

7.2.1.2. Field Documentation

uint32_t ep Entrypoint of executable

uint32_t hdr_size Address size - PE ONLY

uint16_t nsections Number of sections

uint32_t offset Offset where this executable start in file (nonzero if embedded)

uint32_t res_addr Resrources RVA - PE ONLY

struct cli_exe_sectionsection Information about all the sections of this file. This array has nsection elements

7.2.2. cli_exe_section Struct Reference

Data Fields
7.2.2.1. Detailed Description

Section of executable file.

PE

7.2.2.2. Field Documentation

uint32_t chr Section characteristics

uint32_t raw Raw offset (in file)

uint32_t rsz Raw size (in file)

uint32_t rva Relative VirtualAddress

uint32_t uraw PE - unaligned PointerToRawData

uint32_t ursz PE - unaligned SizeOfRawData

uint32_t urva PE - unaligned VirtualAddress

uint32_t uvsz PE - unaligned VirtualSize

uint32_t vsz VirtualSize

7.2.3. cli_pe_hook_data Struct Reference

Data Fields
7.2.3.1. Detailed Description

Data for the bytecode PE hook

PE

7.2.3.2. Field Documentation

struct pe_image_data_dir dirs[16] PE data directory header

uint32_t e_lfanew address of new exe header

uint32_t ep EntryPoint as file offset

struct pe_image_file_hdr file_hdr Header for this PE file

uint32_t hdr_size internally needed by rawaddr

uint16_t nsections Number of sections

struct pe_image_optional_hdr32 opt32 32-bit PE optional header

struct pe_image_optional_hdr64 opt64 64-bit PE optional header

uint32_t overlays number of overlays

int32_t overlays_sz size of overlays

7.2.4. DIS_arg Struct Reference

Data Fields
7.2.4.1. Detailed Description

disassembled operand

Disassemble

7.2.4.2. Field Documentation

enum DIS_SIZE access_size size of access

enum DIS_ACCESS access_type type of access

struct DIS_mem_arg mem memory operand

uint64_t other other operand

enum X86REGS reg register operand

7.2.5. DIS_fixed Struct Reference

Data Fields
7.2.5.1. Detailed Description

disassembled instruction.

Disassemble

7.2.5.2. Field Documentation

enum DIS_SIZE address_size size of address

enum DIS_SIZE operation_size size of operation

uint8_t segment segment

enum X86OPS x86_opcode opcode of X86 instruction

7.2.6. DIS_mem_arg Struct Reference

Data Fields
7.2.6.1. Detailed Description

disassembled memory operand: scale_regscale + add_reg + displacement

Disassemble

7.2.6.2. Field Documentation

enum DIS_SIZE access_size size of access

enum X86REGS add_reg register used as displacemenet

int32_t displacement displacement as immediate number

uint8_t scale scale as immediate number

enum X86REGS scale_reg register used as scale

7.2.7. DISASM_RESULT Struct Reference

7.2.7.1. Detailed Description

disassembly result, 64-byte, matched by type-8 signatures

7.2.8. pe_image_data_dir Struct Reference

7.2.8.1. Detailed Description

PE data directory header

PE

7.2.9. pe_image_file_hdr Struct Reference

Data Fields
7.2.9.1. Detailed Description

Header for this PE file

PE

7.2.9.2. Field Documentation

uint16_t Machine CPU this executable runs on, see libclamav/pe.c for possible values

uint32_t Magic PE magic header: PE\0\0

uint16_t NumberOfSections Number of sections in this executable

uint32_t NumberOfSymbols debug

uint32_t PointerToSymbolTable debug

uint16_t SizeOfOptionalHeader == 224

uint32_t TimeDateStamp Unreliable

7.2.10. pe_image_optional_hdr32 Struct Reference

Data Fields
7.2.10.1. Detailed Description

32-bit PE optional header

PE

7.2.10.2. Field Documentation

uint32_t CheckSum NT drivers only

uint32_t FileAlignment usually 32 or 512

uint32_t ImageBase multiple of 64 KB

uint16_t MajorImageVersion unreliable

uint8_t MajorLinkerVersion unreliable

uint16_t MajorOperatingSystemVersion not used

uint16_t MinorImageVersion unreliable

uint8_t MinorLinkerVersion unreliable

uint16_t MinorOperatingSystemVersion not used

uint32_t NumberOfRvaAndSizes unreliable

uint32_t SectionAlignment usually 32 or 4096

uint32_t SizeOfCode unreliable

uint32_t SizeOfInitializedData unreliable

uint32_t SizeOfUninitializedData unreliable

7.2.11. pe_image_optional_hdr64 Struct Reference

Data Fields
7.2.11.1. Detailed Description

PE 64-bit optional header

PE

7.2.11.2. Field Documentation

uint32_t CheckSum NT drivers only

uint32_t FileAlignment usually 32 or 512

uint64_t ImageBase multiple of 64 KB

uint16_t MajorImageVersion unreliable

uint8_t MajorLinkerVersion unreliable

uint16_t MajorOperatingSystemVersion not used

uint16_t MinorImageVersion unreliable

uint8_t MinorLinkerVersion unreliable

uint16_t MinorOperatingSystemVersion not used

uint32_t NumberOfRvaAndSizes unreliable

uint32_t SectionAlignment usually 32 or 4096

uint32_t SizeOfCode unreliable

uint32_t SizeOfInitializedData unreliable

uint32_t SizeOfUninitializedData unreliable

7.2.12. pe_image_section_hdr Struct Reference

Data Fields
7.2.12.1. Detailed Description

PE section header

PE

7.2.12.2. Field Documentation

uint8_t Name[8] may not end with NULL

uint16_t NumberOfLinenumbers object files only

uint16_t NumberOfRelocations object files only

uint32_t PointerToLinenumbers object files only

uint32_t PointerToRawData offset to the section’s data

uint32_t PointerToRelocations object files only

uint32_t SizeOfRawData multiple of FileAlignment

7.3. Low level API

 

7.3.1. bytecode_api.h File Reference

Enumerations
Functions
Variables
7.3.1.1. Detailed Description
7.3.1.2. Enumeration Type Documentation

anonymous enum

Enumerator:

PE_INVALID_RVA
Invalid RVA specified

anonymous enum

Enumerator:

SEEK_SET
set file position to specified absolute position
SEEK_CUR
set file position relative to current position
SEEK_END
set file position relative to file end

enum BytecodeKind Bytecode trigger kind

Enumerator:

BC_GENERIC
generic bytecode, not tied a specific hook