Blogs – Shilei Tian

How the Device Images of LLVM OpenMP are Organized?

Starting from this post, I’ll introduce all implementation details that I know about LLVM OpenMP target offloading support, which is what libomptarget does. However, I don’t have a clear plan yet about what will be covered and what the order will be. One thing is clear: since I’m not very familiar with the front end, though I’ve also contributed some patches to clang, I’ll not talk about front end support. Despite this, some high level ideas will also be discussed if necessary.

This post is about how the device images are organized. Without further ado, let’s get started.

The entry point of libomptarget is __tgt_register_lib, whose only argument is of type __tgt_bin_desc *. We will talk about __tgt_bin_desc soon. __tgt_register_lib is defined in libomptarget, but you would not find any explicit function call to it. The call is actually inserted by the tool clang-offload-wrapper provided by clang, which is responsible to wrap the device image into the host object file. As you can probably tell now, the address of device image is not available until we wrap it to the host object file. That’s the reason that we can’t call the function __tgt_register_lib beforehand. We will talk about more details about the compilation flow in the future post.

__tgt_bin_desc stands for “target binary descriptor”, which is defined as:

struct __tgt_bin_desc {
  int32_t NumDeviceImages;
  __tgt_device_image *DeviceImages;
  __tgt_offload_entry *HostEntriesBegin;
 __tgt_offload_entry *HostEntriesEnd;
};

It’s a container of all device images and host entries. We will talk about host entry later. Let’s first discuss device image here. The binary descriptor does support multiple device images, which means you can have one executable for all potential target devices (remember the compilation flag for OpenMP target offloading is -fopenmp-targets and it is “targets“). However, design and implementation sometimes are different. At the time of this writing, I kind of doubt our toolchain can actually support to embed images for different architecture into one executable.

__tgt_device_image describes a device image.

struct __tgt_device_image {
  void *ImageStart;
  void *ImageEnd;
  __tgt_offload_entry *EntriesBegin;
  __tgt_offload_entry *EntriesEnd;
};

The first and second data members point to where the image is stored, which will be loaded to a target device later. EntriesBegin and EntriesEnd point to the offload entry table. In most cases, the EntriesBegin and EntriesEnd are same as HostEntriesBegin and HostEntriesEnd in __tgt_bin_desc respectively, but they can also be a subset. For example, even with the same user code, a target device might need extra code, usually inserted by the compiler, to run properly, like initialization. That’s one of the reasons I doubt we can have multiple images for different architectures, because the offload entry table has to be continuous. If more than one architecture require extra entries, how the table will be organized. It can’t be continuous for every architecture.

Now let’s talk about the offload entry. As its name suggests, it is the entry point for offloading, so it can be a kernel function that can be launched on the host. In addition to that, a global variable (on the device) is also an offload entry. The reason is, most of target devices don’t support global variable initialization. As a result, you cannot just write int a = 1; and hope a is initialized to 1 when the image is loaded to a device. Global variables have to be initialized explicitly by transferring data from host to device. Therefore, we need to know at runtime what global variables are on the device, and what’s their values are. Please note that, like host global variables, they are only initialized once, right after the image is loaded to the device, before the execution of host user code (well, technically, this could be inaccurate. I’m currently working on the JIT support for OpenMP, and we propose a new feature to generate device image at kernel launch time, not global initialization time). Another reason to have global variables as offload entry is for data mapping. Data mapping is a map from host address to device address because we need to pass device address to the corresponding kernel function when we launch it. It’s very complicated and worth a quite long post to discuss the implementation details, but here let me only say a few words about its relation to global variables. Some data mapping could use the information of global variables, so we also need to maintain the mapping for each global variable as well. It can only be done in the following way. After the image is loaded to a device, collect the addresses of all global variables and store that information in the mapping table. So we need the information to tell us what global variables we have.

Another entry points are global constructors (c’tors) and destructors (d’tors). That is for C++ global objects. On the host, the compiler inserts function calls to c’tors of those global variables during the global construction, and function calls to d’tors as well during the global destruction. Same thing happens to device code as well. However, since all current target devices feature a host-centric execution model, which means a device can only execute code if the host “asks” it to, basically to launch it. Those global c’tors and d’tors will not be executed by themselves if we don’t launch them. As a result, we need to know all those functions and their corresponding device handles, and launch them explicitly at the right time.

Now we know what inside the binary descriptor and what they are used for. In next post, I’ll introduce how libomptarget is initialized.

COMDAT in LLVM

I have been working on LLVM heterogenous IR module (see this video for more details) for several days. The first thing I need is to modify the class llvm::Module to make it accommodate multiple modules. Frankly speaking, it was my first time to hear COMDAT when I went through every member data in the class. So I did some research and dig into LLVM source code to learn what it is and how it is used. I’m gonna share my thoughts in this post. Feel free to point it out if something is wrong.

What is COMDAT?

If you search this concept in Google, the first result jumping out should be from StackOverflow (at least in my case).

The purpose of a COMDAT section is to allow “duplicate” sections to be defined in multiple object files. Normally, if the same symbol is defined in multiple object files, the linker will report errors. This can cause problems for some C++ language features, like templates, that may instantiate the same symbols in different cpp files.

COMDAT is actually a section in an object file. It contains symbols that can be potentially with same name in different objects. This could happen when different C++ source files instantiate same template functions. Consider the following example:

// header.hpp
template <typename T>
T add(T a, T b) { return a + b; }
// foo.cpp
#include "header.hpp"
int foo(int a, int b) { return add(a, b); }
// bar.cpp
#include "header.hpp"
int bar(int a, int b) { return add(a, b); }

After we compile foo.cpp and bar.cpp and get foo.o and bar.o, both the two objects contain a symbol named __Z3addIiET_S0_S0_, which is a mangled name. It stands for int add<int>(int, int), which is exactly the instance function of the template function in header.hpp. When the linker links the two objects, if we don’t do something, there will be a linker error because there are two symbols with the same name.

COMDAT is to solve this problem. The symbol __Z3addIiET_S0_S0_ is put into a special section (COMDAT section). Since symbols in an object must have different names, when multiple objects are linked together, only one of those symbols with different names from different COMDAT sections in different objects can be kept. The linker must determine which one to stay. There are a couple of strategies that will be covered in next section.

But wait, why? Shouldn’t they be the same, like the above case, such that we can choose whatever we want? They should work fine because they’re same. Well, that’s true. However, things are not always like that. Consider the following code:

// foo.cpp
template <typename T>
T add(T a, T b) { return a + b; }
int foo(int a, int b) { return add(a, b); }
// bar.cpp
template <typename T>
T add(T a, T b) { return a + b + 2; }
int bar(int a, int b) { return add(a, b); }

Every file has its own template function add, and they work differently. However, they’re all called __Z3addIiET_S0_S0_ in their own objects, without any difference! During the linkage, the linker knows they’re different, but which one to choose? Here comes the strategy. You might be thinking, does it mean either foo or bar will not work correctly after the linkage?! Unfortunately, that’s true. That is how C++ works! That’s why we have millions of articles titled “Best Practice in C++” or “Ten things you should never do in C++”, etc. 🙂

How COMDAT works in LLVM?

Let’s first take a look how llvm::Comdat is defined:

class Comdat {
public:
  enum SelectionKind {
    Any,
    ExactMatch,
    Largest,
    NoDuplicates,
    SameSize,
  };
  Comdat(const Comdat &) = delete;
  Comdat(Comdat &&C);
  SelectionKind getSelectionKind() const;
  void setSelectionKind(SelectionKind Val);
  StringRef getName() const;
private:
  friend class Module;
  Comdat();
  StringMapEntry<Comdat> *Name = nullptr;
  SelectionKind SK = Any;
};

Form simplicity, I only kept meaningful parts. The class is very simple. It only contains two data members, a pointer to StringMapEntry and a SelectionKind. The latter one is pretty straightforward, which defines how to deal with the corresponding symbol. It has five kinds (strategies):

Any: The linker can choose whichever it wants when it has multiple symbols with the same name from different objects.
ExactMatch: The linker needs to check every instance from different objects whether they’re exact matched. If so, it can choose any of them (obviously). Others will be dropped. If any of them is different from others, a linkage error will be emitted. As for what is exact match, it just means different instances must have same size, same functionalities, etc.
Largest: The linker should choose the largest one if multiple instances are of different sizes.
NoDuplicates: This symbol should NOT be defined in another object, which means it can only exist in one object. Neither example in the previous section can pass if the COMDAT is this kind.
SameSize: The linker needs to check whether the corresponding symbols from different objects are of same size. It is different from ExactMatch because it only requires the same size. It is possible that different symbols can have the same size but different functionalities.

From the class definition, llvm::Comdat is like a property of a symbol. Therefore, each llvm::GlobalObject holds a pointer to a llvm::Comdat. The llvm::Module contains a mapping from a symbol name to its corresponding llvm::Comdat, and llvm::Comdats are actually stored in the map. For efficient look-up, llvm::Comdat contains a pointer to its corresponding entry in the map. It has three advantages:

The owner of the llvm::Comdat is the map (part of the llvm::Module) not a symbol. In this way, all COMDATs for a single module are in a same place. We can easily traverse all COMDATs if necessary.
A symbol only needs to hold a pointer to its llvm::Comdat without taking care of its lifetime.
The symbol name can be easily got via the pointer to the entry.

Every time we need to check whether a symbol is in the COMDAT section, we can either use function llvm::GlobalObject::hasComdat()or check whether the return value of llvm::GlobalObject::getComdat() is nullptr.

OpenMP Learning Notes

In C/C++, OpenMP directives are speciﬁed by using the #pragmamechanism provided by the C and C++ standards.
OpenMP directives for C/C++ are speciﬁed with #pragma directives. The syntax of an OpenMP directive is as follows:#pragma omp directive-name [clause[ [,] clause] ... ] new-line Each directive starts with #pragma omp. The remainder of the directive follows the conventions of the C and C++ standards for compiler directives. In particular, white space can be used before and after the #, and sometimes white space must be used to separate the words in a directive. Some OpenMP directives may be composed of consecutive #pragmadirectives if speciﬁed in their syntax.
Preprocessing tokens following #pragma omp are subject to macro replacement.
Directives are case-sensitive. Each of the expressions used in the OpenMP syntax inside of the clauses must be a valid assignment-expression of the base language unless otherwise speciﬁed.
Directives may not appear in constexpr functions or in constant expressions. Variadic parameter packs cannot be expanded into a directive or its clauses except as part of an expression argument to be evaluated by the base language, such as into a function call inside an if clause.
Only one directive-name can be speciﬁed per directive (note that this includes combined directives). The order in which clauses appear on directives is not signiﬁcant. Clauses on directives may be repeated as needed, subject to the restrictions listed in the description of each clause.
Some clauses accept a list, an extended-list, or a locator-list.
- A list consists of a comma-separated collection of one or more list items. A list item is a variable or an array section.
- An extended-list consists of a comma-separated collection of one or more extended list items. An extended list item is a list item or a function name.
- A locator-list consists of a comma-separated collection of one or more locator list items. A locator list item is any lvalue expression, including variables, or an array section.
Some executable directives include a structured block. A structured block:
- may contain inﬁnite loops where the point of exit is never reached;
- may halt due to an IEEE exception;
- may contain calls to exit(), _Exit(), quick_exit(), abort() or functions with a _Noreturn speciﬁer (in C) or a noreturn attribute (in C/C++);
- may be an expression statement, iteration statement, selection statement, or try block, provided that the corresponding compound statement obtained by enclosing it in { and } would be a structured block.
Stand-alone directives do not have any associated executable user code. Instead, they represent executable statements that typically do not have succinct equivalent statements in the base language. There are some restrictions on the placement of a stand-alone directive within a program. A stand-alone directive may be placed only at a point where a base language executable statement is allowed. A stand-alone directive may not be used in place of the statement following an if, while, do, switch, or label.
In implementations that support a preprocessor, the _OPENMP macro name is deﬁned to have the decimal value yyyymm where yyyy and mm are the year and month designations of the version of the OpenMP API that the implementation supports. If a #define or a #undef preprocessing directive in user code deﬁnes or undeﬁnes the _OPENMP macro name, the behavior is unspeciﬁed.

Compiler Techniques/Terminologies

Tail call

In computer science, a tail call is a subroutine call performed as the final action of a procedure. If a tail call might lead to the same subroutine being called again later in the call chain, the subroutine is said to be tail-recursive, which is a special case of recursion.

Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is no longer needed, and can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls). The program can then jump to the called subroutine. Producing such code instead of a standard call sequence is called tail call elimination. Tail call elimination allows procedure calls in tail position to be implemented as efficiently as goto statements, thus allowing efficient structured programming. In the words of Guy L. Steele, “in general, procedure calls may be usefully thought of as GOTO statements which also pass parameters, and can be uniformly coded as [machine code] JUMP instructions.”

Call site

In programming, a call site of a function or subroutine is the location (line of code) where the function is called (or may be called, through dynamic dispatch). A call site is where zero or more arguments are passed to the function, and zero or more return values are received.

// this is a function definition
function sqr(x) {
  return x * x;
}

function foo() {
  // these are two call sites of function sqr in this function
  a = sqr(b);
  c = sqr(b);
}

Constant folding

Constant folding is the process of recognizing and evaluating constant expressions at compile time rather than computing them at runtime. Terms in constant expressions are typically simple literals, such as the integer literal, but they may also be variables whose values are known at compile time. Consider the statement:

int i = 320 * 200 * 32;

Most compilers would not actually generate two multiply instructions and a store for this statement. Instead, they identify constructs such as these and substitute the computed values at compile time (in this case, 2,048,000).

Constant folding can make use of arithmetic identities. If x is numeric, the value of 0 * x is zero even if the compiler does not know the value of x (note that this is not valid for IEEE floats since x could be Infinity or NotANumber. Still, some languages favoring performance like GLSL allows this for constants, which can occasionally cause bugs).

Constant folding may apply to more than just numbers. Concatenation of string literals and constant strings can be constant folded. Code such as "abc" + "def" may be replaced with "abcdef".

Constant propagation

Constant propagation is the process of substituting the values of known constants in expressions at compile time. Such constants include those defined above, as well as intrinsic functions applied to constant values. Consider the following pseudocode:

int x = 14;
int y = 7 - x / 2;
return y * (28 / x + 2);

Propagating x yields:

int x = 14;
int y = 7 - 14 / 2;
return y * (28 / 14 + 2);

Continuing to propagate yields the following (which would likely be further optimized by dead code elimination of both x and y.)

int x = 14;
int y = 0;
return 0;

Constant propagation is implemented in compilers using reaching definitionanalysis results. If all a variable’s reaching definitions are the same assignment which assigns a same constant to the variable, then the variable has a constant value and can be replaced with the constant.

Constant propagation can also cause conditional branches to simplify to one or more unconditional statements, when the conditional expression can be evaluated to true or false at compile time to determine the only possible outcome.

Reaching definition

In compiler theory, a reaching definition for a given instruction is an earlier instruction whose target variable can reach (be assigned to) the given one without an intervening assignment. For example, in the following code:

d1 : y := 3
d2 : x := y

d1 is a reaching definition for d2. In the following, example, however:

d1 : y := 3
d2 : y := 4
d3 : x := y

d1 is no longer a reaching definition for d3, because d2 kills its reach: the value defined in d1 is no longer available and cannot reach d3.

The similarly named reaching definitions is a data-flow analysis which statically determines which definitions may reach a given point in the code. Because of its simplicity, it is often used as the canonical example of a data-flow analysis in textbooks. The data-flow confluence operator used is set union, and the analysis is forward flow. Reaching definitions are used to compute use-def chains.

Peephole optimization

Peephole optimization is an optimization technique performed on a small set of instructions in a segment of assembly-language code, known as the peephole or window. Peephole optimization involves changes to individual assembly-language instructions, such as eliminating redundant code, replacing slower instructions with faster ones, optimizing flow control, and performing algebraic simplification.

Common techniques applied in peephole optimization:

Null sequences – Delete useless operations.
Combine operations – Replace several operations with one equivalent.
Algebraic laws – Use algebraic laws to simplify or reorder instructions.
Special case instructions – Use instructions designed for special operand cases.
Address mode operations – Use address modes to simplify code. There can be other types of peephole optimizations.

Value numbering

Value numbering is a technique of determining when two computations in a program are equivalent and eliminating one of them with a semantics preserving optimization.

Global value numbering

Global value numbering (GVN) is a compiler optimization based on the static single assignment form (SSA) intermediate representation. It sometimes helps eliminate redundant code that common subexpression elimination (CSE) does not. At the same time, however, CSE may eliminate code that GVN does not, so both are often found in modern compilers. Global value numbering is distinct from local value numbering in that the value-number mappings hold across basic block boundaries as well, and different algorithms are used to compute the mappings.

Global value numbering works by assigning a value number to variables and expressions. The same value number is assigned to those variables and expressions which are provably equivalent. For instance, in the following code:

Common subexpression elimination

In compiler theory, common subexpression elimination (CSE) is a compiler optimization that searches for instances of identical expressions (i.e., they all evaluate to the same value), and analyzes whether it is worthwhile replacing them with a single variable holding the computed value.

Example

In the following code:

a = b * c + g;
d = b * c * e;

It may be worth transforming the code to:

tmp = b * c;
a = tmp + g;
d = tmp * e;

If the cost of storing and retrieving tmp is less than the cost of calculating b * can extra time.

Principle

The possibility to perform CSE is based on available expression analysis (a data flow analysis). An expression b * c is available at a point p in a program if:

every path from the initial node to p evaluates b * c before reaching p,
and there are no assignments to b or c after the evaluation but before p.

The cost/benefit analysis performed by an optimizer will calculate whether the cost of the store to tmp is less than the cost of the multiplication; in practice other factors such as which values are held in which registers are also significant.

Compiler writers distinguish two kinds of CSE:

local common subexpression elimination works within a single basic block
global common subexpression elimination works on an entire procedure

Both kinds rely on data flow analysis of which expressions are available at which points in a program.