openmp – Shilei Tian

Starting from this post, I’ll introduce all implementation details that I know about LLVM OpenMP target offloading support, which is what libomptarget does. However, I don’t have a clear plan yet about what will be covered and what the order will be. One thing is clear: since I’m not very familiar with the front end, though I’ve also contributed some patches to clang, I’ll not talk about front end support. Despite this, some high level ideas will also be discussed if necessary.

This post is about how the device images are organized. Without further ado, let’s get started.

The entry point of libomptarget is __tgt_register_lib, whose only argument is of type __tgt_bin_desc *. We will talk about __tgt_bin_desc soon. __tgt_register_lib is defined in libomptarget, but you would not find any explicit function call to it. The call is actually inserted by the tool clang-offload-wrapper provided by clang, which is responsible to wrap the device image into the host object file. As you can probably tell now, the address of device image is not available until we wrap it to the host object file. That’s the reason that we can’t call the function __tgt_register_lib beforehand. We will talk about more details about the compilation flow in the future post.

__tgt_bin_desc stands for “target binary descriptor”, which is defined as:

struct __tgt_bin_desc {
  int32_t NumDeviceImages;
  __tgt_device_image *DeviceImages;
  __tgt_offload_entry *HostEntriesBegin;
 __tgt_offload_entry *HostEntriesEnd;
};

It’s a container of all device images and host entries. We will talk about host entry later. Let’s first discuss device image here. The binary descriptor does support multiple device images, which means you can have one executable for all potential target devices (remember the compilation flag for OpenMP target offloading is -fopenmp-targets and it is “targets“). However, design and implementation sometimes are different. At the time of this writing, I kind of doubt our toolchain can actually support to embed images for different architecture into one executable.

__tgt_device_image describes a device image.

struct __tgt_device_image {
  void *ImageStart;
  void *ImageEnd;
  __tgt_offload_entry *EntriesBegin;
  __tgt_offload_entry *EntriesEnd;
};

The first and second data members point to where the image is stored, which will be loaded to a target device later. EntriesBegin and EntriesEnd point to the offload entry table. In most cases, the EntriesBegin and EntriesEnd are same as HostEntriesBegin and HostEntriesEnd in __tgt_bin_desc respectively, but they can also be a subset. For example, even with the same user code, a target device might need extra code, usually inserted by the compiler, to run properly, like initialization. That’s one of the reasons I doubt we can have multiple images for different architectures, because the offload entry table has to be continuous. If more than one architecture require extra entries, how the table will be organized. It can’t be continuous for every architecture.

Now let’s talk about the offload entry. As its name suggests, it is the entry point for offloading, so it can be a kernel function that can be launched on the host. In addition to that, a global variable (on the device) is also an offload entry. The reason is, most of target devices don’t support global variable initialization. As a result, you cannot just write int a = 1; and hope a is initialized to 1 when the image is loaded to a device. Global variables have to be initialized explicitly by transferring data from host to device. Therefore, we need to know at runtime what global variables are on the device, and what’s their values are. Please note that, like host global variables, they are only initialized once, right after the image is loaded to the device, before the execution of host user code (well, technically, this could be inaccurate. I’m currently working on the JIT support for OpenMP, and we propose a new feature to generate device image at kernel launch time, not global initialization time). Another reason to have global variables as offload entry is for data mapping. Data mapping is a map from host address to device address because we need to pass device address to the corresponding kernel function when we launch it. It’s very complicated and worth a quite long post to discuss the implementation details, but here let me only say a few words about its relation to global variables. Some data mapping could use the information of global variables, so we also need to maintain the mapping for each global variable as well. It can only be done in the following way. After the image is loaded to a device, collect the addresses of all global variables and store that information in the mapping table. So we need the information to tell us what global variables we have.

Another entry points are global constructors (c’tors) and destructors (d’tors). That is for C++ global objects. On the host, the compiler inserts function calls to c’tors of those global variables during the global construction, and function calls to d’tors as well during the global destruction. Same thing happens to device code as well. However, since all current target devices feature a host-centric execution model, which means a device can only execute code if the host “asks” it to, basically to launch it. Those global c’tors and d’tors will not be executed by themselves if we don’t launch them. As a result, we need to know all those functions and their corresponding device handles, and launch them explicitly at the right time.

Now we know what inside the binary descriptor and what they are used for. In next post, I’ll introduce how libomptarget is initialized.

In C/C++, OpenMP directives are speciﬁed by using the #pragmamechanism provided by the C and C++ standards.
OpenMP directives for C/C++ are speciﬁed with #pragma directives. The syntax of an OpenMP directive is as follows:#pragma omp directive-name [clause[ [,] clause] ... ] new-line Each directive starts with #pragma omp. The remainder of the directive follows the conventions of the C and C++ standards for compiler directives. In particular, white space can be used before and after the #, and sometimes white space must be used to separate the words in a directive. Some OpenMP directives may be composed of consecutive #pragmadirectives if speciﬁed in their syntax.
Preprocessing tokens following #pragma omp are subject to macro replacement.
Directives are case-sensitive. Each of the expressions used in the OpenMP syntax inside of the clauses must be a valid assignment-expression of the base language unless otherwise speciﬁed.
Directives may not appear in constexpr functions or in constant expressions. Variadic parameter packs cannot be expanded into a directive or its clauses except as part of an expression argument to be evaluated by the base language, such as into a function call inside an if clause.
Only one directive-name can be speciﬁed per directive (note that this includes combined directives). The order in which clauses appear on directives is not signiﬁcant. Clauses on directives may be repeated as needed, subject to the restrictions listed in the description of each clause.
Some clauses accept a list, an extended-list, or a locator-list.
- A list consists of a comma-separated collection of one or more list items. A list item is a variable or an array section.
- An extended-list consists of a comma-separated collection of one or more extended list items. An extended list item is a list item or a function name.
- A locator-list consists of a comma-separated collection of one or more locator list items. A locator list item is any lvalue expression, including variables, or an array section.
Some executable directives include a structured block. A structured block:
- may contain inﬁnite loops where the point of exit is never reached;
- may halt due to an IEEE exception;
- may contain calls to exit(), _Exit(), quick_exit(), abort() or functions with a _Noreturn speciﬁer (in C) or a noreturn attribute (in C/C++);
- may be an expression statement, iteration statement, selection statement, or try block, provided that the corresponding compound statement obtained by enclosing it in { and } would be a structured block.
Stand-alone directives do not have any associated executable user code. Instead, they represent executable statements that typically do not have succinct equivalent statements in the base language. There are some restrictions on the placement of a stand-alone directive within a program. A stand-alone directive may be placed only at a point where a base language executable statement is allowed. A stand-alone directive may not be used in place of the statement following an if, while, do, switch, or label.
In implementations that support a preprocessor, the _OPENMP macro name is deﬁned to have the decimal value yyyymm where yyyy and mm are the year and month designations of the version of the OpenMP API that the implementation supports. If a #define or a #undef preprocessing directive in user code deﬁnes or undeﬁnes the _OPENMP macro name, the behavior is unspeciﬁed.

Category: openmp

How the Device Images of LLVM OpenMP are Organized?

OpenMP Learning Notes