COMDAT in LLVM

I have been working on LLVM heterogenous IR module (see this video for more details) for several days. The first thing I need is to modify the class llvm::Module to make it accommodate multiple modules. Frankly speaking, it was my first time to hear COMDAT when I went through every member data in the class. So I did some research and dig into LLVM source code to learn what it is and how it is used. I’m gonna share my thoughts in this post. Feel free to point it out if something is wrong.

What is COMDAT?

If you search this concept in Google, the first result jumping out should be from StackOverflow (at least in my case).

The purpose of a COMDAT section is to allow “duplicate” sections to be defined in multiple object files. Normally, if the same symbol is defined in multiple object files, the linker will report errors. This can cause problems for some C++ language features, like templates, that may instantiate the same symbols in different cpp files.

COMDAT is actually a section in an object file. It contains symbols that can be potentially with same name in different objects. This could happen when different C++ source files instantiate same template functions. Consider the following example:

// header.hpp
template <typename T>
T add(T a, T b) { return a + b; }
// foo.cpp
#include "header.hpp"
int foo(int a, int b) { return add(a, b); }
// bar.cpp
#include "header.hpp"
int bar(int a, int b) { return add(a, b); }

After we compile foo.cpp and bar.cpp and get foo.o and bar.o, both the two objects contain a symbol named __Z3addIiET_S0_S0_, which is a mangled name. It stands for int add<int>(int, int), which is exactly the instance function of the template function in header.hpp. When the linker links the two objects, if we don’t do something, there will be a linker error because there are two symbols with the same name.

COMDAT is to solve this problem. The symbol __Z3addIiET_S0_S0_ is put into a special section (COMDAT section). Since symbols in an object must have different names, when multiple objects are linked together, only one of those symbols with different names from different COMDAT sections in different objects can be kept. The linker must determine which one to stay. There are a couple of strategies that will be covered in next section.

But wait, why? Shouldn’t they be the same, like the above case, such that we can choose whatever we want? They should work fine because they’re same. Well, that’s true. However, things are not always like that. Consider the following code:

// foo.cpp
template <typename T>
T add(T a, T b) { return a + b; }
int foo(int a, int b) { return add(a, b); }
// bar.cpp
template <typename T>
T add(T a, T b) { return a + b + 2; }
int bar(int a, int b) { return add(a, b); }

Every file has its own template function add, and they work differently. However, they’re all called __Z3addIiET_S0_S0_ in their own objects, without any difference! During the linkage, the linker knows they’re different, but which one to choose? Here comes the strategy. You might be thinking, does it mean either foo or bar will not work correctly after the linkage?! Unfortunately, that’s true. That is how C++ works! That’s why we have millions of articles titled “Best Practice in C++” or “Ten things you should never do in C++”, etc. 🙂

How COMDAT works in LLVM?

Let’s first take a look how llvm::Comdat is defined:

class Comdat {
public:
  enum SelectionKind {
    Any,
    ExactMatch,
    Largest,
    NoDuplicates,
    SameSize,
  };
  Comdat(const Comdat &) = delete;
  Comdat(Comdat &&C);
  SelectionKind getSelectionKind() const;
  void setSelectionKind(SelectionKind Val);
  StringRef getName() const;
private:
  friend class Module;
  Comdat();
  StringMapEntry<Comdat> *Name = nullptr;
  SelectionKind SK = Any;
};

Form simplicity, I only kept meaningful parts. The class is very simple. It only contains two data members, a pointer to StringMapEntry and a SelectionKind. The latter one is pretty straightforward, which defines how to deal with the corresponding symbol. It has five kinds (strategies):

  • Any: The linker can choose whichever it wants when it has multiple symbols with the same name from different objects.
  • ExactMatch: The linker needs to check every instance from different objects whether they’re exact matched. If so, it can choose any of them (obviously). Others will be dropped. If any of them is different from others, a linkage error will be emitted. As for what is exact match, it just means different instances must have same size, same functionalities, etc.
  • Largest: The linker should choose the largest one if multiple instances are of different sizes.
  • NoDuplicates: This symbol should NOT be defined in another object, which means it can only exist in one object. Neither example in the previous section can pass if the COMDAT is this kind.
  • SameSize: The linker needs to check whether the corresponding symbols from different objects are of same size. It is different from ExactMatch because it only requires the same size. It is possible that different symbols can have the same size but different functionalities.

From the class definition, llvm::Comdat is like a property of a symbol. Therefore, each llvm::GlobalObject holds a pointer to a llvm::Comdat. The llvm::Module contains a mapping from a symbol name to its corresponding llvm::Comdat, and llvm::Comdats are actually stored in the map. For efficient look-up, llvm::Comdat contains a pointer to its corresponding entry in the map. It has three advantages:

  • The owner of the llvm::Comdat is the map (part of the llvm::Module) not a symbol. In this way, all COMDATs for a single module are in a same place. We can easily traverse all COMDATs if necessary.
  • A symbol only needs to hold a pointer to its llvm::Comdat without taking care of its lifetime.
  • The symbol name can be easily got via the pointer to the entry.

Every time we need to check whether a symbol is in the COMDAT section, we can either use function llvm::GlobalObject::hasComdat()or check whether the return value of llvm::GlobalObject::getComdat() is nullptr.