Oops!… I violated ODR again

ODR_Again

Often when you’re trying to debug a piece of code, the debugger steps into a block you didn’t expect it to. Most of the time this is because your code has a logic error, and it wasn’t doing what your mental model thought it would.

Other times it’s because you accidentally compiled with optimizations on, and the compiler did some magic to make the outcome the same even if the code was different.

This post talks about a scenario where it’s neither of those things. Instead it’s something much more dangerous and difficult to detect — a violation of the One Definition Rule (ODR).

In this post I’ll talk a little about C++’s One Definition Rule, and then discuss how one manifested itself in a project I was working on. I’ll talk about how to detect them, and how I resolved mine.
(I talk a lot about violating ODR in the context of a tool I use (SWIG), but the manner in which I violated ODR is applicable to any C++ library that links to another.)


SWIG  is a handy little tool I use to generate Python bindings for my C++ code. If I have the following code in a header file:

 

template<class T>
class AndyGsFancyContainer
{
public:
  void insert(int key, T value){/*...*/}
  T get(int key){/*...*/}
  //...
};

I can write a nifty SWIG interface file like so:

// swig.i
%module fancy_container
%{
  #include "AndyGsFancyContainer.h"
%}
%include "AndyGsFancyContainer.h"
%template(AndyGsFancyCharContainer) AndyGsFancyContainer<char>;

When I run SWIG with swig.i as input, it spits out a bunch of Python C API code that I can compile into a .pyd library (really just a renamed .dll). It also spits out a Python module named fancy_container.py that I can import and call through to my C++.

When a Python script calls into my fancy C++ container, the behind the scenes control flow looks something like this:

SWIG layer diagram

Be careful what you SWIG for

You don’t always want SWIG to wrap every single nook and cranny of your header file, so there are ways to tell SWIG to ignore those bits:

#ifndef SWIG
void RazzleDazzle(std::unique_ptr<std::vector<T>> vals); // make sparks come out of the nearest outlet!
#endif

Our RazzleDazzle function is awe-inspiring, but perhaps not something we want Python users to be able to call. Mostly because we don’t want to jump through all the hoops of telling SWIG how to transform unique_ptr and std::vector into types that Python can understand.

Using this preprocessor symbol is perfectly fine for our code, perhaps to some readers’ surprise. SWIG has its own preprocessor that it runs on your C++ code before spitting out what it wants to wrap, and it behaves as if the "SWIG" symbol is always defined at this time. The "SWIG" symbol never actually gets defined either in the code or on the compiler command line.

We run afoul of the C++ Standard when we decide that we want to hide code from Python specifically (versus, say, C# bindings)

To hide code from a specific target language, SWIG utilizes alternative preprocessor symbols:

template<class T>
class AndyGsFancyContainer
{
  public:
  #ifndef SWIGPYTHON // too dangerous in the hands of a Python developer
  virtual void RazzleDazzle(std::unique_ptr<std::vector<T>> vals) // make sparks come out of the nearest outlet!
  {/*...*}; 
  #endif

  void insert(int key, T value){/*...*/}
  T get(int key){/*...*/}

  //...
};

Why is this a problem? Because, while the C++ that SWIG spits out doesn’t define the "SWIG" symbol, it does define the "SWIGPYTHON" symbol. And we’ve silently violated the almighty One Definition Rule.

What is ODR?

A silent violation of the One Definition rule is something you’ll rarely encounter in day-to-day development. Most of the time the compiler or linker will scream and yell at you about multiply-defined symbols until you fix it.

error LNK2005: "int variable_defined_in_header" (?variable_defined_in_header@@3HA) already defined in Source.obj 

The silent (but deadly) ones are often a byproduct of a a mistake in your build process. You may run into it when attempting to integrate third party libraries that were compiled differently or on a different version of the same compiler. You may run into it in your own code when you apply different optimization switches to different parts of the code, or if you inconsistently define preprocessor symbols that affect which code gets compiled at all.

The code above runs afoul of the last one. But let’s not get ahead of ourselves; what is the One Definition rule in the first place? It says that you can only have one definition for any type, variable, function, etc. Let’s elaborate.

A .cpp file in your project is its own little universe. It has no care at all for what other .cpp files are doing. It does, however, care about types and functions declared in its dependent header (.h) files. Some of these types and functions it will create definitions for. This is why you end up with .h/.cpp pairs like Foo.h and Foo.cpp; Foo.cpp creates definitions for the declarations in the header.

For the types and functions that the .cpp file does NOT define, though, it’s generally good enough for the .cpp file to assume that someone else is going to create the definitions for them.

When compilation time rolls around, the compiler takes each .cpp file independently and compiles it into its own binary format (.obj in MSVC, .o in gcc). What does it do about the functions that weren’t defined, though?

It marks those symbols in some way to imply nothing is known about them yet. That’s where the linker comes in.

After compilation, the linker gathers up all your object files with the goal of creating your library (.dll or .so). Its job is to aggregate all definitions from all the .cpp files, and make sure that all the "nothing known about yet" areas get resolved.

If there’s a spot where a definition doesn’t get resolved, you get a linker error. The linker error will often say something along the lines of "undefined external symbol 'blah'" which, following my explanation above, is perfectly clear; you have a .cpp file that relies on a type or function whose definition couldn’t be found by the linker.

Often it’s just because you have a typo, like you declared some function Blarg() in a header, but then typed it as Blargh() in the .cpp file (notice the added ‘h’).

So how does that relate to the One Definition Rule?

Like I said above, the linker takes all the definitions and aggregates them into a single .dll or .so file. What happens if you have two definitions of the same thing?

According to the C++ standard, a single .cpp (more formally, translation unit) shall not contain more than a single definition of anything. From a practical standpoint, this is something the compiler can detect while it’s compiling that .cpp file because it has both definitions readily available.

In this situation, the compiler will be nice to enough to give you an error:

// Foo.h
struct Foo
{
  // declaration
  void DoAThing();
};

// Foo.cpp
#include "Header.h"

// first definition
void Foo::DoAThing()
{
  std::cout << "Doing a thing!\n";
}
    
// second definition, compiler will give you an error
void Foo::DoAThing()
{
  std::cout << "Doing a thing!!!\n";
}

That’s great and all, but what if you put the two definitions into different .cpp files?

// Foo.h
struct Foo
{
  // declaration
  void DoAThing();
};

// Foo_one.cpp
#include "Header.h"

// first definition
void Foo::DoAThing()
{
  std::cout << "Doing a thing!\n";
}
    
// Foo_two.cpp
#include "Header.h"

//second definition
void Foo::DoAThing()
{
  std::cout << "Second Doing a thing!\n";
}

What will happen? The compiler only knows about a single .cpp file at a time, so it goes on its merry way creating Foo_one.obj and Foo_two.obj. Then the linker comes in and sees that there are two definitions for Foo::DoAThing(). How does the linker know which one is right?

This, of course, is a violation of the One Definition Rule. We are required to provide only a single definition of Foo::DoAThing(), and yet we provided two.

However, we’re still in pretty safe* territory because your linker will give you an error along the lines of "One or more multiply defined symbols found".

The real nastiness happens when we start introducing templates (or inline functions, but I won’t talk about those so much)

Templates and ODR

When you use a template in C++, the compiler is going to instantiate the definition of that template right then and there. You don’t think about it, but you’re using templates from the standard template library all the time.

// SourceFile1.cpp
std::vector<int> some_data;

// SourceFile2.cpp
std::vector<int> more_data;

Whoops, now we have a definition of an integer std::vector in two different translation units. Isn’t this an error?

It’s not, because the One Definition Rule makes special exceptions for templates (for obvious reasons). You’re allowed to have more than one definition of a template in your program so long as they’re all the same.

How similar qualifies as "the same"? The standard is pretty strict about this. All definitions must consist of "the same sequence of tokens". On top of that, each name used by the template must resolve to the exact same entity. So if your template refers to "Bob", it cannot refer to namespace_a::Bob in one translation unit, and namespace_b::Bob in another. There are other requirements, but you get the gist of it.

Okay, the rules are pretty clear, so why is this a problem?

Think about trying to enforce these requirements from the compiler and linker’s standpoint. The linker would have to pretty deeply examine every single template instantiation to enforce these rules.

At scale, you’re probably using many templates over and over. To save on time, the compiler may cache template instantiations for reuse. The linker, upon seeing the same definition twice, may simply discard the newer one.

So, for performance reasons, the C++ Standard dictates that if you violate the One Definition Rule with respect to templates, the behavior is simply undefined. Since the linker doesn’t care, violations of this rule are silent.

By hand-waving violations of ODR away as undefined, compilers and linkers have an easier job to do, but it makes our job as C++ programmers a little harder (as if C++ development wasn’t hard enough!).

How might I accidentally violate ODR?

There are many ways you may silently run afoul of ODR in C++, but it almost always boils down to this:

You compile different translation units with different compiler options, and then you use them together.

That’s pretty much it! To avoid ODR violations, you need to be very sure that the same type appears the same and is compiled the same way in every place that type is used.

Let’s take a look at some concrete examples of how one might accidentally run afoul of this (seemingly) simple rule.

1. Different alignment

The size of a class in C++ is often thought of as the sum of the sizes of its constituent members, but this is not always the case.

The compiler has free reign to (and usually does) insert padding between class members so that they align to certain byte boundaries in order to make fetching them more efficient.

Take this contrived class for example:

struct RedditPost
{
  int user_id; // who made the post
  double post_time; // seconds since epoch
  int upvotes; // negative implies downvotes
};

If I tell you that sizeof(int) == 4, and sizeof(double) == 8, then what is sizeof(RedditPost)? You might justifiably say it’s 4 + 8 + 4 = 16 bytes, but in reality most compilers will tell you that sizeof(RedditPost) == 24. Why?

Alignment rules are complicated, and I’m not going to go into great detail about them, but I’ll say this: the compiler wants to align individual members of a class for easy access, as well as instances of the class as a whole in order to optimize access in the context of an array of that class type.

(One of the best articles I ever read about alignment was by Eric S. Raymond in his article “The Lost Art of Structure Packing“)

What this boils down to saying is that the compiler treats our RedditPost class like so:

struct RedditPost
{
  int user_id; // 4 bytes (total so far: 4)
  unsigned char padding1[4]; // post_time needs to start an address divisible by 8, so add 4 bytes (total so far: 8)
  double post_time; // 8 bytes (total so far: 16)
  int upvotes; // 4 bytes. no padding because 16 is divisible by 4 (total so far: 20)
  unsigned char padding2[4]; // add 4 bytes to get to 24, which is the nearest integer evenly divisble by 8 (total: 24 bytes)
};

(See the comments within the code for why).

For the sake of completeness, note that we could "fix" things by grouping like types:

struct RedditPost
{
  int user_id; // 4 bytes (total: 4)
  int upvotes; // 4 bytes (total: 8)
  double post_time; // 8 bytes (total 16), no need for padding because we're already on an 8 byte boundary
};

But let’s ignore that for a minute.

The compiler doesn’t have to insert this padding. In fact, you sometimes want to tell the compiler NOT to add any padding so that your class is exactly the size of its constituent members. You might need this if you are modeling a network packet following the standardized description in an official RFC.

For this reason, all compilers have some option for you to tell the compiler not to add any padding. (In MSVC you can do this via the compiler switch /Zp.)

So what happens if we have two interdependent libraries, library1.cpp and library2.cpp, that both use our RedditPost class, and we compile like so:

library1.cpp -o library1.dll /Zp1
library2.cpp -o library2.dll

(Note that MSVC, gcc, and clang all have ways to specify padding directly in your source files. In MSVC this is via #pragma pack(N), and in gcc/clang it’s __attribute__((packed, aligned(1))))

Library1 thinks that RedditPost is 16 bytes and Library2 thinks its 24 bytes. Why is this a problem? If one DLL creates a RedditPost instance that the other is supposed to access, that access will not be correct. The resulting data that is accessed is undefined, and if you’re lucky then you’ll get a segfault instead of silent garbage.

2. Hiding members (and sometimes virtual functions) behind preprocessor symbols

Perhaps some well-intending developer added diagnostic information to a class like so:

struct Person
{
  std::string name;
    
  #if defined(DEBUG)
   int callCount = 0;
  #endif
  void Print()
  {
    std::cout << "name is " << name << std::endl;
    #if defined(DEBUG)
     std::cout << "This function has been called " << callCount << " times\n";
    #endif
  }
};

What happens in this situation?

library1.cpp -o library1.dll
library2.cpp -o library2.dll -DDEBUG

Since we’ve mixed a debug and non-debug compiled library, the sizeof(Person) is different because in library2 it has an extra integer member.

Hiding virtual member functions behind preprocessor flags can be similarly problematic; the compiler has to build a virtual table for your classes that have virtual members, which changes the class’ size and layout. I’ll talk a little more about this in a second.

Recall that the definitions of your classes must be symbol-for-symbol identical irrespective of translation unit. (Mark Nelson wrote about an interesting experience he had with violating this rule in a less malignant way in 2014.)

3. Or you could accidentally compile two translation units with ABI incompatible versions of the same compiler.

4. Or you could accidentally use two different compilers.

5. Or you could accidentally target one translation unit to a different architecture than the other.

6. Or you could compile one where integers are 64 bits and only 32 bits in the other.

7. …and beyond.

The list goes on and on, with the main theme being that the code itself is fine but the build process is broken.

How can I recognize the manifestation of an ODR violation?

Since ODR violations are undefined behavior, there’s no set way that they will surface. Oftentimes it may be a segfault that is indistinguishable from e.g., accessing one past the end of an array.

Like in Mark Nelson’s case, it may just be that you are expecting output A from some function, but instead see output B, and it has nothing to do with branching.

When I hid my virtual function RazzleDazzle from SWIG’s C++ translation unit, it affected how SWIG built its virtual table. The way this manifested was via a call from a base class pointer to a derived class’ overrided method. Instead of entering the RazzleDazzle function, control entered a different virtual function altogether! On top of that, these functions had compatible signatures. The result was that completely unexpected code was being executed, which eventually put an object into a bad state where the runtime finally, thankfully, segfaulted.

What can my tools do for me?

Your compilers have tools to help you out here. Remember earlier how I said that linkers discard multiple definitions as an optimization technique?

In GCC you can basically turn that off with the flto-odr-type-merging switch. And when you turn that switch on, since GCC 5.1 there is the (poorly documented at this time of writing) wodr switch that gets turned on with it to give you the actual ODR violation warnings.

Clang’s wodr switch attempts to do the same thing as GCC’s flag.

While I haven’t used it, Google’s ASAN appears to be the perfect tool for the job, performing granular testing of multiply defined symbols to ensure they’re the same. In MSVC, there is the officially unofficially supported /d1reportSingleClassLayout that prints out a handy ASCII representation of your class layout.

1>class RedditPost size(16):
1> +---
1> 0 | user_id
1> 4 | upvotes
1> 8 | post_time
1> +---

I used this tool successfully to prove my SWIG-related ODR violation, but it’s only really useful after the ODR violation has manifested itself. To detect ODR violations, Microsoft recommends #pragma detect_mismatch (Stephan T Lavavej gives a better description of this pragma around the 29:00 mark in this video). In my opinion, it’s pretty intrusive.

Closing thoughts

The One Definition Rule permeates all the C++ code you write, subconsciously or not, and it’s just so easy to violate it. Fortunately, the compiler can catch a lot of the simple mistakes, but for a staggering amount of code the compiler and linker tell you that you’re on your own. What’s worse is the embarrassing lack of built-in diagnostic support your linkers provide in the current landscape.

Nailing down your build process should make things an order of magnitude safer. If you don’t already have an automated build, get one. I cannot stress enough how important it is to give your build some TLC. If your management balks at hiring at least one FTE to manage your build, remind them that it’s the only reason you have a product, no build = no product.

(*for inlined functions maybe not so safe)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s