It does mean a thing

Some notes on software and beyond

Opaque Pointers Revisited

Opaque pointer (aka d-pointer or pimpl) is a great C++ design pattern useful for prolongated binary interface compatibility, properly hidden implementation and faster compilation. However, it has inherent performance drawback, which could get pretty critical if you care about efficiency. In this post I propose an approach that makes d-pointers less binary compatible but swipes away its inefficiency.

Most of the time, in C++ you don’t really hide members of your class as they are visible (even though not accessible) to all the users of this class. This means that any interface changes, like new private member function or renamed member variable, are propagated to all units that depend on this interface. In practice, this denies backward compatibility and leads to awful (re)compilation times.

This issue can be gracefully resolved with d-pointers. Examples are always better so let’s consider some class named Clam, which doesn’t want to expose its data and internal functions. The declaration .hpp of Clam is:

1
2
3
4
5
6
7
8
9
class Clam
{
public:
  Clam();
  ~Clam();
private:
  class Self;
  Self* self;
};

Here, we declared that Clam has a member, that is a pointer to some incomplete class Self — from now on it is possible to declare any members in Self really private so that they won’t be visible anywhere but in the Clam implementation scope (note that we have to avoid default constructor and destructor as they would ask us to provide the complete definition of Self — we don’t want it). The source file .cpp would be:

1
2
3
4
5
6
7
8
9
10
11
12
13
class Clam::Self
{
  // private things (pearls?) here
};

Clam::Clam() : self(new Clam::Self())
{
}

Clam::~Clam()
{
  delete self;
}

As I noted before, this allows you to alter the Clam class with no changes to the header — we just modify the Self thing and its handling in Clam. Such an approach does a great good for everyone:

  • This speeds up compilation: a) we don’t need to recompile classes that depend on Clam with any changes taken on it b) members of Clam are now totally hidden and not propagated to its users. This is important for large projects with a great number of classes (like Qt, that uses it extensively).

  • This introduces binary compatibility: we can alter implementation Clam with no changes to its binary interface so we can expose the class and make the implementation interchangeable (e.g. different versions are still binary compatible).

The aforementioned drawback is pretty clear — a level of indirection makes any member access or member function call be done through a pointer. Furthermore, the Self instance is allocated somewhere on heap — this leads to unnecessary cache misses.

I think I have (more or less) a sane solution that removes this overhead (I haven’t found anything on that but Fast Pimpl, which is a bit different). Let me call it EmbeddedDPtr. The naming comes from the idea of embedding the object to some buffer of fixed size. This way we avoid any heap allocations and employ caching in its best. But how do we know size if we want to keep the declaration of Self incomplete? We can’t know this sizeof. Although, what we can do is to provide enough space and check if it is really enough to keep the object in the buffer with static_assert. The implementation of this idea is the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
template <typename T, int size>
class EmbeddedDPtr
{
public:
  EmbeddedDPtr() : storage()
  {
      static_assert(sizeof(T) <= size, "Can't embed object, increase size");
      new (reinterpret_cast<void*>(storage)) T();
  }
  EmbeddedDPtr(const EmbeddedDPtr<T,size>& other) : storage()
  {
      new (reinterpret_cast<void*>(storage)) T(reinterpret_cast<const T&>(other.storage));
  }
  EmbeddedDPtr& operator=(const EmbeddedDPtr<T,size>& other) = delete;
  ~EmbeddedDPtr()
  {
      reinterpret_cast<const T*>(storage)->~T();
  }
  T* operator->() const
  {
      return reinterpret_cast<T*>(const_cast<char*>(storage));
  }
private:
  char storage[size];
};

Unfortunately, this approach restricts the binary compatibility as the altered Self should still fit to the buffer. Nevertheless, the compilation speedup and cleaner headers are still here. The need to maintain the storage size is indeed boring but the static_assert keeps you safe from memory errors. So, as usual, that’s a trade-off.

The other (good) side of this trade-off is performance. To check whether EmbeddedDPtr is faster than usual d-pointer I implemented a simple benchmark. It consists of two classes:

1
2
3
4
5
6
7
8
class WithEmbeddedDPtr
{
  // ctor, dtor
  void doStuff();
private:
  class Self;
  EmbeddedDPtr<Self, 4> self;
}
1
2
3
4
5
6
7
8
class WithUniqueDPtr
{
  // ctor, dtor
  void doStuff();
private:
  class Self;
  std::unique_ptr<Self> self;
}

Both WithEmbeddedDPtr::Self and WithUniqueDPtr::Self private classes do the same simple thing — they have a counter they increase each time the doStuff function is called. This is probably the simplest (but not optimized) operation that could give us a good estimate on the overhead. The WithEmbeddedDPtr::Self looks like:

1
2
3
4
5
6
7
8
9
10
class WithEmbeddedDPtr::Self
{
public:
  void doStuff()
  {
      i++;
  }
private:
  int i;
}

Finally, everything is ready for the benchmark cases. They are pretty simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
const int REPEATS = 10000;
const int RUNS = 100;

BENCHMARK(DPtr, Embedded, RUNS, REPEATS)
{
  WithEmbeddedDPtr object;
  object.doStuff();
}

BENCHMARK(DPtr, Unique, RUNS, REPEATS)
{
  WithUniqueDPtr object;
  object.doStuff();
}

I’ve ran the benchmarks on my machine and got the following output:

1
2
3
4
5
6
7
8
9
[ RUN       ] DPtr.Embedded (100 runs, 10000 iterations per run)
[     DONE  ] DPtr.Embedded (14.026703 ms)
[   RUNS    ]        Average time: 140.267 us
...

[ RUN       ] DPtr.Unique (100 runs, 10000 iterations per run)
[     DONE  ] DPtr.Unique (49.996157 ms)
[   RUNS    ]        Average time: 499.962 us
...

This ~3x speedup could be pretty significant in some cases. The speedup is quite easy to explain with just two facts: the first is that embedded d-pointer avoids heap allocations and the second, such code is much more cache-friendly with the implementation being located inside of the main class.

I am still thinking about possible approaches to ease the pain of the need to provide the exact size of the buffer. Although, that’s pretty clear that we can’t make it dynamic enough to be really easy to use.

Comments