Qt Signal Slot Thread Performance

For any C++ developer who's used Qt, we've grown to love the Signals/Slots idiom it presents for creating clean Observer code. However, it relied on the Qt Moc pre-compiler tool, which meant any project that wanted to use this feature had to use follow along with the Qt idiom, which really made Qt applications look potentially foreign despite being written in C++. In addition to this the implementation wasn't type-safe (a.k.a. no compile-time type checking), and doesn't generalize to any callable target (you have to extend QtObject and declare a slot using Qt's syntax, can only return void).

  1. Qt Signal Slot Thread Performance Chart
  2. Qt Signal Slot Not Working
  3. Qt Signal Slot Thread Performance Tool
  4. Qt Signal Slot Thread Performance Upgrades
  5. Qt Signal Slot Thread Performance Upgrades

Since then, there have been multiple implementations of Signals/Slots. Some notable ones are listed below:

The second connects the thread's started signal to the processing slot in the worker, causing it to start. Then the clean-up: when the worker instance emits finished, as we did in the example, it will signal the thread to quit, i.e. We then mark the worker instance using the same finished signal for deletion. I am developing a cross-platform system (Windows and Ubuntu) that needs signal and slot communication between two QObjects living in different threads. When both QObjects live in the same thread, the performance difference between Windows and Ubuntu is negligible, but when I move one the QObjects to another thread I notice the performance on.

Qt signal slot thread performance tuning

The only way to call GUI functions from another thread is through a signal/slot connection. Signal/slot communication always incurs overhead, since it's purely runtime-based and operates on strings (the names of the slots are generated at compile time, and then compared at runtime.) It's dynamic dispatch. Direct function calls are always faster. I Qt is a cross-platform toolkit for application development I Largely used and known as a graphical widget library, but Qt is far more than that. I QtCore, event loop with an original signal/slot mechanism, data structures, threads, regular expressions I QtNetwork networking (TCP, UDP clients and servers made easy, HTTP, FTP support) I QtXml.

  • Boost Signals. Not thread safe, performance wasn't great, now deprecated in favor of Boost Signals2. Licensed under the Boost Liscense.
  • Boost Signals2. Thread safe upgrade of Boost Signals. Others have complained about its performance, but my tests seem to show it's at least decent.. Licensed under the Boost Liscense.
  • Libsigc++. Supposedly decently quick, but not thread safe. I think this is also a near-fully featured implementation like Boost Signals/Signals2. Also licensed under LGPL, making use somewhat restricted.
  • libtscb. A thread safe fairly quick quick implementation. However, it skimps on features (I think it offers similar features to Qt's implementation). Also licensed under LGPL.
  • C++11-based Implementation. This is actually another blog which sought to implement Signals/Slots using new features brought by C++11, namely variadic templates, std::function, and possibly more. This is CC0 licensed (public domain). Probably one of the fastest implementations I have seen, but is not thread-safe.

Qt Signal Slot Thread Performance Chart

I was wondering if I could implement a more feature-full implementation like Boost Signals2 bet with better performance like libtscb. I'll be using some C++11 features like the last implementation, notably atomics, std::function, and variadic templates. I'm also using Boost shared_ptr/weak_ptr, Boost mutex, and Boost shared_ptr atomics. These libraries are being used because currently the compiler I'm using doesn't have the standard variants implemented (Mingw w64 gcc-4.8.1).

For the most part, I was able to implement (or actually, can see a straight-forward implementation) nearly all of the features provided by Boost Signals2. There are some semantic changes (notably the interface for combiners is different), and I was unable to capture the O(log(n)) performance of inserting group-ordered slots. My implementation likely will have O(n) group insertion.

Some basic observations I've noticed about how Signals/Slots are usually used:

  • Connecting/disconnecting slots is usually not that time critical.
  • Most signals either have no slots, or very few slots connected.
  • Signals may be invoked from multiple threads, and usually can be evaluated asynchronously. It seems like slots usually can be evaluated asynchronously, but they could be strongly ordered, too. In any case, only forward iteration is required for emitting a signal.
  • Emitting should be as fast as possible.

Qt Signal Slot Not Working

Thinking about how to best reach these goals, I decided to use an implementation which guarantees a wait-free singly linked list for read operations. The back-end implementation is still a doubly linked list, and only one writer is allowed at a time.

Memory management is taken care of using shared_ptr/weak_ptr. I wanted to see if I could implement using only standard C++11, and theoretically I could have, but unfortunately I don't have access to a compiler with all the necessary features. Fortunately, I'm only using the Boost equivalents in a 'standards-compliant' manner, so as time goes on changing these to a pure C++11 implementation is a find/replace operation.

basic slot class structure

I'm utilizing a trick for transforming multiple inheritance into single inheritance. There's not really much good reason for doing so other than code bloat issues with Visual Studio. Currently my implementation uses standard C++11 features not supported by any release of Visual Studio (2012 or older), but in the future I want to work towards C++03 compatibility using various Boost libraries (since I'm already using a few for this purpose).

This is not the full code, but a near-bare bones implementation which at least shows the usage and some implementation details.

Because I am limiting myself to single inheritance chains only, I have to carefully consider what order classes get inherited (especially for classes which could be instantiated). Eventually, I ended up with the following:

  • slot_base -> slot_wrapper
  • slot -> callable -> slot_base -> slot_wrapper
  • grouped_slot -> slot -> groupable -> callable -> slot_base -> slot_wrapper
  • extended_slot -> callable -> slot_base -> slot_wrapper
  • grouped_extended_slot -> extended_slot -> groupable -> callable -> slot_base -> slot_wrapper

Basic Signal Structure

Update: While I was implementing this I had assumed that atomic shared_ptr was lock-free. After looking into the current implementations, they currently use spinlocks on each shared_ptr. In other words, not lock-free. However, in theory my implementation could take advantage of true lock-free atomic shared_ptr's in the future.

The basic structure of the Signal class is a doubly linked list, but there internally the linked list ensures that atomically there is always a valid singly linked list from any node to the tail. I am unaware of any generic lock-free doubly linked list implementation which has all of the features I need, so the trade off I'm making is only one writer is allowed at a time. Memory management is handled entirely by shared_ptr's internal reference counting mechanisms. There might be some more efficient atomic reference counting method I could implement, but for now shared_ptr is fast enough.

In order to make signal emission lock-free, I decided to use head and tail nodes. I also place a shared_ptr to a given node which is where grouped slot insertion can begin searching from, but this only points to an existing node. I don't know if I can implement some sort of interleaved binary tree/singly linked list to allow better grouped slot insertion, but for now I'm going to assume that the somewhat crummy group insertion performance is acceptable since signal emission is the primary optimization goal.

Basic Algorithm for List Write Operations

  1. Do non-list dependant operations (mainly allocating memory for nodes and build connection object)
  2. Acquire unique write lock.
  3. Perform all necessary list modification operations. Operations which will change the implicit singly linked list must be atomic, otherwise they do not.
  4. Release lock.

Here's an example implementation for push_front (connect new slot at front of list):

Modified Iterator

Qt Signal Slot Thread Performance Tool

One problem I encountered while working on the combiner call implementation was how to efficiently pass iterators to the combiner. The solution I came up with means the iterator object encapsulated all state internally, in particular if it is the end node. This means that no separate end iterator is passed. I think the problem could be resolved using a similar single inheritance chain like I did for the slots classes, though I haven't tried this out yet.

I am using a std::tuple to store parameters for later execution, and unpacking them using the technique Johannes Schaub presented here.

Qt Signal Slot Thread Performance Upgrades

Qt Signal Slot Thread Performance

Signal Emit Algorithm

Implementing the signal emit operation is fairly simple, the only thing to keep in mind is that operations which try to move to the next node should be atomic. I've omitted a sample Combiner implementation because there's not much different about it (other than keeping in mind an iterator currently encapsulates the end state).

For my testing, I'm going to benchmark Qt Signals/Slots, Boost Signals2, and this implementation. The slot being benchmarked:

The hardware running the test is an Intel i5-m430 laptop running Windows 7 x64. For Qt test I'm using VS2012 x64, and for the other tests I'm using Mingw-w64 gcc 4.8.1. Basically, I didn't want to re-compile Qt with mingw since I already had it built for VS2012. For the record I'm testing with *Qt5 and Boost *1.54.0.

Note*: well, actually I'm using trunk repository builds, but I'm pretty sure there's no difference between the two (it would be silly for Qt to significantly muck around with the Signals/Slots implementation, and I haven't observed any changes to Boost Signals2).

I'm going to time how long it takes to emit signals with various number of number of slots (yes, I am averaging over many signal emits). The std::function calls is repeated calls to a std::function wrapping this to handler, more or less used to determine how much time is spent doing actual slot work.

Test CaseNo Slots (ns)1 Slot (ns)2 Slots (ns)4 Slots (ns)8 Slots (ns)16 Slots (ns)32 Slots (ns)64 Slots (ns)static overhead (ns)dynamic overhead (ns/slot)
std::function calls03.57.515296011623203.6
Qt Signals/Slots1475115177312581112322024033.8
Boost Signals275137183253398705128025007537
My Implementation275788150272525101019802730

So what can we tell from these results? Well, unfortunately it's difficult to know exactly how Qt compares to the others because of the different compilers. However, it is probably safe to say Qt has some short-circuit mechanism for handling empty signals. I would be curious to see how well these implementations would fare under high contention/parallel loads. I suspect my implementation would handle multiple emits with a single writer quite easily because this is what the implementation targets. I don't know how Qt or Boost Signals2 are implemented under the hood, but I suspect both of them use some sort of locking mechanism on emits.

Qt Signal Slot Thread Performance Upgrades

I don't know exactly how bad Boost Signals2 performed in the past, but as far as I can tell it is probably 'acceptable' most of the time. However, it does perform poorly when there are no slots connected. I'm going to keep working on this implementation and will eventually release it when it's done (right now it basically works, but many features are unimplemented). The cost is not as good as is possible with non-thread safe implementations, but there are nice gains. I may even try submitting it for inclusion in Boost (either as signals3, or use parts to improve signals2).