Freitag, 6. Januar 2017

The fastest MPL in the west

Just like the revolver swinging cowboys the title is eluding to I am uneducated, I did not shower today and I am making big claims which may more may not be shot down at a moments notice.

Jokes aside readers of my blog have no doubt noticed I have been putting a lot of thought into how to make TMP faster and I think I have reached a point where its worth dumping my brain again. My previous posts on meta monads are required reading in order to understand this one. This time I have been thinking about how best to compose operations and how best to store data.

It seems that the cost creating and passing types around interestingly depends on whether the type is empty or not:

template<typename T, typename U>
struct thing{}; //fast

template<typename T, typename U>
struct thing{
    using ding = T;  //slow even if
    using bat = U;   //never accessed

I may be wrong as I am not a compiler guy but it seems to me the compiler does not need to do as much book keeping regarding point of instantiation if there are no effects of the point of instantiation. Essentially what does the point of instantiation really mean for an empty struct and does it really ever need to be instantiated? On the other hand nested values save unpacking. Its important however to remember that the cost of instantiating a type is dependent on the complexity of that type.

As far as my vague understanding of compiler internals goes the instantiation of a type requires both allocation of the new type and indexing to the types of its parameters. Unpacking types wrapped in other types always creates a new type and this process is costly (like resolving 100 aliases or something).

Therefore it is advantageous to unpack as little as possible and to unpack types which are as simple (as in few parameters) as possible. Bare in mind I'm oversimplifying a bit.

So if we are unpacking as seldom as possible and not making any new types how are we getting any work done? With aliases of course ;) In my previous posts I have shown how we can implement algorithms which work on data using aliases, what about composition of algorithms though?

We already compose many algorithms in the implementation of brigand, for example a remove_if is a transform and then a join, partition is essentially two remove_if's. Users may also want to compose a transform and then a fold or a drop and then find or any number of other combinations.

If we think about the last step of a drop for example, all we are really doing is wrapping the rest of the types which we have in a list, that listification could just as well be a template template parameter which defaults to a list but could also be a something else. We can also compose the control structure from parts at essentially no cost (some benchmarks were actually faster with the composition):

struct makeControl:Ts...{};

struct stepIsT{
    template<typename T, typename...Ts>
    using step = T;
template<template<typename...> class F>
struct doneIsT{
    using done = F<Ts...>;
struct doneIsFirst{
    template<typename T, typename...Ts>
    using done = T;

struct selectIsCielLog2{
    static constexpr int select(const int i) {
        return ciel_log2(i);

//rather than supplying drop_control we can compose our control structure
template<typename L, int I>
using drop = typename wrap_generator<L, makeControl<stepIsT, doneIsT<list>, selectIsCielLog2>>::type; 

Now if we want to make at using composition all we need to do is swap out the done policy.

And now for the kicker, since done receives an unpacked list and we can easily swap out the done policy and thus allow the user to specify what algorithm she wants called at that point. This allows us to chain algorithms without the step of packing and unpacking in between. There does need to be a sexier calling syntax including the ability to bind args for the next call but its a good start.

The one major limiting factor is still instantiation depth as the packing and unpacking we did in the past also serves as a trampoline. However chaining a few logarithmic algorithms should still stay well below the 256 boundary even for relatively large lists.

One could even consider keeping a running count of nesting depth and allow the algorithms to return "partial work" upon nearing 256, return to some generic trampoline and then continue without the user needing to worry about it. I'm not sure if that would complicate things too much though from a library implementation standpoint.