CINQ: C++ Integrated Query

A library for querying and filtering containers

CINQ (pronounced “sink”) is a clone of C# LINQ written in C++. It uses Concepts to constrain what types can be put into template arguments, improving compile-time errors & ensuring correctness.

I worked on this project with a group of friends for our Language Library Design class project. I was responsible for the overall architecture of the library (for example, figuring out how we would implement the syntax we wanted), and also had a hand in implementing most of the features.

I’ve pasted the design document / project report below. You can also:

View the source on GitHub
Read my tutorial on how to compile GCC with concepts support

CINQ: Design Document

Kevin Chen kxc2103
Jonathan Barrios jeb2239
Jonathan Wong jmw2245

Summary

C++ Integrated Query (CINQ, pronounced “sink”) is a C++ implementation of Microsoft’s Language Integrated Query (LINQ). We implemented most of LINQ’s method syntax, using concepts to ensure that the user passes the correct types. Additionally, our syntax is much cleaner than other C++ LINQ implementations, since we use the . operator instead of >> to chain methods.

We chose this project because it would allow us to use templates and concepts. These C++ features were introduced in class, but having never used them, we felt that we did not have enough familiarity. Through this project, we have become scarily good at weird templating tricks.

Feature set
Interface design
- Combining order_by and then_by to reduce confusion
- Using . instead of >> for method chaining
Error handling
Resource management
Optimizations
Features for the convenience of users
- Overrides that exist for preventing unsigned to size_t conversion errors
- Default versions of methods
Behind the scenes: templating tricks
- Building the order_by comparison function
- Access private members of other template instantiations
Ideas for release 1.2

Feature set

LINQ methods implemented

Select. Creates a new sequence by mapping the source sequence with a user-supplied mapping lambda.
Where. Filters the sequence to remove items not matched by a user-supplied predicate.
Single. Filters the sequence. If there is only one element left, return the item. Otherwise, throw an exception.
Any, All. Checks whether any or all of the elements match a user-supplied predicate.
Min, Max, Sum, Average. If the source sequence is a Number, these methods will compute the min, max, sum, or average. Otherwise, it can compute those values on the results of a user-provided mapping lambda.
Take. Includes only the first N elements. (Convenience method.)
Skip. Includes everything after and including the Nth element. (Convenience method.)
ElementAt, First, Last. Get the element at the specified index.
Concat. Concatenates two sequences.
OrderBy. Sorts the sequence. If a mapping lambda is provided, the sequences will be sorted based on the return value of the lambda. If multiple lambdas are provided, the other lambdas will be used to specify subsequent ordering for the sort.
Reverse. Reverses the order of the sequence.

Concepts improve template error messages

The nature of our project requires a lot of templating, especially since C++ lambdas each have a unique type. Prior to C++ concepts, this would have meant incomprehensible error messages with type names long enough to fill an entire terminal window.

With concepts, the compiler also tells the user which constraint could not be satisfied, making it easy to see which template argument had the wrong type. In our testing, we found that most of these errors are simple mistakes — returning an int from a lambda instead of a bool, for example. Concepts allows the user to spend less time tracking down template problems and more time writing their application.

Prior to this project, it was not easy to get up and running with a C++ concepts toolchain. GCC, Clang, and Microsoft do not have concepts on their feature roadmaps. There are many experimental compiler and standard library implementations, each one implementing a different draft TS. Some compilers did not come with a standard library, and vice versa. We have documented the steps we took to set up a C++ concepts toolchain to save others some trouble.

(Using concepts in our library has also been a great learning experience for us. We were able to apply our knowledge from class, as well as try out a C++ feature that almost nobody else is using — currently, there are only 55 questions tagged c++-concepts on Stack Overflow.)

Lots of tests

We have over 60 unit tests covering the correctness and performance of all methods in the library. This allows us to verify that changes to the code continue to return the correct results, and gives us a way to verify that performance fixes actually improved the library’s performance.

Interface design

One might imagine that implementing an existing library in a new programming language would remove the need to think about design. That is not true. The existing library provides a feature list of things that should be implemented, but the interface still needs to be adapted to match the target language’s conventions.

Combining `order_by` and `then_by` to reduce confusion

The C# version of LINQ allows specifying a primary sort order with OrderBy() and subsequent orderings with ThenBy(). For example:

myList.OrderBy(Person p => p.Age)
      .ThenBy(p => p.LastName)
      .ThenBy(p => p.FirstName);

However, this syntax makes it possible to write nonsensical queries, such as:

myList.OrderBy(Person p => p.Age)
      .ThenBy(p => p.LastName)
      .Where(p => p.Age >= 30)
      .ThenBy(p => p.FirstName);

The C# compiler catches these, but unfortunately, outputs a difficult-to-understand error:

error CS0411: The type arguments for method System.Linq.Enumerable.ThenBy<TSource,TKey>(this System.Linq.IOrderedEnumerable<TSource>, System.Func<TSource,TKey>) cannot be inferred from the usage. Try specifying the type arguments explicitly

To eliminate the possibility of the user making this error, we have combined OrderBy and ThenBy into a single order_by() method that takes multiple mapping lambdas for specifying the orderings:

cinq::from(my_vector)
     .order_by([](person p) { return p.age; },
               [](person p) { return p.last_name; },
               [](person p) { return p.last_name; });

Behind the scenes, we recursively construct a comparison lambda that we pass into std::sort.

Using `.` instead of `>>` for method chaining

The “C++ way” of passing the result of one method to another would be overloading operator>>. istream and ostream are implemented this way. However, we know from using cout that the angle brackets are annoying to type. They are also confusing to C++ novices who think they are bit shift operators — especially bad for a data processing library.

Error handling

In general, returning a magic number to indicate an error is a bad idea. For a data processing library, it makes even less sense — there is a high risk that the error number is misinterpreted as the result of a computation.

Each CINQ method checks its preconditions to catch possible runtime errors by the user, and throws an exception with a helpful message if the precondition is not met. For example, it does not make sense to average() an empty sequence, so the method will throw a length_error. Should the user catch the exception, the enumerable object’s state is still valid and other calls can be made.

Resource management

Based on what we learned in class, we decided to avoid bare pointers and explicit heap memory allocation (with new) wherever possible. The result is that CINQ only allocates heap memory through STL classes such as std::vector, and does not use pointers. We were surprised at how easy it was to write error-free C++ code after adopting these rules.

Consider this CINQ call:

vector<int> result = cinq::from(my_vector)
                          .where([](const int& x) { return x > 0; })
                          .to_vector();

Here is what happens to memory during that call:

from() constructs the enumerable object, which holds information related to the sequence being queried.
The enumerable object retrieves the iterators of my_vector. (This assumes that my_vector is not destroyed or modified during the CINQ call.)
from() returns the enumerable instance by value — no pointers involved. This will use the move constructor on compilers that support it.
where() creates a new vector to store the result of the filtering. The new vector is assigned to the data member of the enumerable, which uses the copy constructor.
where() returns the enumerable instance by value.
to_vector() returns a copy of the data vector.
The enumerable object goes out of scope and its destructor is automatically called.

Our testing strategy includes running the program in Valgrind to ensure we didn’t miss anything.

CINQ does not open files or listen on the network.

Optimizations

Performance comparisons

We performed performance testing on a computer with Intel i7-4850HQ. Equivalent queries were performed with the following languages and libraries:

CINQ on Linux 3.16.0-34-generic
C++ “by hand” on Linux 3.16.0-34-generic (rewriting CINQ queries manually, using the standard library)
Microsoft C# VM on Windows 8.1

We ran the tests located in test_performance.cpp. Here is a summary of what they test:

where(): Find days hotter than 90 degrees F
select(): Mapping to cloud cover
where().average(): Average daily cloud cover
max(): finding max temperature
where().select(): coldest snowy days
where().select().order_by().take(): 5 coldest rainy days

Results (all times are listed in milliseconds)

Test Number	1	2	3	4	5	6
CINQ	142	178	207	212	301	254
C++ by hand	131	160	219	209	248	201
MS C#	603	1085	935	393	530	249

CINQ’s overhead per test (CINQ time / by hand time - 1):

Test Number	1	2	3	4	5	6
Overhead	8%	11%	60%	1%	21%	26%

Template specializations improve performance in specific cases

A wide variety of containers and iterators can be used as inputs to CINQ. This could be a problem when implementing methods like count():

Incrementing the iterator until we hit the end works for all containers, but is inefficient for random-access containers like vector.
On the other hand, not all iterators can get the distance as an O(1) operation.

We overloaded count() using concepts. One requires Random_access_iterator and the other requires only Forward_iterator. If the constraints on the more specific and more efficient specialization cannot be satisfied, the compiler will fall back to the other one.

GCC lambda inlining reduces mapper & predicate overhead

Many of our library’s functions involve calling small, one-line lambdas, whether they are predicates or mappers. At first, this seems extremely inefficient. However, lambdas are more easily optimized by the compiler than function pointers. When compiling with the -S flag to output assembly, we have never observed a situation where the lambdas are not inlined.

What happened to caching?

In our project proposal, we wrote:

A good compromise is to use both solutions: use the original vector until a function involving deletion is called. Then, we can silently copy the vector’s contents into our lazy deletion vector before continuing to execute the query. We will need to measure this to know whether it is faster.

We implemented something like this in CINQ before mostly throwing it away — you can still see the remnants in the ensure_data() method. The idea was to keep a cache of the data in a vector to avoid accessing the original container, which might be a linked list.

As it turns out, this approach was too heavy-handed. For example, in where(), it would cause the method to:

Copy all the elements into the cache.
Read the items out of the cache one by one.
Pass them to the predicate lambda. If it returns true, copy it to the result vector.

In the case of where(), step 1 was unnecessary because the items could just be read out of the source vector. Writing the method without ensure_data() decreased the average running time from 212 ms to 11 ms (95 percent faster) and putting us on par with writing the same code by hand. This experience showed that knowing the exact needs of each method allows us to optimize better.

Features for the convenience of users

Overrides that exist for preventing unsigned to `size_t` conversion errors

The convention for C++ indices is to store them in size_t, which is a typedef for an unsigned integer. However, this leads to some nasty problems where the user passes in a signed value that happens to be negative — the signed value is interpreted as a very large unsigned value:

cinq::from(my_vector).take(-1);

It will most likely cause the program to crash immediately, but depending on the value, it can also create more subtle errors: an invalid result from CINQ or even memory corruption that causes problems later.

Critically, there will be no compile-time warnings. Neither Clang nor GCC gives warnings for the code above — and even if they wanted to, they could not prevent all cases: an end user may enter “-1” into the application at run time.

To solve this problem, any CINQ method that takes a size_t argument also has an int version that throws an exception if the argument is negative, exposing the problem at the source:

enumerable<TSource> take(int count)
{
    if (count >= 0)
        return take((size_t)count);
    else
        throw invalid_argument("cinq: take() was called with negative count");
}

Default versions of methods

Many CINQ methods take mapper lambdas — for example, average() will map the sequence using user-provided lambda, and average the result. But if the user already has a sequence of numbers, it would be very annoying to write:

cinq::from(class_grades).average([](float grade) { return grade; });

So we provide a default version that does the same thing as the code above:

cinq::from(class_grades).average();

Thanks to concepts, we can make the no-lambda overload of average() available only when used with a sequence of numeric type. This is also a much cleaner and less error-prone than the C# implementation of average, which has a manually templated copy of the method for each numeric primitive and object wrapper.

Behind the scenes: templating tricks

Here, we will explain some of our C++ templates that we thought were cool.

Building the `order_by` comparison function

std::sort() expects a comparison function that tells it whether the first element is less than the second. We wanted the user to have the option of providing multiple lambdas, but we also wanted to generate the comparison function at compile time so the compiler would have the opportunity to do some inlining.

Here is the public order_by() method that users of our library call. TFunc... rest is one or multiple lambdas. Each lambda’s return value is what we will use to order the sequence.

template<typename ... TFunc>
enumerable<TSource> order_by(TFunc... rest)
{
    ensure_data();
    std::stable_sort(data.begin(), data.end(), multicmp(rest...));

    return *this;
}

The code calls out to multicmp() to do the heavy lifting and return a lambda that can be passed into std::sort():

template<typename ... TFunc,
         typename TFirst = typename std::tuple_element<0, std::tuple<TFunc...>>::type,
         typename TReturn = typename result_of<TFirst(TElement)>::type>
requires Invokable<TFirst, TElement>() && Totally_ordered<TReturn>()
auto multicmp(TFirst first, TFunc... rest)
{
    return [=](const TElement& a, const TElement& b) -> bool
    {
        auto a_map = first(a);
        auto b_map = first(b);
        if (a_map == b_map) return multicmp(rest...)(a, b);
        else return a_map < b_map;
    };
}

TFirst is the type of the first lambda, TReturn is its return type, and TFunc... is the type of the rest. We require that TFirst can be called on (TFunc... will be checked recursively). And we make sure that TReturn is totally ordered: that is, we can call operator==() and operator<() on it.

In the comparison function we return:

If this mapper causes the items to be equal, recursively ask the next mapper what the answer should be.
Otherwise, return whether the first item is less than the second.

Finally, we have a specialization to handle the base case:

template<typename TFirst, typename TReturn = typename result_of<TFirst(TElement)>::type>
requires Invokable<TFirst, TElement>() && Totally_ordered<TReturn>()
auto multicmp(TFirst first)
{
    return [=](const TElement& a, const TElement& b) -> bool
    {
        return first(a) < first(b);
    };
}

The compiler generates all the necessary templates and lambdas, and has the opportunity to inline the lambdas. An informal test showed that the performance overhead of our sort is less than 10 percent, which indicates that most or all the lambdas were inlined. So we did not look for further optimizations in this method.

Access private members of other template instantiations

Normally, an object of type foo can access private members of other foo objects. However, this is not true for templated classes on some compilers: each template instantiation is treated as its own class.

Templates created this problem, but fortunately, templates can solve it:

template <typename TSourceFriend, typename TElementFriend, typename TIterFriend>
friend class enumerable;

Any time we try to access the private members of another enumerable, the template will friend that instantiation for us. Neat!

Ideas for release 1.2

Parallel queries

No other C++ LINQ implementation allows the user to parallelize a query automatically. For example, the following code would detect the number of CPUs on your computer (using std::thread::hardware_concurrency()) and distribute the filtering so that each core is doing something:

cinq::from(my_vector)
     .as_parallel()
     .where([](person p) { return some_very_expensive_call(p); });

By making it so easy to parallelize queries, CINQ in practice might even be faster than writing the query out by hand. More programmers will be inclined to parallelize if it’s as easy as writing “as_parallel.”

Full implementation of LINQ method syntax

We did not get around to implementing some of the LINQ method syntax, mostly the set operations. It would be good to have a complete implementation, so that we can meet the expectations of programmers already familiar with LINQ.

select_many
join
group_join
distinct
except
intersect
union
group_by
aggregate

Query Syntax

In our opinion, query syntax is harder to use and less flexible than method syntax. So it’s not a high priority. However, it looks cooler — especially if implemented in a language that is not supposed to have an SQL-like syntax:

auto result = ( <from> my_contacts
                <where> [](person &p) { return p.age >= 21 }
                <order_by> ascending ).to_vector();