In my previous post, I told you about what it meant for an index in SQL Server to be ‘clustered’. Usually, when we talk about SQL Server indexes, we are talking about indexes organized as B-trees (or technically, B+-trees, but we won’t go into the difference here.) For these indexes there are two basic properties that set clustered indexes apart from nonclustered.
First, the index leaf level IS the data. It is not a copy of the data or pointers to the data. When you create a clustered index, your data is reorganized and the rows and pages become the leaf level of the index.
Second, the data is sorted. As I discussed last time, this doesn’t mean it is necessarily physically sorted in consecutive slots on the pages, in contiguous pages in the file. But by following the page pointers from the first page to the last page, SQL Server can retrieve the data in order without having to perform a sort operation.
Here’s a graphic from the SQL Server documentation showing you the general structure of clustered B-tree indexes.
But not all clustered indexes have these properties. Let me tell you a bit about columnstore indexes. I won’t be able to tell you EVERYTHING about columnstore indexes, but I point you to another awesome resource for that. I’ll just tell you the basics.
SQL Server introduced a completely different way of organizing your data in SQL Server 2012 with nonclustered columnstore indexes. The name comes from the fact that the data is organized by column, not by row, as our ‘regular’ (B-tree) indexes are. Now that we have columnstore indexes, our original B-tree indexes are sometimes referred to as rowstore indexes.
Rows from a partition of the indexed table are divided into rowgroups of about one million rows each. The actual maximum is 220 or 1,048,576 rows per group. In other words, columnstore indexes were designed to help on big tables.
Within each rowgroup, the 2^20 rows for each index column are stored separately, in a structure called a segment. Each segment is stored as a Large Object (LOB) value, within the LOB allocation unit for the partition. Each value in the segment is encoded, and then the segment is compressed before being stored as a LOB. Segments are the basic unit of manipulation and can be read and written separately.
SQL Server will attempt to put the full 2^20 values in each rowgroup, and place however many rows are leftover in a final rowgroup. For example, if there are exactly 10 million rows in a table, there would be 9 rowgroups of 1,048,576 values and one of 562,816 values. However, because the index is usually built in parallel, with each thread processing its own subset of rows, there may be multiple rowgroups with fewer than the full 1,048,576 values.
In SQL Server 2012, the only columnstore indexes were nonclustered. These were stored separately from your data, and contained some or all of the columns from the table. The CREATE INDEX statement listed which columns you wanted to become part of the rowgroups and segments.
SQL Server 2014 introduced CLUSTERED columnstore indexes. When I first read about them, I actually didn’t like that they were called CLUSTERED because they didn’t adhere to both of the properties mentioned above. The meet only the first criteria: the clustered columnstore structure IS the table. Once you create the clustered columnstore index on a table, the table’s data is completely re-organized into the rowgroup and segment organization. There is no other copy of the data.
Here is a graphic of what happens when a clustered columnstore index is built:
This graphic assumes there are four columns in the table: A, B, C and D. The table is divided into rowgroups, and in this picture, we see three rowgroups. Then each column in each rowgroup is encoded and compressed and becomes a segment.
There is no other copy of this table’s data beyond the compressed rowgroups.
Because all the columns are always part of the clustered columnstore index, the syntax for creating the index is incredibly simple:
CREATE CLUSTERED COLUMNSTORE INDEX ON
That’s all you need to do. You don’t specify a column list because all the columns are included. But remember what else a column list specifies in a rowstore index. It specifies the key column or columns, which are the columns that are sorted. There is no sorting at all in a columnstore index, neither logical nor physical sorting. Columnstore indexes are intended to be used when you want to run reports or return summary data from huge numbers of rows. We don’t need sorted data to do that usually.
So what does clustered mean? It does not mean sorted. There is no sorting for columnstore indexes at all. A clustered columnstore index is just a different way of organizing the data in your table: column by column instead of row by row. Should it be called a ‘clustered’ index? That’s not a question we’re going to answer here. It’s just what it is.
If you want more information about all the details of columnstore indexes, including the internals of the encoding and compression, plus what’s new in SQL Server 2016, I strongly recommend Niko Neugebauer’s excellent blog series. He’s written 107 posts (so far) on columnstore indexes! http://www.nikoport.com/columnstore/