This is a tricky topic because indexing an in-memory table isn’t the same as indexing a disk-based table. With disk based tables we have clustered or nonclustered indexes to work with and these are the indexes we all know and love. They simply order the data we’re interested in and therefore we can easily understand how SQL uses them to seek and scan. However, this is very much not the case with in-memory indexes.
How data is heldThis is a key part to understanding indexes on in-memory tables. I’m not going to go into complete detail here, but give a simple version of how the data is held so that indexing is easier to understand (as this is a complex topic, especially when considering implications of versioning).
Unlike disk-based tables we do NOT follow the 8k page structure with in-memory data, instead opting for a linked list approach. The best way to think of this is to imagine a ribbon… each record being written out in full back to back along the ribbon. Therefore the table is basically one long string. (As mentioned, it’s actually a linked list therefore it holds individual records as strings, each with a pointer to the previous and next record… hence why a ribbon works quite well as an analogy, at least in my head)
Two Types of IndexThe two types of index available to an in-memory table are a RANGE index and a HASH index. A RANGE index is actually the default syntax-wise (ie. If you simply write “INDEX ix_myIndex” then you’ll get a RANGE index), it’s also the easiest to understand so we’ll start with that.
RANGE IndexThis is the closest to a traditional index in that it’s effectively a sorting process. When you create a range index you are creating a new linked list of pointers which access the data in the order of the index key.
The key part to note with this is that because the effect is like a ribbon, this is best for flowing seeks… what I mean by that is range scans. If you want to pick a spot and then read through the data in index order then a range index is for you (for example data BETWEEN x and y, or data > or < z).
This kind of access pattern is not conducive to a HASH index and, likewise, many individual key lookups are not suited to a RANGE index.
As mentioned, creating one is easy as it’s the default syntax:
create table testMem
(
id int identity primary key nonclustered,
miscData varchar(100),
index ix_testMem (miscData)
)
with (memory_optimized = on, durability=schema_only)
go
HASH IndexThis is the more complex of the indexes on offer but, once explained, it’s pretty simple at heart. Again, I’ll not go into detail, just explain the concept so that you can get going on implementation.
I think the best way to think of a HASH index is as a grouping index. Let’s take a phonebook as an example… imagine we have 26 million records in a phonebook, evenly spread out, no duplicates (pretty sure you can spot the simple maths I’m going for here)… we want to index the phonebook to search on last name… we can add a HASH index as follows:
create table testPhoneBook
(
id int identity primary key nonclustered,
firstName varchar(100),
lastName varchar(100),
phoneNo varchar(20),
index ix_testLastName nonclustered hash (lastName) with (bucket_count=26)
)
with (memory_optimized = on, durability=schema_only)
go
I’m really generalizing here but we’re going for concept not exact internals… Imagine that SQL Server was to run a hashing algorithm against every value in the test phonebook (like checksum, but much more advanced) and, conveniently, everything starting with A hashed to a value starting 0xA… and all records starting with B hashed to 0xB… (yes, this is absurd but it’s the principle)… internally SQL Server will create 26 “buckets” (think groups) and therefore all records hashing to 0xA… would go into bucket 1, 0xB… into bucket 2, and so on…
Now… we have a query that says “I want all records where lastName = ‘Smith’”… SQL simply hashes Smith, goes straight to the 0xS bucket… reads all the records in the bucket and returns our data.
Now, that’s not particularly effective because I picked only 26 buckets… but now consider this:
Our phonebook has just 1 million records and actually only has 1000 different surnames.
index ix_testLastName nonclustered hash (lastName) with (bucket_count=1000)
Now, despite our 1 million records, every single lastname value has its own bucket. This means that suddenly doing a value lookup is incredibly efficient… SQL will simply hash the input, go straight to the bucket, and return all data. No ambiguity, no wasted reads, just results. This makes a HASH index incredibly fast for data point lookups.
Therefore the rule of thumb is to create your HASH index with the same bucket count as distinct values in your dataset in order to get the best performance. Although you need to remember that you cannot change the index (and therefore bucket count) without dropping and re-creating the table, therefore you might want to project for future growth if you’re using a non-static valued table.