Worth Corner, Crawley, RH10 7SL

Support Hours - Mon-Sat, 8.00-18.00

Welcome to our SQL Server blog

This blog was created in order to pass on some of our knowledge and advice in the hope that the community might find it useful.

Obviously Microsoft SQL Server is our area of expertise and therefore this is our most thorough blog with everything from the basic through to the the incredibly advanced and, as such, we should have something for everyone within our pages.


Indexing In-Memory Data
Published: Dec 02, 2020
This is a tricky topic because indexing an in-memory table isn’t the same as indexing a disk-based table. With disk based tables we have clustered or nonclustered indexes to work with and these are the indexes we all know and love. They simply order the data we’re interested in and therefore we can easily understand how SQL uses them to seek and scan. However, this is very much not the case with in-memory indexes.

How data is held

This is a key part to understanding indexes on in-memory tables. I’m not going to go into complete detail here, but give a simple version of how the data is held so that indexing is easier to understand (as this is a complex topic, especially when considering implications of versioning).

Unlike disk-based tables we do NOT follow the 8k page structure with in-memory data, instead opting for a linked list approach. The best way to think of this is to imagine a ribbon… each record being written out in full back to back along the ribbon. Therefore the table is basically one long string. (As mentioned, it’s actually a linked list therefore it holds individual records as strings, each with a pointer to the previous and next record… hence why a ribbon works quite well as an analogy, at least in my head)


Two Types of Index

The two types of index available to an in-memory table are a RANGE index and a HASH index. A RANGE index is actually the default syntax-wise (ie. If you simply write “INDEX ix_myIndex” then you’ll get a RANGE index), it’s also the easiest to understand so we’ll start with that.

RANGE Index

This is the closest to a traditional index in that it’s effectively a sorting process. When you create a range index you are creating a new linked list of pointers which access the data in the order of the index key.

The key part to note with this is that because the effect is like a ribbon, this is best for flowing seeks… what I mean by that is range scans. If you want to pick a spot and then read through the data in index order then a range index is for you (for example data BETWEEN x and y, or data > or < z).

This kind of access pattern is not conducive to a HASH index and, likewise, many individual key lookups are not suited to a RANGE index.

As mentioned, creating one is easy as it’s the default syntax:

create table testMem
(
      
id int identity primary key nonclustered,
      
miscData varchar(100),
      
index ix_testMem (miscData)
)
with (memory_optimized = on, durability=schema_only)
go


HASH Index

This is the more complex of the indexes on offer but, once explained, it’s pretty simple at heart. Again, I’ll not go into detail, just explain the concept so that you can get going on implementation.

I think the best way to think of a HASH index is as a grouping index. Let’s take a phonebook as an example… imagine we have 26 million records in a phonebook, evenly spread out, no duplicates (pretty sure you can spot the simple maths I’m going for here)… we want to index the phonebook to search on last name… we can add a HASH index as follows:

create table testPhoneBook
(
      
id int identity primary key nonclustered,
      
firstName varchar(100),
      
lastName varchar(100),
      
phoneNo varchar(20),
      
index ix_testLastName nonclustered hash (lastName) with (bucket_count=26)
)
with (memory_optimized = on, durability=schema_only)
go


I’m really generalizing here but we’re going for concept not exact internals… Imagine that SQL Server was to run a hashing algorithm against every value in the test phonebook (like checksum, but much more advanced) and, conveniently, everything starting with A hashed to a value starting 0xA… and all records starting with B hashed to 0xB… (yes, this is absurd but it’s the principle)… internally SQL Server will create 26 “buckets” (think groups) and therefore all records hashing to 0xA… would go into bucket 1, 0xB… into bucket 2, and so on…

Now… we have a query that says “I want all records where lastName = ‘Smith’”… SQL simply hashes Smith, goes straight to the 0xS bucket… reads all the records in the bucket and returns our data.

Now, that’s not particularly effective because I picked only 26 buckets… but now consider this:

Our phonebook has just 1 million records and actually only has 1000 different surnames.

index ix_testLastName nonclustered hash (lastName) with (bucket_count=1000)


Now, despite our 1 million records, every single lastname value has its own bucket. This means that suddenly doing a value lookup is incredibly efficient… SQL will simply hash the input, go straight to the bucket, and return all data. No ambiguity, no wasted reads, just results. This makes a HASH index incredibly fast for data point lookups.

Therefore the rule of thumb is to create your HASH index with the same bucket count as distinct values in your dataset in order to get the best performance. Although you need to remember that you cannot change the index (and therefore bucket count) without dropping and re-creating the table, therefore you might want to project for future growth if you’re using a non-static valued table.

Leave a Comment
Your email address will not be published. All fields are mandatory.
NB: Comments will only appear once they have been moderated.

SQL  World  CEO
Kevin  Urquhart

iPhone Selfie

I am a SQL Server DBA, Architect, Developer, Trainer, and CEO of SQL World. This is my blog in which I’m simply trying to share my SQL knowledge and experiences with the world.

Categories


© Copyright 2020 SQLTraining Ltd.