I had morning coffee with a well respected friend of mine recently. Aside from chatting about the usual wifery and family, we touched on the subject of clustered indices and SQL Server performance.
A common misconception in the software industry is that a clustered index will make your database queries faster. In fact, most cases will demonstrate the polar opposite of this assumption. The reason for this misconception is a misunderstanding of how the clustered index works in any database server.
A clustered index is a node clustering of records that share a common index value. When you decide on an index strategy for your data, you must consider the range of data to be indexed. Remember back to your data structures classes and what you were taught about hashtable optimizations.
A hashtable, which is another way of saying a database index, is just a table of N values that organizes a set of M records in quickly accessible lists that are of order L, where L is significantly less than M. If your index/hashtable is poorly performing, then you have a problem with your L and N, i.e. your N is approaching M and your L is approaching 1.
A clustered index is something special because it is the only index that dictates how data is PHYSICALLY stored on disk. While a simple index is an in-memory data structure, the clustered index is a physical clustering of the data records (order L from prior paragraph) written to storage. This is the killing joke of a clustered index because the worst case scenario of your index is to constantly change its range, which means you're constantly recreating your clusters. That, my friends, is why your database gets slower as it fills with records.
Your solution is to never use a clustered index unless you have a finite, pre-determined, range of index values. The clustered index is only best-performing when the key range is static, or rarely changing.
A common mistake is to set a record ID in your table to be your Primary Key, which by default is also clustered. Oops. That means your key space for that primary clustered key is always changing with every insert. Crap, right? I bet you've got a few of those in your database right now. You better run off and fix that right now.
You might wonder why this is such a big deal, right? At what point does the clustered index start to fail? Thousands of records, hundreds of thousands, millions? Those are good questions.
Every index in a database is ordered. This is a requirement so that searching the index space can be done in log(n) time. This is also the reason why the physical layout of a clustered index is so problematic. Each time the index space changes, the physical layout of the records must be recomputed and stored. A write operation to a hard disk is orders of magnitude slower than one in memory, and so the cost of rewriting the ENTIRE INDEX SPACE becomes prohibitive as the indexed table grows in size. Sound familiar? Even with fill factors set as high as 90%, you will feel the bite of rewriting the clustered index space if your index range changes rapidly.
I've found that the clustered index on a bad key range starts to fail around a hundred thousand records, but that depends on the size of your records and the speed of your hard drives. If you have high speed (AV-rated) SCSI drives, then you might be able to get away with more records. Once you hit a million records, though, you're sunk, no matter how fast the hard drive.
Your remedy is to always use non-clustered indices for your primary keys. The very nature of a primary key is that it must be unique, which is orthogonal to the idea of a clustered index. In SQL Server, the clustered nature of a key is a constraint, so you have to ALTER the table and DROP the constraint, and then re-create the primary key without clustering.
Then when you decide to add other indices to your tables, only do it on columns that are used in queries. Take advantage of multi-column indices because many times, your searches are multi-column filtered.
If you absolutely need to use a clustered index to make your row sorting faster, then do it on a static key space, or one that is almost never changing.
A common misconception in the software industry is that a clustered index will make your database queries faster. In fact, most cases will demonstrate the polar opposite of this assumption. The reason for this misconception is a misunderstanding of how the clustered index works in any database server.
A clustered index is a node clustering of records that share a common index value. When you decide on an index strategy for your data, you must consider the range of data to be indexed. Remember back to your data structures classes and what you were taught about hashtable optimizations.
A hashtable, which is another way of saying a database index, is just a table of N values that organizes a set of M records in quickly accessible lists that are of order L, where L is significantly less than M. If your index/hashtable is poorly performing, then you have a problem with your L and N, i.e. your N is approaching M and your L is approaching 1.
A clustered index is something special because it is the only index that dictates how data is PHYSICALLY stored on disk. While a simple index is an in-memory data structure, the clustered index is a physical clustering of the data records (order L from prior paragraph) written to storage. This is the killing joke of a clustered index because the worst case scenario of your index is to constantly change its range, which means you're constantly recreating your clusters. That, my friends, is why your database gets slower as it fills with records.
Your solution is to never use a clustered index unless you have a finite, pre-determined, range of index values. The clustered index is only best-performing when the key range is static, or rarely changing.
A common mistake is to set a record ID in your table to be your Primary Key, which by default is also clustered. Oops. That means your key space for that primary clustered key is always changing with every insert. Crap, right? I bet you've got a few of those in your database right now. You better run off and fix that right now.
You might wonder why this is such a big deal, right? At what point does the clustered index start to fail? Thousands of records, hundreds of thousands, millions? Those are good questions.
Every index in a database is ordered. This is a requirement so that searching the index space can be done in log(n) time. This is also the reason why the physical layout of a clustered index is so problematic. Each time the index space changes, the physical layout of the records must be recomputed and stored. A write operation to a hard disk is orders of magnitude slower than one in memory, and so the cost of rewriting the ENTIRE INDEX SPACE becomes prohibitive as the indexed table grows in size. Sound familiar? Even with fill factors set as high as 90%, you will feel the bite of rewriting the clustered index space if your index range changes rapidly.
I've found that the clustered index on a bad key range starts to fail around a hundred thousand records, but that depends on the size of your records and the speed of your hard drives. If you have high speed (AV-rated) SCSI drives, then you might be able to get away with more records. Once you hit a million records, though, you're sunk, no matter how fast the hard drive.
Your remedy is to always use non-clustered indices for your primary keys. The very nature of a primary key is that it must be unique, which is orthogonal to the idea of a clustered index. In SQL Server, the clustered nature of a key is a constraint, so you have to ALTER the table and DROP the constraint, and then re-create the primary key without clustering.
Then when you decide to add other indices to your tables, only do it on columns that are used in queries. Take advantage of multi-column indices because many times, your searches are multi-column filtered.
If you absolutely need to use a clustered index to make your row sorting faster, then do it on a static key space, or one that is almost never changing.