Learning To Scale SQL Server

2/20/2015

I'm currently employed at AudiencePoint Inc., the industry leader in Email Send Time Optimization. AudiencePoint Send Time Optimization allows clients determine the optimal time to send a piece of correspondence to each individual on their subscriber list. To do this we need to crunch through A LOT of data. This massive amount of data is stored today in a SQL Server 2012 database running on powerful dedicated hardware. As the data and volume of operations on that data has continued to grow, we have learnt some lessons on how to scale SQL Server to keep up with the heavy demand.

One thing to note is that database operations at AudiencePoint (AP) look different from DB operations at a web based companies like StackOverflow or Amazon.com. Operations at those companies are typically transactional (OLTP) in nature; e.g. every page view requires a certain number of DB reads, new records are inserted into the DB when a question or answer is posted, etc. At AP our DB operations are typically in batches; e.g. 10 million records need to be inserted, queried, sorted, updated in batch fashion. The question becomes how do you speed up and/or scale these operations? Especially when you don't have available to you all the traditional scaling techniques used by OLTP DB applications such as caching and partitioning. Here are some of the things we learnt:

Joins are bad
And by bad, I mean slow. Especially when we are talking about large tables and more than two tables are involved. Joins combine scattered data fragments to form a single table. Therefore they are sensitive to disk seek latencies. They also generate a lot of data which may not fit into memory and have to involve the use of a swap file. Additionally, a join operation can only involve two tables at a time so when more than 2 tables are involved, it must first join the first 2 tables to create an intermediate dataset and then join that data set to the next table. This is actually a denormalization process. Therefore it's worth examining your tables and queries to see if you are overnormalized and denormalize your tables or adjust your queries like we had to do. Indexes can help improve join performance on SELECT statements but comes at a cost to INSERT operations. More on that later.


Don't put it in the DB
The fastest database operation possible is the one you don't do. No matter how fast your query operation is, not having to query for the data at all is faster. Same goes for inserts, updates or deletes. Based on the information we had at the time, we used to stuff a lot of unnecessary data into the DB and then have to query back out those millions of rows of unnecessary data. This is obviously not good for performance. In the OLTP world, unnecessary data operations are avoided by various layers of caching. If that's not available to you, then you need to think critically and work hard to avoid unnecessary DB operations. Does it need to be in the database? Why?

SQL Server performs better on subqueries than joins
Questions about the performance of joins versus the IN clause have been asked several times on StackOverflow. Two such instances can be found here: SQL Join vs. In clause and SQL Join vs. Sub-query (first link has an example). We found that SQL server performs better with sub-queries / the IN clause. This is because the IN clause precludes the necessity of a join and as we now know; Joins are bad!
On StackOverflow the common answer to the debate of joins vs. subqueries is usually "it depends on your DB engine". In our experience SQL Server (and apparently Oracle according to an SO poster) has picked a side.

Multithread your DB operations
Horizontal scaling is the holy grail of DB scaling. This is why data partitioning is so popular. In batch processing though, the tendency is to have a process repeat the same operations over and over again (hence batch). The problem with that is that if you have a SQL Server database sitting on a quad-core or octa-core server and you fire up the profiler to look at what's going on, you'll find that only one core on your SQL Server database is active while the others are just hanging around working on their tans. When we divvied up our work into smaller batches and had multiple threads doing our DB operations we saw all the other cores come to life and our performance increase by orders of magnitude. We had just achieved horizontal scaling but within one box. Snow cones for everyone!

Watch your indexes
Markus Winand explains it better; while indexes really speed up SELECT queries they have the opposite effect on INSERT (as well as UPDATE and DELETE) statements. For an insert operation on a table;

adding a single index is enough to increase the execute time by a factor of a hundred. Each additional index slows the execution down further
That's why indexes should be used deliberately, sparingly and with database usage patterns in mind. At AP we had (and probably still have) some tables where the index size is greater than the actual table size.

and finally...

Don't use SQL Server
... for things its not good at. Gotcha there?! The tendency is to try to shoehorn one solution for every problem to be solved. We've been there, done that and have learned not to do that anymore. For areas where SQL Server isn't a good fit, evaluate other solutions to pair with it. NoSQL, Graph databases, Time series databases, etc. all have their strong points and would perform better at those functions than an RDBMS. As the popular saying goes, "use the right tool for the job".

You Might Also Like

0 comments