Introduction to PDW (Parallel Data Warehouse)
The PDW has been a very underserved topic in the public eye, but with the release of version 2, we are off the leash, and I can share my experiences on the PDW with you.
This is the first of a series of posts I’ll be doing on the Parallel Data Warehouse: we’ll move onto advanced topics in later posts, this post is going to focus on introducing you to the PDW. You’ll read about what MPP is (as opposed to SMP and NUMA) and a quick intro to the Data Movement Service that is at the heart of the PDW. More importantly for most of you, I’ll dive into what is different from a SQL SMP box.
My future posts:
Distribution theory and the principles of hashing
Hardware in a PDW
Getting data into a PDW – SSIS and DWLoader
Distributed Query Plans – data movement
Optimising a PDW – introduction
V2: Updateable Columnstore
V2: Polybase. Where would we use it?
MPP vs SMP
As you start working in the PDW world, you’ll hear the term MPP – Massively Parallel Processing often. An MPP system is one built from multiple nodes in a cluster, with a balanced hardware architecture where the IO capacity of dedicated storage is mated to a defined area of RAM used by a single core of a multi-core, multi-processor machine. To understand the reasons for this, a brief understanding of the history of processor architecture is required. For simplicity’s sake I am not going to go into instruction sets or hyper-threading.
In the days of yore, a single processor (think Pentium and before) could run a pipeline of instructions one after the other – this was a simple thing to do, and making a CPU faster was primarily a matter of increasing the clock speed (and playing with memory timings and other things to match this increased speed). As the clock speeds increased, a hard limit around the 3GHz mark arose – while it could be exceeded, the cooling requirements and other manufacturing limits made this no longer viable, and the concept of adding multiple cores that can execute instructions in parallel arose. As it turns out, making operations run in parallel is a non-trivial programming task and thus having two cores does not automatically double your speed. It does however let you run more tasks at once – this is the world of SMP, Symmetric Multi-Processing.
SQL Server specifically does a (relatively) good job of working in parallel, and using a tool such as SSIS where you manually create multiple single-threaded operations that you run in parallel is also a good way of taking advantage of the multiple cores. The disadvantage of this is that the speed of storage has not kept up with the speed of CPUs growing, and if your data is not supplied, and thus storage speed improvements by adding more disks and increasing IO, as well as optimising for sequential access are needed to be done to improve the rate data is fed to the CPU.
Another piece of the puzzle is something called Non-Uniform Memory access (NUMA). If you look at a traditional multi-processor system, you have a single memory bus for all the processors, as in Figure 1 below
Early on, this wasn’t a problem, but as cores increased and processors outstripped memory speeds, the memory bus became a bottleneck. Allocating memory bandwidth and splitting it for the separate processors thus became a performance enhancement technique:
Each NUMA node has its own memory controller and thus the bottleneck is alleviated. Well, theoretically. The challenge is that the workload still needs to be balanced across the pieces of hardware: and this comes down to application design in a scenario like this.
(For an actual read on how this works, try Jonathan Kehayias here)
In an ideally balanced application, each single core in the machine is working on an item of data being loaded from a single storage unit, into a piece of memory, and supplies it fast enough to keep the CPU running at capacity.
This is the approach taken with a Fast Track architecture and others. In a Fast Track architecture, the threading is built by the person creating the SSIS packages, which is time-consuming and error prone. Try here for a reference guide. What is done on a SQL level is to partition the tables, and assign each partition to a piece of storage, then work with multiple partitions in parallel.
That’s then SMP and a version of SMP called NUMA.
MPP draws upon many of the principles of what was done with a NUMA architecture (indeed uses NUMA), but goes a whole step further.
Starting with multiple IO-balanced machines, each one with dedicated storage and RAM per processor, add a way of distributing a dataset across the different sections. So, for example, if you have 8 slices of data on each machine, and 10 machines, you would slice your data up into 80 different sections, and distribute them across the machines. The more evenly this is done, the faster it will work, as if your dataset is being operated on by 80 processor cores at once, with no bottlenecks in memory or IO, whatever you are doing is 80 times faster. I will leave the topic of how the data is distributed for my next post: you may want to read up on hashing to prep J
The PDW, Parallel Data Warehouse is an MPP system: and all the pieces around how the data is made to work in a parallel way is the PDW magic.
What is under the hood is a SQL instance on each machine, with the partitioning done on each instance automatically. Careful with the terminology though: in PDW-land, this split of the data is called Distribution, and is by a distribution key. You have no control over this other than the key: the data will be split across the machines and within each machine as evenly as the PDW can. Partitioning that we do control is something for a later post.
The PDW is something called a “Shared Nothing architecture”. In pure essence, what this means is that any query can be executed independently on each node, and then combined.
In practice, this becomes a lot more difficult.
Based on the above, if we assumed that we had 3 tables: a sales table, a customer table, and a product table, and we distributed the customer table across the appliance by customer key, the products table by product key, and the sales table by sales key, we’d end up with some sales transactions that were located on one node, with the customer record on another node, and the products perhaps on another. SQL Server on each instance would then not be able to do the join, and we’d get nonsense results.
So the first solution to this is to have two types of tables: Distributed, as we’ve discussed, and Replicated. On a replicated table, we keep a copy of the entire table on each node. In our scenario above, if we replicated customer and product, and we asked for the SUM of sales for a set of products, each SQL instance would be able to calculate its own answer, and then the results can be pushed back up to a central machine (called the control node. More on this in a later post), and then returned.
Alternate methods may have both customer and sales distributed on customer, and product replicated – oft times, distributing facts and replicating dimensions will be a good place to start, but will need tweaking.
The piece of the PDW that moves data around, both in terms of pushing it to the different nodes, and in terms of bringing back the answers is the Data Movement Service (DMS). It has a secondary function: if you do have a query that requires a join between tables that are distributed differently (for instance, if you have been so unwise as to distribute sales on customers, and sales items on product), it will move the data around to first satisfy the join. PDW handles all this automatically, all you will notice is that your query is much (multiples of times!) slower than it should be.
SQL vs PDW
Firstly, the PDW is a preconfigured appliance. So you can’t just install it on a VM to play with.
So, you are probably thinking, if a PDW is just a surface on top of SQL, developing on it must be pretty similar?
It turns out that certain things are really hard to distribute.
Functions are not supported. CLR functions are definitely not supported!
Foreign keys don’t exist. And this is inherently complex. If you have a customer on one node and sales for that customer on another, so satisfying a foreign key is impossible on the SQL node, but this is a valid PDW distribution. So they just don’t exist.
The list of datatypes is a subset: no data types off page, i.e. any varchar/nvarchar(max), no BLOBs, no CLR types (eg spatial). No numeric, but that can just be replaced with decimal.
Some other restrictions that hurt: No joins in an Update! (Or a delete) Only a single table to update from: i.e. if you want to update for a subset of customers, you’ll need to put the data from the tables you need to update from into a new (single) table.
No auto-update statistics. More on this now, but knowing to create and update statistics is one of the most important performance tuning tips you’ll need to know.
This leads us to the really important part of the PDW. And so far, I might have sounded negative – I’m not! Some of the fundamentals of how you think about a PDW are different: firstly, we are distributing data across IO subsystems, and dividing the data set up is the key to the performance, and thus we have a concept called index-lite. Apart from the clustered index (little tip: you’ll often want to put your clustered index on a different key to your distribution), you’ll very often not have other indexes. What you will often want to do however is create statistics, especially multi-column statistics, where you might have created indexes before. Now, I will talk a bit more about statistics in a later post, but you are creating the PDW statistics, and the PDW is handling the SQL statistics down on the individual instances. Having these stats allows the PDW to create the most efficient Distributed Query Plans, i.e. how it should move the data around (or not).
And don’t misunderstand: moving data around has a cost, but is part of what makes the PDW so very good at what it does is how good it is at doing this. This is the reason for the masses of IO available and the dual Infiniband network. But minimising this (moving the small dimension around instead of the big fact table to satisfy a join)
There are also some new constructs: specifying distributed or replicated when creating a table is one of the design fundamentals.
Another new construct is what is called CTAS. Create Table As Select. In some ways, this is like a SELECT INTO, but with the ability to specify that you want a distribution. In many ways, in the PDW, copying data using a CTAS (especially with a distributed table) works really really quickly, and so you will often default to using CTAS, copying the data into work tables (or temp tables), and then join to the temp table to do an UPDATE, or rename the work table to replace the original table. While this is something you might do for a SQL table (albeit without a CTAS), it’s nowhere as common as you will do in the PDW.
And that brings me to the final thinking point you need to consider using the PDW. In the warehouse world, we are working with ETL – Extract, Transform, Load – and in the Transform space we are trying to do as much as possible in the ETL tool, for instance doing our transformations in SSIS in-memory using cached Lookups (as I cover in other blog posts).
On a PDW (or any MPP system) we want to do ELT. Extract, Load, Transform. Basically, get the data onto the appliance, then use this mass of hardware to do the transformations, rather than some other machine. It is a thinking process change, unless you already do most of your ETL in stored procs already. I have found this concept the hardest to come to grips with as I come from a strong ETL background.
This has been a very introductory session to the PDW: in my future posts, I will be diving into a great deal more info.