Saturday, December 12, 2015

SMP versus MPP architecture in the context of Azure SQL Data Warehouse

A Symmetric Multi-Processing (SMP) system is a computing architecture in which all the processors share the same operating system, memory, and disk storage and are connected via a system bus. 

As shown in the image above, SMP is a single-bus system where there is a common way to access resources (such as storage, for example) – this is one of the main architectures bottlenecks of an SMP system that events it from scaling. All the traditional SQL Server suite of products including SQL server 2005, 2008, 2012 etc. use SMP architecture. 

A Massively Parallel Processing (MPP) system, on the other hand, uses a "shared nothing" approach. In this architecture, each processor has its own set of resources (memory, disk storage, operating system) and each processor is fully independent and isolated from other processors. As you can guess, there is no single point of contention in this architecture and hence, it is able to scale massively. 

All the nodes in the system communicate with each other via a high-speed communication system. An MPP system can be thought of as a set of SMPs lined up and programmed to perform a single task in a parallel and coordinated fashion.
Examples of SQL Server products that are able to leverage MPP architecture: Microsoft Analytics Platform System (formerly SQL Server Parallel  Data Warehouse), Azure SQL Server Data Warehouse.

Now let’s try to put this concept in the context of Azure SQL Data Warehouse:

In an SMP architecture
- There is a single instance of SQL Server shared by all the resources (CPU, memory, disk storage)
Though multiple CPUs work together to execute individual tasks concurrently (using application threading), the main bottleneck is that memory, disk storage etc. will be shared by all the CPUs.

In an MPP architecture (which Azure SQL Data Warehouse is built on)
- Each node runs its own instance of SQL Server and processes only the rows on its own disks - for example, in a 4-node MPP system, there will be 4 instances of SQL Server processing queries in parallel. If a table has 1 Million rows, considering the same 4-node example, each node will store 250,000 rows. This means, in theory, a query that takes 1 minute on an SMP system will take 15 seconds on an MPP system with 4 nodes
- When a query is executed (say a simple select statement), all the nodes in the system work in parallel to retrieve data from their respective disks and coordinate with each other to satisfy the request
In other words, the system distributes data and query processing across the nodes, thereby increasing parallelism and eliminating bottlenecks that are inherent in an SMP architecture.

1 comment: