Microsoft announced the CPS in a joint venture with Dell at the TechEd keynote in Barcelona. I first thought of something similar to the Nutanix boxes or the VMware Evo:Rail. But after visiting the Microsoft booth in the TechExpo I knew I was wrong.
Microsoft CPS is a ready to run Azure consistent cloud in your datacenter. It is a Microsoft validated design and was developed by Microsoft with standard components available from Dell. It comes pre-integrated and pre-deployed to you based on Windows Server 2012 R2, System Center 2012 R2 and Microsoft Azure Pack. It´s single point of support is Microsoft that opens up a Dell call in hardware related issues and takes the necessary steps if software related issues occur.
Later yesterday I attended a session with Vijay Tewari who is the Group Programm Manager for CPS at Microsoft.
I am really hard to be impressed but the specs and the thoughts and experience that seems to be integrated in this solution are just awesome. Even if you don´t have the necessary money – a thing I will come to later in this post – you can take the specs and the concepts behind it as blueprint for your own solution.
Microsoft has been working on CPS for the past 18 months and there are some really big customers like Capgemini using it already in production.
The experiences made in Azure should help in building a robust and effective solution. Therefore only proven ideas and concepts were used.
All management functions are virtualized. Efficient packing of storage and VMs should be used in combination with state of the art network offloads. Dynamic scaling is a key architectural component.
The whole applicance is build and delivery within days because it makes only sence in terms of agile scaling.
Lets spend some words about the specification of each rack and the hardware that is used. Again, all of the used hardware components are standard components everybody can order from Dell. You have to start with one rack.
Each rack has:
It weights/needs/consumes:
Hardware components within each rack consist of the following components:
Networking
Compute Scale Unit (32x Hyper-V hosts)
Storage Scale Unit (4x File servers, 4x JBODS)
Maximum Scale of four racks consists of:
A single rack can support up to 2000 VM’s (2 vCPU, 1.75 GB RAM, and 50 GB disk). You can scale up to 8000 VM’s using a maximum of four of these racks. Of course these numbers vary if using different machine sizes. All hardware components in the rack are hot-pluggable.
That are impressive numbers.
Lets digg a little bit deeper in the clusters, that are build in this solution:
Networking cluster
Networking performance of VM to VM connections could be maximized to 18 Gbps. Offtrack Forwarding to 10 Gbps. Offtrack NAT runs up to 8 Gbps and Offtrack 25 with 1,8 Gbps. All these numbers are achieved trough the usage of LACP Teaming, RSS enabled on Host and Guest, VMQ enabled and NVGRE offload being enabled.
Management cluster
This Cluster is the heart of the rack. It consists of 30 VMs on a 6 node Hyper-V failover cluster. It runs:
If you ask yourself were the fileservers are. They are the only components that are not virtualized.
Management Cluster Services
The management services are roughly devided into the following services:
The Storage cluster
Space available
4 JBODS
Storage spaces configuration
4 File servers
The Management host group, Edge host group, Compute host group and Storage host group are spread over all available racks. But there is only one instance of the Management host group that resides on one rack.
The biggest operational cost in cloud environments is patching. Therefore Microsoft has made a great step forward in Patching & Updating infrastructures like this. Hopefully these experiences will float into future versions of WSUS and/or SCCM.
The Process is as follows:
Patches are validated on Microsoft internal stamps > Then deployed on the internal DEV and TEST cloud > Then the customer starts Patching > Runs an Inventory on ist racks first > then updates what is needed on FW, Drivers and BIOS side > then Updates & Validates Windows Server systems > then Updates and validates System Center > in the end updates and validates Windows Azure Pack.
Microsoft itself has an infrastructure for thesting purposes were every day about 20.000 VMs are being deployed. The tests are done through the Windows and System Center teams.
The primary focus is a wide range of use cases. I asked one of the guys at the booth if they would support also VDI workloads. At that time they could not make any announcements. Also if you could use a CPS rack as Azure Backup location was not confirmed. But Microsoft said every use-case that would have been tested successfull internally would later on officially be supported.
The CPS systems can be ordered in November 2014 in US, Canada, Europe & South Africa. Additional countries will follow in 2015.
Server core installations are installed wherever possible to maximize performance. GUI versions are used were ever it is needed.
The whole solution comes predeployed to the customer. Putting racks together isn´t that interesting but what´s interesting is the way how the software is deloyed.
VMM is deployed before all other systems because the infrastructure is built through images. PowerShell is used to automate tasks
During the development of the solution a private cloud simulator was developed to automate the simulation of certain failures in the infrastructre. It would be great if Microsoft would spread information about it or release it someday to the community.
The folks at TechEd doesn´t wanted to talk about prices. But I found a Whitepaper on the Microsoft CPS Site with some examples in US Dollar. https://www.microsoft.com/en-us/server-cloud/products/cloud-platform-system/
Rack costs
The price for one rack is about $1,635.600 based on list prices from Dell.
VM costs
The Whitepaper comes to the conclusion that an average VM costs round about 4300 Dollars.
This is excatly what I have been praying for years. These solutions can only work with a huge amount of automation just to be sure administrators and operators can focus on the things they really need to do every day.
The CPS solution for example does automatic consistency checks after restoring data. The reset and the rotation of passwords can be run in full autmated mode or with an alarm through SCOM.
Isn´t that really cool?
Microsoft theme self say that they hide all of the arising events for the whole environment that are not needed for normal operation because they know what the infrastructure does. They show only important messages related to the infrastructure.
That is excactly the point that we need to achieve. Proactive Systems Management with full automation instead of manual reactive systems management.