Your application, hardware failure and you

Originally posted on IBM Developer blog “Exploring PureApplication System, Software Service and more” by John Hawkins on 9 July 2013 (4484 visits)

This post was written by John Hawkins, a Principal Consultant at Icon Solutions.

Although PureApplication System has some great local hot swap and off-site Disaster Recovery technology you still need to understand what it means for you and your application. I haven’t found any concise documentation for this so I’m sharing what I’ve found out here.

Pure has the ability to shut down the workloads on a compute node and move it to another Compute node. This is called “evacuation”. Evacuations can be planned or unplanned (you can guess the difference !).

“Why would I do a planned evacuation?” you may be asking yourself. Well, if you need to upgrade your compute nodes for instance. Or maybe it’s got an intermittent fault that needs looking at (it happens – even with IBM hardware).

Planned Evacuation

With PureApplication systems you can plan to close down a node. If a planned evacuation takes place then Pure will migrate your running vms to other nodes. In this case the Pure “magic” makes sure that your workload never knows what happened: it gets the same IP address, same disk etc. Now, that’s good, but the real sauce is that it mirrors the exact state of the VM when you took it down – that’s pretty cool ! To be clear: your complete memory, CPU, disk, network resources are migrated, even your tcp adapter/packet state is preserved with no data loss or discrepancy. This means that your application will be restarted on a new node and continue right where it left off – even if it was in the middle of a transaction !

One minor downside is that you may notice a slight decrease in performance for one or two minutes, but, if that’s the price you pay for not having to worry about a node being brought down then that’s cheap by my book.

Unplanned Evacuation

For an unplanned evacuation (failure) of a Node – Pure will still move the workload VMs to another node in the rack. However, in this case, the move won’t be so smooth. To the application, it will look like the VM has crashed and it’s been restarted (naturally enough because that’s what happened). There’s no attempt to reproduce active memory state or anything else. In this case, standard transaction semantics apply and, if you care that you don’t lose data, then you need to code around it. There are plenty of discussions on how to code transactions so I won’t go into that here.

Disaster Recovery

Not only can Pure do hot-swap and VM evacuation across a single Pure box but it can do Disaster-Recovery across two racks. Disaster-Recovery is usually between two distant data-centers which you don’t think will be affected by the same physical problems at the same time. Look at this video to see just how easy that can be setup: http://www.youtube.com/watch?v=YOJS6z18p7E . Disaster recovery is achieved using hardware based disk replication. When the primary rack fails all the workloads get moved to the new, standby, rack. Again, it looks to the application like the VM failed – but that’s acceptable – your rack’s just crashed! The application does have some of the same context though – it still has the same disks and IP addresses on the new, remote, rack. But you still need to code around any transactions you were in.

Middleware High-Availability

The functions just described are at the infrastructure level. Pure can also help you at the software level too. For instance, the in-built WebSphere Application Server application pattern will deploy two WAS nodes on two different compute nodes. This means that if the first compute node fails then the second one takes over while the first recovers (this may mean that the first instance gets moved to a different node remember – but all that’s taken care of for you). Be careful though, this does require you to set your scalability policy so that there a minimum of two WAS nodes in the cluster.

Summary

So, to summarise: Pure is IBM’s latest system of expertise. They’ve not only given it expertise in terms of the software patterns but they also help you manage the solutions at runtime too.

For a planned shutdown of a node you have no fears. Your solution is in very safe hands – the application will not know that it ever got moved nor will the clients that connected to it.

For a node failure the application will get rebooted, as if it had just started up, so you’re going to need to code around any data loss issues you may have. This applies to whether the failure was a local inter-rack failure or a planned or unplanned Disaster-Recovery (cross-rack) scenario.

John Hawkins is a Principal Consultant at Icon Solutions. He has seventeen years of IT industry experience and worked for IBM at their Hursley laboratory on MQ, Message Broker and cloud technologies . John is highly skilled in Messaging, Pure Application patterns and implementing Proof-of-concepts with bleeding edge technologies. He prototyped MQ and WESB on PureApplication Systems as well as inventing a new type of “business driven” scalability policy.

Share this:

Related

Leave a comment Cancel reply