Today has been an interesting day with the keynote which I blogged about earlier.
After that I gave myself some time to wonder around and get some work stuff done in terms of investigating some more vendors at the solutions exchange.
Then my next session of the day was 5 Functions of Software Defined availability with Duncan Epping and Frank Denneman. I’ve been reading their vSphere Clustering deep dive book recently and it is superb. This session was a great compliment to the my reading and also included some new features in vSphere 6.0 that aren’t in the book (yet!?).
This was a great session and I’ll post my notes for it below, as I did yesterday for Day 2. The guys are great speakers and really know their stuff. At the end I decided to go up and say hello, I spoke with Duncan at the signing for my Essential VSAN book yesterday and managed to get him to sign my clustering deep dive book, so I went and spoke to Frank and had a chat with him. He was a top guy and was happy to sign the clustering deep dive book and eluded to a new copy possibly arriving the future!
I then decided to go and watch Chad Sakac speak at the EMC Vendor booth, not for the purpose of winning any swag but because he is a well known charismatic and enthusiastic speaker; the rumours were definitely true.
I then attended a great session with GS-Khalsa and Smruti Patel on Architecting Site Recovery Manager. My notes for that will follow at the bottom of this post.
I then decided to head back to my hotel to get ready for the vExpert/VCDX party at the Julie Morgan Ballroom. This was a fantastic opportunity to meet guys in the community and also some very smart VCDX’s. It was a brilliant experience talking to guys who I follow on Twitter. Pat Gelsinger turned up and gave a great speech about community. I felt that it was an excellent touch, having someone of his importance, turning up to meet the people at the core of technical/evangelism outside of VMware. He even took a selfie with us all which I thought was hillarous and really showed that he is a down to earth guy.
I also met Eric Neilson from the vCommunity Podcast, he’s a very friendly and funny guy. He even offered to take me on a tour of the VMware offices in Palo Alto this coming Saturday. I’ll post about that if we do get the chance!
I then went home very tired but had an absolute blast!! Tomorrow is the VMworld big party at AT&T park, which I look forward to covering!
5 Functions of Software Defined availability – Duncan Epping and Frank Denneman
There are many parts to a modern DC: VM , Server, Mgmt., Storage, DC, Network but the reality that business owners care about is the Application!
Most used feature, importantly the restart of VMs or applications.
Heartbeats are important to determine what has happened to hosts.
Admission control allows you to reserve resources. It ensures VM’s will restart after a failed host.
Host failures to tolerate, % based capacity or designated failover host.
In version 6, the VM Component protection (VMCP) comes in to assist with APD/PDL scenarios.
Recommendations in 6.0 for maintenance:
– Disable host monitoring
– Make sure to have a redundant mgmt. network.
o Fewest hops to isolation address
o MTU size end to end the same
o Enable portfast on your switch
o Route based on Originating Port ID (active/standby)
o Failback set to No.
o Pingable gateway address
– Use admission control
– Load balancing and initial placement.
– Dependant on vCenter.
– Brokers resources between producers and consumers.
– Goal is to provide resources of VM demands.
– Resource control allow for resource allocation based on business drivers.
– Provides cluster management (Maintenance mode and continuity rules anti/affinity)
When using resource pools, try and assign resource pools reservations and then VM shares within the pool.
Storage IO Control
Quick fix method for detecting issues short term.
– Controls congestion in shared datastore
– Focussed on solving short term problem
– Enabled at a datastore level
– Detects congestion by monitoring avg. IO latency for the datastore
– Latency above threshold indicated congestion
– SOIC throttles once congestion is detected.
o Controls IO issued per host
– Based on VM Shares, reservations and limits.
– Throttles adjusted dynamically based on workload
o Bursty behaviour
More of a long term fix for a cluster of datastores.
– Controls congestion on a datastore cluster
– Detects congestion
o SOIC monitors average datastore latency
– Storage DRS migrates once congestion is detected
o Capacity threshold per datastore
o I/O metric threshold per datastore
– Affinity Rules
o Default affinity for VMFK (VM on 1 datastore)
Storage DRS is now aware of storage capabilities through VASA 2.0
– Array based thin-provision
– Array based dedupe
– Array based auto tiering
– Array based snapshot
– Storage DRS integration with SRM
It is possible to set IOPS reservations on VM’s through API only.
It is a technology that has progressed a lot since its birth in 2003. From SPDS technology to long distance AKA cross-cloud vMotion announced in 2015.
– Supports GEO Distances (150ms)
o No WAN acceleration needed
– Maintain standard vMotion guarantees
– Various optimizations
o Batched RPCs
o BDP socket buffer sizing
o Congestion window/slow start handoff
o Disk lock handoff changes
– Disk handoff changes
o Restricted lock handoff
o Minimise disk IO
What happens if the switchover process (stun/un-stunned) is too long?
Consider using the advanced setting:
VMX Option = extension.convertonnew = “FALSE”
vMotion anywhere, across vCenter Server Boundaries
– vMotion across hosts without shared storage
– Easily move VMs across vDS, vS and folders and data centers.
o Simplifies vCenter Migration and consolidation
o Aligns vMotion capabilities with larger DC environments
Network IO Control
– QoS on the vDS layer.
– Allows you to partition physical network bandwidth.
– Applies to vNIC and vDS port group
It uses resource pools which enables shares, reservations and limits to ensure availability of resources.
In v6 NetOIC v3 allows configuration of bandwidth requirements for individual VM’s.
DRS is aware of NOIC and on initial placement is aware of network resources and is able to place based on this information.
Tip: Use NOIC on VM level for reservations, I.E Tier 1 applications only.
Architecting Site Recovery Manager to meet your recovery goals – GS Khalsa & Smruti Patel
Protection Group considerations
Protection groups can have a many to many recovery plan relationship.
1) vSphere replication based protection groups. It’s simpler than array based replication and not tied to underlying storage technology – therefore it doesn’t need to be identical like Array Based.
2) Array based protection groups have consistency groups which used Array based replication (E.G – VMAX to VMAX). VM’s map directly to the data store specifically.
3) In SRM 6.1 there is a new Storage Based levering storage profiles.
– High level of automation compared to traditional protection groups.
– Policy based approach reduced OPEX
– Similar integration of VM provisioning, integration and decommissioning.
You select the VM to a policy and the policy defines the storage and the protection with SRM automatically.
– More PGs = more granular testing/failover
o DR testing is easier
o Failover only what is needed
o Added complexity
– Less PG’s = less complex and lower RTO
o Fewer LUNS, PG’s and recovery plans
o Less flexibility
Active/Passive Failover – Dedicated resources for recovery
Active-Active failover – Run low-priority apps on recovery infrastructure
Bi-Directional Failover – Production applications at both sites. Each site acts as the recovery site for the other.
Multi-Site – Many to one failover. Useful for remote office/branch office.
Stretched storage & Orchestrated vMotion
– The best ofboth stretched storage and SRM
o Support stretched solutions with SRM
o Orchestrate cross-VC vMoton
– Unified plan for Disaster avoidane, disaster recovery and mobility
– Zero downtime migrations for planned maintenance and Disaster Avoidance.
– Ability to non-distruptivley test Recveroy plans
– Enhanced reliability with active-active datacenters and dual vCetners
– Lower RTO in event of unplanned failsures.
SRM is a paired topology. It always needs a paired server back to a central site. Maximum of 10 SRM.
SRM has a rule for each VM can only be replicated and protected once. In a triangle replication, you can have A to B, B to C, C to A.
– Keep it simple
– Each VM can only be protected once
– Each VM only replicated once
– Utilize enhanced Linked mode.
Impacts to RTO
It is one of the most metrics when designing a recovery plan.
How long does it take to decide to failover?
Disaster strikes – how long do you leave it before invoking failover?
– Workflow without IP customization
o Power on VM wait for heartbeats
– Workflow with IP
o Power on VM with network disconnected
o Customize IP utilizing VM tools
o Power off VM
o Power on VM and wait for VM tools
Considerations for lower RTO
– Fewer Larger NFS datastores /LUNs are better.
– Fewer PG’s
– Don’t replicate VM swap files
– Fewer recovery plans
– Install Tools on VMs
– Suspend VMs on recovery
– PowerOff VMs
– vCenter Sizing – it works harder than you think
– Number of hosts (more is better)
– Enable DRS for cluster load balancing on recovery operation.
– Different recovery plans target different clusters
– Be clear with the business
o What is RPO
o Cost of downtime
o Application priorities
o Units of failover
– Do you have executive buy in?
– Do you have documented SLA’s
– Do your SLA’s clearly communicate the RPO, RTO and availability of service tiers.
– Are your SLA documents readily available to everyone in the company?
– Use service tiers
– Minimal requirements/decisions
vSphere Infrastructure Navigator – helps you figure out how your VM’s are interconnected which is useful.
– Use VLAN or isolated network to create a test network environment
– Different port groups can be specified for SRM test against actual failover.
– Work with network team to try and replicate some form of the production network
Test the DR Plan Frequently