Thursday, October 8, 2009

Update Manager

Here is an extremely useful tip if you are attempting to update ESX hosts in a remote datacenter.
I found it nearly impossible to update the hosts in a well connected datacenter. I even tried one update at a time but still was getting failures.

This occurs around the default timeout for tasks of fifteen minutes in Virtual Center.

Increasing this timeout gives the patches time to apply and copy over the WAN. It does have to be changed in the vpxd.cfg of Virtual Center and vpxa.cfg of the ESX hosts.

<task>
<timeout>10800</timeout>
</task>
<vmomi>
<soapstubadapter>
<blockingtimeoutseconds>10800</blockingtimeoutseconds>
</soapstubadapter>
</vmomi>

Fix:
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009670

I found some references to editing the vci-integrity.xml recvTimeout and UpdateDownloadReries. However this is incorrect, these settings control the download of the updates from the internet, not the download of the updates to the ESX host.

Tuesday, September 1, 2009

VMWorld 2009

Out in San Francisco for VMWorld 2009. Great conference so far. Random observations. VDI Classes are easier to get into. It seems like people are using VDI now but why the low interest? BYOI: The WiFi is slow..... Bring your own aircard. Time to get deep into vSphere. Not enough time in the day or night for the sessions, dinners, and parties. Sessions that are 100% statistics are boring. Way too much advertising on #vmware on Twitter. Everyone loves a flying monkey.

Friday, August 8, 2008

Spanning Tree and BPDU Guard

One of our engineers accidentally "bridged" the interfaces in windows. This is done by selecting multiple interfaces and picking "Bridge". This triggered the PortFast Bridge Protocol Data Unit (BPDU) guard feature when the switch saw BPDU traffic on the port. Which disabled all of the ports (backup and production) on the host to isolate and protect the greater network infrastructure. http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a008009482f.shtml This is a good safety mechanism and isolation response but something that may not be obvious at first if you encounter this issue. We were able to correct this by shutting down and disconnecting the network ports of the offending VM through the service console and bringing the ports back up on the Cisco swtich. We still did have link connectivity issues and ended up having to reboot the ESX host. At this point we got all interfaces back and were able to bring up the offending VM, disconnected from any network, delete the bridge, and reconnect to the network with no further issues. Hope this helps if anyone else runs into it.

Thursday, July 31, 2008

DRS Performance and Best Practices

VMWare seems to have seriously ramped up their documentation machine in a wonderful way. I'm seeing more and more valuable papers posted. http://www.vmware.com/files/pdf/drs_performance_best_practices_wp.pdf There are some interesting conclusions in the document.
  • When deciding which hosts to group into a DRS cluster, try to choose hosts that are as homogeneous as possible in CPU and memory. This seems to be a bit of a no brainer in any kind of clustered environment, but even more important in DRS where all nodes of the cluster are active partners, even though DRS will account for CPU/Memory size differences.
  • When more ESX hosts in a DRS cluster are VMotion compatible, DRS has more choices to better balance workloads across the cluster. We recommend clusters of up to 32 hosts. I'm not sure I buy 32 hosts yet. There is overhead involved in the DRS calculations even though they only take place every 5 minutes by default. The limit in a cluster was just recently raised from 16 to 32. There is a definite advantage to larger cluster size though as can be seen in the following chart. One may also wonder why they only tested up to 16 rather than 32 nodes of the cluster. Of course that could just be a lab limitation though as the cap was just lifted in u1 or u2 of 3.5.
  • The default migration threshold (moderate) works for most configurations. You can set the migration threshold to more aggressive levels when all of the following conditions are satisfied:
    • The hosts in the cluster are relatively homogeneous.
    • The virtual machines’ resource utilization remains fairly constant.
    • The cluster has relatively few constraints on where a virtual machine can be placed
    You should set the migration threshold to more conservative levels when the converse is true.
  • The default DRS frequency is once every five minutes, but you can set it to any period between one and 60 minutes. You should avoid changing the default value. This should never be less than 5 minutes and make sure you adjust aggressiveness of automation before even considering this setting.
  • In general, do not specify affinity rules unless you have a specific need to do so. In some cases, however, specifying affinity rules can improve performance.
    • Keeping virtual machines together can improve performance if the virtual machines need to communicate with each other, because network communication between virtual machines on the same host enjoys lower latencies.
    • Separating virtual machines maintains maximal availability of the virtual machines.
    • Virtual machines that might need to be separated is virtual machines with I/O‐intensive workloads. If they share a single host, they might saturate the host’s I/O capacity, leading to performance degradation. DRS does not make virtual machine placement decisions based on their usage of I/O resources.
    Very good point on the I/O placement. When will I/O be taken into consideration by DRS? Storage I/O is being addressed by VMWare well around proving throughput. But balancing that throughput is still non existent.
  • Assign resource allocations to virtual machines and resource pools carefully. Be mindful of the impact of limits, reservations and virtual machine memory overhead. Make sure you understand "slices" if you are working with DRS clusters and how they impact admission control. This has been changed made more conservative between 3.0.x & 3.5.x. Always considerer expandable reservations if you need to use reservations.
  • Virtual machines with smaller memory sizes or fewer virtual CPUs provide more opportunities for DRS to migrate them in order to improve balance across the cluster. Virtual machines with larger memory size or more virtual CPUs add more constraints in migrating the virtual machines. Hence you should configure only as many virtual CPUs and as much memory for a virtual machine as needed. One of the golden rules in VMWare no matter if using DRS or not.
  • You can specify DRS modes of automatic, manual, or partially automated at the cluster level as well as the virtual machine level. We recommend that you keep the cluster in automatic mode. DRS is now a mature and stable process. I've yet to see anyone that starts in "manual" or "partially automated" stay in that mode.
Kudos to VMWare for another quality white paper.

Tuesday, July 15, 2008

PlateSpin PowerRecon: Long Term Impressions

I've been using PowerRecon for about six months now and thought I'd share my long term impressions. Technically, this is a wonderful product and does really do everything it's advertised too. I would hands down recommend this product over the VRA offered by VMWare and their VARs. However, the support is VERY lacking from PlateSpin. Their support isn't advertised as callback support but that's really what it is. Try calling the 800 number and talking to someone live. 9 out of 10 times you will just end up in a voice mail box somewhere. I also have used their support several times thorough email and really didn't get any headway until asking to be contacted over the phone. Once on the phone their support engineers are very capable and engaged. One would have hoped the Novell acquisition would have brought more money to build out a support organization, but at this point it is severely lacking. Bottom line: Wonderful product, woeful support.

Tuesday, April 29, 2008

ESX Storage Architecture

So something else interesting from yesterday is it sounds like big changes are coming to ESX storage architecture in the back end. It's still a ways off but it could finally mean Power Path coming to ESX as well as a lot of other storage tools being able to be written and work with Virtual Infrastructure. Power Path has been a major pain point for us. Our storage team requires it on all SAN connected hosts. We worked around this from some rather expensive script writing by EMC. This really seems like a script that EMC should provide customers work around the Power Path not working issue. Issue: When a Storage Processor event occurs such as a flare code upgrade or SP failure/replacement happens, all the load is pushed to the remaining SP. In a Power Path connected world, Power Path manages the trespass and moves the LUN back to the preferred SP. So at this point all ESX hosts are forced to one SP. Solution: We engaged EMC to write a perl script that works of the navi management server. This enumerates all the VMWare LUNS and checks the "default SP" If the current owner is not the default SP, the script issues a trespass and moves the LUN to the correct SP. Seems simple enough. But by the time we included all the features we wanted the functional spec was two pages long. Some examples are exit codes, configuration files, interactive/silent mode, etc...

Monday, April 28, 2008

SAN Pathing on an Active/Passive Array Part II

In a previous post I spoke of the issues around SAN Pathing on an Active/Passive Array. Today I met with VMWare EMC and our internal storage team to discuss this. I was slightly incorrect on how ESX decides where to assign a path. ESX issues a SCSI inq command and the first path to respond is used. So, because the first card to come up issues the command first, it usually takes the path. But depending on load of the array and fabric, the other card could be used. In talking through the issue though with everyone we came to a couple conclusions.
  1. This can be done.
    1. Build a table of the wwn from both SP.
    2. Determine what SP a LUN is on from the wwn of the SP.
    3. Determine what path on the second HBA is on the same SP.
    4. Check if active path is already on preferred HBA (every other LUN alternates HBA)
    5. If not on preferred HBA disable all but active path and path preferred to move to, the path being the ON but not ON ACTIVE path is not enough.
    6. Disable ON ACTIVE path to move to other HBA.
    7. Re-enable all paths.
    8. Continue though all LUNS
  2. In most environments the effort is not worth the benefit though.
So, we are going to keep an eye on HBA utilization and see if a bottleneck appears. At that point we may pursue the script above.