Showing posts with label Virtualization. Show all posts
Showing posts with label Virtualization. Show all posts

Friday, May 18, 2012

The Data Center Network: to Fabric or not to Fabric?


- Cliff Grossner, Ph.D.  director of strategic marketing, Alcatel-Lucent (http://www.alcatel-lucent.com), says:

Many Choices, Lot of Risk
2012 has been labeled the year of the data center fabric. However, it’s still early in the standardization and deployment of data center fabrics, and enterprises are faced with a large number of choices that can spell the difference between failure and success. Hype aside, only early adopters have deployed real fabrics to date.

A properly architected data center fabric has the potential to bring the performance, scalability and elasticity to the data center required by today’s virtualized applications and when connecting to Cloud services. Choosing wisely when deploying a real data center fabric can unleash the, as yet, untapped potential from existing investments in server visualization platforms.

The marketplace is abundant with solutions offering vastly different and sometimes confusing alternatives. Choices in selecting a fabric include the technology used to virtualize the network enabling any to any server connectivity, technology for automating virtual machine (VM) mobility, technology for implementing virtual switching (vSwitch), and technology for convergence of storage traffic onto the IP network.

Help With Some Tough Questions
There are many competing technology options to be investigated for network virtualization, vSwitch and storage convergence. Many new standards are emerging to help. Some suggestions for consideration are as follows:
Network Virtualization with Shortest Path Bridging IEEE 802.1aq (SBP) a newly ratified standard that also has undergone a multi-vendor interoperability test in Q4 of 2011. SPB has distinct advantages of being scalable from the very small to the very large, being compatible with protocols already in use in the service provider Cloud and being able to provide the ability to easily shuffle resources within a single data center site or between data center sites to optimize resource utilization and ensure quality application delivery. In effect, using SPB enables creating a cloud-like elastic environment for the enterprise.
Virtual Switching delivered by the top of rack switch rather than on the server leveraging the Virtual Edge Port Aggregation IEEE 802.1Qbg (VEPA) standard, providing a single point of control, management and security significantly reducing management complexity. This approach is also hypervisor agnostic allowing freedom to choose which hypervisor to use or even more than one hypervisor in the same data center.
Storage Convergence enabled with Data Center Bridging IEEE 802.1Qbb, IEEE 802.1Qau and IEEE 802.1Qaz (DCB) standard. Providing DCB enables a choice for the customer concerning storage convergence as to which technology they use (iSCSI, FCoE, or FC) and a choice of “if and when” they wish to push forward with a single network for data and storage.

Alcatel-Lucent’s Award Winning  Mesh: Provide Choice, Reduce Risk
Alcatel-Lucent’s strategy with its data center switching solution, or fabric, is to take a very practical approach. This means that the customer has the choice to set their strategy in the data center and not be locked in by technology choices, such as in choosing when and if to converge their IP and storage network. In addition, Alcatel-Lucent’s solution provides the scalability and associated linear cost model to allow a pay as you grow approach for the customer and also avoids the need for a high risk forklift approach.

Alcatel-Lucent’s vision for the data center is application fluency where the network infrastructure is capable of optimizing resources to ensure the best possible user experience and reduce complexity for the IT team.  To improve end-user productivity, an application fluent network also features automatic controls for adjusting application delivery based upon profiles, policies and context. Application Fluent Networks also deliver streamlined operations through automated provisioning and low power consumption.

An application fluent data center network essentially transforms an enterprise data center into a multi-site private cloud: a single, seamless, highly elastic cloud-type environment. This gives the flexibility to reconfigure data center resources quickly and easily.  Alcatel-Lucent’s data application fluent center network solution can be easily integrated with Alcatel-Lucent’s CloudBand carrier cloud solution, creating a hybrid cloud model where employees can access a wide array of data and applications anywhere and on any device.

The Alcatel-Lucent Mesh is based upon standardized technology as follows:
  • Network vitalization with SPB
  • Virtual switching with VEPA
  • Storage convergence with DCB

A Prudent Approach
Given the current state of the technology and the market when it comes to selecting and deploying a fabric in the data center, it is best to be risk adverse. Selecting to deploy a fabric that can scale from the small to the large, both in architecture and cost model, will allow your organization to become familiar with the technology and transform the data center in a controlled fashion. This can potentially be done by rolling out a fabric with the migration to 10GigE servers without requiring a forklift of existing infrastructure.


Friday, May 11, 2012

Back to Data Center Basics: Load balancing Virtualized Applications


- Lori MacVittie, senior technical marketing manager at F5 Networks (www.f5.com), says: 

The introduction of virtualization and cloud computing to data centers has been heralded as “transformational” and “disruptive” and “game changing.” From an operational IT perspective, that’s absolutely true.

But like transformational innovation in other industries, such disruption is often not in how the core solution is leveraged or used, but how it impacts operations and the broader ecosystem, rather than the individual tasked with using the solution. The transformation of the auto-industry, for example, toward alternative fuel-sourced vehicles is disruptive and changes much about the industry. But it doesn’t change the way you drive a car; it still works on the same principles and the skills you’ve learned driving gas powered cars are still applicable to alternative fuel-source cars.

What changes for the operator – just as within IT -  is there may be new concerns with which you must contend.

Load balancing virtualized applications is in this category. While the core principles you’ve always applied to load balancing applications still applies, there are a few additional concerns that arise from the use of virtualization that you’re going to have to take into consideration.

LOAD BALANCING 101 REFRESH
Let’s remember quickly how load balancing traditional applications works, shall we?
The load balancing service presents to the end-user a single endpoint, i.e. “the application”. Users communicate exclusively with that endpoint. The load balancing service communicates with a pool of resources comprised of one or more application instances. It is by adding instances to the pool that an application is able to scale horizontally to meet demand.

In the most common traditional load balancing environment, each application instance is hosted on a single, physical server. The availability of the “application” is maintained by insuring there are always enough instances (nodes) available to compensate for any failures that might occur at the physical server, operating system, platform, or application layers.

Load balancing services also allow for the designation of “back up” nodes. Each node in a pool may have a back up node that is only activated in the event of a failure. This is used primarily for high-availability purposes to ensure continuous application availability rather than for scaling purposes.

Now, when we replace the physical servers with virtual servers, we have pretty much the same system. There still exists a pool of resources that comprise “the application”, the load balancing service still mediates for the end-user, and there are still enough application instances in the pool to compensate for failure, thus ensuring availability of “the application.”

However, there are some new potential sources of failure that must be addressed that impact the topology – the physical placement – of the application instances in the pool.

TWO RULES for LOAD BALANCING VIRTUALIZED APPLICATIONS
One of the most important changes coming from virtualization that must be addressed is fault isolation. Assume for a moment that we took all four physical nodes and consolidated them on a single, physical virtualized platform.

In theory, nothing changes. The load balancing service views a “node” as a unique combination of IP address and TCP port, and whether that’s hosted on a virtual platform or a physical server is irrelevant to the load balancing service. The load balancing algorithms still work the same way, nodes are selected as directed by configured policies, backup nodes are still used to ensure continuous availability, and nothing about the way in which load balancing works changes. 

But it’s very relevant to operations because this type of server-consolidated deployment model introduces higher unrecoverable failure scenarios and it will directly impact the performance (in a bad way) of “the application.”

There are a couple operational axioms at work here:

1. Shared infrastructure (network, compute, storage) means shared risk.
2. As load increases, performance decreases.

Let’s say “Node 1” fails. In both the physical and virtual deployments, the load is simply shifted to the remaining active nodes. No problem.

But what if the network connectivity between the load balancing service and “Node 1” fails? In a physical deployment, no problem – each node has its own physical connection and is unlikely to impact the other nodes. But what about the virtual deployment? Each node has its own virtual network connection, certainly, but does it have its own physical network connection or is it shared? If it’s a shared physical connection and it fails, then all nodes will fail – leaving “the application” unavailable.

Load Balancing Virtualized Applications Rule #1: Team and Trunk.

Physical network redundancy is a must. Modern server platforms are generally enabled with at least 2 if not 4 GBE connections, use them.

So now you’ve got your network topology designed to ensure that a physical failure will not take out every application instance on the server. Next you need to consider how the application instances are isolated and deployed to ensure that a failure at the hypervisor layer does not disrupt all application instances.
Consider that there are two possible reasons you are implementing load balancing: scalability and availability. In the former, you’re trying to ensure supply meets demand. In the latter, you’re trying to mitigate potential failure in a way to ensure “the application” is always available, regardless of failure. If there is a failure at the hypervisor layer, all instances relying on that hypervisor will be impacted (and not in a good way). Regardless of why you’re implementing load balancing, the result of such a failure is the same, instances are unavailable. Similarly, if the physical device on which virtualized applications are deployed fails, every instance on that device will be down.

In both cases, if all your virtual eggs are in one basket and there’s a failure at the hypervisor layer, you’re in trouble.

Load Balancing Virtualized Applications Rule #2: Divide and Conquer.

Application instance redundancy is a must. Never put all your application instances on a single virtualized or physical platform. Spread them across at least two, to isolate potential failures in the virtualization layer or at the physical server layer.

Node backups should always be located on physically separate devices. Load balancing services are adept at discerning failure but they are not necessarily able to determine the source. A failure to communicate with an application instance could be caused by a bad cable, a failed port, an unresponsive network stack, or an application error. The load balancing service knows the application instance is down, but not necessarily why it’s down. If it’s a crashed instance, then failing over to a back up instance on the same server is probably going to work out fine. But if the root cause is a failed port or bad cable, failing over to a backup instance on the same server isn’t going to help – because it is down too.

It is imperative to ensure availability that there are always at least two of everything – and that means physical devices, as well. Never put all your eggs in one basket – at any layer.

THE PERFORMANCE IMPACT
Aside from general availability issues, there is also the very real possibility that where you deploy virtualized application instances will impact performance of “the application.” Remember that even though you can designate CPU and memory on a per application instance, they still ultimately shared I/O – both storage and network. That means even if you use rate limiting technologies to try to manage bandwidth consumption as a means to reduce congestion or latency, ultimately you’re impacting performance. If you don’t use rate limiting or other bandwidth-focused solutions to manage the shared network resource, you run the risk of congestion and increasing latency on the wire.

Similarly, shared storage is even more problematic because when you trace I/O down through the system, you end up at a single, shared I/O controller that is going to have some serious limitations on it. I/O intense application instances deployed on the same physical device are going to cause contention in the underlying system, which is going to negatively impact performance.

Again, divide and conquer. Disperse such instances across two (or more) physical servers. The number of servers will depend on the overall scale of the application and the resource consumption rate. Load balancing will  be able to assist in maintaining performance across instances if you take advantage of a response-time aware algorithm such as fastest response time (the assumption is that response time correlates directly to load and in most cases, this is true). This keeps any given instance from becoming overwhelmed.
Ultimately, what this means is that you have to be a little more aware of physical deployment location for application instances than you did with pure physical deployments. Consolidation is a great way to reduce operational and capital expenditures, but it also means consolidating risk.

LOCATION MATTERS
This is a particularly tough nut to crack especially when combined with the desire to implement auto-scaling operations in a more cloud-like environment. The idea that you can leverage “whatever idle resources” you can find to scale out applications on-demand is powerful, but it’s also potentially fraught with risk if you’re unable to control placement at all. While the possibility that every instance would end up deployed on a single server or even a select handful of servers is minimal, there is the possibility that multiple instances could be deployed in a way that means a single server failure could eliminate a sizeable number of application instances, resulting in an unacceptable degradation of performance or even downtime for some percentage of users.

In the end, location really does matter when it comes to load balancing virtualized applications. Where they are deployed and in what groupings becomes a critical factor for maintaining performance and availability. The tendency to increase VM density is high, but that tendency can lead to highly disruptive situations in the event of a failed component. Be aware that cost savings from mass-consolidation and “high efficiency” through increasing VM density metrics may look good now, but may not look so good through the lens of hindsight. 

Tuesday, May 1, 2012

Secure and Resilient Virtual Computing Platform for Sensitive Cloud Deployments




Frank Huerta, CEO of TransLattice (www.translattice.com) says:

LynuxWorks, Inc., TransLattice and Fritz Technologies Corporation have combined their technologies and expertise to create a new platform for building cloud deployments in sensitive environments. The resulting S.E.C.U.R.E. (Secure, Enterprise, Cross-Domain, Unified, Resilient Environment) platform solution provides an ideal environment for situations requiring secure hosting of applications, geographic redundancy of applications and data, and secure cross-domain transfer of information.

The S.E.C.U.R.E. solution consolidates multiple applications onto virtual machines (VMs) on a single server platform while:

  • Maintaining complete domain separation in the virtualization solution
  • Reducing cost through hardware consolidation
  • Deploying market-standard operating systems including Windows and Red Hat Linux
  • Achieving high security standards for military deployments
  • Improving resilience with a lattice approach to application and data distribution
The TransLattice technology, TransLattice Application Platform (TAP), fully distributes both the application and its data across multiple VMs and system platforms within a secure domain, ensuring non-stop operations, even in the case of the loss of one or more components of the computing “lattice.”  This means that mission-critical applications and data are fully distributed and the loss of an individual system (node) or multiple nodes does not endanger the continued efficient operation of the application.

“Customers today are looking to move their applications and data into the cloud, while maintaining the security of sensitive data,” said Frank Huerta, CEO and co-founder at TransLattice. “We are very confident that the S.E.C.U.R.E. platform will help meet the needs of our customers. By working closely with LynuxWorks and Fritz Technologies, we have been able to leverage both the technologies and expertise that provide a great
environment for our TAP solution.”

Tuesday, April 10, 2012

DCIM Solution to Control Data Center Sprawl

- Maurice Donegan, director of product management for Emerson Network Power’s Avocent business (www.emersonnetworkpower.com), says:

Virtual sprawl results when different applications are added to handle various business processes, causing the automated solutions –put in place to manage demand in virtual environments – to populate virtual instances in an undisciplined way. Consequences include unbalanced power consumption and resource utilization.  

Emerson Network Power has introduced version 650 of its Aperture™ Suite with advanced software for data center planning, management and performance optimization. Key features respond to customer needs to manage and track virtual sprawl impact on the data center, gain dashboard analysis view on data center efficiency and manage integrated monitoring on data center devices.

While earlier versions of the Aperture Suite helped managers better understand the amount of resource consumption and where it occurs across the data center floor, the new version takes the next step of giving them insight into what is driving that consumption. By integrating with popular virtual management systems (VMware and Microsoft Virtual Machine Manager), now virtual processes can be mapped to their physical hosts to prevent infrastructure overloading and to identify underutilized resources. Armed with this knowledge, the data center manager can advise staff where to shift resources across the data center to more safely provide the infrastructure support they require.

In addition to helping control virtual sprawl, the Aperture Suite now offers high-level dashboards that deliver more actionable data center performance metrics. With the new dashboards, data center personnel can quickly see if efficiency numbers are being met, whether assets are performing correctly, which assets might be aging out and more – and can easily report this information to management.

In short, the latest Aperture Suite release enables data center managers to see more discreetly across the data center, evaluate all options and optimize resources to safely and efficiently to support demand. It also provides a wealth of performance data in a form that can readily be used to support management decisions.

Tuesday, March 20, 2012

Intelligent Workload Management in Today's Data Center

- Q&A with Derek Slayton, vice president of marketing at VMTurbo (www.vmturbo.com):




Chris MacKinnon: Why is intelligent workload management useful in today's enterprise data centers? Why should data center and IT managers care about it? How can they benefit from it?


Slayton: Virtualization has become common place and almost expected in any enterprise data center. Reasons to adopt vary from cost savings to decrease in management requirements. However, if a virtualized workload is mismanaged the cost is enormous.


Virtualized data centers, which share hardware resources to dynamically meet workload demands, must perform a balancing act between assuring mission-critical application performance while utilizing the virtualized infrastructure as efficiently as possible. To gain the optimum ROI of a virtualized data center, the system needs to know which workload to run where, and when. With an over-provisioned workload, the data center faces increased capital, operating costs, and the loss of resources that may be better used elsewhere. Under-provisioning may increase utilization, but quality of service – and consequently revenues – will suffer.

Therefore, intelligent workload management is a must that results in hardware efficiency gains, increased productivity, and lower IT costs, but the approach is not without its own unique management challenges.

MacKinnon:Where should intelligent workload management rank in terms of overall priority in the data center?

Slayton: Intelligent workload management should be a high priority for any data center seeking the best return on their virtual infrastructure. Real-time optimal workload placement and resource allocation can provide savings in:

1) Server and storage cost reduction due to improved resource utilization;

2) Software license cost reduction due to elimination of physical and virtual infrastructure sprawl;

3) Infrastructure cost savings due to lower electricity, A/C, floor space, and rack space costs; and,

4) Lower support/personnel costs through automation of tasks associated with incident/problem management, capacity planning, optimization and stakeholder reporting.

MacKinnon: What are the biggest challenges for data center and IT managers when it comes to workload management for virtualization?

Slayton: Virtualization is redefining the data center with benefits such as lower costs of operation, hardware consolidation, better utilization of existing resources, and lower initial investment – but the technology also redefines data center management needs. VM resource utilization and performance behaviors may be dramatically different from those of physical servers. In contrast to physical servers, VMs see their resources fluctuate dynamically and may experience bottlenecks from other VMs. The increased utilization of physical resources can also drive applications beyond the boundaries of safe operations if not managed properly.

MacKinnon:How can data center and IT managers overcome those challenges?

Virtualization requires new resources and performance management technologies to handle these new factors of complexity. Data centers need to replace manual partitioned management with dynamic, scalable, automated, and unified resource and performance management to maximize ROI.

MacKinnon:What advice can you give to IT and data center managers that have a plethora of similar solutions to choose from?

Slayont: It’s no secret the market is stuffed with vendors offering products to monitor and report on performance of virtual infrastructure. Many can even collect and report on a broad range of performance metrics. I think it is extremely important for data center managers to educate themselves on the solutions available and look for the one that best suits their needs.

Monday, March 12, 2012

vCPU Sizing Considerations






Challenges and considerations in determining how many vCPU’s to allocate to virtual machines.
 
By VKernel's David Davis and Alex Rosemblat (http://www.vkernel.com/):

When sizing virtual machines a virtualization administrator must select the number of vCPU’s, size of the virtual disk, number of vNICs, and the amount of memory. Out of all those potential variables, the two most difficult to determine are always the number of vCPU’s and the amount of memory to allocate. This is because CPU and memory are the most finite resources that a server has and these resources are also the most dynamically demanded resources by the guest OS in each VM. We covered virtual machine memory sizing in our VKernel whitepaper (see VM Memory (vRAM) Sizing Considerations), now let’s cover proper vCPU sizing in your virtual machines.

Virtualization administrators should avoid over-allocating vCPU’s because doing so wastes expensive server resources and will minimize ROI on that infrastructure. In fact, over-allocation of vCPU’s in some VMs will actually cause vm performance problems for that VM and other VMs. On the other hand, the business critical applications running in virtual machines also need to maintain high performance and they need processing power to do so. The last thing a virtualization administrator wants is to have performance complaints from end users. VM administrators face a difficultly in balancing the need to maximize ROI of the server hardware and the requirement for applications to perform optimally.

Fortunately, with the right tools in place and the guidance from this whitepaper, virtualization administrators will gain a deeper understanding and be able to easily make the right decisions related to vCPU sizing.

By reading this whitepaper, VM administrators will learn:

• How to correctly size VMs in an environment
• How to approach appropriate sizing of new VMs that need to be deployed
• How to institute a regular CPU sizing process for the data center
• How CPU usage works at different levels of the virtualization “stack”
• How to screen for CPU-based VM performance issues before attempting to optimize an environment
• What VM performance metrics can provide insights on CPU usage and how to access them

CPU Usage through Different Levels of the Virtualization Stack
With the traditional physical server (or desktop PC), you have an entire CPU or multiple CPUs (each with multiple cores) dedicated to the OS and applications running on it. The virtualization hypervisor adds an additional layer between OS and the physical CPU, allowing multiple virtual machines to share the hardware. Instead of the CPU requests from applications going to the OS and then the OS scheduling them on the physical CPU, the OS in the VMs talks to virtual CPUs (which it thinks are real physical CPUs). Requests from the multiple virtual CPUs are scheduled, by the hypervisor, across the multiple physical CPU cores. Just like with memory sharing in virtualization, with CPU sharing in virtualization, there is the traditional OS CPU scheduler and the hypervisor CPU scheduler.

All of this enables greater utilization and massive sharing of the server’s physical CPUs.

Differences in CPU Usage in Physical and Virtual Servers
To summarize, the difference in CPU usage with physical and virtual servers is this:

• In the physical world, applications are scheduled by the OS onto the physical CPUs
• In the virtual world, applications running on each OS, in each VM, make requests of virtual CPUs that are then scheduled by the hypervisor
Single and Multi-Threading in CPUs
Not only do today’s servers have multiple CPU cores in each CPU but they are also multi-threading (aka “Hyper-threading” if you are using an Intel CPU). This means that each CPU core can execute “threads” in parallel. You can think of a thread as a process so envision a CPU core being able to run multiple processes all at the same time (in parallel). However, having more than one CPU will only benefit applications that are multi-threaded apps. In other words, servers that are dedicated task-based servers (database servers, web servers, etc.) but run a single-threaded application will see no performance benefit by adding more CPUs.
To know if an application is single-threaded or multi-threaded, you have two options:

1. Ask the software manufacturer if it is multi-threaded and supports SMP (symmetric multi-processing)
2. See if, when you have more than one processor or core on a server, the program’s multiple processes are using more than one processor or core at a time. In other words, say that you have a quad-core CPU, you would watch the application as the CPU demands increase. If the application can use only 25% of total CPU capacity (1 of the 4 cores), then it is a single-threaded application that can’t use more than one core. On the other hand, if the application is using 50, 75, or 100% of the total CPU capacity (4 of the 4 cores), then it is multi-threaded.
Whether you are running an application on a physical server or a virtual machine, you only want to have multiple CPUs available if the application running on that host can take advantage of those CPUs with multiple threads.

Performance Impacts of CPU Command Processing With One or Multiple vCPU’s
On a physical server if you have multiple CPUs available for the server’s primary application but that app isn’t multi-threaded, you are wasting the cost of those CPUs. However, in a virtual infrastructure, if you over-allocate multiple vCPU’s to a VM and the primary app on that VM isn’t multi-threaded, you could actually cause performance issues for that VM and others. The reason for this is that, with multiple vCPU’s, the hypervisor’s CPU scheduler must wait for multiple physical CPU time slots to become available before it can process requests from the multi-vCPU VM.

In other words:

• With one vCPU, CPU requests are quickly processed (or they are waiting on pCPU if no pCPU is available)
• With multiple vCPU’s, the hypervisor CPU scheduler must wait for multiple pCPUs to be available
• Having multiple vCPU’s when not needed will slow down VMs


Detecting Virtual Host and VM Performance Problems before Analyzing vCPU Allocations
Deciding the proper number of vCPU’s for a VM should be a long-term performance planning exercise which is periodically performed. Prior to starting that vCPU planning exercise, you should first determine if there is already CPU over-utilization within your virtual machines or on your virtual host.

Typically, this is done by checking the following:

• CPU Ready (cpu.ready.summation) in milliseconds (ms)
This per virtual machine and per vCPU statistic is the number of milliseconds that the VM was ready to execute requests on the virtual host’s CPU but there wasn’t pCPU available to do so. An increasing CPU Ready value for a VM indicates that there is an ongoing lack of CPU cores on the virtual host.

• CPU Usage (cpu.usage.average) in %
This CPU usage statistic is available per host, per VM, or per resource pool. It shows us what % of the time, over the time range selected, that the CPU was utilized. Note that the total CPU includes all vCPU’s on a VM or all pCPUs on a host. If you already have > 80% CPU utilization on a VM or host, over a 1 hour time period, you should find ways to solve that problem. Those
solutions could include reducing the number of vCPU’s on your VMs, migrating active VMs to another host, adding more pCPU to a host, or dealing with misbehaving applications.

If your virtual hosts don’t have the CPU capacity needed for today’s application demands you should solve that first. It should be noted that one of the potential solutions for solving high virtual host CPU utilization is to reduce the number of vCPU’s allocated to VMs if they were dramatically over allocated.

VMware Performance Metrics to use for Virtual Host pCPU and VM vCPU Usage Analysis
At this point, you’ve determined that your physical server isn’t over-utilized and it’s time to move on to a structured analysis of vCPU allocation on one or across all virtual machines. We’ll provide a detailed process below but first, let’s cover what you need to know before you start that process:

• How to Measure CPU Usage Maximum (cpu.usage.maximum) in percent
This shows the maximum amount of CPU that was used at any one time, as a percentage. Keep in mind that if you have 2 vCPU’s, a 50% utilization value is 100% of one vCPU and 0% of another, or 25% of one and 25% of another. This value will be used to determine if we hit the maximum amount of CPU already allocated to that VM during the time range.

• How to Measure CPU Usage Average (cpu.usage.average) in percent
This value shows the average CPU utilization over the time range. Note that this is based on the total number of vCPU’s such that if you have 2 vCPU’s, a 100% utilization over the time range is 100% of both vCPU’s.

• Understand Time Ranges and Sample Periods
To be able to accurately understand statistics that you are viewing, you must fully understand that these stats are measured over the default sample periods or whatever sample period you specify. For example, if you measure real-time or one minute CPU usage, it is very likely that you will see periodic CPU utilization peaks of 100%. Short periodic CPU usage at 100% isn’t cause for alarm or changes to CPU configuration. Instead, I recommend that you look at CPU statistics over a slightly longer sample time to ensure that you are seeing trends instead of instantaneous usage.

Some number of those statistical samples are then pulled out and stored in a historical database and used when you run a performance report over a specific time period. Understanding both the sample time and the time range are important when it comes to interpreting performance graphs.

• Know How to View and Modify vCPU’s for VMs
Viewing and modifying vCPU configurations on your VMs isn’t something VMware Admins need to do every day, however you do need to know where to do it and when you can make changes. You can view VM vCPU configuration by clicking on a VM and looking at the Summary tab. You can edit VM vCPU’s by going to a VM’s properties and then clicking on the CPU in the hardware tab. You can also view vCPU configurations across all VMs by going to a higher level in the virtual infrastructure (from a VM, go up to the ESXi server or up to the resource pool or cluster level) and then clicking on the Virtual Machines tab.

Additional Factors to Keep in Mind when Allocating VM CPUs
Limits

By configuring a limit on a virtual machine, an artificial cap is being placed on the maximum amount of CPU that the VM can use. Without a limit, the maximum CPU that can be used is the Mhz of the pCPUs on the server, per vCPU allocated to a VM. Keep in mind that while a VM has full access to the total Mhz of a pCPU (per vCPU allocated), those pCPUs are still shared between that VM and all others on that virtual host. Note that limits configured in the hypervisor aren’t visible to the guest OS, which can further cause unexpected application performance issues.

While functionality exists to configure a Mhz limit on a virtual machine's vCPU, it rarely makes sense. Instead, it makes more sense to set a CPU Mhz limit across all VMs in a particular resource pool.

Many times, VM CPU limits are put in place by an administrator who did not fully understand the poor performance and wasted resources that limits can create. When looking at CPU utilization, CPU limits skew performance metrics and can cause confusion while troubleshooting. In other cases, VM CPU limits have also been put in place by the VMware Admin to limit physical resource consumption by applications (unbeknown to the application owner).

ReservationsCPU reservations artificially set the minimum amount of CPU that a virtual machine (or resource pool) has access to. Even though those CPU resources may not be needed by the VM now, a reservation pulls those CPU resources away from other virtual machines that may need it. Like limits, reservations are better set on resource pools instead of individual virtual machines as they can hurt the performance of other virtual machines and skew CPU metrics due to the artificial requirements being put in place.

Knowing an Application’s NeedsTaking the time to look inside the virtual machine and analyze the CPU resources that an application uses can yield a great deal of information as to that VM’s CPU needs. When evaluating a Windows operating system, Resource Monitor and Performance Monitor can be run to expose which processes use the most CPU and how it varies.

Also, speaking to the business owner for an application helps determine when this application is used, who uses it, what would happen if it were unavailable, and how the use of the application is growing.

By taking time to understand these factors, VM administrators can draw further insights into properly sizing vCPU for the VM, configuring resource controls (if needed), and understanding the priority of the application as compared to other apps.

Special Considerations When Changing vCPU ConfigurationsWhen you determine that the number of vCPU’s that a VM has configured needs to be changed, there are a few considerations:

• When going from one vCPU to many OR from many to one vCPU kernel changes are required in the VM guest operating system. For example, with Windows Server 2003 you need to make a change to the HAL (see VMware KB article 1003978 for more information). However, with Windows Server 2008 you can switch between single and multiple CPUs without making any HAL changes.
• Virtual machines need to be shutdown to remove vCPU and most VMs need to be shutdown to change vCPU unless the vSphere “hot plug CPU” feature was enabled before boot AND the VM meets the operating system requirements to use that feature.

Instituting a VM CPU Sizing Process
Not only is a virtual environment dynamic, but also, the usage of the applications in the VMs will be in constant flux. As a VMware Admin working on a critical production virtual infrastructure with hundreds or thousands of constantly changing virtual machines, there must be a formal process for proper vCPU sizing, other than just a "rule of thumb".

This process, ideally, involves the application admins and will have to be undertaken BOTH when a new VM is being created for an application as well as on a periodic basis.

The workflow diagram below introduces a best practice for how to execute this process. While this process may need slight modification for certain companies, it will work as-is for most VMware Administrators.

Below, this vCPU sizing process is presented step by step:

Determining Time Frame for Analysis
As I mentioned earlier in this guide, when performing your analysis, you need to take the perspective of the time frame into consideration. Statistics can easily be interpreted incorrectly and improper vCPU configurations can be made just by looking at a poorly selected time range.

While it is difficult to pick one timeframe that is perfect for all situations, I generally recommend a timeframe that is no less than 1 week. However, in some cases, that time frame may be as much as a year (such as a retail company that has high utilization around the holiday season or a university that has high utilization around registration). On the other hand, if you have a common application that you know is relatively predictable, you can safely look at time frames between 1 week and 1 month.

Finding the Peak ValueSo what peak value do you choose? “Maximum” and “Peak” values are interesting because they show if the vCPU’s were ever 100% utilized during your time period. However, you need to 1) investigate WHEN that was (what it during backups?) and 2) weigh that with the average utilization. This is the case because an instantaneous value of 100% CPU utilization for a single vCPU VM over a week means nothing if it only happened once during the backup window and the average utilization is just 25%.

In other words, don't just look at the current value and use that. Use the tips above in determining a time frame to make sure that you identify the true peak.

Comparing the Peak Usage Value to Allocated CPUFrom there, the true peak value observed, if deemed relevant in context, can be compared to the number of vCPU’s configured for the virtual machine. The process should to be:

• Look at the average and peak (maximum) percent of vCPU utilized during your sample
• Compare that to the number of vCPU’s that are allocated to that VM
• If the average is less than 38% and the peak is less than 45% then you should consider downsizing vCPU’s
• If the average is greater than 75%, and the peak is above 90%, then you should consider adding vCPU’s.
For example, if the VM is configured with Qty 2 vCPU’s and the maximum utilization is 100%, you are maxing out both vCPU’s, at some point. If the average utilization is consistently > 80% then consider adding vCPU’s.

On the other side of this, if after the comparison, we find that that vCPU’s never hit their maximum, average utilization is low, and the number of vCPU’s is > 1 then consider migrating the VM down to 1 vCPU (or reducing the VM’s vCPU by one).

Forecasting Usage Changes until Next VM CPU Allocation ReviewBefore deciding to downsize a VM's vCPU configuration, it should be noted that the vCPU demands of the VM (and, more specifically, its applications) could increase between now and the next time that this vCPU sizing process is conducted again.

Has the application owner been contacted? Was an increasing trend in vCPU demand noted based on historical analysis? It is important to ensure that sufficient vCPU for both current usage and forecasted future usage growth is provisioned or a VM will face vCPU-related performance issues.

Factor in a Buffer to determine the Right AllocationBesides factoring in an expected growth rate, adding a buffer is important to ensure that the vCPU allocation is accurate. While historical peaks have likely been taken into account, it is still good to factor in a buffer to ensure that the virtual machine's vCPU has headroom to grow into and does not max out. For less critical applications, perhaps this is not necessary based on the application’s needs, and can be avoided to retain more CPU capacity for other VMs. However, for multi-threaded critical applications whose CPU utilization fluctuates, adding an additional vCPU is recommended. This way, in the event that an unexpected, business critical demand on CPU happens, end-user application performance won’t suffer.

Change the VM Sizing Allocation, Kernel-Level Changes and Document Change
Now that the timeframe for analysis has been determined, peak utilization found and compared to the allocated amount, and further forecasting for VM memory usage in the future has been undertaken, it is time to take action: the virtual machine's number of vCPU’s can be changed.

Unless the correct guest OS is installed and CPU hot plug is enabled (assuming you need to add additional vCPU), the guest OS must be shut down to add or remove vCPU from a virtual machine.

Once the guest OS is shut down, a virtual machine's vCPU can be resized by right-clicking on the VM (in the hosts and clusters inventory) and clicking on Properties. At that point, you'll see the window below where you can resize the VM’s CPUs on the Hardware tab.

Once the VM's vCPU has been resized, you need to take Guest OS Kernel / HAL changes into account.

Typically, the servers that need HAL / Kernel changes done when going from one to multi-CPU or from multi to one are the older Windows Server operating systems like Windows Server 2003. Without the proper changes, Windows 2003 Server systems will perform slowly or not at all. Windows Server 2008 and Linux-based systems don’t require any HAL/Kernel changes when their quantity of virtual CPUs change.

With Kernel / HAL changes done, you can document the change to the VM’s vCPU and power back on the VM.
Setting Up a Regular CPU Sizing Review Based on an Appropriate Time Frame
Because application usage changes, vCPU must be continuously evaluated to ensure that performance will not be impacted due to additional changes in the dynamic virtualized infrastructure.

A regular vCPU review process should be set at appropriate intervals. It is this paper’s recommendation that this process once per month. However, some organizations that are growing quickly may want to do this more frequently. Once a vm sizing process has been conducted a few times on the virtual machines in an environment, the steps will become more familiar and administrators can able to build on this experience to streamline the process.

Provisioning a New VM for Appropriate vCPU AllocationsNow, let's look at the vCPU sizing workflow from the beginning and take the alternate path: let's say that a new virtual machine is being provisioned. There won't be any historical performance data for this new virtual machine. Because of this, the vCPU configuration made for this VM could be much less accurate and the VM will need to be monitored more closely.

Still, following the sizing procedure below with frequent monitoring during the time shortly after the VM is created will provide the necessary insights into the number of vCPU’s that VM should have.

Initially Allocate the Minimum vCPU’s PossibleIn many cases, there will be some idea of how many vCPU’s a new VM should have based on the same application (or similar app) running on another VM. Even if this information is available, the process of monitoring and reviewing described below should still be followed.

On the other hand, if a brand new VM for an application or usage case that has not been previously deployed is being created, and this is no indication how many vCPU’s should be allocated for that app and its users, you could just go with the recommendations as if this app were being installed on a physical server. Or, you could start with just 1 vCPU and see how it goes from there.

While it may be tempting to go with a "gut feel" or use the same number of vCPU’s that a physical server is configured with (i.e., just give all Windows servers 2 x vCPU’s OR “the sharepoint admin said he’d like his VM to have 6 vCPU’s”), it's likely that the allocation may be grossly overshooting the number of vCPU’s needed or, even worse, underestimating the number of vCPU’s required by the application and its users.

Monitor CPU Metrics IntenselyOnce a new VM has been deployed with a set number of vCPU’s, CPU usage should be monitored intensely. Just how frequent and in-depth this monitoring is depends on the criticality of that VM. Administrators may also be able to rely on vSphere alarms to alert if and when the new VM is hitting high CPU utilization.

When monitoring, look both at the vSphere VM CPU usage level and inside the guest OS to see how much CPU is in use and by what application. Is the VM maxing out the base quantity of vCPU’s you configured? If so, then more vCPU’s are likely needed.

Add More vCPU’s One at a TimeIf the new VM is running low on CPU capacity, another vCPU should be added one at a time. Remember, you want to find the right number of vCPU’s for the VM, not just throw a bunch of resources at it and potentially cause performance problems later. When more vCPU is needed, add additional vCPU just one at a time and then monitor very carefully.

Add the New VM to the Regular CPU Sizing Review ProcessThis new VM should be included in the periodic vCPU review process discussed once the vCPU usage has been stabilized and the new VM has appropriately size quantity of vCPU.

Additional considerations are…Assessing SLA Requirements
As administrators, we strive to give applications what they need for peak performance. However, in the real world, due to service level agreements or cost models, this isn’t always possible. In these cases, we have to intentionally give fewer vCPU’s that would be ideal or use resource controls like limits to only give a VM (and it’s underlying CPU-hungry application) what should be allocated for business reasons.

Thus, keep service level agreements in mind when analyzing vCPU allocations and realize that there are cases where we have to allocate a vCPU value other than what is optimum.

ConclusionBesides memory, CPU is the most finite computing resource that a virtual infrastructure has. By leveraging the information presented in this whitepaper and implementing the workflow detailed above, virtualization administrators can more efficiently use their computing resources. This procedure will allow data centers to reclaim CPU resources for other VMs which results in increased VM density and will help defer purchases of hardware. Additionally, regularly monitoring of CPU usage will allow VM administrators to proactively spot problem areas in VMs that are underprovisioned and may experience performance issues. Ultimately, accurately sizing vCPU in virtual machines can result in a better return on investment for a virtualization initiative, application owners that are confident in the performance that a virtualization endeavor provides their applications, and end users that are content with their usage of a firm’s IT resources.

About the Author

David Davis is the author of the best-selling VMware vSphere video training library from Train Signal. He has written hundreds of virtualization articles on the Web, is a vExpert, VCP, VCAP-DCA, and CCIE #9369 with more than 18 years of enterprise IT experience. His personal Website is VMwareVideos.com.

Monday, February 27, 2012

Breaking Down Cost, Complexity and Operational Barriers to Coud-based Storage Solutions.

- Ash Ashutosh, CEO of Actifio (http://www.actifio.com/), says:

Actifio™, the Protection and Availability Storage (PAS) Platform Company, recently announced a collaboration with IBM to deliver new virtualized storage offerings that help service providers get their clients into the cloud.

To date, service providers and Value Added Resellers (VARs) have been unable to deliver on the promise of the economics of the cloud due to their use of traditional technologies and IT architectures that create silos of physical infrastructure and applications. By combining server, storage and data management virtualization technologies, with a purpose-built Service Level Agreement (SLA)-driven operational model, Actifio and IBM will be able to deliver solutions that break down the cost, complexity and operational barriers to cloud-based storage solutions.

As a result, service providers can offer a wide range of cloud storage services cost-effectively to end-users while realizing strong margins and improved SLAs associated with cloud applications.

“The combination of Actifio and IBM enables us to deliver a more robust set of cloud services to our customers,” said Actifio customer and NaviSite President R. Brooks Borcherding. “It allows us to deploy a more cost effective network, better manage our data, and deliver new self service enhancements. As a result, our customers gain increased visibility and management control over their cloud storage environments from NaviSite.”

“2012 is the year when the critical mass of technologies and market momentum comes together to deliver on the promise of the cloud. End-to-end virtualization of the IT stack delivered by the Actifio and IBM solution is key to transforming IT into a cloud service, private or public, delivered at an unprecedented low cost point,” said Ash Ashutosh, Founder and CEO of Actifio. “With technology crossing the tipping point, VARs can become service providers and, along with existing service providers, deliver differentiated data management services quickly and cost-effectively, supported by IBM’s global delivery capabilities.”

Thursday, February 16, 2012

Requirements for Capacity Management in Data Center Virtual Environments

- Bryan Semple, CMO at VKernel (http://www.vkernel.com/), says:

Any data center with a virtualized environment has a real need for effective capacity management. This white paper discusses the reasons why capacity management is critical to achieving the benefits of server virtualization and outlines the three key requirements to consider when evaluating capacity management systems.

Why Capacity Management in Virtualized Environments
A major advantage of virtualized environments is their ability to improve resource utilization by running multiple virtual machines (VMs) on the physical servers in a shared infrastructure. With such an architecture, utilization can increase from as low as 10% for dedicated servers to 60% or more for virtualized servers. The enhanced resource efficiencies make it possible to more fully utilize ever-increasing server power and provide significant savings in capital expenditures, power consumption, rack space and cooling.

This concept of greater efficiencies through resource-sharing is not new. Mainframe systems have long employed time-slicing to enable multiple applications to run concurrently. With mainframe systems, the dedicated and quite sophisticated “capacity planning” is performed by the operating system, which ensures that no application can cause any others to suffer from resource contention issues. The high cost of mainframes created a strong incentive for IT departments to maximize mainframe resource utilization by running as many concurrent applications as physically possible.

Today’s server virtualization solutions operate in a similar manner. Hypervisors enable multiple virtual applications to run on the same physical x86 server, with all sharing the common CPU, memory, storage and networking resources. Through the magic of the hypervisor, each application operates as if running alone on a dedicated server. As with the mainframe, however, each virtual machine is actually sharing resources with other virtual machines. And as with the mainframe, the multiple applications sometimes contend for shared resources causing performance to degrade, especially during peak periods.

The goal with both mainframes and virtualized servers is the same: optimize resource utilization without degrading performance to maximize cost-saving efficiencies. Organizations undertaking server consolidation projects invariably experience such savings—at least initially. Where their data centers had been filled with row after row of underutilized servers, each running a single application, the post-consolidation data center may have seemed almost deserted with the reduction in the number of racks required. A very successful consolidation effort, for example, might be able to run as many as 10 or 20 different applications on each server, thereby requiring only 1/10TH the number of servers. Fewer servers consuming less space and power and requiring less cooling led to significant savings.

This dream of dramatic savings through consolidation and virtualization has the potential to become a real performance nightmare, however, without good capacity planning and management. The key to successful capacity management, therefore, is to ensure satisfactory application performance (prevent the nightmare) while maximizing efficiencies and savings (preserve the dream).

At a high level, managing virtualized server capacity is not that much different from managing mainframes, which also have shared CPU, memory and storage resources. But looking deeper at the details reveals some dramatic differences that might make the mainframe’s systems engineer (read: capacity manager) feel completely unqualified to deal with the complexity inherent in open systems capacity management in virtualized environments.

Clearly, for an organization to benefit the most from its virtualized infrastructure, robust capacity management must be an integral component of that infrastructure. Leading analyst firms Gartner, Forrester and others all concur on this need. In Jean-Pierre Garbani’s report titled I&O’s New Capacity Planning Organization, for example, the Forrester analyst states emphatically: “Capacity management and planning are the keys to virtualization.”

Distributed Resource Schedulers Are Not Capacity Managers
An obvious question to ask here is: “Don’t applications like distributed resource schedulers solve the capacity management problem?” And the answer is an emphatic no. DRS applications are intended to balance the load virtual machines place on a hardware cluster. So just as a hypervisor provisions and balances the resources a virtual machine is able to consume on a single host, distributed resource schedulers perform the same provisioning and balancing across a cluster of hosts.

Balancing load across resources in an environment is important. But a balanced environment can still lack sufficient capacity or have too much capacity. In addition, virtual machines in a balanced environment can still be impacted by performance problems caused by the noisy neighbor problem, or the underlying resource availability of the host it is running on. So while operating distributed resource schedulers is good practice, system administrators need more management capabilities at the host, cluster and data center to fully and effectively plan and manage capacity.

Capacity Management Challenges will Only Increase
While daunting already for some today, the capacity management challenges faced by most IT organizations are certain to increase. The following trends are driving the need for more sophisticated capacity management solutions:

• Environment scale – Relentless growth in applications will cause virtualized environments to become increasingly larger and denser, making capacity management more complex.
• Mission-critical applications – The growing number of critical applications will all require enhanced performance monitoring.
• Multi-hypervisor deployments – The use of multiple hypervisors will require an agnostic approach to capacity and performance management.
• Cost optimization – With the “low fruit” savings from initial server consolidation projects now in the past for most organizations, future savings will need to come from cost optimization initiatives. And while chargeback is not as effective as initially thought at curbing waste, CFOs will continue to demand annual improvements.
• VM mobility – Mobility among private, public and hybrid clouds and even among development, production and DR environments, will add complexity to the decision-making process for determining the optimal allocation of VM capacity and workloads.

All of these trends combine to make maintaining the control over and the predictability of virtualized capacity increasingly challenging.

The Cost of Delay
Many organizations do not yet have a purpose-built capacity management system for their virtual environments, relying instead on other tools to somehow perform this essential function. Without dedicated and sophisticated capacity management, however, one of two scenarios inevitably unfolds: either the environment is so over-provisioned that there are no performance issues (and no one has yet caught on to the tremendous waste!); or administrators are using spreadsheets and other manual procedures in a daily struggle to maintain service levels by constantly reallocating an increasingly complex array of resources (often by trial-and-error!).

Delaying the inevitable need to implement fully-effective capacity management has real costs to an organization that often manifest as:

• Application performance problems as VMs contend with each other for resources
• Hours spent firefighting either perceived or actual problems throughout the virtual environment
• Loss of confidence in virtual infrastructure performance
• Wasteful resource allocations that undermine the cost-saving advantages of virtualization
• Over-purchasing of server hardware, memory or storage on a routine basis
• Hours of staff time spent maintaining spreadsheets for management reporting (time spent not being able to work on more productive projects)
• Incorrect sizing of new servers during a hardware refresh by paying a premium for:
     o Expensive scale-up systems when scale-out systems are more efficient
     o Excessive support for scale-out systems where scale-up systems are more appropriate
     o Purchasing the latest CPUs for maximum clock speed when slower, earlier generation (and far less expensive) CPUs will suffice
     o Purchasing the latest, highest-density memory when far more economical lower density memory is sufficient for the actual VM load.

But perhaps the greatest cost of delay is not getting started aligning IT services with costs. Public clouds now provide alternatives for internal IT consumers to shop for services. These data points create the perception that public cloud services are “cheaper” and these beliefs are difficult to counter when IT has yet to develop a workable cost model for its services.

Even ignoring the public cloud “competition”, few IT executives are currently not focused on maximizing resource utilization to drive down capital and operational expenditures. Virtualization provides the ability to begin aligning IT costs to the services provided. But it is critical to begin this journey with a full understanding of the linkages among capacity, performance and cost. And this is perhaps the biggest reason not to delay implementing a genuine and capable capacity management system. Senior IT management is focusing on the problem throughout IT. Solving the problem sooner rather than later in virtualized infrastructure just makes sense.

Requirements for Capacity Management Solutions
What are the requirements for capacity management in a virtualized environment? At a high level, a capable capacity management solution must:

• Offer enterprise-wide visibility into performance, capacity, cost and resource efficiency of the entire virtualized infrastructure
• Provide actionable intelligence from this information
• Be simple to deploy, operate and maintain

Enterprise-Wide Visibility
Performance, capacity, cost and resource efficiency are all intertwined in a virtual environment. Without sufficient capacity, performance suffers. With too much capacity, infrastructure costs soar. Even with the right amount of capacity, efficiency can still suffer if virtual machines are consuming more expensive resources than are required.

Therefore, performance, capacity, cost and resource efficiency must be viewed across the enterprise in a holistic fashion to provide visibility for the administrator, as well as to provide information that is both sufficient and accurate enough to facilitate fully-informed decision making. Such visibility requires roll-ups of information across:

• Data centers
• Different types of hypervisors
• Different resource pools, such as CPU, memory, storage and networking

Simply being able to roll-up information up is not enough, however. As environments scale, functionality must be added to view all of this information in a meaningful way.

Cost
Understanding the virtual environment cost structure is critical as enterprises move toward the cloud. Because the cloud enables self-service portals, end users can quickly drive up operating costs in the absence of a thorough understanding of the underlying costs. Indeed, the sheer ease with which virtual environments enable the deployment of virtual machines has led to virtual machine sprawl. Understanding the cost component of a virtual environment is, therefore, essential to good capacity planning. Support for cost visibility requires:

• Chargeback (or at least “showback”) capabilities by customer for either allocated or utilized resources

• Robust reporting and potential integration with financial management systems

Chargeback/showback may encounter some significant organizational and computational limitations, however. These include:

• Financial systems that lack the ability to integrate chargeback information

• Generally accepted accounting principles that make chargeback difficult

• Budgeting cycles that are based on assumptions of fixed costs, not variable consumption models. IT customers faced with a consumption-based chargeback models must then confront the challenge of estimating uncertain computational demand.

• IT charging back for services may not be “politically palatable” for an organization

• Determining the chargeback amounts is also a non-trivial exercise if the intention is to get an accurate model of pricing

• Finally, chargeback is a measure of the price IT is charging for services and is not necessarily a measure of the actual cost to deliver that service. The actual cost to deliver all services at a high level is the total cost to own and operate all IT infrastructure divided by the number of virtual machines on that infrastructure. IT needs to focus on its actual costs, not the costs charged. This makes chargeback, without cost awareness, less beneficial as a management tool.

The barriers to chargeback are many. Nevertheless, this should not prevent IT from being on the path to understand and manage its cost structure. A key element to this is to implement a cost index that reflects the cost to IT to deploy a VM. Cost indices are a fairly new and advanced tool for IT. Using a cost index, the systems administrators can identify their most expensive virtual machines based on resource consumption, cost of the underlying hardware and density of deployment relative to other virtual machines. By identifying the most expensive virtual machines, actions can be taken to reduce costs or at least understand the impact on overall efficiency. Combining cost indices with cost visibility provides a solid foundation to lower IT costs over time and to understand the main cost drivers throughout the IT infrastructure.

Resource Efficiency
Understanding efficiency in virtualized environments is critical because wasted or under-utilized resources are what drive up capital and operational expenditures. More importantly, since one of the original goals for virtualization was server consolidation and efficiency improvements, poor efficiency of the virtualized environment undermines this fundamental and worthy goal. Of course, IT can perform chargeback or showback, yet still have tremendous inefficiencies throughout the environment. Chargeback can, however, be a tool to help reveal such inefficiencies.

How is virtualized resource efficiency monitored and analyzed?

• The cost index, introduced above, is a way for IT teams to understand the relative costs of operating a virtual machine. While one VM could be expensive to operate relative to others, it could be operating efficiently with the underlying system being the actual culprit driving up the costs. A capacity management system must, therefore, be able to rank the indexed virtual machine costs accordingly.

• Over-allocating VM resources is a major source of inefficiency. Over-allocation occurs when applications have more CPU, memory or storage than needed to perform adequately. The capacity management system must be able to identify over-allocations continuously, preferably by monitoring for peak and average values of resource utilization across CPU, memory and storage. As some hypervisor vendors shift to consumption-based models for licensing, removing over-allocated memory will become an increasingly important aspect of cost control.

• Wasted resources occur in virtualized environments from normal operations. These wasted resources include zombie VMs, abandoned VMs, unused templates and unused snapshots. Capacity management systems must effectively distinguish these wasteful resources from similar resources that are actually in production use.

Actionable Intelligence
While enterprise-wide visibility is the first major requirement for a capacity management, generating actionable intelligence from the data collected is just as important. Capacity management is essentially an analysis problem. Correctly performing capacity and performance management requires the analysis of about 20 different metrics per virtual machine at the VM, host, cluster and data center levels taken in at least five minute intervals. For a simple 100 virtual machine environment, for example, this requires the analysis of 100 VMs x 20 metrics x 12 samples/hour x 24 hours x 4 levels of analysis, yielding about 2 million data points per day. Given the sheer volume of data, it is not difficult to see why manual processes simply fail to scale. The better capacity management solutions are able to perform this multi-variable analysis on a massive scale.

The visibility requirement of capacity management solutions demands a significant amount of computational horsepower simply to make sense of the wealth of data. The need for creating actionable intelligence requires even more computations to enable system administrators to move beyond basic visibility into various problems and efficiency issues to being empowered to take action to address the underlying cause(s), either in a manual or automated fashion.

For performance issues, actionable intelligence involves:

• Root cause analysis of the problem with specific recommendations that do not require any additional analysis for how to clear the problem

• Impact analysis to point out any related virtual objects that might be affected by an issue

• Automated actions to clear performance issues, such as moving a virtual machine to a different cluster, or working with native distributed resource schedulers to accomplish the task

• Automated resizing of a virtual machine within the limitations imposed by the operating system(s) or corporate policies, with or without a restart

For cost accounting and efficiency, actionable intelligence involves:

• Specific recommendations for ways to improve efficiency and lower costs

• Automated zombie destruction, template cleansing and abandoned VM clean up, especially for QA environments that potentially contain thousands of virtual machines

• Automated downsizing of virtual machines within the limitations imposed by the operating system(s)

• Automated reporting of cost and efficiency numbers for key stakeholders in a variety of formats

Visibility without actionable intelligence, while a step in the right direction, leaves the administrators, especially in larger environments, with a significant labor burden to maintain efficiency and performance of their virtualized infrastructures.

Starts to Work Out of the Box
Why does a capacity management system have to be hard to deploy? It doesn’t. Here are some reasons ease of deployment and use are important requirements:

• Lack of dedicated capacity planners – Most virtualization teams, even very large ones, do not have dedicated capacity planners. Any application to manage capacity and performance must, therefore, be useable by all team members without significant training.

• Lack of capacity management skills in the existing team – Capacity management is an analytics problem. It is certainly possible to develop this skill set on a virtualization team without a dedicated capacity planner. But there currently exists no formal certification authority similar to vExpert for managing capacity. Even with sufficient training, the scale of the analytics problem would still require a sizable investment in software (whether bought or built) to mine the raw metrics data generated by the hypervisors.

• Return on investment – Long deployment times and high training costs both diminish the return on the investment in a capacity management system. The better systems are able to recover their purchase price in just a few months.

• Pace of expansion – Virtualized environments are in a constant state of change, which normally involves both reconfiguring and expanding resources. If the capacity management solution cannot keep pace with the expansion, or requires constant configuration changes to do so, it will eventually need to be replaced.

• “Expert in the Box” – Similar to having a vExpert on staff to address technical issues, the capacity management solution needs to function like an expert itself from day one and it should not require an expert operator to get great results—ever.

The specific requirements for a capacity management system to work “out of the box” are:

• Little to no configuration or maintenance required for operation

• A minimal learning curve for basic operation with intuitive interfaces to facilitate usage by part-time capacity planners

• Automation to support repetitive tasks

• Automatic creation of user views to eliminate the need for manual customization

• Pre-configured and easily configurable customizable reporting for different audiences

Conclusion
Capacity management for virtualized environments is an absolute necessity in any infrastructure of reasonable scale. Virtualization has reintroduced the mainframe model of computing, but with significantly more complexity for sharing CPU, memory, storage and networking resources. While implementing a capacity management solution could be postponed, most environments will incur some very real and potentially substantial costs by doing so. Robust capacity management systems—those that meet the requirements outlined here for enterprise-wide visibility, actionable intelligence and “out of the box” productivity—pay for themselves almost immediately, with the cost savings continuing to accumulate year after year. It is perhaps the best investment an organization will ever make to get the most from its virtualized infrastructure.






Wednesday, February 8, 2012

Rethinking Virtualization Strategies

Q&A with Kent Christensen, practice manager with Datalink (http://www.datalink.com/):

Chris MacKinnon (DCP): Why are unified virtual infrastructures useful in today's enterprise data centers?

Christensen: A dramatic transformation in the way Information Technology (IT) departments operate has put their directors and managers at a crossroads. On one side, IT administrators are under pressure to deliver higher levels of service and be more responsive to enabling competitive business objectives. On the other side, IT departments are equally pressured to limit budgets, “do more with less,” and show positive ROI from optimization initiatives.

Savvy IT leaders are beginning to resolve both sides of this conflict by rethinking their virtualization strategies. Virtualization was originally a way to improve utilization of physical servers. Now it’s being expanded to turn entire data centers into dynamic, agile, services-oriented architectures — ones that accelerate business objectives and competitiveness.

Data center virtualization is a rare opportunity for IT. The potential cost savings are tremendous. The efficient sharing of physical server, storage, and network resources translates into far lower capital purchases and operating expenses. Wasteful application “silos” are eliminated. Data centers can support more applications, implement them faster, and maintain higher service levels. Data center virtualization also gives IT managers and admins powerful new tools for resource scheduling, data protection, and disaster recovery. And while the prospect of low-cost, no-fuss cloud computing from outside vendors is tempting, it’s not ready for prime time due to serious performance and security issues.

Instead, an IT department can use data center virtualization to build its own private cloud, delivering the same economies and efficiencies to the organization. Then, once the public cloud matures, IT can buy resources from third parties as needed to meet unexpected demands or offload resource-intensive tasks.

MacKinnon: Why should data center and IT managers care about  unified virtual infrastructures ? How can they benefit from them?


Christensen: Virtualization across the data center can provide notable savings on floor space, power, and cooling costs, as well as utilization of existing assets across servers, storage, and networks. While the financial benefits alone are compelling, the largest gains can be obtained by reducing complexity and streamlining the speed at which IT accelerates the business.

Instead of building separate infrastructures according to the needs of individual applications, data center virtualization lets you build a dynamic platform of infrastructure that supports all applications. Abstracting applications from physical resources gives you managed capabilities that you can’t get from physical hardware. These include:

- The ability to migrate live applications from one physical server to another without disruption
- Increased availability for applications during hardware failure
- Resource scheduling and load balancing across existing infrastructure
- Improved backup and disaster recovery
- Increased performance, scale, and security
- Integration with storage and network infrastructures

The result is a platform that will support many—if not most—IT applications. The availability, performance, and security are provided by the platform, which reduces the need to build those services into each individual application. The resulting common infrastructure is much more flexible and agile. This is also the framework for expanding to an internal private cloud infrastructure.

MacKinnon: Where should unified virtual data center infrastructures rank in terms of overall priority in the data center?

Christensen: A recent survey (Source: Gartner Executive Programs - January 2012) by Gartner called out that cloud computing ranks #3 on a list of top ten priorities for CIOs in 2012.

What are the biggest challenges for data center and IT managers when it comes to unifying and virtualizing their data centers?
There can be a lot of obstacles to building a virtual data center. Virtualization is still new in many ways and not fully understood outside of the core IT group. There can be disagreements due to the number and complexity of solutions, and the fact that they cross multiple disciplines. As you map out your virtualization strategy, consider the barriers to adoption, both inside and outside your organization.

Internal barriers fall into two groups: politics and culture, and new ways to think about IT. Most organizations use a variety of applications running on different platforms. Each has its own requirements for networking and storage resources and may have different requirements for access and availability. Multiple applications and technologies can lead to isolated islands of data and potential interoperability issues. In addition, the stakeholders who helped build those applications likely have entrenched policies and attitudes that are not easily changed. As a result, many organizations have a number of different virtualization initiatives directed by different groups within the company. Server teams may not be in sync with application administrators, and storage or networking teams may take a completely different and uncoordinated approach. A unified approach may disrupt the “corporate culture” and can create some internal conflict where decisions could potentially be based on relationships and alliances rather than sound business principles.

Virtualization also requires new skills. Many people need time to think it through. But thinking is good because building an internal cloud requires a lot of planning based on an understanding of exactly what the business needs. It’s an incremental process, taking the time to think through where you want to go and how you will accomplish it.

External barriers largely come from disagreement within the industry on how to proceed. No two storage or network virtualization vendors agree on how to design and deploy a virtualization strategy. Reliable interoperability standards have not yet emerged. That’s why it’s prudent to work with a vendor-agnostic consultant such as Datalink. Whereas many manufacturers can only push their products and services, we look at a plethora of options, making our customer’s success our first priority.

MacKinnon: How can data center and IT managers overcome those challenges?

Christensen: The first and most important step is to create a vision and lead to that vision. As a board or CEO considers outside service providers (or cloud providers) as experts at delivering IT services internal IT organizations are challenged with creating a competitive operation. The opportunity for IT leadership is to think and act like a service provider to the organization. What are the services the organization needs not only maintain existing operations but gain a competitive advantage? And how can IT most efficiently deliver those services to the organization reliably and efficiently.

If you look at a cloud service provider as an example, many IT organizations come to the logical conclusion that they can provide services more reliably and at a reduced cost by building a highly efficient internal or private cloud that is designed to support the organization.

Armed with a goal to create a highly competitive and efficient operation IT leaders need to provide leadership to break down existing silos or thought, design and even procurement and raise the bar of IT to what is required to holistically deliver the services the organization requires. This is where a unified data center architecture can accelerate the mission to create unified orchestrated data centers that are both highly efficient and agile to drive business needs.

MacKinnon: What advice can you give to IT and data center managers that have a plethora of similar solutions to choose from?


The challenges with building an internal private cloud are threefold. One is that no single vendor or solution delivers complete unified private cloud architecture. As a result, organizations either need a partner that can assemble a complete solution or IT has to continue to sort out the solutions themselves. An integrator like Datalink with experience in delivering complete unified architectures and helps align best of breed solutions against the organizations requirements.

The second is that a pre-defined private cloud architecture in most cases will not fit a particular organization’s objectives. Many times, IT will determine they do need agile unified resources with which are elastic and measures but chose, for example, to limit the use of self service or charge back. So working with an integrator that can align objectives is important.

Finally, it’s important to recognize that most organizations cannot simply stop existing operations and transform over night. A flexible solution should be able to both leverage existing infrastructure and grow as the business objectives dictate. Working with a solution that can migrate you toward an IT as a Service private cloud vs. selling you a complete solution all at once is the most common approach.