Showing posts with label Data Center Monitoring. Show all posts
Showing posts with label Data Center Monitoring. Show all posts

Monday, April 30, 2012

Enhancing SCOM Exchange Monitoring

Jean-Francois Piot, vice-president of Microsoft Global Market at GSX Solutions (www.gsx.com), says:


The need for enhanced SCOM Exchange monitoring

While SCOM for Exchange is seen as a robust monitoring tool that provides a wide range of server level statistics, many SCOM environments could benefit from more advanced user and service level monitoring and reports.

GSX Monitor and Analyzer aims to fill that gap with a powerful application performance monitoring solution that enables IT administrators and managers to proactively maintain their enterprise collaboration environment from a single user interface. Its robust executive dashboard and reporting tool highlight key trends and performance metrics, enabling you to prioritize and act on emerging issues before they impact users.

How GSX compliments SCOM
GSX compliments SCOM for Exchange in terms of simplicity, relevancy, end user performance, and reporting.

Simplicity
GSX Monitor and Analyzer can be installed and configured in less than an hour. It requires no agent on your servers, and thresholds can be configured in a matter of minutes. In-place upgrades are made via the GSX Monitor station.

Relevancy
GSX Monitor only sends alerts on issues that impact your users or service. This helps you to keep a step ahead of any issues since you don’t need to sift through hundreds of nuisance alerts to find what’s important. You can then correlate these alerts with any SCOM notifications to help quickly resolve the issues.

End user performance
GSX Monitor and Analyzer ensure performance from a user perspective. It is one thing for a server to be up and running, it is another matter entirely for the service to be performing as the user expects. GSX Monitor alerts you to user impacts and GSX Analyzer helps you with capacity planning and measuring performance against any set SLA. The focus is always on what’s most important: your users’ experience and the services you provide.

Reporting
GSX Monitor gives you access to all current statistics and graphs of recent performance. GSX Analyzer lets you view historical statistics, trend your growth, and even forecast the future. This allows you to pinpoint any negative trends, stay a step ahead of performance issues, and plan for capacity growth. What’s more, you can view a dynamic snapshot of SLA performance, set a performance indicator’s SLA, and then select the number of KPIs that are critical to achieving your SLA. If performance isn’t meeting the SLA, you can quickly determine which server is impacting it and how. For example, if server availability is supposed to be 99.9% but you see that you’re only at 98%, you can quickly locate the offending server with only 80% availability and continuous RAM utilization of 99%. To meet the SLA, you can then add more RAM or remove the server from your environment.

A comprehensive solution
To conclude, GSX Monitor and Analyze provides the comprehensive reporting and analysis you need to help fill the gaps in your SCOM Exchange environment and deliver the performance that your users expect. Together, SCOM, GSX Monitor, and GSX Analyzer provide a comprehensive solution to your Exchange monitoring, analysis and troubleshooting needs.

Tuesday, March 13, 2012

You Can’t Manage What You Can’t Measure in Enterprise Data Centers

- John Consoli, vice president of sales and marketing at FieldView Solutions (www.fieldviewsolutions.com) says:

Accessing real-time energy data is vital for the successful operation of today’s enterprise data centers. The adage “you can’t manage what you can’t measure” has never been more relevant to what we do. The data center’s critical resources are space, power, cooling and connectivity. Most enterprise data center managers can tell you exactly how many data ports and circuits they have available, but, at best have “guesstimates” when it comes to usable space, power and cooling. The vast majority implement safety buffers and over provision to ward off catastrophic failure. The tradeoff is catastrophic WASTE that goes right to the bottom line.

Managing today’s complex, high density data centers without real-time monitoring is like trying to fly a Jumbo airliner without an instrument panel. The value proposition for implanting a real time energy management tool literally writes itself. The ability to increase utilization of available resources by 20, 30 or 40 percent has a HUGE impact on both operating expenses and capital expenditures.

How much energy can be saved by raising data center temperatures 2-3 degrees Fahrenheit? What if you could delay or totally avoid new construction because you find that there is 30% more available power than you thought? Real-time energy monitoring can deliver both of these things, and more! We see it happen for our clients every month!!

As mentioned before how important is the instrument panel to a jumbo jetliner? FAA reports that the most advanced commercial jet monitors up to 800 data points per minute. A 5000 square foot, Tier 3 data center with a load of 5kW per rack has potentially 100 times more than that!

The biggest challenge with regard to implementing real-time energy management is two-fold"

1) People are resistant to change
2) Organizations have not developed a cohesive action plan for the data once it is collected

Data center and IT managers need to have a plan! Failure to plan is planning to fail!

The plan should be detailed and documented and involve stakeholders from IT, facilities, senior management and lines of business. It is often worth the investment to:
• Bring a professional, third party consulting firm to manage the project.
• Implement a plan for selecting the right tool, based upon the needs and goals of the organization.
• Consider vendor evaluation as a critical success factor.
• Create a documented plan for how data will be used and what the positive impact will be on corporate goals.

Implementation of real-time energy management should be important enough to be included in the company’s annual report and should be on everyone’s radar screen, up to and including the CEO.

Insist your vendors “put up or shut up.” Don’t be the first one on your block to implement their solution. Do not settle for smoke and mirror presentations. Your selection process MUST include a Proof of Concept (POC) installation at one of your sites!! The POC should have a clear, written scope that includes documentation of all acceptance criteria.

Don’t be afraid to invest resources (money, time, personnel) in the POC effort. Make this, part of your project budget from day one.

John Consoli is VP of Sales and Marketing at FieldView Solutions, the industry-leading Data Center Infrastructure Management (DCIM) provider, which recently unveiled FieldView 5.0 IT power management tailored with Microsoft® Business Intelligence (BI) tool sets, providing access to structured data through an Open Database Connection (ODBC), with colocation support enabling application assets to be provisioned and easily maintained.

Tuesday, December 20, 2011

Catering to the Underserved Small Enterprise Market

- Alex Bewley, CTO at uptime software (www.uptimesoftware.com), says:

The small enterprise market has traditionally been left out when it comes to powerful IT systems management suites. Instead this market has had to rely on low-end tools, time consuming freeware or suites that lack the powerful IT management they need.

uptime software (www.uptimsoftware.com) has been successfully delivering IT systems monitoring and management software to mid-market and enterprise companies for years. The company recently announced the shipping of its flagship product, up.time, in a new version called up.time Small Business Edition (SBE), a deep, easy-to-use and powerful IT systems monitoring and management suite designed for small and medium businesses (SMB). up.time SBE is uptime software’s first foray into the lucrative SMB market and delivers full server and application monitoring and reporting across a myriad of platforms, melding deep and powerful technology with ease-of-use for a perfect SMB solution.

Designed specifically for companies requiring in-depth and easy-to-use availability, capacity and performance monitoring tools without the cost and complexity of legacy enterprise-grade software solutions, up.time SBE delivers powerful IT systems management that’s packaged and priced to meet the unique needs of small enterprises.

up.time SBE is the most recent product innovation released by uptime software and includes the exciting VMware environment monitoring capabilities introduced in October with the launch of up.time 6. In addition, SBE is quick to deploy, easy-to-use and maintain, and is delivered in a simple pricing structure. SBE is designed to monitor the entire IT stack of servers, applications and services— across virtual, physical and cloud environments, as well as multiple platforms. All of this is easily manageable through a single dashboard.

Features specific to up.time SBE include:

  • Easy-to-use and Powerful IT Systems Management. Deep metrics for IT system monitoring across servers, services, applications and system resource levels give you the best performance, availability and capacity management on the market, all without the complexity of traditional enterprise tools.
  • Lightening Fast Root-Cause Analysis. SBE delivers deep IT monitoring and reporting not found in the low-end and freeware tools the small enterprise market has traditionally been stuck using. up.time offers easy-to-use monitoring and reporting for fast root-cause analysis of servers and applications.
  • More Proactive Problem Solving. up.time comes complete with detailed capacity planning to help create a more proactive IT department.
  • A Single Tool That Monitors The Entire IT Stack. Monitor hybrid environments from a single dashboard, including virtual, physical and cloud servers and applications across many platforms (Windows, Solaris, AIX, HP-UX, Linux, Novell, VMware).
  • Fast, Painless Deployment. The trial installs in minutes, and the entire up.time suite deploys within hours.
  • Affordable and Simple SMB Licensing. Complete IT systems management for each licensed server.

Alex Bewley, Chief Technology Officer at uptime software
Alex is passionate about bridging the gap between business value and cutting edge technology. A technologist by trade, Alex’s 17-year long tour of duty included time with Sun Microsystems in its heyday. However, Alex craved a more entrepreneurial muse, and founded uptime software (with co-founder Phil Didaskalou) in 2001. When not behind his desk, Alex is either knee deep with the Toronto tech startup community or is on his iPad dreaming of new ways to simplify IT systems management across virtual, physical, and cloud environments.

Tuesday, November 15, 2011

Cutting Down on Workplace Productivity Loss

- Mark Ackerman, senior engineer with SuperLumin Networks (www.superlumin.com), says:

Want to know when your security measures aren’t functioning properly? Want to know what your employees are doing on the Internet? Want to know how to improve the performance of your network?

Whether you’re running a large or small enterprise, in today’s Web-centric workplace more and more organizations want to monitor access to Internet resources to accurately manage usage policies and keep employees productive—and for viable reasons too. According to IT research firm Gartner, non-work related Internet surfing results in an estimated 40% productivity loss each year for American businesses.

In response to the demand for Internet monitoring and reporting, SuperLumin Networks announced its partnership with WebSpy Vantage to provide reporting support for its Nemesis 2.3 software.

This new partnership allows enterprises to assess bandwidth consumption and identify excessive downloading from particular websites, of specific files and by which employee. Two popular reports for organizations include employee productivity and specific user browsing activity.

WebSpy Vantage support through Nemesis generates reports in HTML, Microsoft® Word, CSV or plain text and is able to distribute them securely via a WebModule.

Offering an interactive reporting interface, WebSpy Vantage also allows users to import log data from SuperLumin Nemesis, directly from servers or workstations on an enterprise network. Users can interactively drill-down into any log data and interrogate specific events, and generate a wide variety of default reports or create customized report templates.

“Webspy Vantage reporting, supported by SuperLumin, is designed to efficiently analyze gigabytes of data with a simple-to-use interface—providing an enterprise-scale monitoring solution and offering a completely automated reporting function,” said Mark Ackerman, Senior Engineer at SuperLumin Networks. “It’s the needed solution to help protect Internet usage, increase employee productivity and maximize cost savings—ultimately enhancing the bottom line.”

Wednesday, September 21, 2011

Benefits of Centralized Monitoring, Alarm, & Notification

Dane Overfield, product development lead at Exele Information Systems (www.exele.com), says:

Data Unification
Ensuring the reliability and efficiency of a data center involves the monitoring of many disparate types of data across multiple vendors and protocols. Real-time data such as hardware and network performance, building power management, and environmental conditions need immediate attention if behavior deviates from the desired or normal operating ranges.

Commonly, this division results in splitting the responsibility between different internal groups and the implementation of different software with varying capabilities and features. Some may implement vendor-based software solutions, while others may be able to seek solutions based on common protocols among multiple devices and equipment.

Luckily, today’s data centers can benefit from unification steps made in automation and process monitoring field since the mid-1990’s. Faced with the same dilemma of multiple vendors and protocols, the need to unify the communication has resulted in a clear winner: OPC (www.opcfoundation.orgg). OPC provides a single communication translation between those needing the data (the monitoring tools) and the underlying protocols needed to access this data. The result is an abundance of OPC-based tools like Exele TopView that can be used across industries and vendors to solve common needs.

Yet, this unification is only beneficial if the translation layer (the OPC Server) exists for the required data and protocols. Again, data centers can benefit from established vendors and third-party companies that are providing the required OPC Servers for vendor-specific data communication and open communication protocols such as SNMP, BACNet, and Modbus.

Detect… and Notify
Once the data is centralized, a single solution such as Exele TopView can monitor the current values and statuses of the disparate measurement data in an attempt to identify abnormal operating conditions within the data center.

For some data, the identification of abnormal conditions is straight-forward (e.g. power relay tripped) but others may involve more complex logic such as multiple variables, aggregates, time delays, rates of change, and deadbands. The solution must allow the user to easily specify both simple and complex conditions that indicate abnormal events in need of attention.

Immediate action requires immediate notification. The notification solution should support multiple notification channels (email, text/SMS, voice callout, audible) and notification escalation to ensure delivery of the alarm condition to those responsible for handling and correcting the abnormality. Through flexible messaging content, the recipient can learn about the alarm as well as related details and conditions of the monitored data.

Birds-eye View of Alarms
In the process and automation world, operators expect real-time displays of current measurement values and alarms. Within the data center, similar displays provide a birds-eye view of the current values, state of alarms (how many and in what areas) as well as allowing individual alarm acknowledgement and annotations. These actions can influence alarm notification (e.g. only notify if the alarm is unacknowledged for 2 minutes) and should be stored along with the alarm history for later reporting and analysis.

Learn Through History
While the real-time displays and immediate alarm detection and notification are critical to the data center operation, additional value is gained through the storage and analysis of the abnormal and alarm event activity. The personnel responsible for overall health of the data center may not need to receive individual alarm notifications, but instead may gain insight through scheduled and ad-hoc reports of global or grouped alarm activity.
For ad-hoc alarm analysis, TopView provides the tools to query and report “bad actors”, times of heavy alarm activity (flooding), and periods of high active alarms counts. Scheduled reports can deliver hourly, daily, and weekly summaries of the alarms.
Alarm reports and analyses will enable users to identify failing equipment, time-of-day related failures (e.g. power load or network), and incorrectly configured alarms.

Embrace Unification, Reap the Rewards
The required data unification tools exist today, and Exele TopView can provide centralized data monitoring, alarm detection, and notification across your data center to allow immediate response to disruptions. In addition, you gain the tools necessary to identify long-term trends in order to detect problem areas and failing equipment, optimize performance and avoid more critical failures.

Wednesday, August 3, 2011

Are Your Data Center Monitoring Practices Putting Critical Operations At Risk?

- Kurt Crisman, marketing manager with Network Technologies, Inc.(www.networktechinc.com), says:

Temperature, humidity, and other factors can impact data centers, telecom switching sites, and other POP sites. In a many businesses, three groups monitor environmental threats to data center and switching site equipment: network administrators or operations managers, security personnel, and maintenance employees. Often, particularly in a small or mid-sized business, monitoring of equipment may be performed by staff onsite or visiting equipment in remote locations. However, these monitoring practices may be putting critical business operations at risk.
  • Damage caused by the environment can be subtle, unseen, or attributed to other causes. Condensation, rust, and heat damage is usually hidden inside machines, out of human sight.
  • The frequency and quality of a site check may vary from person to person. Even if procedures and schedules are in place, adherence to those procedures and schedules may vary.
  • Environment threats occur 24 hours a day, seven days a week. But staff is not always on site. Depending on staffing levels and schedules, environments can be unmonitored up to seventy percent of the time during an average week.
  • Without a log of changing conditions—temperature and humidity levels constantly increase and decrease—administrators and managers cannot identify problems caused by these changes. These problems can continue for days or months, while time and money is wasted investigating false causes and solutions.
  • As soon as you have people checking on equipment or performing maintenance, you can actually create problems where they hadn’t existed before. For example, boxes set in front of vents “temporarily” are not moved.
An effective server environment monitoring system addresses the weaknesses in the current practice of having personnel monitor the environment.

Network Technologies, inc. offers a range of server environment monitoring solutions that monitor critical environmental conditions that can destroy network components in a server room or POP site. When a sensor exceeds a configurable threshold, the system will notify the selected administrators/staff via email, SNMP traps, Web-page alerts and a visual indicator (LED). The systems connect to your IP network, so they can be configured and monitored from any workstation with a Web browser. Event-triggered snapshots from an IP Camera can be sent by email.


Our products provide the following benefits:
  • Control costs - In a stable environment, equipment lasts longer, and less equipment is damaged and needs replaced. Typically, the savings from not having to replace equipment can pay for the cost of the monitoring system.
  • Increase lead-time to fix a problem - The earlier the warning alarm sounds, the sooner personnel can solve the problem before it becomes a disaster.
  • Reduce downtime - Hardware housed at the recommended environmental conditions operates more efficiently, while also shutting down less frequently. Consequently, employees stay productive, and e-commerce sites continue to generate revenue.
  • Log environmental data for greater insight - In order to maintain stable conditions in the server room, administrators must have accurate records of what has happened. Logging is also critical for investigating problems.

Friday, July 29, 2011

Lowering Cooling Expenses without Risking Downtime


- Jonathan Burk, vice president at Burk Technology (http://www.burk.com/), says:

Temperature monitoring throughout the data center facilitates efficient, cost effective cooling without risking hot spots and downtime.

Cooling costs in the data center can comprise a substantial portion of an IT department’s operating expenses. While guidelines for temperatures are slowly increasing and cooling solutions become more efficient, inadequate cooling still poses a serious threat to uptime and reliability. Simply overcompensating by lowering overall temperature is a costly workaround. The only way to be certain that initiatives to lower cooling costs will not adversely impact equipment performance is to monitor temperature in multiple locations throughout the data center.

Airflow problems and improper cool air distribution can cause significant disparities in temperature in data centers, as well as in individual racks. While some servers will perform normally, others in the same location may be degrading or outright failing due to heat related problems. When attempting to run a data center efficiently, a difference of only a few degrees in one area can be the difference between reliability and costly downtime.

More than ever, it is necessary to carefully monitor environmental conditions throughout the data center. Simply monitoring ambient temperature is inadequate, as rack density, ventilation and server load have a significant impact on server temperatures. Monitoring onboard server diagnostics, while necessary, will not differentiate between problems with individual hardware and an overall system or design. Depending on the design of the data center, temperature monitoring should be implemented in each rack or, at least each row. To ensure adequate airflow throughout each rack, temperature sensors can be placed at the top and bottom of each rack.

With over 25 years of facilities monitoring experience, Burk Technology developed Climate Guard to serve the environmental monitoring needs of data centers and server rooms of all sizes. Climate Guard monitors temperature, humidity, flood/leak and many other conditions that can adversely impact uptime and reliability. Climate Guard’s built-in logging allows IT and facilities personnel to spot trends and eliminate problems before they become disasters. The system alerts staff to out-of-tolerance conditions via email, SMS and SNMP traps.

For more information on Climate Guard and to see a live demo, visit climateguard.burk.com.

Tuesday, July 5, 2011

IPv6 and IPv4 Monitoring: Will You be Ready?

- Vikas Aggarwal, CEO of Zyrion (www.zyrion.com), says:

In the 30+ year history of the Internet, the move to IPv6 will be the largest single upgrade. The clock is ticking on the availability of IPv4 addresses, and experts say IPv4 addresses will begin running out as early as December 2011. Many organizations are putting concrete plans in place to complete the migration over the next few years, and. through the first half of 2011, the awareness and activity on the IPv6 front has increased significantly. On June 8th, hundreds of governmental organizations, enterprises and service providers participated in a 24-hour, large-scale “test flight” of IPv6 technology. The event was coined as World IPv6 Day, and was organized by the Internet Society. The purpose of the event was to energize, educate and motivate organizations across the IT and communications industry to prepare their services for IPv6 to enable a successful migration as IPv4 addresses begin running out.

While much of the current focus on the migration to IPv6 is around the intricacies of making external facing websites and services (e.g. DNS) work cleanly in a hybrid world, as well as the use of IP addresses to interconnect distributed server, storage and network elements, organizations need to also be thinking about internal controls, management systems and frameworks as part of the transition.

A key aspect of transitioning to IPv6 technology involves ensuring that the right IT, cloud and network monitoring software systems are in place to assure the performance of complex networks, data centers and cloud infrastructures. For distributed organizations, where services may be tied to partner or remote IT infrastructure, the preparation to deal with a hybrid IPv4 and IPv6 world has to be done much more proactively. In some cases, the IT services being managed by an IT group in one department may link to data center components and applications of other departments, which could be using different IP versions. If the IT organization is on the hook to deliver against agreed to SLAs or performance levels to users and business constituents, then it needs to have visibility into the health and performance of the broader IT infrastructure that is part of its scope of coverage.

It is time to start taking steps to trial and implement network and IT monitoring software systems that can seamlessly monitor IPv6 and IPv4 applications, servers and network devices in a hybrid environment (see http://www.zyrion.com). Given that hybrid environments will coexist for a while, these monitoring solutions will enable organizations to uniformly discover and provision IPv6 devices, and collect and analyze performance data, all within one integrated system that supports IPv4 devices as well. Users can ignore the intricacies of managing different types of devices, and are able to benefit from a unified management and operational view of their entire IT infrastructure. Being able to capture performance metrics from the full IT and cloud infrastructure, and then correlating the data and linking this to supported business services is critical to ensure the effective delivery of services and assure business operations in the new dynamic environment. These systems address this need by providing a service-oriented, end-to-end, performance view, whether IPv6 based or otherwise.

Although your organization may be taking preliminary steps towards implementing IPv6 compatible infrastructure, being prepared in advance by having the management tools in place will ease the process as you make the transition from an all IPv4 to a hybrid to a fully converted environment.

Wednesday, April 13, 2011

Are You Able to Seamlessly Monitor Your Remote Sites?

- Vikas Aggarwal, CEO of Zyrion (www.zyrion.com), says:

Given the dispersed nature of today’s organizations, with mobile workers and regional offices, the data center and IT infrastructure in reality extends beyond the boundaries of one or more centralized physical locations. What this means is that the operations team will be required to monitor, from a central NOC location, the performance of core IT infrastructure at remote sites and offices.

The IT infrastructure, devices and applications being monitored will in many situations be behind firewalls, and in most cases, behind NAT-enabled routers. Examples of remote monitoring may include the NOC being responsible for monitoring the execution of daily automated server back-up jobs, amongst other scheduled jobs at the site. The monitoring software will need to generate an alert in the event the back-up job did not execute properly. Additionally, it may be necessary to monitor site-specific applications and servers, such as a local dispatch application, through querying core performance metrics or executing ‘synthetic’ user transactions and monitoring their responses.

In order to address these requirement, there are some key remote site capabilities that need to be available in the network monitoring software (learn more about distributed infrastructure monitoring at http://tiny.cc/rnz53). The ability to gather metrics securely from behind a firewall is critical. What this means is that the monitoring solution has to include easily deployable and low-cost remote data-gathering components that are able to process traps/syslogs/eventlogs and execute scripts locally against monitored devices and applications within the secure remote network. The remote module has to be capable of pushing the data to an upstream event management system via SSL, and not require inbound requests.

Another challenge that the network monitoring software will have to deal with is that of the remote sites and office networks having overlapping or duplicated IP ranges. It’s extremely likely that many of the remote sites are using some parts of the 192.168.x.x network. The monitoring solution has to account for this scenario, and uniquely identify site-specific devices without requiring the re-addressing of the networks just so that they'll be easier for the IT operations team to monitor and manage.

The ability to monitor and manage secure remote sites is becoming a key requirement for distributed organizations. Make sure your network monitoring software supports remote site monitoring behind firewalls and with over-lapping IP addresses, and includes coverage for a wide range of network infrastructure (see examples at http://tiny.cc/y5rat) to ensure the smooth running of your business operations.

Sunday, April 10, 2011

Guide to Selecting a Data Center Monitoring System

- Steve Francis, Founder and CEO of LogicMonitor (www.logicmonitor.com), says:

While the process of selection of a monitoring system is necessarily unique to every enterprise, this post provides some guidance as to issues to consider when making that decision. Selecting the best monitoring system for your enterprise boils down to a single selection criteria: Pick the monitoring system that adds the most value to your business.

A monitoring system adds value if the benefits of the system are greater than the acquisition, implementation and operational costs.

Generally, the benefits an enterprise will obtain from a monitoring system fall into the following categories:
  • reducing the cost of outages and service degrading events
  • reducing staff cost (time) of investigations into performance and availability issues
  • improved information efficiency
Note that the focus of assessing a monitoring system’s positives should always be on the business benefits, not the features.

Balanced against these benefits will be the costs of the monitoring system:
  • acquisition cost
  • implementation costs
  • operational costs

Assessing the Benefits of Monitoring
A monitoring system is an efficiency tool - it allows enterprises to avoid and minimize expenses and revenue loss, rather than contributing directly to increased revenue. (Managed Service Providers that sell monitoring and value-added response services are an obvious exception.) Thus in order to assess the business value of a monitoring system, and to compare possible systems, one must have an idea of the possible expenses the tools will mitigate.

Minimizing the Cost of Outages and Service degrading events

Quantifying Outage Costs
Avoiding outage costs is a common justification of monitoring, but is often hard to quantify, and is different for every enterprise. For some enterprises (although increasingly few), downtime may matter very little, and only the simplest of monitoring is justified.

Each enterprise should consider both the immediate impacts of outages and the brand impacts, but both cases will require thought and discussion specific to the enterprise.

Consider the case of online retailers with directly measurable dollar/minute metrics attributable to web site sales. Does an outage mean that revenue for the duration of the outage is lost? Perhaps customers will simply purchase later, when the site is online. Perhaps the outage means customers lose trust in the brand, and not only make their immediate purchases at a competitor, but also make all future purchases at the competitor. In this case, the outage cost for a small but growing site could be much greater than at an established brand, despite a much lower sales volume. The established brand may impact $1million in sales during an hour long outage - but those sales will likely be made up later. A similar outage on a smaller, growing site may only directly impact $2,000 in sales - but the sales are likely to be permanently lost, and worse, the loss of goodwill by early evangelists of the site can significantly affect growth.

An outage on a site that provides a subscription service may have less impact on longer term customers, but customers are more likely to churn if they experience an outage before they have internalized the value of the service - new customers, or those in trial. In this case, the outage costs not the customers subscription fees for a month, but the lifetime customer value of those that did not convert.

An outage of an internal IT virtualization infrastructure that idles the workstations of 150 engineers (at $150 an hour fully loaded salary) is superficially an obvious direct cost - but as exempt employees, the engineers may complete their work anyway, perhaps by staying late. Then the cost becomes one of employee satisfaction - and if it results in employee turnover, the cost becomes much higher. If an outage of IT systems affect sales people at the end of the quarter, preventing them from accessing their CRM, or perhaps their phone systems, there can be a very large cost - in sales staff dissatisfaction, revenue for the quarter, and even corporate stock price.

There are non-market driven costs too - downtime in a business unit may be valued
disproportionately to its revenue contribution due to political clout of its executives. Thus determining the cost of an outage is not a simple matter of entering data into a formula, but requires knowledge of the revenue models of the enterprise.

Quantifying Service Degradation Costs
Service degradation issues can often cost more than outages. With an outage, there is a clear, identifiable situation - a service is down. With a degradation, there is often a lag before the issue is reported, another before it is acknowledged, and further complications with identifying the systems and personnel responsible (networking staff, server staff, and storage staff may each insist their respective systems are working correctly). This longer duration of the issue can result in larger costs. The costs may be lower sales revenue on an ecommerce site (slower site performance directly correlates with less conversions.1) For internal systems, costs may be inefficient use of engineers time as they wait for compilations or other resources; or less effective sales staff if their CRM system is slow. Given the high fully loaded cost of personnel, any system impact that detracts from productivity can quickly become a large drain.

Analysis of past Outages
Each organization will have to rely on its own experience to assess the historical frequency of outages, whether the outage would have been averted given ideal monitoring, the direct costs of the outage and the indirect, brand costs of the outage.

Some questions to discuss that can help guide this assessment:
  • Why do you want a monitoring system?
  • What do you want the monitoring system to do? What benefits do you anticipate getting from it?
  • How many outages or adverse performance events occurred over the last month? 6 months?
For each historical incident, as best can be determined:
  • What were the direct costs of this outage or performance issue?
  • What were the ‘brand’ costs of this event?
  • How many hours of staff time were involved in determining the cause of the outage?
  • What is the fully loaded cost of staff time for the staff involved?
  • What capabilities would a monitoring system have required in order to alert on the issue and identify the cause during the event?
  • What capabilities would a monitoring system have required in order to alert on the impending issue before the event?
A question that is always useful to ask is “So what?” If some devices went down, and there was no monitoring - so what? Why does it matter? This is a good way to flush out who cares about the issue.

Reduction of staff cost for investigations into performance and
availability issues


With increased complexity of applications and infrastructure, the time spent to determine the root cause of performance or availability issues can be a substantial expense that good monitoring can significantly reduce.

Consider the example of a performance issue on an e-commerce web site. Troubleshooting the issue could involve bringing in staff resources to look at the network, the web server operating systems, the front end application, the load balancers, the back end database, the virtualization platform that runs the database virtual machine, fiber channel systems that connect the virtualization platform to the storage, and the storage system. Any one of these areas could reasonably be the cause of the issue. Further, silos of information can exacerbate the time required to determine a system is not contributing to the poor performance. For example, the database server operating system may be observed to be running slowly, leading to troubleshooting efforts to focus on OS level tuning and issues - but the issue may be the underlying virtualization platform being memory starved, and transparently swapping out memory from the virtualized OS. In such a case, if the monitoring system alerted that the virtualization layer was low on memory and that swapping of virtual machines was occurring, and this information was available to all team members, troubleshooting would be much quicker, involve fewer resources, and the issue would be resolved sooner.

Of course, not every situation is going to be alerted on by monitoring, but even in such cases monitoring can still greatly reduce the time to resolution of the issue. This will only be true if the monitoring is collecting a wide variety of information, from a wide variety of systems, and making this information visible in chart form, so that trends and changes can be spotted by human intelligence, and the issue correlated with these changes. A simple example: after a software release, the performance of an application is worse. A quick examination of charts can show if there are differences in request load. If this is the same as recent historical levels, the monitoring can show if the database is performing significantly more table scans after the release, perhaps because a needed index was not created. Charts will also show that the increase in sequential scans was attributable to the release, and not a gradual increase over time with load; and also show how much extra Disk IO is being put on the storage system as a result, and how this is affecting request latency. Without historical charts, resolution of such an issue would take much longer - translating to a significant expense.

Improved information efficiency

By providing accurate data as to where resource bottlenecks are, and by aggregating data from multiple systems, monitoring systems can provide actionable data about costs and performance that improve enterprise efficiency. A simple example is that in the fact of performance issues and inadequate monitoring and analysis, it is not uncommon for organizations to purchase new capital infrastructure that does not address the root issue. (For example, upgrading front end CPU capacity when the issue is the storage system IO operations per second capacity.)

Another example where monitoring can optimize capital expenditures is to ensure equipment purchases meet current and future needs, but avoid overspending on overcapacity. (“Buying out of fear”, as one customer calls it - spending $80,000 on storage, in case the $50,000 storage is not performant - without knowing exactly what the requirements are.) It also allows purchases to be planned - trends can clearly show when circuit or equipment upgrades will be required, giving months of warning with commensurate negotiation power, rather than requiring immediate outlays to maintain service levels.

Monitoring systems collect a lot of information about a lot of systems, and this data can, if presented efficiently, allow new insights into the enterprise’s operations, that can realize better planning and expense control. Aggregating all the ISP bandwidth used per ISP, or per datacenter, can reveal opportunities for contract negotiation savings. Being able to track storage usage by business unit across all storage assets in an enterprise may not fall under the traditional rubric of monitoring, but given that monitoring systems collect the data underlying this information (storage capacity of every volume on every storage system), it is a reasonable item to extract from them. Being able to track real time and historical trends of a variety of performance and utilization metrics can provide unanticipated benefits to enterprises.

Translating business requirements to features.

Features required for Proactive Warning of Outages
Certainly one of the business goals is to proactively warn about, and hopefully prevent, impending outages. This is one of the easier business drivers to convert to a feature list, as it is driven largely by technical requirements. While any monitoring system should be able to alert of an outage on a system, and thus speed time to resolution, being able to proactively provide warnings of impending failures and performance issue requires different capabilities. It may require a monitoring system that can alert when a load balancer detects that a Virtual IP has less than the desired level of server redundancy; or when request latency is increasing on a storage array, or when database replication is lagging more than the desired time offset, or when the number of server threads on a Java application is approaching a limit. Being able to prevent outages requires a much more capable monitoring system - but the capabilities must match the infrastructure deployed.

Converting other business requirements to features.
As noted above, the process for selecting a monitoring system should care less about features and more about evaluating how the system will impact business, positively or negatively. To align features with business value, an enterprise should detail the way their organization works (or how they want it to work), and translate that into capabilities that help meet their business goals. The important issue to remember is that except for specific technical goals as mentioned in the above section, the feature list should detail business goals and capabilities, not specific ways of achieving the goals.

For example, an organization may operate with the following operational constraints: they run east and west coast datacenters, with staff at both locations, and applications run at both. They have infrastructure from 3 business units at each location, and some infrastructure is shared. They employ virtualization technology, and have little staff time to devote to their monitoring. Their custom applications are a mix of java and windows .NET, and they also use Tomcat, IIS-, MySQL and SQL Server. They want alerts to be routed to the appropriate teams, differentiating between roles even within the same host (e.g. Storage and DB groups may both be paged for different reasons for the same host), and escalated to people to ensure coverage. They want morning alerts handled by their east coast staff, and later switch to the west coast staff. There is frequent change in their datacenter in terms of reconfiguring or adding devices or applications, but not all the devices are production, warranting production alerting. They plan to grow some infrastructure into Amazon’s EC2 cloud in the future.

Their business goals are to allow the growth of service revenue, which will require additional infrastructure to handle the load. They wish to target their capital expenditures for this growth correctly; avoid headcount growth; minimize downtime and its impact on revenue and get better information for cost allocation among business units.

Each feature should be prioritized in terms of how much value each feature brings to the enterprise. This value will vary by enterprise - an organization with a fairly static infrastructure may decide that relying on manual workflow is sufficient for ensuring changes to infrastructure are reflected in monitoring (although I would suggest that processes done rarely are also rarely done when needed!) One enterprise may initially desire role based access control, but on reflection find that it adds no business value. Another may determine it is essential, as it allows them to unify monitoring while meeting contractual requirements of confidentiality for their customers.

Having determined the list of features and their relative value to enterprise, an organization can then narrow down a list proposed solutions that meets the most important of these features, in order to accurately assess the value to the enterprise.

Evaluating Candidate Software

Each candidate solution should be evaluated for the prioritized list of features - as they relate to business value - weighted as appropriate for the typical actions of the enterprise.

With a trial deployment, the realistic costs and benefits of a system can be assessed, always keeping a focus on business value comparison, not feature comparison. There will likely be multiple ways to deliver the same business value, that may not fall into the same “feature” check box.

A simple example is system security. The business goal is to prevent the disclosure of information that may be embarrassing to the enterprise or provide intelligence to competitors or vendors. Yet this goal may be translated to a feature checklist as “all data stored locally in corporate datacenter.” This is one way of achieving the goal (although it makes many assumptions about the deployment.) But the goal may be better achieved through a SaaS model, even though it would not meet the checklist requirement. A SaaS system is likely to be delivered from audited, tested datacenters with 24 hour manned guards, biometrics, cameras, external penetration tests, and from a system designed explicitly with security in mind and encryption used at many levels (transmission and storage of data, etc). A premise based system, even if operated behind the corporate firewall, is likely to be deficient in many of these areas - so while it would meet the checkbox, it would not deliver the business value as efficiently. This illustrates why it is important to detail the business drivers for each feature (“maintain security of data”) rather than just the feature as the end users expect it to be delivered (“all data stored locally in corporate datacenter”) - no one will be able to predict the ways in which all the business drivers can be delivered, so listing the driver makes the assessment far more likely to based on the business driver, rather than the anticipated way of delivery.

Conclusion

We hope this post illustrates some of issues involved in selecting a data center
monitoring system. Selection of such a system will always require a good knowledge of the enterprise to be monitored, so that business value can be accurately aligned with the benefits of the systems. Selection lists should be driven by business values, except for specific technical requirements such as the ability to monitor a specific protocol. Some of the questions above should help bring out the expected benefits and costs of a monitoring system. After all the discussions and dialog has occurred, the selection of a monitoring system comes down to the simple statement made at the beginning of this post:

Forget about features. Pick the monitoring system that adds the most value to your business.

Wednesday, March 30, 2011

Visibility Into Your Entire IT Infrastructure







- Josh Duncan, Product Evangelist at Zenoss (www.zenoss.com), says:

Enterprise operations teams need a way to manage and monitor all of their physical servers, networks, storage devices, and an increasing amount of virtual resources. It is no longer a simple exercise to determine where your services are physically running, and what the impact is if a device has to go offline. Zenoss provides visibility into the entire IT infrastructure, so you can address the growing challenge of efficiently managing your physical, virtual, and Cloud resources.

Datacenter managers are being asked to manage dynamic environments and are rapidly finding out that management solutions built around a static CMDB paradigm don’t scale. Having a monitoring solution that provides real-time visibility will guarantee that operations is always away of the “as-is” state. This enables IT to react faster and successfully manage a constantly changing environment at much higher level of service.

Operational visibility is a foundational requirement for any IT environment. However, for highly virtualized and Cloud environments, it is critical that IT is able to guarantee service delivery. Service assurance in the Cloud requires that operations can see the linkage between the devices they are monitoring and the services they are supporting. When a service starts to become degraded, operations must be able to proactively react.

Biggest Challenges
Mean time to resolution - In highly virtualized environments, finding the root cause of a service impact is a challenge. When a resource fails, the wave of events generated from the dependent resources can make it nearly impossible to rapidly address the problem. Swivel-chair management between different monitoring solutions and organizational finger pointing is not the path for highly effectively IT operations. Having a single view into the entire infrastructure is the first step to aligning to an IT-as-a-service organization.

Overcoming the Challenge
Being aware of the health and performance of all the devices in the infrastructure is a start, but the real trick here is understanding how these devices are actually playing a role in the services being delivered. Users care that their email service is working; not what’s happening in the background. Datacenter managers need to be able to address service-level issues, not just device uptime.

Advice for Data Center Managers
As always, it is important to ask the right questions and to not only look at what the solution is delivering today, but its track record for supporting new technology and features. The only way to effectively manage and monitor cloud-based infrastructures is with a model-based solution that keeps track of all of the dynamic changes in the environment, and can meet the requirements of carrier-grade scale.

From working with some of the leading hosting and Cloud providers in the industry, we put together an eBook, “Can You Really Manage the Cloud?” listing questions that should be addressed when looking at management and monitoring solutions. The list is meant to be a starting point when it goes to figuring out what’s important for your business and IT strategy.

Friday, March 18, 2011

Meeting Monitoring Challenges in Today’s Complex Data Center and Cloud Service Environments

- Imin Lee, CEO of AccelOps (www.accelops.com), says:

Expectations are high for enterprise and service provider data center organizations
to adapt their complex, dynamic infrastructures to the changing needs of business. But between virtualization and public/private cloud hybrid environments, the old ways of doing things won’t cut it when it comes to scaling dynamically to provide elastic monitoring and performance analysis, finding the root cause of problems and analyzing service levels across the entire infrastructure.

Our company met that challenge head-on when we launched AccelOps, the industry’s only integrated monitoring solutions designed from the ground up for cloud-generation data centers and managed service providers. Now we have announced the general availability of version 3.1 of the AccelOps solution. This latest release of our award-winning platform delivers powerful role-based access control (RBAC) capabilities, the ability to intelligently suppress alerts, and deeper and broader insight into the dynamic virtual environment, all of which increase operational visibility and control, improve resource utilization, and facilitate a more service-oriented approach to data center and cloud management.

Customized Role-based Views

With support for RBAC, a widely accepted best practice for managing user privileges, v3.1 gives enterprises and cloud service providers the flexibility to tailor the AccelOps user interface to the role each user has within their organization. For example:
  • A super administrator could specify that server administrators see a customized AccelOps user interface showing only servers and server-related incidents, and have permission to perform server-related analytics only.
  • The super administrator could give C-level executives access to view only specific AccelOps dashboards that display business service level performance and availability data.
  • Cloud service providers now have the flexibility to assign an IT security expert visibility and control over security devices and security-related incidents only, across all or a subset of its customers’ networks.
By extending visibility and control of specific areas of IT operations, this new functionality aligns well with how larger organizations and service providers are structured, that is, by functional specialty.

“When I chose AccelOps, I knew I’d found a truly top-notch solution for all our monitoring needs. But the AccelOps solution is also helping streamline communication within our organization. The introduction of RBAC support has allowed me to give broader access to the AccelOps GUI and the intelligence it makes visible, significantly reducing the amount of information I have to manually disseminate to my help desk and senior management teams,” said Geoff Christy, senior network administrator at Austin Radiological Association (ARA).

Reduction of Alert Noise

The latest AccelOps release also features built-in intelligence for suppressing alerts based on user-defined logic, topological relationship, patch information, and job calendar information. For example, the AccelOps solution can suppress “device down” alerts coming from devices that are downstream from the device that is the true cause of the problem; it can also suppress alerts when maintenance is being performed. The reduction of these extraneous alerts, also known as noise, drives operational efficiencies within data center and cloud service environments by reducing false positives, facilitating true problem identification, and reducing mean time to resolution (MTTR).

This new capability is possible because the AccelOps solution understands all the relationships between the various elements of a network and cross-correlates raw data – logs, events, metrics, alerts, etc. – with context and interdependency knowledge and pre-defined rules. The AccelOps solution also discovers and updates this relationship data automatically through its auto-discovery feature, rather than via manual configuration, saving IT administrator time and reducing the risk of human error.

More Support of Virtualized Environments

The new version delivers deeper and broader support of VMware clusters and resource pools within the data center. Specifically, AccelOps now identifies resources in clusters and their utilization levels and supports the creation of rules and utilization dashboards, making it easier for IT administrators to do capacity planning for their virtual environment. AccelOps also supports alerts for VMware cluster utilization levels, essential for preventing capacity problems, and supports both multiple vCenter management consoles and mixed environments in which some virtual machines are monitored by vCenter and others are not.

AccelOps v3.1 is available now as a virtual appliance or software-as-a-service (SaaS) direct from AccelOps or through authorized partners. Organizations can purchase the SIEM (Security Information Event Management) module, the PAM (Performance and Availability Monitoring) module, or both as an “all-in-one” solution.

WHVK7WJK3JKK

Monday, March 14, 2011

Environmental Protection for Large Data Centers

- Mo Sheikh, spokesperson for ITWatchDogs (www.itwatchdogs.com), says:

The dynamic nature of today’s virtualized data centers presents new environmental monitoring challenges


The combination of today’s powerful servers with the wide-scale adoption of virtualization is radically changing the way companies must monitor their data centers.

The consequences when problems arise are very high. While the loss of a single system in a smaller server room is cause for concern, a single data-center blade server failure in a large organization might take down multiple applications running as virtual instances on that piece of hardware.

Besides the higher stakes in the case of a problem, monitoring conditions in a large data center are much more complex than they have ever been before.

In a traditional server room, workloads are fairly predictable, and thus heat generation patterns, while variable throughout a rack, are often understandable. For example, increased workloads in the early morn¬ing — when everyone is launching applications — may routinely increase server heat output.

However, in today’s large data centers, workloads are much more dynamic. With virtualization, IT managers can easily move instances of an application from one physical server to another. So a server that has been relatively idle for hours might suddenly be running at 90 percent CPU utilization in just a couple of minutes.

Many of today’s computing environments virtualize the entire server fleet in the data center, allowing an IT manager to shift workloads not just from server to server in a rack, but also from row to row.

A Problem’s Impact Multiplies

As a result of these newer approaches to data-center management, conditions throughout the facility are highly variable. And more is at risk.

Unlike the traditional server room, where the loss of a single server might only take one application offline, equipment failure in a large data center due to environ¬mental problems such as excess heat, water, humidity, smoke, fire, or a power failure can bring business to a halt for an entire department or the whole organization.

What’s needed is a sophisticated monitoring and surveillance solution to track environmental changes, identify anomalies, and send alerts when thresholds have been exceeded. The solution must give managers information about their data centers at a very granular level, so a spike in environmental conditions in a single rack can be identified quickly to head off a potential problem.

The bottom line is that in today’s data centers, rather than merely reacting to shutdowns when they occur, you need to be proactive in trying to anticipate them before they happen. This is an area where ITWatchDogs environmental monitoring solutions can help.

Monitoring Data Center Environmentals

Several data center environmental factors can contrib¬ute to or increase downtime and service disruptions. And as noted earlier, it is becoming harder to monitor these conditions due to the very dynamic nature of today’s data centers.

A system that has run cool for months can instantly show a spike in output heat if a manager shifts multiple virtualized workloads onto it all at once. An increase of 18° Fahrenheit (10° Celsius), can double equipment failure rates over time. Since the workload shifts can be very precise (from, say, shelf 1 in rack 2 of row 3 to shelf 5 in rack 4 of row 8), you need separate temperature probes on individual racks and critical devices. That way, problems with a broken fan or air-conditioning failure will show up quickly. Similarly, you might be able to identify a server that is overheat¬ing due to its increased workload.

ITWatchDogs takes into account the space limitations in today’s densely packed equipment racks. Its environ-mental monitors are small, ranging in size from compact models only 4 inches long and about the size of a large candy bar, up to full-sized 19” rack-mountable models which, despite all of the features packed into them, only take up a single 1U space in a rack. The devices can run off of existing electrical power outlets, and several models also support Power over Ethernet (POE).

When it comes to other environmental conditions, in addition to temperature, ITWatchDogs’ monitors come equipped with various on-board sensors along with digital and analog inputs for external sensors, in¬cluding humidity, water, smoke, and fire. The environ¬mental monitors provide a way to remotely monitor data-center conditions, view historical data to spot trends, and receive alerts when conditions exceed pre-defined thresholds.

With comprehensive monitoring in place, a spike in operating temperature due to a shifted workload, a buildup of condensation in a single rack, or exces¬sive humidity along a row of equipment will be noticed quickly. To make that information available to the ap¬propriate data center staff, ITWatchDogs probes can be monitored via any standard Web browser, without requiring you to install any proprietary software on a host PC to access the monitoring units.

Additionally, ITWatchDogs offers power monitoring. This is accomplished using the Remote Power Manger X2 (RPM X2). The RPM X2 adds remote power moni¬toring and switching capabilities to any ITWatchDogs environment monitors which have one or more digital sensor ports available. The device enables users to set alarm thresholds for these measurements, and also allow the user to remotely power-cycle alocked-up system to reboot it or to turn equipment off and on via the secure user interface.

Your Technology Partner

To address the highly dynamic nature of today’s data centers, organizations need to take a proactive ap¬proach to monitoring the environmental conditions that contribute to downtime and disruptions.

ITWatchDogs offers a wide range of environmental monitors providing cost effective ways for data center staff to proactively monitor their IT infrastructure and maintain system uptime. The products provide a quick and easy way to keep an eye on remote condi¬tions from a secure web interface and receive SNMP, E-mail, or text-message alerts when specified alarm thresholds are exceeded. The interface displays live video feeds and environmental measurements includ¬ing temperature, humidity, air flow, light, sound, power, water detection, and much more. The measurements are logged and graphed for viewing trend patterns. External processes or applications can be automated on trigger of an alarm or remotely through the web interface with units supporting output relay control or with the Remote Power Manager X2.3

* ITWatchDogs is a regular contributor on Data Center POST

Friday, March 11, 2011

How to Protect Your Data Center from Environmental Threats

- Mo Sheikh, spokesperson for ITWatchDogs (www.itwatchdogs.com), says:

Introduction: Physical Dangers Just as Important as Cyber-Threats


Viruses, spyware, and network threats get most of the attention, but environmental factors like heat, humidity, airflow, smoke, and electricity can be equally devastating to server room equipment, and thus to a company’s IT operations.

To get a sense of the danger, let’s take overheating as an example. Servers generate high levels of heat, and the facility must be kept cool to ensure optimal performance. The warmer it gets, the more likely equipment will overheat and malfunction. In fact, an increase from 68°F (20°C) to 86°F (30°C) can reduce the long-term reliability of electronic equipment by as much as 50 percent. And when air conditioning fails, temperature can skyrocket in a matter of minutes. In February 2009, Duke University Professor of Physics Robert G. Brown explained that heat weakens electronic components like power supplies, motherboards, and memory chips, so even if they don’t fail immediately, they become more susceptible to failure over time.

“The one time our server room overheated drastically, reaching 85° to 95°F (30-35°C) for an extended period of time…we had node crashes galore, and a string (literally) of hardware failures over the next three months— some immediate and obviously due to immediate overheating, some a week later, two weeks later, four weeks later,” Brown writes.

In this post, we’ll discuss the danger that environmental threats post to server room equipment, outline a comprehensive environmental monitoring strategy, and explain how environmental monitoring products from ITWatchDogs deliver an end-to-end solution for prevention and early detection of environmental issues.

No company is immune

Depending on the size of a company and its industry, downtime can cost tens of thousands of dollars per hour. For example, if your Web site is down and visitors choose a competitor, you’ve lost both the immediate transaction and the opportunity for their repeat business. If the outage causes your company to break a service-level agreement with a customer, the associated fees and potential lost business add up quickly.

Every server room and data center—even those of household-name companies and sites—is vulnerable to environmental damage. In March 2010, Wikipedia suffered a two-hour outage when one of its server clusters—located in a European data center—overheated. The company was able to reroute traffic to a North American data center, but a glitch in its DNS server tools caused Wikipedia address resolutions to fail globally. Think about how many users were frustrated by this outage. According to 2008 statistics, Wikipedia receives between 25,000 and 60,000 page requests per second. Multiplied by 2 hours, that’s at least 180 million failed requests due to overheated servers.

Lost business aside, you must also consider the cost of replacing expensive servers. In September of 2007, an overheating condition at St James Hospital in Leeds destroyed 1 million pounds’ worth of server equipment. The negative publicity surrounding the incident also impacted the facility’s credibility and public image.

Can your operation afford a large-scale server failure?

What’s clear is that companies of every size must protect their IT investments from environmental threats like overheating, power outages, and excessive moisture—all of which may result from flooding, condensation, leaks, or malfunctioning/poorly-configured air-conditioners.
Smoke conditions can also lead to serious equipment damage, in case alarms are triggered during off hours and personnel aren’t available to remediate or respond quickly. If a smoke alarm triggers an ‘emergency power off’ (EPO) device, for example, cooling systems could go offline and leave servers susceptible to overheating.


Environmental Monitoring Is the Key

In a typical server room, a wall-mounted thermostat measures room temperature and controls the air conditioning. Individual servers now come with built-in temperature sensors that issue alerts if the level of heat surrounding the individual unit rises above a certain threshold, or if an internal fan breaks down. Isn’t that enough to ensure safe operating temperatures?

The short answer is, no. Data center temperatures vary widely from one zone to another. Even if the overall room temperature is 68°F (20°C), the area near the output vents may be 5 degrees cooler, and the area behind server nodes may be 5-10 degrees warmer. Airflow problems could create higher-temperature pockets of still air in some aisles, creating hot spots that can damage sensitive components.

A better approach involves temperature/humidity/airflow sensors installed on or near individual racks and critical devices. Logging and graphing these measurements over time can help administrators spot trends, such as temperature spikes during peak operating hours or fluctuations when the building’s HVAC systems are throttled back on weekends.
With comprehensive monitoring in place, if an internal fan breaks or an air conditioning unit fails, the spike in operating temperature will be noticed quickly. Probes with internal microprocessors are easy to configure and highly reliable. Similar sensors can track humidity and moisture in the air and the floor, and measure the temperature and rate of air flowing along different paths in the server aisles.

Even sound sensors can help in the early detection and remediation of component failures. For example, a fan that is wearing out may get louder over time, which could be spotted at an early stage on a device that graphs relative measurements. A properly calibrated sensor would send out alerts for either condition and help IT staff resolve the issue rapidly.

The benefit of microprocessor-based sensors is that they can be monitored via Web browser, without requiring proprietary software installations. With a Web-enabled monitoring system, you can measure temperature, humidity, airflow, water leaks, power, door/cabinet position and more, setting alert thresholds and escalation schemes in case an anomaly is detected.

Optimal sensor equipment can send alerts in numerous formats, including SNMP

Best Practices for Optimal Monitoring

Heat: An optimal environmental monitoring strategy includes multiple temperature sensors. These should be placed on top, middle, and bottom of individual racks to measure the heat being generated by equipment, and at the air conditioning system’s intake and discharge vents, to measure efficiency. Probes should also be placed around critical devices, because the temperature inside a rack-mounted device could be as much as 20 degrees higher than the surrounding area. A probe near the room’s thermostat can help monitor what the thermostat is ‘seeing’ as it controls the air conditioner.

You can also use a hand-held thermometer to determine where the hottest spots are in the server room, and then set up sensors in those areas to get an ‘early warning’ when temperatures rise.

Once these sensors are in place and being monitored centrally from a browser, emergency alert policies should be set up to ensure that the right personnel are informed of potential problems. Remediation procedures should also be mapped out ahead of time. Service contracts with an air-conditioning repair company ensure rapid response, and you should make sure the company offers 24-hour service.

The logs that track temperature over time are also helpful, in that IT managers can review them over a weekly or monthly span and analyze them for spikes that occur during off hours. In addition, testing the sensors every month is an important step to making sure the system will function properly when an event does occur.

Water: Moisture and humidity sensors should monitor for leaks inside cooling equipment, potential leaks that come from nearby pipes, or water caused by a flood or disaster. Water sensors should be placed at the lowest point (wherever water would tend to puddle) on the floor, and underneath any pipe junctions. Air-conditioning condensation trays should also be equipped with sensors to detect overflow.

Power: Electrical failures can cause air-conditioning equipment to shut down even while an uninterruptible power supply (UPS) ensures that servers stay up and running – a sure recipe for overheating a server room in short order. The best approach is to monitor current coming into the data center, and arrange for an orderly shutdown of IT equipment in case power is knocked out. The hour or two of downtime is far preferable to the widespread device failures that would result from an overheating condition.

Smoke: Smoke alarms can trigger power shutdowns. Also, they’re usually not tied to an alerting system that contacts IT personnel. Alarms may be noticed by facilities managers—or the local fire department—but the maintenance of sensitive server equipment is not their top priority. Here, the best approach is to wire the smoke alarms directly into the climate monitoring and alerting system, essentially extending the functionality of the climate sensors to the smoke alarm.

Doors: A final concern for data center monitoring is unauthorized entry. Dry-contact sensors that detect the opening and closing of a door should be installed at the room entry points and on the doors of server and UPS cabinets. On a busy day, these sensors can send alerts numerous times and present a time-consuming irritation, but managers can configure alerts to account for weekday vs. weekend operations, work hours vs. overnights, and other factors to help reduce the number of alerts sent and pinpoint unusual activities.

IP cameras are another fairly easy component to add to a monitoring solution. They provide real-time surveillance of sensitive areas in the data center and tie into the Web-based console, so administrators can get a first-hand look at the environment wherever they may be.

What to Look For in an Environmental Monitoring Solution

A solid environmental protection solution should include sensors that are easily deployed throughout the data center, connected to a monitor with a built-in Web server for easy access and communication. It should also deliver:

  • Secure, browser-based access
  • Comprehensive logs and graphical analyses of environmental factors over time
  • Multiple account levels, to ensure that IT staffers or clients see only what they’re authorized to see
  • Multi-level alarm policies with escalation, so admins can set alert thresholds and contact lists for prompt response
  • Multiple notification media, including e-mail, SMS text message, SNMP alerts, and telephone auto-dialer

Requirements aside, the solution should not charge subscription fees for tech support and software updates. A long-term data center management and monitoring solution is critical to preserving your IT investment, but it should not generate recurring expenses that degrade ROI.

The ITWatchDogs Solution

The ITWatchDogs family of monitoring devices provides remote monitoring of environmental parameters in data centers and server rooms. They track temperature, humidity, leaks, power supplies, door position and more. ITWatchDogs’ wide variety of models and options fit different requirements and room sizes, but all are based on standard hardware and software and monitored via a Web browser.

The environmental units are designed to take up very little space; the largest models are 1U high rack-mount units, the smallest is only 4 inches long by 1.5 inches wide and deep. Models with built-in Power over Ethernet (POE) capability are available.

All the products have a wide range of on-board sensors; most models allow 16 or more remote sensors to be connected as well.

All ITWatchDogs’ climate monitors have a built-in Web server that automatically generates sensor data logs and graphs, without any need for external software. All management and monitoring tools are accessible securely via Ethernet or the Internet. The monitors have SNMP agent software to integrate with popular networking management tools, and they support SNMP v1, v2c, and v3. Some models include low-voltage relay outputs that can be used to activate a strobe light, an alarm, a backup air conditioning unit, or an auto-dialer. ITWatchDogs offers highly reliable auto-dialer devices for both GSM and analog phone systems, with their own independent backup-power batteries which allow them to make phone calls to your IT and service personnel even in the event of a power failure.

Lastly, ITWatchDogs stands behind its products, with firmware updates available free on its Web site and technical support available free for life. Support is provided by the same engineers that designed and engineered the devices themselves, so questions and problems are resolved quickly and authoritatively.

Conclusion

Data center equipment is very sensitive and susceptible to environmental damage from excessive heat, moisture, and unauthorized access. Power outages that knock out cooling systems can lead to overheated servers in a matter of minutes.

Simple thermostats and server-based temperature sensors aren’t enough to ensure comprehensive protection. IT organizations need temperature and water sensors throughout the data center and at specific strategic locations near potential trouble spots. They also need door sensors and IP cameras to alert administrators in case of unauthorized entry and provide real-time views of the space. They also need comprehensive management tools to tie the data from these sensors together into a cohesive display, and to set alarm parameters in case a threshold is exceeded.

ITWatchDogs provides a full line of environmental sensors that deliver exceptional protection and alerting functions without requiring any proprietary software installations or update subscriptions. Regardless of your data center’s size or complexity, ITWatchDogs has a cost-effective monitor and sensor solution that will reduce risk and enable smoother IT operations for your company.

To learn more about ITWatchDogs and its line of monitors and sensors, visit www.itwatchdogs.com

* ITWatchDogs is a regular contributor on Data Center POST

Thursday, February 24, 2011

What Makes a Good Monitoring System?

- Steve Francis, Founder and CEO of LogicMonitor (http://www.logicmonitor.com/), says:

Even with the best datacenter monitoring system in place, whether an implementation succeeds depends largely on the processes adopted around monitoring.

The ideal is a monitoring system that is comprehensive (alerting on all conditions that need attention) and noise free (NOT alerting to any conditions that do not need attention.) A noisy alert system is almost as bad as no monitoring – it will train people to ignore their alerts.

These goals are diametrically opposed, and further complicated by the fact that different users of the monitoring have different criteria for what needs attention or is noise. However, with appropriate alert escalations and routing, and some good processes in place, you can approach this state.

What processes can help?

Use Scheduled Down Time
This is probably the most important process to enforce. If someone is going to be working on a system, schedule downtime! Prevent the alerts going out in the first place. If you have regular maintenance windows for sets of hosts, set up the scheduled downtime to recur automatically. If there are processes that will trigger alerts periodically (such as CPU alerts triggered by disk scrubs on NetApp filers), schedule recurring downtime for just that alert.

Get Rid of unneeded alerts
For every alert received, an assessment should be made during the initial deployment of the monitoring – was the alert needed? If so, acknowledge the alert and go fix the problem. If not, how general can the removal of the alert be made? If the alert is regarding buffer discards on a switch, but only on the port where a 1Gbps network links to a 100Mbps uplink, discards are normal and expected, so the threshold should be adjusted just for this one specific port. If the alert is regarding swap space used on a QA system, and QA regularly engages in stress tests that would trigger this, you should disable the alert or adjust the threshold on the QA group. (For resources shared on systems, such as storage array volumes, you should add a filter to disable the QA alerts, or set different thresholds.)

Send alerts to the right place
For every valid alert received, make sure it’s going to the all the right people, only the right people, by the right methods, and is escalating at appropriate periods. (An example of inappropriate escalation would be sending warning alerts on production systems via email, instead of pager – but escalate them every 5 minutes. If three warnings occur at 1.00 am, and no one checks the email until 8.00 am, everyone will have 250 email alerts cluttering their inbox.) The set of people that want to know about network retransmissions are probably different from the set that wants to know about a commerce site being completely down.

Review
We recommend a weekly review, especially during the initial roll out of a monitoring system, to ensure the above points:

  • every alert was valid. If not, consensus is arrived at as to how and at what level the alert needs to be tuned. If alerts were the result of scheduled staff actions, but no one told the monitoring about the scheduled down time, liberal use of the clue stick (or stronger action) is recommended.
  • every alert was delivered to all, and only, the correct recipients.
  • the escalations for each alert were valid and appropriate.
  • don’t close any incident that was not alerted on, until the alerts to detect it have been created.
Putting processes will likely be the key to your monitoring deployment succeeding or failing, and is another area that a SaaS monitoring service is likely to be superior to a premise based system.  A SaaS service is invested in your continual use and satisfaction of monitoring – they don’t get the money up front.  At LogicMonitor, at least, we help our customers with the whole implementation, including processes.

Tuesday, January 11, 2011

Data Center Energy Efficiency: The OptoEMU Sensor

David Crump, Marketing Communications Manager at Opto 22 (www.opto22.com), says:

Data centers are large facilities that include a huge assortment of servers and other computer hardware. Keeping this equipment operating and the facility up and running 24/7 is certainly of the utmost importance, but meeting the power requirements to accomplish this can be incredibly expensive.

OPTO 22's OptoEMU Sensor is designed for maintenance engineers, facilities managers, business owners, energy consultants, and others looking for ways to better understand and reduce energy consumption. The Sensor can be implemented in these data centers to offer far more than a simple snapshot of the data center's total power usage presented on a local PC or operator interface terminal. Instead, the Sensor is able to identify the individual power draw of the specific loads of specific equipment, then communicate this data over standard IP-based networks where it can be easily accessed and viewed in both tabular and graphical formats either locally, or over the web via third party applications like Google PowerMeter and Pulse Energy's Pulse.

After studying their energy data, data center managers will have the information they need to enact procedural and operational changes that lower utility bills. They also have the option to easily expand the OptoEMU Sensor's capabilities to add equipment management and control functions.

Nutshell.
Opto 22's OptoEMU Sensor is an energy monitoring and data acquisition hardware appliance that lets commercial and industrial customers acquire real time power consumption data from facility systems, machines, equipment and metering devices in real time and with minimal configuration. The Sensor also provides the interfaces needed to quickly and easily view and share this data over standard wired and wireless networks and the internet so it can be archived, presented, analyzed, and used to develop effective energy management strategies that reduce costs.

Unique.
Opto 22's philosophy of "open-ness" makes the company unique. All Opto 22 products (including the OptoEMU Sensor) are built on open, standard, ubiquitous, and well-understood information and communications technologies and protocols, such as IP; analog, digital, and serial signal processing; and serial, Ethernet, wireless LAN, and cellular communication. This standards-based approach allows Opto 22 products to exist in a wide variety of industrial and business architectures and perform with the power and reliability Opto 22 components are known for industry-wide. Opto is perhaps best known for the incredible reliability and durability of its products, which are used by automation end-users, OEMs, and information technology and operations personnel in over 10,000 installations worldwide.