- Brad Bonn, senior systems engineer at VKernel (www.vkernel.com), says:
Recently, and over the last couple years or so specifically, I've seen IOPS become a buzzword everywhere. Phrases such as “We have a one million IOPS-capable SAN so storage shouldn't be having a problem,” or “I really need IOPS visibility” are popping up in conversations and appearing in online communities like the weeds in my garden. It makes sense that we're seeing the topic appear, because there is a decided lack of measurement standards for the overall performance of an end-to-end storage solution.
Earlier on in my years of IT, disk I/O capacity has classically been “measured” in terms of the number of acronyms you could rattle off. “Yeah, I've got a 12-spindle RAID 0+1 of 15K 146GB SAS drives in the backplane connected to my HBA with FC.” That's all well and good, but what does it translate to in terms of actual usability? How many databases can it support, and of what level of transactional utilization? How many users on an exchange system could it handle? What kind of file server load could it deliver? The universal answer is “it depends.” Application and file system diversity, the sharing of storage hardware through SAN/NAS devices, and the additional levels of sharing added through virtualization make unexpected results...well, expected.
The market abhors a vacuum, and when there is a clear need, vendors and integrators alike will move to try and fill it. The need in this case is the simplification of storage utilization both in terms of need from the application point of view, and in terms of delivery from the storage vendor point of view. Voila, IOPS.
It's a very simple concept, which is part of the reason it's become so widely used. “I/O Operations Per Second” is an easily understood and communicated unit of measurement. Unfortunately, it's also very easy to over-simplify. IOPS (or IOps or IOPs depending on the phrase you’re actually abbreviating) only describes the number of times an application, OS, or VM is reading and/or writing to storage each second. This sounds like a useful metric, because it is! More IOPS means more disk I/O, and if all IOPS are created equal we can measure disk activity with it alone. But the problem is, they aren't.
This topic is hotly debated on all sides, and having spoken with storage vendors, IT admins, and SMEs, I’ve come to the conclusion that as important as IOPS are, they aren’t the only metric you need to examine when you measure storage performance.
The goal of this document is primarily to outline the strengths and weaknesses of the use of IOPS in measuring storage capabilities; specifically from the perspective of shared storage. Along the way, we will cover some of the basic concepts surrounding shared storage itself and the implications that design choices in building a solution can have upon the performance and price of your infrastructure.
Shared Storage Fundamentals
If you’re already familiar with shared storage technology in general, feel free to skip ahead to the next section, but it’s helpful to review the components of a SAN to best understand the impact IOPS can have.
Where we keep the bits and bytes of data in our server farms has come a long way. Just spend some time on Wikipedia looking up things like “core memory” and punch cards to get a reminder of the evolution that storage has undergone over the years. Plus, not only has the medium by which we store data changed, but also the method by which we get information in and out of those sources has become just as diverse.
In the mainframe days, “shared storage” was a redundant title. All the computing resources for the building or company, including storage, were centrally located and therefore shared. Whatever tapes, disks or memory housed the data was all connected to the same core system, or systems. The resulting hierarchy was then logically very star-shaped with terminals connecting centrally in order to utilize the mainframe.
Modern computing is much more dense, and likewise, much more distributed. The giant mainframes of the past with their singular presences have been replaced by sprawling, interconnected datacenters consisting of dozens, hundreds, or even thousands of individual servers that all communicate internally and externally with other servers or personal computers. With each node being able to house its own storage devices, where data is located becomes equally as distributed.
These days, unless your “server room” consists of a few PCs and a SOHO router, you’re probably using a SAN or NAS in your infrastructure to share data between servers. Storage solutions like these make the allocation and migration of logical disks more manageable and fluid, to the point where directly-attached storage, or DAS (local disks contained inside each individual server) is practically never used in the datacenter any longer. Small infrastructures can still benefit from the lower initial investment of DAS, but in a rapidly-growing environment shared storage is critical to enabling rapid expansion, and vastly improving cost density. It seems the concept of where data is housed has come full circle and become centralized once more (failover sites notwithstanding.)
Whether you are using a storage area network or network-attached storage system, the end goal is the same: having a one-to-many relationship between a storage device and the computers that access it. Making that relationship happen involves various pieces of physical equipment and layers of logical abstraction, and while this whitepaper makes no claim to be a definitive guide on the broad topic of shared storage, we need to discuss some of the complexities involved in order to get a clear idea of what IOPS really means for such a system.
Links in the chain
Shared storage devices, regardless of their make, model, size or configuration all consist of the same general components. Any single read or write command has to traverse at least part of this chain in order to reach its destination, whether it is the active memory of a server, or the bare metal of a spinning disk. Every single link in this chain will affect the speed at which the data reaches where it’s going, as well as how many of those operations can be executed within a span of time.
The “weakest link” in this chain will determine the maximum number of IOPS and the total I/O bandwidth of the system. Every step of the way, additional connections are possible. In a block-level shared storage system such as fibre channel or iSCSI, (generally what would be considered a SAN) a LUN (logical unit number) can be accessed by multiple hosts (the effective one-to-many use case itself), but then those hosts can talk to multiple storage devices containing multiple disks via multiple paths.
In a file-based shared storage system such as an NFS filer (considered a type of NAS,) a LUN isn't used. Instead, filers will present network share locations, which can be accessed as logical disks by hosts and VMs. This option tends to be much more affordable since it can leverage existing networks and does not require as much specialized hardware. However this comes with a performance cost which may not always suit the needs of high-demand tier1 applications.
Starting at the “bottom,” we find the physical disks themselves. The faster the drive and the more data it can hold, the more expensive it becomes. Choosing the underlying disk technology will heavily impact the SAN or NAS’ total I/O capability and storage capacity. Lots can be done to get the most out of the spindles, but the buck stops here in the end.
A set of inexpensive magnetic SATA drives spinning at only 7200RPM bring a great cost to storage density ratio, but at the expense of limited speed. I/O-light applications or data archives are always well-suited to these. On the other hand, a set of 15,000RPM serial-attached SCSI (SAS) drives with hefty memory caches on-board, or solid-state drives (SSD) with no moving parts built entirely from flash chips will bring astonishing speed to the table for hefty database operations or disk-intensive apps at the cost of serious hit to your budget. Larger SANs will contain a mixture of these kinds of drive technologies, allowing for the intelligent balancing of disk loads to match performance tiers.
On the next link in the chain, these physical disks are bundled into logical groups, or arrays, often known as a RAID (Redundant Array of Independent Disks) which ensure the protection of the data against disk failure, and can potentially speed up bare disk operations through parallel reads and writes. However, depending on the type of RAID configuration in place, the total performance of the disks can be negatively impacted as well. For a detailed description of the types of RAID that are out there and their effects on performance, this article online can help: http://www.accs.com/p_and_p/RAID/BasicRAID.html
Many modern SANs take array configuration completely out of the hands of the storage administrator. This can greatly simplify the bare disk component of the storage and enforce best practices across tiers.
In order for the disks in an array to function as a logical unit, there must be a device or software to organize them into such a structure. In a shared storage configuration, the storage processor (or storage controller) handles this. Usually housed inside the chassis of the SAN, the storage processor is a self-contained computer system that handles all I/O for the device. It manages each I/O operation written to and read from the disks, as well as manages the communication to the hosts utilizing the shared storage via the various mediums available.
A storage system can have multiple redundant or load-balanced storage processors, and can be made up of multiple physical chassis containing anywhere from a few dozen to a few hundred physical disks.
With both SAN and NAS shared storage models involving a many-to-many connection scheme between the physical drives and logical disks, the full map of a complex SAN configuration can become quite the atlas!
Each of these links in the chain will affect the end-to-end performance of a SAN or NAS and the diverse possibilities that lie in each possible portion mean that shared storage configurations can vary to an astonishing degree.
Troubleshooting poor performance in a shared storage system can often feel like trying to search the inside of an oil tanker with a penlight. And while the specifics of what is causing slowness can be innumerably varied, the source or sources of storage limitations typically break down to a few major areas. These are where storage admins generally look:
• Hardware or system failures
• Communication “traffic jams” to the storage device
• Surges in storage I/O load demand
• Inefficient configurations for required throughput
• Application or OS-level inefficiencies
Failures in the chain of storage will cause at worst a complete outage, or at the very least a disruption in the performance of the complete system. For example, disk failures will place a RAID array into a “degraded” state where the system attempts to work around the missing disk or disks. Degraded arrays will almost always be significantly slower in responding to IOPS and degraded arrays no longer have their redundancy and could exhibit data loss if further failures occur, so be sure to replace the failed disks ASAP. In addition to the slowdown caused by the degraded array, once a replacement disk is inserted, the array must be rebuilt. This causes the storage processor and all disks in said array to have an additional task to perform on top of whatever normal I/O load is being placed on it. This will slow down the responsiveness of the system even further.
There can also be failures in the communication between the hosts and the storage device. Loss of connectivity on redundant links will cause a reduction in the maximum amount of bandwidth to the SAN/NAS, or even sever connections entirely, causing outages or failover.
The storage devices themselves can also experience outages and failures, whether it is in the form of a storage controller failure, an issue on the backplane circuitry, or any other hosts of potential breakdowns.
Generally, any good storage system has redundancy throughout, allowing for various components to fail without bringing production operations to a standstill. For this reason on top of the fact that failures are rare, this tends to be the least likely cause of performance problems in a shared storage configuration.
Outside of system failures, excessive I/O loads on a perfectly functional storage system can create higher latencies and therefore cause poor application performance. While this is the most obvious potential issue, it is also one of the most difficult to diagnose. Closely monitoring I/O load metrics such as IOPS and MB/s are critical for determining where heavy loads are coming from and how to most appropriately respond to those loads.
Visibility into the various “links in the chain” to determine the source of these metrics is also critical in narrowing down which portion of the storage system is taxed most heavily. Is the disk array overloaded, or are the iSCSI links saturated? Is the storage processor’s CPU unable to keep up, or are the host bus adapters inside the servers being pushed to their limits? If iSCSI or another Ethernet-based connection is employed, are jumbo frames turned on at every step of the way? Very frequently, VM Admins don’t realize that even in vSphere 4.1, setting up multipath iSCSI with jumbo frames involves a lot of legwork in the ESX console before it works fully! This is being improved upon in 5.0 to be more GUI-centric, but double-check your work using the guide here to make sure:
http://www.vmware.com/pdf/vsphere4/r41/vsp_41_iscsi_san_cfg.pdf
Occasionally, the bottleneck might be within the application or OS, and not in the storage environment at all. For example, poor I/O performance on a database server might be due to tiny growth increments, which cause fragmentation. Another common (but decreasingly so) example is disk alignment. If the clusters of a partition are not aligned with the blocks of the disk (either physical or virtual) then reading one cluster may end up requiring access to up to three chunks of the LUN. Under extreme conditions, a misaligned disk can cause up to a 40% degradation in I/O performance! Most modern operating systems don’t exhibit this problem, and typically it only occurs in legacy deployments.
In the end, working out bottlenecks in the chain involves figuring out which component is waiting for the other to finish its job and send the information along. Any time the average total latencies of I/O operations exceed 20ms, the applications and the users on them will begin to notice degradation of performance. As this latency increases, performance will only get worse, potentially reaching the point where the delay in reads and writes will cause the applications or OS’s hosting them to time out and give up on their attempts to access the disk. In virtual and non-virtual systems, these situations are often logged as “aborted I/O commands,” and they can cause serious high-level errors if proper handling for I/O is not implemented within the applications.
Measuring the I/O latency is most often done from within the SAN. Tools from storage manufacturers or third-parties will grant visibility into the wait times for IOPS and can help pinpoint the source of the slowdown from within the storage system. However, these tools often overlook the guests themselves and their point-of-view. Measuring storage latency at the server (or VDI) is just as important, since latency inside the SAN doesn’t always account for the links and protocols which connect that SAN to all of the systems communicating with it. Looking deep into the storage device will help show which RAID array is having the most trouble, for example, but it will not help track down the fact that one of the network connections is saturated, causing the traffic to the SAN to bottleneck. So make sure that whatever methodology you are using to monitor disk performance incorporates a full view of latency end-to-end with the goal being to keep those numbers as small as possible.
What does all this mean? It means storage is complicated! Despite the simplicity of the concept to store a byte in a location and being able to access it from many places, the implementation of such becomes a mechanism of underappreciated complexity.
IOPS as a measurement of disk I/O
So, back to IOPS. Where does this measurement metric come into play, and how does it affect the overall picture of disk I/O? Can your storage system’s performance be measured in IOPS effectively? Well, yes, but only in part. A better question to ask is, “What do you define as performance?” Are we talking about the maximum I/O potential of a storage system, or the responsiveness of the storage to the demand being placed upon it?
IOPS are an effect of a storage system’s performance. Better performance on a more expensive SAN means more potential IOPS. Easy, right? Well here’s where things become tricky:
An idle storage system has zero IOPS. As load increases on a storage system, the IOPS against it go up. As the load continues to increase, bottlenecks within the system will cause latency to rise and with more time between each I/O, the number of IOPS will eventually plateau. IOPS can therefore be an indicator of load under ideal circumstances, but one cannot simply say that because a SAN is showing a certain number of IOPS that it is performing well. It’s not until the latency for those IOPS is examined that we know the SAN has reached its saturation point and our applications have begun to suffer.
Storage performance affecting IOPS
There are many, many ways to build a storage system. And nearly all of these permutations will have an affect upon the number of simultaneous inputs and outputs that can be executed.
For example, let’s say a single spindle can perform X number of IOPS. However, if that disk is now striped with a second disk inside an array, this increases the amount of possible I/O by giving the potential to read and write to both devices simultaneously. Thereby giving us X * 2 total IOPS available. Correct? Well, sort of. Even though the I/O is going to two disks at once, thereby increasing the total amount of data that can be written simultaneously, the application layer does not see this. The storage processor is transparently handling the transfer to both disks, and with the increase in bandwidth, can potentially handle more IOPS, but it isn’t as simple as pure multiplication.
The process of encapsulating I/O also involves overhead on the part of the storage processor itself, so while adding more spindles to an array will bring about more available I/O capacity to a point, eventually the overhead becomes so great that all benefit of the parallelization is lost. This concept holds true throughout computing, so I won’t address it in-depth in this article. For additional reading on the concept, point your browsers here:
http://en.wikipedia.org/wiki/Parallel_computing
The potential bandwidth of a storage system also affects IOPS, regardless of an I/O’s size or nature. The more bits that can be sent per second, the more IOPS, so it’s clear that a SAN’s performance affects IOPS.
IOPS affecting storage performance
Now let’s look at the situation from the other side. Reading or writing to a storage device means that those IOPS are placing load upon the network links, taking up CPU cycles on the storage processors and HBAs, and making the heads of the spindles move to various locations around the disks. This means that if another system wants to utilize that same shared storage device, it will need to work within the boundaries of what remains. The disk heads will probably now have further to travel in order to read or write the next time around and the network links only have so much bandwidth remaining at that moment, etc. This means that the more IOPS that are being pushed through the storage, the slower the response time will be for other systems trying to access it. Even if one system has a relatively low demand on the disk, another with high demand will cause that low-demand one to have slower performance waiting with its foot tapping for the I/O request it sent to come back from the SAN.
This means that the number of IOPS is both an indicator of the SAN’s load and an affecter of its “performance.” It really depends upon the perspective we are examining in the situation.
Not all IOPS are created equal until they enter the storage system
The metric of I/O’s per second is one that involves several caveats alluded to earlier in this paper. What size are the I/Os themselves? What percentage of them consists of read operations vs. write operations? Are they seeking data that’s likely to be read from sequential areas of the storage or are they utterly random?
This is a screen capture from just one environment where the disconnect between IOPS and total I/O load is markedly visible.
Notice the fact that the first VM is showing only 22% more IOPS than the second VM, yet it’s showing 800% more disk throughput! Looking at throughput shows that the first VM is vastly heavier in disk I/O, yet if we were only considering IOPS, these two VMs on the same datastore would be almost indistinguishable between each other in their I/O needs. If these VMs have high disk latency (waiting long periods for their I/Os to return) how would you then determine which VM should be granted a more dedicated or higher-performance LUN for its operations in order to reduce that latency? Purely based on IOPS, it would practically be a coin toss. But since one VM is writing 64KB blocks and the other appears to be writing 12KB blocks, the difference is much more drastic.
Just looking at block sizes alone, a storage system’s I/O capacity will vary greatly even with the same number of IOPS. In short, the bigger the IOP, the fewer of them that can go through the entirety of a storage system.
Conversely, the throughput potential of the entire system actually increases with larger block sizes. Much like enabling jumbo frames on an Ethernet connection can decrease overhead and improve maximum throughput for larger data transfers, the same can be true for disk I/O in certain circumstances.
Note that the total amount of I/O data throughput (essentially, the storage bandwidth) of a storage system plateaus at a certain block size and does not continue to improve. This is a perfect example of how the principle limitation of a storage system can be the total amount of bytes that can be written to or read from it each second and not so much the number of times a read or write operation can be executed on that system per second.
It’s important to note that this differentiation ends once an IOp enters the physical storage device itself. Once inside the storage processor, the block sizes are consistent. However, when examining the end-to-end performance of a storage solution, this differentiation is vital.
So when trying to design and configure storage, do you focus on maximum potential I/O’s or maximum throughput? The answer will depend on the following factors:
1) The class of operation
a. Higher performance expectations and SLA’s generally will demand low latency numbers, so storage must be not only optimal, but capable of sustained operation at high loads.
2) The type of data being accessed
a. Large quantities of tiny files or granular database operations will benefit from systems that perform better with small block sizes, yielding greater IOPS for when throughput is secondary.
b. Larger files or data access taking place in bigger chunks won’t take advantage of smaller block sizes, and therefore will not benefit from an IOPS-oriented design, instead needing top performance for sustained throughput.
Let's take a VDI infrastructure for example:
In most virtual desktop deployments, disk I/O tends to be very heavy during initial boot-storms with close to 99% of the IOPS being read from the boot image(s) that are very similar to one another. (Powering on a bunch of virtual desktops that all run the same operating system at the beginning of a work day.) Then, during normal operation the majority of IOPS are frequent, small, but very random writes alongside random reads. (Periodic saving of work files, writes to web browsing cache folders, email client activity when receiving messages, etc.) This kind of activity benefits heavily from caching whether on the spindles themselves, in the storage controller, or the HBA. During the boot periods all of the reads are from very similar or identical images, so caching means only one read IO from the spindles is needed for each sector, leaving the bottleneck to be the maximum throughput speed of the fabric to the SAN, provided the total caching size is large enough to store the full “golden” boot image within memory.
Once booted up, the virtual desktops issuing write commands will also benefit from write caching, allowing extra time for the spindles to keep up with receiving the I/Os. A 4GB write-back cache could provide space for over one million 4KB-sized I/O blocks (a common block size,) giving ample time for them to be written to disk in between bursts of disk activity. Once again, the bottleneck would usually come down to the media fabric between the storage and the hosts. So in the case of VDI (often touted as a very IOPS-intensive function,) a SAN with ample caching would support a great deal of VDI-oriented IOPS before suffering any kind of slowdown either from the cache filling up, or from the HBAs being unable to deliver data to the SAN fast enough.
Storage architects will quickly (and correctly) note that any sort of caching will only improve I/O performance at burst speeds, and primarily for write operations. Read operations only occasionally benefit from caching because they depend on the data already existing in the cache’s memory, either from being read previously or from an algorithmic selection of data that is likely to be read in the near future. (The percentage of success in utilizing a cache resource is known as its “hit rate”.) So while this might work ideally in the case of VDI, the same configuration may not do nearly as much good for a large-scale database deployment
Making the Most of IOPS
With all of the complexities involved with storage, the divergent nature of throughput against IOPS, and the fact that no two IOPS are the same, does this mean that IOPS is a worthless metric? Far from it! It’s merely one of many pieces of the puzzle of disk I/O. It also means that consistency is key, and that all the factors involved need to be accounted for in order to determine the performance requirements of an infrastructure, as well as the capabilities of a storage system to handle those requirements.
Consistency needs to come from the perspective of the storage vendor, so that when a system architect is looking to choose a storage solution, they have an even playing field. If a vendor promises a system capable of one million IOPS, and all of those IOPS are sequential, read-only with 100% cache success, and bursting for no longer than 10 seconds, then that information is at best, not helpful, and at worst, false advertising. An industry standard for overall I/O measurement is something that I believe the community should call for, but in the meantime, following these guidelines can help keep things in perspective when choosing a shared storage solution:
1) Assume the worst-case scenario and avoid over-simplification
Make sure that the storage vendor provides detailed information about the number, size, and type of IOPS the system can handle. Storage manufacturers tend to optimize their hardware and software for 512 byte transfers, to maximize their advertised IOPS rate. But if IOPS aren’t as important as throughput for your application, this won’t be optimal. Plus, 4KB or 8KB transfers are far more realistic to encounter in real-world applications.
Ensure that the numbers assume a 0% cache hit rate, eliminating potential false performance improvements due to rare “ideal” circumstances.
Arm yourself with information from application vendors ahead of time to work out what kind of I/O requirements you’re going to be expected to support within the infrastructure. Lots of tiny I/Os vs. a small number of very large transfers will place very different requirements on your storage.
2) Hold each solution to the same standard
Even if you’re unsure of the exact I/O needs of your application, use the same measuring stick between vendors and models.
Keep block sizes, stripe sizes, drive technologies and spindle counts consistent to narrow down the focus to the storage processor(s) and interfaces. Very frequently, Vendor A will use the exact same disks inside the SAN as Vendor B. Your focus, therefore, should be on working out how well their system can handle those disks.
3) Diversify, Diversify, Diversify
Don’t put all your eggs in one basket. If you have a complex environment with variable storage performance needs, you should not be held to just one storage solution or model. Even if you keep a consistent vendor, focus on application-level delivery when determining the storage configuration that will work best for you.
And even if you want to keep a consistent vendor AND model, diversify the SAN itself! Nearly all storage systems have provisions for more than one type of disk, storage processor, and communication medium.
Once a storage system is in place, following best practices in allocating space within it is critical to getting the most out of the deployment. For instance, when configuring LUNs it is important to divide and conquer. Do not put all of your disks into one logical array and expect great performance! Depending on the number of disks, the parity calculations alone could push your storage processors to 100% even with the lightest of I/O loads.
It is better to split things up. Put your I/O-light apps on inexpensive disks in parity configurations, and dedicate high-performance drives in their own LUNs for databases or other I/O–intensive operations. This can almost always be accomplished inside the same physical SAN.
Storage vendors will have their own technologies and methods for assigning the right amount of disks in the right configuration for best performance and most efficient distribution of available space. Make sure to work closely with the manufacturer whenever possible, since even someone with a great deal of storage experience can be taken by surprise.
4) Check your (and their) work
Verify that the promised amount of I/O capacity lines up with what should be theoretically possible given the hardware configuration proposed. Calculators like the one here by Marek Wołynko can help: http://www.wmarow.com/storage/strcalc.html
Utilize tools such as I/O Meter: http://www.iometer.org/ to push storage systems to their limits and see how well they perform under high-stress loads before they end up being put into production.
Once in production, closely monitor I/O responsiveness from the perspective of the systems utilizing the storage. Keep an eye on your latency values. No IO operation should be taking longer than 50ms to round-trip within the storage environment in order to maintain the best performance and even tighter tolerances will be needed for higher-tier operations.
Conclusion
It is clear that IOPS as a term for storage performance is here to stay, regardless of any limitations that may exist with its use. The best way to keep on top of what it means to you and to your infrastructure is to remain informed. This whitepaper began as a blog post just about IOPS, but this topic is so enormous that I am going to keep this as a “work in progress” that I will keep updating with time.
Showing posts with label Storage. Show all posts
Showing posts with label Storage. Show all posts
Monday, March 19, 2012
Friday, March 9, 2012
Data Management Tips for 2012
- Stephen Chan, VP of Business Development and Co-Founder of ZL Technologies (www.zlti.com), says:
In a recent study, Gartner found that enterprise data is expected to grow over 650% over the next five years. As the proliferation of big data continues to surge at a staggering rate, the challenges associated with management is expected to increase directly. Traditional methods will prove futile or too costly. The market is shifting and organizations are learning that the best approach to tackle the exponential influx of data is with a unified solution that utilizes single silo storage.
Here are some other trends to watch out for in the coming year:
Unstructured data is expanding. As we live in an increasingly digitalized world, more and more types of unstructured data are being introduced each day. In addition to emails and files, social media like Twitter, Facebook, SMS, BlackBerry messages, and other data types are rapidly entering the mix. But even as the types of data increase, maintaining all types in a unified silo become more and more important.
Email: The tail that wags the dog. While 80% of corporate data stored is unstructured, 80% of this unstructured data is comprised solely of email. It is the single largest data type within the organization. As a result, email archiving will become even more critical, ensuring that non-record emails are retained until properly classified or declared as a record. Unified control of emails at every stage is crucial to litigation, governance, records management, and business intelligence efforts because it provides the single source of truth for the business.
Stop keeping it forever. With the increase in data generated, there will also be increased attention on data disposition and lifecycle management. Proper classification and disposition policies must be established so that data is not stored forever. Unified control of lifecycle policies is the only way to ensure their effectiveness.
In a recent study, Gartner found that enterprise data is expected to grow over 650% over the next five years. As the proliferation of big data continues to surge at a staggering rate, the challenges associated with management is expected to increase directly. Traditional methods will prove futile or too costly. The market is shifting and organizations are learning that the best approach to tackle the exponential influx of data is with a unified solution that utilizes single silo storage.
Here are some other trends to watch out for in the coming year:
Unstructured data is expanding. As we live in an increasingly digitalized world, more and more types of unstructured data are being introduced each day. In addition to emails and files, social media like Twitter, Facebook, SMS, BlackBerry messages, and other data types are rapidly entering the mix. But even as the types of data increase, maintaining all types in a unified silo become more and more important.
Email: The tail that wags the dog. While 80% of corporate data stored is unstructured, 80% of this unstructured data is comprised solely of email. It is the single largest data type within the organization. As a result, email archiving will become even more critical, ensuring that non-record emails are retained until properly classified or declared as a record. Unified control of emails at every stage is crucial to litigation, governance, records management, and business intelligence efforts because it provides the single source of truth for the business.
Stop keeping it forever. With the increase in data generated, there will also be increased attention on data disposition and lifecycle management. Proper classification and disposition policies must be established so that data is not stored forever. Unified control of lifecycle policies is the only way to ensure their effectiveness.
Tuesday, February 7, 2012
SSDs: An Essential Component Of An Efficient, Innovative Data Center
Q&A with Robert Jenkins, CTO of CloudSigma (www.cloudsigma.com):
Chris MacKinnon (DCP): Why is SSD storage useful in today's data centers?
Jenkins: A typical enterprise is managing more and more data all the time: according to Gartner, data worldwide is currently growing at a minimum rate of 59 percent annually. Data is so essential to most enterprises’ daily operations that their infrastructures must keep up with heavier workloads or else risk jeopardizing companies’ bottom lines. Enterprise data centers need a solution that satisfies their needs for increasing storage capacities while maintaining optimal performance. Solid State Drive (SSD) storage is part of that solution, helping eliminate storage bottlenecks.
For enterprise data centers, a major obstacle to efficient workload management is the inability of typical magnetic disk storage to handle flurries of Input/Output (I/O) operations caused by spikes in server activity. A highly viable solution to enterprise server strain is the strategic implementation of SSDs to house high-priority data, expand caching, etc. SSDs, which have no moving parts, are a perfect solution for I/O traffic spikes and other strains on server productivity because they help eliminate the potential for traffic bottlenecks. They also excel at random I/O operations that magnetic storage struggles with, making them ideal for databases, for example. SSDs thus create a healthier infrastructure with less risk of downtime, more accessible data, better performance, less CPU wait time and more predictable system performance.
MacKinnon: Why should data center and IT managers care about SSD? How can they benefit from it?
Jenkins: Data center and IT managers need to look into SSD storage as a viable solution to server inefficiency and sluggishness. By investing in SSD, they can expect to achieve a more performance-driven infrastructure that’s better suited to the needs of the modern enterprise. These same benefits hold true for operations in a cloud environment, and Infrastructure as a Service (IaaS) providers are realizing the same advantages of investing in SSD as part of a competitive offering for their customers. By placing the most critical data on SSDs, data center managers and public cloud IaaS providers effectively eliminate storage bottlenecks and reduce variable performance in server environments, producing infrastructures far more competent at handling large volumes of information flow.
MacKinnon: Where should SSD storage rank in terms of overall priority in the data center?
Jenkins: The importance of SSD storage is arguably subject to the individual needs of the enterprise. The more data processing needs an organization has, the more it needs a solution like SSD storage. Any organization utilizing databases and other operations requiring low latency random I/O activity will see significant benefits from incorporating SSD storage into their infrastructure architecture. That being said, SSDs are an essential component of an efficient, innovative data center that can keep up with the demands of any enterprise. Nearly all enterprises will benefit from using SSD storage, what varies is the correct trade-off point between SSD and magnetic storage. Similarly, access to SSD capabilities is paramount for an IaaS provider that hopes to provide its customers with the kind of robust public cloud infrastructure that companies demand today.
If enterprise data centers and IaaS providers want to remain relevant in an age of big data, they need to seriously consider incorporating SSDs. With digital data expected to grow 48 percent this year from 2011 according to IDC, premium storage capabilities are all the more important. SSD storage is simply the best way to keep scaling storage solutions while maintaining or enhancing performance levels. SSD is therefore a significant contributor to an enterprise’s competitive edge.
MacKinnon: What are the biggest challenges for data center and IT managers when it comes to SSD storage? How can data center and IT managers overcome those challenges?
Jenkins: The biggest challenge involved with implementing SSD storage is overcoming the cost barrier. In most cases, the high cost of SSDs is mitigated by the increase in performance yielded and the higher levels of competitiveness achieved for the provider. SSD storage should best be viewed as an investment. The short term cost-benefit ratio might not be as obvious as the long term. Regardless, data center managers will notice immediately the performance benefits of SSD, which are then passed on to the enterprise. Without time-wasting bottlenecks and performance lag, companies can complete tasks more efficiently and avoid incurring latency, therefore saving on other resources like CPU and making up the cost. Saved CPU time alone can recover a significant proportion of the higher SSD storage cost. Additionally, the lower power footprint of SSD storage and lower heat emissions reaps further savings for companies managing their own data centers. Ultimately, if an enterprise can better execute its core business operations, then SSD storage is worth the extra expense.
A further consideration is SSD storage lifetime, which previously had been limited. The falling cost and improved management systems on drives means that the cost per GB of write activity on SSD is falling rapidly and is already a fraction of what it cost only one or two years ago.
One effective way for enterprises IT managers to overcome the cost barrier is to seek out IaaS providers who offer SSD capability, rather than trying to maintain in-house hardware with expensive SSDs. IaaS offerings provide the equivalent or better services as in-house hardware, and at a more manageable price point. Turning CAPEX into OPEX is particularly appealing when considering SSD storage options.
Some IT managers are wary about IaaS because they view it as incapable of meeting their hardware demands. However, today’s most cutting edge IaaS providers offer optimal scalability, flexibility and performance, in part thanks to SSD capability.
MacKinnon: What advice can you give to IT and data center managers that have a plethora of similar solutions to choose from?
Jenkins: Some alternative options could include extra standard magnetic disk drives and/or hybrid drives. Standard magnetic disk drives are adequate for lower-priority data storage, and are best implemented in conjunction with SSDs, which are suited for higher-priority data, where instant access is necessary to optimize performance levels. Enterprises who try to manage more data by adding more magnetic drives will face storage sprawl, under- and over-provisioning of resources and a poor cost-benefit ratio. A server system entirely outfitted with magnetic disk drives is simply not prepared to handle the large amounts of data flow that today’s enterprise must manage.
Hybrid drives, which incorporate both a hard drive and flash memory (or in some cases SSD), have not really been implemented on a large scale for enterprise data centers, and with fairly good reason. Primarily, hybrid drives are not especially suited for enterprise-class servers because data retrieval is not optimal; there is a slower data search time with a hybrid drive because its data storage is dynamic, i.e. the data moves around frequently. Hybrid drives are slowly entering the enterprise market, but it remains to be seen whether this technology could be an adequate competitor to pure SSD.
There is currently no better solution for enterprise data centers than to strategically implement both traditional magnetic drives as well as SSDs. This ensures that priority data is accessed quickly and efficiently, thereby creating a performance-driven infrastructure with the highest cost-benefit ratio.
When evaluating any storage solution, a true total cost of ownership (TCO) calculation should be employed. This is particularly true when considering SSD storage where power and cooling savings can be significant. Each enterprise then needs to calculate the commercial benefit of increased performance and the business opportunities that SSD can help them deliver on. Combining the two processes will result in the right mix of SSD and magnetic storage solutions.
When working in a cloud environment, the most effective way to deliver the benefits of SSD storage to an enterprise is via an IaaS provider. This relationship maximizes the benefits for the data center manager, the IaaS provider and ultimately the enterprise IT manager. For data center managers, IaaS delivery ensures that their servers will provide an efficient and effective service for enterprises at a fraction of the cost of in-house hardware. IaaS providers who offer SSD capability therefore have a distinguishing competitive edge in the market. The enterprise IT manager sees both the cost and performance benefit; they no longer have to manage their own expensive in-house hardware, which can be very distracting from the business’ core competencies, and the public cloud-delivered infrastructure service carries with it the benefits achieved through data center SSD implementation, plus the added bonus of high cost-effectiveness. SSD storage is the key solution for IaaS providers who hope to provide the greatest storage and performance capacity on the market today.
Labels:
Storage
Friday, February 3, 2012
Removing Data Waste from Virtualized Storage
Brad Bonn and Alex Rosemblat from VKernel (www.vkernel.com):
Forgotten data objects from virtual machines can clog up virtualized storage. Cleaning up this waste will reclaim storage that can then be reused.
Optimizing resource usage in a virtual environment is significantly more challenging than in a physical environment when it comes to efficiency. The ability to create virtual machines rapidly, while a key driver for enhanced agility and return on investment; is also the main cause of this challenge. To further intensify the problem, related data objects such as snapshots or additional VM images are sometimes created for each VM, increasing the amount of storage usage created by virtualization. Unused snapshots, templates, abandoned VM images and Zombie VMs all contribute to wasted CPU, memory, throughput and most importantly, storage resources. Yet locating and reclaiming these resources is not always a simple task. This whitepaper walks through each of the data objects that can create waste and describes how to clean up unused data objects so as to increase data center ROI.
Types of Waste that Occur in Virtual Infrastructure
Abandoned VM Images
When a VM is deleted from VMware vCenter, Microsoft Systems Center, Red Hat Enterprise Management (RHEM) or another VM management console, it must also be deleted from the disk. Otherwise, the VM listing is no longer present in the management console, but the accompanying VM image still exists in storage. If this scenario occurs, the result is an abandoned VM image which resides in storage and consumes space but is no longer in use. Theoretically, proper operating procedures should ensure that each time a system administrator deletes a virtual machine from the management console, that administrator also repeats the same action with the storage array. But this is not always the case.
Due to a variety of reasons, a VM image may not get deleted from storage although it has been deleted from the management console. This can occur when:
- A VMware vMotion storage fails and a file is not completely moved to another datastore. In this case, it is possible that either the old VMDK file will be left where it was, or that the new partially copied VMDK file will be placed on the new datastore. In either case, vCenter will not know about the file. vMotion storage can fail if the new host doesn’t have the same configuration as the old one did, or if there wasn’t sufficient disk space. Additionally, some users configure vMotion to be completely automated, so unless an error log is checked, it would be difficult to know that a vMotion had failed.
- System administrators manually copy and paste VM images to move them and forget to delete the old file.
- VM images are copied in lieu of using templates and a VM image that isn’t necessary is not deleted.
- A third party backup or storage snapshot taking tool is duplicating VM image files. The management console will not know about additional VM image files created in this way.
Detecting abandoned VM images is accomplished through a reconciliation of VMs listed in the management console and the VM images reported within storage. These files are not easy to identify manually since importantly, the VM image name may not always match the VM name. These VM image files are often called orphaned files or orphaned VMs.
Sources of Savings:
Finding and deleting orphaned VMs is important to reclaim and free up storage. Also, deleting these files will liberate software licenses which can be reused.
Powered-Off VMs
Powered-off VMs are just what the name implies: VMs that are powered-off. There is nothing wrong with a VM powered-off unless it is an indication of a VM that is no longer required. The longer a VM has been powered down and not used, the more likely it is no longer needed. It is also possible that a zombie (also known as an idle) VM is identified and rather than being deleted is powered-off to be dealt with later. As this file is no longer in use, it will then become a powered-off VM and still take up storage resources which could be reclaimed.
Finding this Waste:
Reporting on the number of days a VM has been powered-off is the starting point to determining whether a VM is still required. Once the candidate list of powered-off VMs that are no longer needed is identified, some amount of detective work and cross referencing the VM with an inventory report will be required prior to deleting the file.
The key to detecting powered-off virtual machines that are no longer used is the ability to exclude a VM from the analysis going forward if a VM administrator determines that the powered-off image is needed. Otherwise, any reporting mechanism on powered-off VMs will eventually become cluttered with long unused powered-off VMs obscuring the next level of detail and hiding powered-off VMs that need to be deleted.
Sources of Savings:
Deleting powered-off VMs is important to not only free up storage space, but also to release any software licenses that the VM is taking up.
Unused Snapshots
Snapshots are a state of a virtual machine at a particular time and are used for backup and recovery purposes. A snapshot is similar to a desktop recover point. VM administrators typically will make a snapshot of a VM image prior to updating or changing a particular VM to provide for rollback should the upgrade not go according to plan.
Once the patch has been successfully applied, system administrators are supposed to incorporate the snapshot back into the VM configuration which will remove the snapshot and make the changes permanent.
Theoretically, this sounds easy. But there are a few key issues with snapshots that can make managing snapshots particularly troublesome for administrators:
- Snapshots can take up all the available storage without an administrator knowing it.
- Some environments have multiple snapshots for a particular VM
- Finding what snapshots are in storage, when they were made, and if they are still being used requires additional work and tools to report on.
- Snapshots for a VM that has been deleted may remain, and will be difficult to cross-reference to the VM that no longer exists.
- The SAN can take its own snapshots with its software.
- System administrators simply forget about snapshots that were taken, or the snapshot was taken by another user and not reported.
Finding this Waste:
The key to finding unused snapshots is to look at the age of the snapshot and then remove the older snapshots for a particular VM when it has been deemed that that snapshot is out of date.
Sources of Savings:
Deleting unused snapshots will free up storage space.
Unused Template Images
A template is a base image that is used to quickly create virtual machines that are identical. Templates drive compliance for operating system images, patch levels and installed software. Each time the operating system or application changes, a new template needs to be created. If a template that is out of date is not deleted, it will consume storage resources unnecessarily. Unused template images can become a significant source of wasted storage.
Finding this Waste:
The key to finding unused templates is to look at the age of the template and then check to see that the template is still valid.
Sources of Savings:
Deleting unused templates will free up storage space.
Zombie (also known as Idle) VMs
Zombie VMs are virtual machines that are still running but have reached the end of their production lifecycle. This is most likely to happen in volume in environments where:
- VMs are created by end users themselves.
- Where a communication loop has not been closed between an end user and a system administrator and the end user has not informed the system administrator that a particular VM is no longer being used.
- QA or development teams can spin up VMs on their own without central administration oversight.
- End customers can deploy VMs automatically in cloud initiatives.
- Applications running on a VM are offline and the application owner has not yet noted that the applications are down.
Finding this Waste:
Zombie VMs are tough to detect. They are powered on and appear to have a load on them. However, as a Zombie, the load may be low. But even low load is not a reliable way to close in on a potential Zombie VM. The deviation of the load over time is the best way to separate a Zombie, for example, from a DNS server that simply rarely has a load on it.
Once the list of potential Zombie VMs is identified, cross referencing the VM information with inventory information is the next step to determine if a VM is truly a Zombie or is doing useful work. VMs that appear to be Zombies but are not should then be tagged as such to prevent them from being identified over and over again as potential Zombies. Importantly, if the Zombie VM is simply powered-off instead of being deleted, it will still take up storage resources, and will be noted as a powered-off VM.
Sources of Savings:
Cleaning up Zombies frees up CPU, memory, throughput, and storage for reuse. In addition, each operating Zombie VM consumes licenses for operating systems and other software that could be used elsewhere.
Labels:
Storage,
Virtualization
Wednesday, January 4, 2012
The Magic of Mobile Cloud
- Lori MacVittie, senior technical marketing manager at F5 Networks (www.f5.com), says:
Mark my words, the term “mobile” is the noun (or is it a verb? Depends on the context, doesn’t it?) that will replace “cloud” as the most used and abused and misapplied term in technology in the coming year.
If I was to find a pitch in my inbox that did not someway invoke the term “mobile” I’d be surprised. The latest one to catch my eye was pitching a survey on the “mobile cloud”. The idea, apparently, around this pitch involving “mobile cloud” is the miraculous capability bestowed upon cloud deployed services to automagically perform synchronization and storage tasks.
The proliferation of mobile devices has created demand for services that allow users to access personal data and content from any device at any time. Mobile cloud services are emerging that synchronise data across multiple mobile devices with centralised storage in the cloud.
While the statement regarding demand is true, the follow-on assertion is at best inaccurate, at worst it is false. There are no services, in the cloud or anywhere else, that can synchronize data across multiple devices. Oh, services may be emerging that claim to do so, but they can’t and don’t. Without fail, services “in the cloud” are invoked from the client – each individual client, mind you – and without that initiating event a cloud service would no more be able to synchronize data than previous incarnations of mobile services when we called them hosted applications.
SERVICE-SIDE PUSH
This is because the underlying technology used to access these services is still, regardless of the interface presented, the web. It’s an API. It’s HTTP. It’s a client-server paradigm that hasn’t changed very much since it rose to ascendancy as the preferred application architectural model back in the last century. The reason SPDY has started to gain attention and mindshare is not necessarily because it’s faster (that’s a plus, mind you, but it’s not the whole enchilada) but because of its bidirectional communication capabilities. SPDY can push to clients in a way that HTTP has never really been able to do, though many have tried. They’ve come close with approximations and solutions that to the untrained user appear to be a “push” but in reality they are little more than “dragging out a pull response.” And yet SPDY is still constrained in the same way as traditional HTTP: the client must initiate the connection.
The capability to push from the service-side does not and will not imbue “cloud services” of any kind with the ability to initiate actions, because the “cloud” cannot push to a client unless a connection is already established. And who initiates connections? That’s right, clients.
The only entity that could make a claim that it could initiate anything on a mobile device would be a service provider. That’s because they are the only ones who can actually find and connect to a device on-demand – and then it’s only their devices on their mobile networks. And then they’d best only do that if it’s (1) part of their terms of service or (2) the user specifically checked a box allowing them (or their service) to do so.
But consider the impracticality of “service-side push” to clients to synchronize data. Client devices are, well, mobile. That means their connectivity is not assured. “Always on” is a misnomer. Yes, the device is always on in a way the PC has never been, but it’s also in stand-by mode, which often means the radio – its means of communication – is off. This little fact is a problem for performance-focused IT, and it’s even more troublesome to those who’d like to create a service-side “push”. So Bob uploads a photo to a “cloud storage” service and the service wants to synchronize it with Bob’s other (configured by Bob, of course) devices. So the service starts sending out messages to try to connect to Bob’s other devices.
Right. One is turned off and the other is in flight mode to prevent his three-year old from purchasing God only knows what apps through the Android market and the third? It’s in standby, the radio is off.
That’s not the way it works today and it certainly shouldn’t be the way it works in the future. It’s a waste of processing power, of bandwidth, of resources in general. The client will eventually be online and will open a session with the “cloud service” and ask it for updates.
MOBILE CLOUD
Whether applications use web technologies because of the reality that clients are not “always on” or because it’s the model (client initiated and more importantly to them, controlled) most familiar and acceptable to consumers, reality is that mobile devices and clients leverage web technologies to store, share, and synchronize data across services. The “mobile cloud” and its alleged ability to “synchronize data across devices” is little more than cloud washing, as is the term “mobile cloud” itself which some have tried to claim is defined by the way in which a device accesses its services. From differentiation between network type (wired versus wireless) to the client-model (thin client browser versus thick client application), some continue to try to make the case that there exists some “mobile cloud” that is completely different than that of the “regular old cloud.”
There is not. The web is the web, the presentation layer of an application (thick or thin) does not define its server-side technological model, and service-side push (and control) remains yet another marketing phrase used to describe capabilities that is not technically accurate and which ultimately sets unrealistic expectations for consumers – and in the enterprise, IT.
The notion that you’d build a “mobile cloud” that is somehow separate from the “regular cloud” is preposterous precisely because it contradicts the purported purpose for building it: synchronization and “access from anywhere.” It’s that “anywhere” requirement that makes a mobile cloud as realistic as unicorns. If I upload a photo to I should be able to access – and thus synchronize – from any device, and that includes my laptop or desktop PC, the latter of which is certainly not “mobile”.
These assertions that a mobile cloud exist only serve to reinforce the heretofore unknown Clark’s Third (and a half) Law: Any sufficiently advanced web technology is indistinguishable from cloud in the eyes of the marketing department.
Mark my words, the term “mobile” is the noun (or is it a verb? Depends on the context, doesn’t it?) that will replace “cloud” as the most used and abused and misapplied term in technology in the coming year.
If I was to find a pitch in my inbox that did not someway invoke the term “mobile” I’d be surprised. The latest one to catch my eye was pitching a survey on the “mobile cloud”. The idea, apparently, around this pitch involving “mobile cloud” is the miraculous capability bestowed upon cloud deployed services to automagically perform synchronization and storage tasks.
The proliferation of mobile devices has created demand for services that allow users to access personal data and content from any device at any time. Mobile cloud services are emerging that synchronise data across multiple mobile devices with centralised storage in the cloud.
While the statement regarding demand is true, the follow-on assertion is at best inaccurate, at worst it is false. There are no services, in the cloud or anywhere else, that can synchronize data across multiple devices. Oh, services may be emerging that claim to do so, but they can’t and don’t. Without fail, services “in the cloud” are invoked from the client – each individual client, mind you – and without that initiating event a cloud service would no more be able to synchronize data than previous incarnations of mobile services when we called them hosted applications.
SERVICE-SIDE PUSH
This is because the underlying technology used to access these services is still, regardless of the interface presented, the web. It’s an API. It’s HTTP. It’s a client-server paradigm that hasn’t changed very much since it rose to ascendancy as the preferred application architectural model back in the last century. The reason SPDY has started to gain attention and mindshare is not necessarily because it’s faster (that’s a plus, mind you, but it’s not the whole enchilada) but because of its bidirectional communication capabilities. SPDY can push to clients in a way that HTTP has never really been able to do, though many have tried. They’ve come close with approximations and solutions that to the untrained user appear to be a “push” but in reality they are little more than “dragging out a pull response.” And yet SPDY is still constrained in the same way as traditional HTTP: the client must initiate the connection.
The capability to push from the service-side does not and will not imbue “cloud services” of any kind with the ability to initiate actions, because the “cloud” cannot push to a client unless a connection is already established. And who initiates connections? That’s right, clients.
The only entity that could make a claim that it could initiate anything on a mobile device would be a service provider. That’s because they are the only ones who can actually find and connect to a device on-demand – and then it’s only their devices on their mobile networks. And then they’d best only do that if it’s (1) part of their terms of service or (2) the user specifically checked a box allowing them (or their service) to do so.
But consider the impracticality of “service-side push” to clients to synchronize data. Client devices are, well, mobile. That means their connectivity is not assured. “Always on” is a misnomer. Yes, the device is always on in a way the PC has never been, but it’s also in stand-by mode, which often means the radio – its means of communication – is off. This little fact is a problem for performance-focused IT, and it’s even more troublesome to those who’d like to create a service-side “push”. So Bob uploads a photo to a “cloud storage” service and the service wants to synchronize it with Bob’s other (configured by Bob, of course) devices. So the service starts sending out messages to try to connect to Bob’s other devices.
Right. One is turned off and the other is in flight mode to prevent his three-year old from purchasing God only knows what apps through the Android market and the third? It’s in standby, the radio is off.
That’s not the way it works today and it certainly shouldn’t be the way it works in the future. It’s a waste of processing power, of bandwidth, of resources in general. The client will eventually be online and will open a session with the “cloud service” and ask it for updates.
MOBILE CLOUD
Whether applications use web technologies because of the reality that clients are not “always on” or because it’s the model (client initiated and more importantly to them, controlled) most familiar and acceptable to consumers, reality is that mobile devices and clients leverage web technologies to store, share, and synchronize data across services. The “mobile cloud” and its alleged ability to “synchronize data across devices” is little more than cloud washing, as is the term “mobile cloud” itself which some have tried to claim is defined by the way in which a device accesses its services. From differentiation between network type (wired versus wireless) to the client-model (thin client browser versus thick client application), some continue to try to make the case that there exists some “mobile cloud” that is completely different than that of the “regular old cloud.”
There is not. The web is the web, the presentation layer of an application (thick or thin) does not define its server-side technological model, and service-side push (and control) remains yet another marketing phrase used to describe capabilities that is not technically accurate and which ultimately sets unrealistic expectations for consumers – and in the enterprise, IT.
The notion that you’d build a “mobile cloud” that is somehow separate from the “regular cloud” is preposterous precisely because it contradicts the purported purpose for building it: synchronization and “access from anywhere.” It’s that “anywhere” requirement that makes a mobile cloud as realistic as unicorns. If I upload a photo to
These assertions that a mobile cloud exist only serve to reinforce the heretofore unknown Clark’s Third (and a half) Law: Any sufficiently advanced web technology is indistinguishable from cloud in the eyes of the marketing department.
Labels:
cloud computing,
Storage
Wednesday, December 14, 2011
Tackling Big Data Through a New Approach to Distributed, Unified Storage Systems
- Bryan Bogensberger, GM Ceph Distributed Storage and vice president of Business Strategy at DreamHost (http://dreamhost.com/), says:
In this era of Big Data, enterprise storage needs are increasing at an exponential rate. With the advent of the cloud, IT professionals are expected to provide instant access to data, as well as auto-scaling to handle ever-growing data stores. However, incumbent storage systems are often proprietary and expensive to maintain. As the deluge of Big Data in the enterprise continues to grow, these storage systems are being left further behind, simply unable to scale to meet today’s data needs.
To meet the growing challenge of Big Data, IT professionals are looking for a new distributed, unified storage system capable of massive scaling. Going beyond just a file system, it is essential to also include object storage and block storage to ensure optimum flexibility and performance. In an object-based file system, file metadata is separated from file data, and the data is then split into flexible-sized data containers called objects. One advantage to object-based architecture is the elimination of metadata bottlenecks: because the metadata is stored separately from the data, metadata servers are contacted only once when the file is accessed. Block storage provides a reliable, scalable disk interface for compute within cloud and legacy environments.
An emerging technology designed to meet the challenges of rapidly expanding storage demands including in the world of Big Data is Ceph, a distributed storage system that is designed to be massively scalable, self managing, without any single point of failure. Ceph is unique because it offers block storage, object storage, and a POSIX-compliant file system all in one. Able to seamlessly scale from gigabytes to exabytes and beyond, Ceph is designed to handle extreme workloads—for example, when tens of thousands of clients simultaneously access the same file—a usage scenario that brings typical incumbent storage systems to their knees. Ceph uses the CRUSH algorithm for efficient, scalable and highly specific data placement. Ceph runs on commodity hardware, with intelligent storage nodes and no single point of failure. Because it is open source, Ceph is free to use and can be easily integrated into existing architectures.
Originally created as an open-source project by DreamHost co-founder Sage Weil, Ceph is making waves in the industry, including working closely with Dell and contributing to the OpenStack project. Ceph was included in the Linux Kernel in 2009. The Ceph team continues to work on improving the entire storage system, and they recently announced a project Roadmap explaining their vision for the future of Ceph, including Hadoop integration and full RADOS Block Device support within Openstack.
As the amount of data being generated and stored in the enterprise continues to grow, IT professionals will be forced to abandon their crumbling, legacy storage systems that simply cannot handle today’s data needs. The next generation of distributed, unified storage systems is the answer for the enterprise’s Big Data needs. When evaluating storage systems, companies should look for massive scalability (to the exabyte level and beyond), self management for lower management costs, and maximum reliability that will ensure that the organization will have ready access to all of its expanding pool of data.
Labels:
Storage
Monday, December 5, 2011
New Software Helps Companies Manage “Big Data”
- Ken Cheney, vice president for business development and marketing, Likewise (http://www.likewise.com/), says:
While digital data in all forms is more than doubling every two years, IDC predicted (in 2008) that the annual growth rate for unstructured data in data centers would exceed 60 percent through 2012. More recent estimates indicate that IDC’s prediction was somewhat conservative. One estimate indicates that by 2012 unstructured data will consume 80 percent of data center storage. (Unstructured data includes financial files, medical records, office documents, media and big data files.)
According to Ken Cheney, vice president of business development and marketing for Likewise, “some 40 percent of unstructured data is classified as sensitive and only 14 percent of organizations with a plan for managing that data.” To address this growing challenge, Likewise announced Likewise Data Analytics and Governance software, now available in a public beta, which gives organizations greater visibility into their unstructured data for improved security, auditing and compliance.
Industry analysts, Storage Strategies Now, wrote in a report: Most enterprise organizations have little understanding of their unstructured data. The risks and costs due to this lack of understanding include losing valuable data, not effectively exploiting assets, security risks and the inability to meet compliance, legal, and regulatory requirements.
Likewise Data Analytics and Governance enables organizations to implement a set of automated best practices to secure and manage unstructured data. The application uses analytics to contextualize data with user identity, sensitivity, and other information to mitigate risks, reduce costs and create value.
The software can help organizations understand performance and usage across storage pools, categorize unstructured data to create new applications or lines of business, and exploit data to maximize revenue. Companies can consolidate reporting across data silos, enforce consistent access policies, and manage entitlements from a single web console. The result is a global hierarchical view of an organization’s unstructured data that can identify and remediate root causes of security, performance and access issues.
"The problem with unstructured data has grown exponentially over time. It can seem insurmountable, but companies must get their arms around the sensitive data contained in these files,” said Ginny Roth, analyst, Enterprise Strategy Group. “Without the ability to have some glimpse into this data in the wild, companies will be increasingly vulnerable to high profile breaches."
The new Likewise application integrates with the Likewise Storage Services platform used by OEM network attached storage (NAS) vendors such as HP and EMC-Isilon, and has adapters that support NetApp, EMC-Celerra and other NAS filers. The beta version is available for qualifying customers with pricing that starts at $18,000.
The Likewise Storage Services platform, used by such OEM storage vendors as HP and EMC Isilon, offers a consistent security model for file-based access and cross-platform, unified storage across physical, virtual and cloud environments. Likewise Storage Services provides integrated identity and access management, as well as secure access to data from Windows, Unix and Linux systems. Supported protocols include SMB/CIFS 1.0, 2.0, 2.1, NFS 3.0, and a RESTful API. Likewise Storage Services is available with a commercial license from Likewise Software.
Labels:
Data Protection,
Storage
Wednesday, November 16, 2011
Manage Data Growth and Strategically Adopt Virtualization
- Yogesh Agrawal, Vice President and General Manager, FileStore Product Group, Symantec (www.symantec.com), says:
Maintaining performance and scalability of storage systems while controlling associated costs are key challenges businesses face due to explosive growth of unstructured data especially in virtualization, cloud and archiving infrastructures.
Symantec FileStore N8300TM, the latest version of Symantec’s clustered, Network Attached Storage appliance is designed to help customers address business challenges associated with building virtual environments and cloud storage, and managing large volumes of data while controlling associated storage costs. With linear performance and scalability, FileStore N8300 is positioned to address even the most demanding file serving and emerging web-based services. It can be configured to survive multiple node failures, ensuring the high availability of data and continued uptime of data operations with redistribution of workloads across clusters. FileStore N8300 is modularly scalable and can scale up to 16 active-active nodes, up to 256 TB of file system capacity, and supports 1.4 PB of total storage on the backend. The fully redundant connections between clustered nodes and back-end storage arrays help to avoid performance bottlenecks. The system may grow, shrink, or be reconfigured, all online and without downtime or interruption to business operations and built-in replication ensures the non-disruptive operations in case of a disaster.
FileStore N8300 is VMware-certified and enables organizations to fully benefit from their virtualization investments with independent server and storage scaling, efficient provisioning of virtual machines and advanced storage optimization capabilities. Additionally, FileStore N8300 offers organizations simplified manageability of their storage infrastructure from within VMware vCenter or with a View plug-in.
FileStore N8300 also natively integrates with other Symantec products to provide a holistic solution for effective storage and management of data. FileStore N8300 provides automatic storage tiering with Veritas Storage Foundation SmartTier feature, malware protection with Symantec Anti-Virus and faster backup through integration with Symantec NetBackup. In addition, FileStore N8300 and Symantec Enterprise Vault provide an ideal storage infrastructure for end-to-end archiving.
To learn more about Symantec FileStore N8300, visit the FileStore N8300 Product Page at http://go.symantec.com/filestore.
Maintaining performance and scalability of storage systems while controlling associated costs are key challenges businesses face due to explosive growth of unstructured data especially in virtualization, cloud and archiving infrastructures.
Symantec FileStore N8300TM, the latest version of Symantec’s clustered, Network Attached Storage appliance is designed to help customers address business challenges associated with building virtual environments and cloud storage, and managing large volumes of data while controlling associated storage costs. With linear performance and scalability, FileStore N8300 is positioned to address even the most demanding file serving and emerging web-based services. It can be configured to survive multiple node failures, ensuring the high availability of data and continued uptime of data operations with redistribution of workloads across clusters. FileStore N8300 is modularly scalable and can scale up to 16 active-active nodes, up to 256 TB of file system capacity, and supports 1.4 PB of total storage on the backend. The fully redundant connections between clustered nodes and back-end storage arrays help to avoid performance bottlenecks. The system may grow, shrink, or be reconfigured, all online and without downtime or interruption to business operations and built-in replication ensures the non-disruptive operations in case of a disaster.
FileStore N8300 is VMware-certified and enables organizations to fully benefit from their virtualization investments with independent server and storage scaling, efficient provisioning of virtual machines and advanced storage optimization capabilities. Additionally, FileStore N8300 offers organizations simplified manageability of their storage infrastructure from within VMware vCenter or with a View plug-in.
FileStore N8300 also natively integrates with other Symantec products to provide a holistic solution for effective storage and management of data. FileStore N8300 provides automatic storage tiering with Veritas Storage Foundation SmartTier feature, malware protection with Symantec Anti-Virus and faster backup through integration with Symantec NetBackup. In addition, FileStore N8300 and Symantec Enterprise Vault provide an ideal storage infrastructure for end-to-end archiving.
To learn more about Symantec FileStore N8300, visit the FileStore N8300 Product Page at http://go.symantec.com/filestore.
Labels:
Enterprise Network,
Storage
Thursday, November 10, 2011
Tom Buiocchi’s Storage Predictions for 2012: Infiltration of ‘Small Data’ and a New Kind of Cloud
- Tom Buiocchi is CEO of Drobo (www.drobo.com), says:
The pace of change in the storage industry is going to accelerate in 2012. Cloud strategies are evolving rapidly, solid-state media will have its day, and Big Data technologies will find their way to “Small Data” customers. Any vendor with an old school product line is going to learn some new lessons the hard way in 2012. Among Drobo’s 2012 predictions are:
As for Drobo, the coming year will see the “Drobo Invasion” continue—bringing advanced technologies and unprecedented ease-of-use to more and more businesses. Its most recent Drobo for business solution, the Drobo B1200i, is shipping now and features technological breakthroughs and an unrivaled combination of automation, advanced features and affordability.
The pace of change in the storage industry is going to accelerate in 2012. Cloud strategies are evolving rapidly, solid-state media will have its day, and Big Data technologies will find their way to “Small Data” customers. Any vendor with an old school product line is going to learn some new lessons the hard way in 2012. Among Drobo’s 2012 predictions are:
- It’s the end of cloud storage as we know it today. Pure cloud adoption will become less common than a hybrid approach that tightly integrates public and private cloud architectures with modern on-premise storage systems. This trend will hold true for both home users and small-medium businesses (SMBs). According to recent cloud usage research conducted by Drobo, 96 percent of SMBs (up to 500 employees) report they will store at least 50 percent of their data on-site for a minimum of the next three years. Factors cited included cloud performance, security and reliability concerns. Both businesses and individuals did state that they wanted tighter and more automated integration between their on-site data and their cloud provider. The cloud is going to have one foot on the ground for some time to come.
- “Small Data” eclipses Big Data in importance. Today there is big buzz around Big Data, but the fact of the matter is Big Data is relevant to only the largest of companies and data hoarders—similar to the perspective that only one percent of the population owns 99 percent of the nation’s wealth. It’s the one person, family or business having to navigate the protection and management of their own data that affects the largest group of people—100 million individuals and small businesses nationwide alone. This is the more pervasive problem (when compared to Big Data), and it highlights a persistent oversight of the entrenched, legacy storage system vendors that focus on the one percent while under-serving the “little guy.” The numbers are too big to ignore—while Big Data will continue as a top issue in 2012, it’s the “Small Data” opportunity that will explode.
- Consumerization of IT continues as enterprise storage features hit the SMB and home user market. It happened with PCs years ago and now it’s happening with tablets. In 2012 it will happen with personal and small business storage. Automated data protection, advanced thin provisioning, and powerful data-tiering with solid-state drives (SSD) are among the innovative technologies that entered the enterprise market first, but in 2012 they will further permeate home and small business offices. Will most new home or small office users know how to describe these cool, geeky storage features? Probably not, but they will know that storage has never been so easy to use, reliable and fast. 2012 will be the year that the idea of storage for the rest of us takes on a larger role in our lives, better protecting our rapidly growing digital universe.
As for Drobo, the coming year will see the “Drobo Invasion” continue—bringing advanced technologies and unprecedented ease-of-use to more and more businesses. Its most recent Drobo for business solution, the Drobo B1200i, is shipping now and features technological breakthroughs and an unrivaled combination of automation, advanced features and affordability.
Labels:
cloud computing,
Storage
Thursday, October 27, 2011
Storage Knows No Boundaries
- Ken Cheney, vice president of marketing and business development at Likewise Software (www.likewise.com), says:
While many segments of the economy struggle, virtualization, cloud computing and an explosion of unstructured data in the enterprise are shifting the landscape for storage providers, creating more market opportunities than ever before.
We are experiencing this first hand at Likewise Software where we recently announced that Likewise will add 25 percent to our workforce in sales, marketing and software engineers. Back in early June, we announced Likewise Storage Services, an integrated software platform for identity, security and storage. Built for OEM consumption, Likewise Storage Services supports both CIFS and NFS, as well as “mixed-mode” multi-protocol deployments with consistent identity and security across both CIFS and NFS for file objects. A few weeks later, we announced our decision to sell our Active Directory Bridge business unit to Beyond Trust. Now with our focus 100 percent on storage, we are poised to go after this market with gusto.
And we’re not alone at the storage party. EMC, HP, Microsoft and VMware are among the big industry players turning up the heat in storage. Recently, Microsoft has been making a lot of noise around its Server Message Block (SMB) 2.2 for Windows Server 8 – its big, bold platform play for virtualization and the cloud. SMB 2.2 has huge implications for the storage industry and BILLIONS are at stake for those who miss out.
The limitations of previous versions of CIFS / SMB lead to the widespread adoption of block-based storage for applications and virtualization. SMB 2.2 file-based storage becomes not only a credible option, but the recommended option for provisioning Microsoft workloads. This will result in a sea change in the storage industry.
For months Likewise has been quietly working to build support for SMB 2.2 for our OEM customers. We just announced a licensing agreement with Microsoft to add support for SMB 2.2 in Likewise Storage Services. This will allow our OEM customers to better manage the growth of unstructured data by providing consistent access control for files across physical, virtual and cloud environments. Despite the momentum we feel at Likewise, we are keenly aware that we’ve only scratched the surface of the huge and growing opportunity in storage software.
While many segments of the economy struggle, virtualization, cloud computing and an explosion of unstructured data in the enterprise are shifting the landscape for storage providers, creating more market opportunities than ever before.
We are experiencing this first hand at Likewise Software where we recently announced that Likewise will add 25 percent to our workforce in sales, marketing and software engineers. Back in early June, we announced Likewise Storage Services, an integrated software platform for identity, security and storage. Built for OEM consumption, Likewise Storage Services supports both CIFS and NFS, as well as “mixed-mode” multi-protocol deployments with consistent identity and security across both CIFS and NFS for file objects. A few weeks later, we announced our decision to sell our Active Directory Bridge business unit to Beyond Trust. Now with our focus 100 percent on storage, we are poised to go after this market with gusto.
And we’re not alone at the storage party. EMC, HP, Microsoft and VMware are among the big industry players turning up the heat in storage. Recently, Microsoft has been making a lot of noise around its Server Message Block (SMB) 2.2 for Windows Server 8 – its big, bold platform play for virtualization and the cloud. SMB 2.2 has huge implications for the storage industry and BILLIONS are at stake for those who miss out.
The limitations of previous versions of CIFS / SMB lead to the widespread adoption of block-based storage for applications and virtualization. SMB 2.2 file-based storage becomes not only a credible option, but the recommended option for provisioning Microsoft workloads. This will result in a sea change in the storage industry.
For months Likewise has been quietly working to build support for SMB 2.2 for our OEM customers. We just announced a licensing agreement with Microsoft to add support for SMB 2.2 in Likewise Storage Services. This will allow our OEM customers to better manage the growth of unstructured data by providing consistent access control for files across physical, virtual and cloud environments. Despite the momentum we feel at Likewise, we are keenly aware that we’ve only scratched the surface of the huge and growing opportunity in storage software.
Labels:
Storage
Tuesday, October 25, 2011
Retaining Data At The Lowest Possible Cost And Efficiency Scale
- Deirdre Mahon, vice president of marketing at Rainstor (www.rainstor.com), says:
Probably the single most challenging part of proactively managing the data center is the strategy and planning around IT infrastructure and how much capacity is required to retain existing enterprise data in addition to future storage capacity requirements. Most organizations today retain enterprise data for many years and in fact many never actually delete the data – once transacted, it is retained from “now on.” This places burden on IT that requires data to be online and available for continuous query and analysis in addition to providing fast access to external regulators that govern how long data be retained.
Typically, IT keeps the data in the systems it was originally transacted until such a time where that system is no longer used and becomes legacy but where the data still needs to be retained and accessed. Increasing demands from the business to query this data enforces IT to keep it in expensive systems that require costly DBA resources to maintain over time. However, more diligent information life-cycle data management is required which enforces policies around how long data is retained in enterprise production environments that will ultimately make IT much more efficient and satisfy both the business needs and additionally the IT budget. Offloading large volumes of transactional data from production to a dedicated online archive is key to enabling Big Data to be retained at lowest possible cost and efficient scale.
IT needs to be more rigorous with data management and infrastructure technology choices and the resultant expenditures. Gone are the days where traditional relational or analytical environments are the only option to keep data secure, available and online for business query. There is no longer a one-size fits all approach to managing enterprise data. In the last decade, there has been tremendous innovation in the world of data management and we have witnessed rapid adoption of NoSQL, In-memory, Columnar and Hadoop/MapReduce as ways to corral the ever-growing volume of multi-structured enterprise data. Whilst IT is struggling to transform this data into actionable information for the business, it is very important to not lose sight of the overall cost of storing and retaining this data, which will become even more pronounced as volumes continue to escalate.
A right-tiering approach to how data is managed and stored is required and deploying best-of-breed purpose-built technologies to satisfy the specific business need is what IT needs to focus on.
Analysts continue to report that Big Data is on the rise. IDC says the amount of data will grow 44 times by 2020, and the amount of digital information created and replicated rose by 62 percent in 2010 to nearly 800,000 petabytes, which would fill a stack of DVDs reaching from the earth to the moon and back. By 2020, that pile of DVDs would stretch halfway to Mars.
In terms of RainStor’s rank in overall data center priorities, it’s high, given the speed of enterprise data growth. As our world continues to become more digital, the Big Data deluge will drive an increasing need for additional data center storage across all industries, including communications, healthcare, financial services, SmartGrid utilities, security, etc. This will place new levels of stress on our data centers, systems and infrastructures.
Central to RainStor’s unique product capabilities is the ability to compress and de-duplicate large data sets, enabling reduction ratios that are typically 40:1, rising to 100:1 with some data, through the use of four distinct, yet complementary, techniques. With RainStor’s data reduction capabilities, organizations can significantly reduce overall storage costs and enable a data center to run much more efficiently.
The four techniques include field level de-duplication, pattern level de-duplication, algorithmic and byte level compression. These don’t result in any loss of detail; instead, RainStor stores each record as a series of pointers to the location of a single instance of data value or pattern of data values.
RainStor offers a new class of Big Data repository, focused on long-term Big Data retention with continuous query access. With RainStor, data centers can go on a “Big Data Diet” or in other words, reduce the storage capacity and cost to keep large volumes of data online. For example, you can offload 180-day-old+ data from production to RainStor for your online archive, and retain query and analysis capabilities via standard SQL and various BI tools. RainStor achieves this at a much lower cost per terabyte stored. By having virtually unlimited amounts of data online and available, you eliminate the need for tape archive and therefore the time delay and manual effort to retrieve data from tape, which is risky especially if data sets are large and schemas have changed since the time the data was offloaded.
Data center and IT managers should carefully consider a tiered infrastructure and data management strategy to retain and store critical enterprise data for both business and external regulatory requirements. RainStor’s patented technology is primarily focused on reducing the amount of data stored, which also significantly reduces overall storage costs, and you can run on low-cost commodity hardware enabling you to lower overall total cost of retained data. Let’s look at the key benefits to RainStor’s unique capabilities.
RainStor benefits enterprise data centers in the following ways:
- Dramatically reduces the cost and complexity of storing large volumes of historical structured and semi-structured data compared to traditional databases
- Provides continuous access to historical data, which enables organizations to meet compliance regulations and to give business users access to broader data sets for ongoing analytics and BI
- Allows organizations to retain historical data, on-premise, via public or private cloud and hybrid storage
- Enables you to better control your data assets by auto-deleting records based on compliance retention rules.
Most large organizations today retain data for many years, and a 2011 DBTA survey reveals that data is retained forever. They will benefit from the following capabilities:
- Specific use-cases would include compliance data retention, query and reporting and situations where you need to archive legacy application data on systems you are retiring due to consolidation or modernization efforts.
- Continuous online access to larger and broader data sets that are query-able through standard SQL or BI tools whereby you can re-instate older data into production analytics environments for better results
- Ability to compress or reduce data sets to a smaller, manageable footprint (~40 to 1 or greater) in order to reduce overall storage costs and scale as data volumes inevitably grow
- Ability to retain specific data sets by pre-configured business rules, which allow organizations to easily purge data at exactly the right time. (Keeping data longer than required makes little sense and can in some cases be risky so automating this keeps data retention costs down.)
- Ability to run on a broad range of hardware and operating systems, which ensures future flexibility
- Compressing and reducing data to 95 percent means less storage footprint and provides not only significant savings for on-premise data center deployments but is even more economically attractive with cloud deployments.
Big Data presents a challenge for IT and is particularly pronounced in key industries including communications, financial services, utilities and healthcare because they are governed by external regulatory requirements for retaining and providing quick access to historical data for audits, reports and business analysis. IT must select the best technology solutions available to keep data for extended periods of time and more importantly, in the most cost effective way. For large global organizations, keeping and storing large volumes of data is a sunk cost, and doing so in the most efficient way is critical to staying ahead of the competition.
Investing in technology that compresses data at a high rate, satisfies stringent compliance and government regulations, provides ease-of scale and the fact that it’s query-able is critical for these organizations. RainStor solves this problem by delivering a unique technology capability that ultimately reduces the data footprint and makes the problem 10x less cost, when compared to a traditional database approach. Often operational systems become bloated over time with historical data sets, which can be offloaded to a RainStor archive for continuous data access. Additionally, instead of putting data on tape which is risky because you will have challenges with re-instating the data to the original system especially if it is voluminous. Data warehouse repositories can also be offloaded with large data sets to RainStor where that historical data can later be pulled back into the core BI system if deeper analysis is required in the future.
RainStor’s IP is on its unique compression capabilities where it uses a tree-based structure or a “binary tree” to store data that links the various instances of patterns together to establish data records. This means that the original records can be reconstituted at any time. This de-duplication process also means that the bigger the data set, the higher the probability that values and patterns will be repeated, and the greater the level of compression that can be achieved when loaded.
Take a look at this video by RainStor’s Chief Architect, which explains how extreme data compression is achieved to deliver significant reduction in storage footprint for cost-efficient Big Data retention:
Labels:
Data Protection,
Infrastructure,
Storage
Thursday, September 22, 2011
Meeting the Challenge of Information Overload
- Balaji Srinivasan, director of microsoft exchange products at Sherpa Software (www.sherpasoftware.com), says:
The amount of electronic data flowing through organizations is growing at an incredible rate. Much of this information is collected and stored. According to a whitepaper published by Osterman Research, 75 percent of the information end users need to do their jobs is stored in email. The consequences of this are numerous, and include typical data management issues such as the cost of storage and difficulties with backup and recovery. In today’s heavily regulated environment, there are more significant challenges associated with ensuring all corporate data meets relevant organization and industry requirements and is accessible for legal and eDiscovery purposes.
Where is all this data coming from? Email has been an ongoing culprit. Despite the rise of other methods of communication, email remains the primary means of corporate communication and continues to grow and generate the vast amount data being retained and managed by IT departments. In a recent report, Osterman Research found that the average email system message store size had increased by more than 25 percent during the past 12 months for nearly half of organizations. The firm further estimated that storage-related issues such as increasing message size, increasing backup and restore times, and lack of messaging-related disk space constitute three out of the five leading problems in managing messaging systems.
These issues, in particular the “slowness” of email, has created a need for a more immediate means of communication, resulting in the rise in use of instant messaging and social media. However, corporate information shared over instant messages and social networks is subject to the same regulatory and compliance requirements as email and other corporate data. As organizations grapple with the right corporate social media strategy, the fact remains that it is turning into another area through which information is distributed and warrants monitoring.
The drop in the cost of storage devices has led to another trend. Rather than taking the time to clean up their environment, individuals and organizations seemingly retain more and more, potentially unneeded, data. With the increased adoption of document management applications such as Microsoft SharePoint, duplication of data across multiple repositories, such as files stored both in network shares and also in SharePoint, is steadily on the rise.
Security is another obvious concern when looking at managing large volumes of data across multiple repositories. Although data leakage has received a lot of attention recently through the activities of Wiki Leaks, it is an age-old problem. With all the different avenues for easily extracting and sharing information - from physical media such as thumb drives to technologies such as email, instant messaging and social media outlets - there are an increasing number of ways for information to leave an organization.
While organizations have been dealing with many of these challenges for a number of years, the sheer volumes of data involved makes managing them a daunting task. Recent regulatory changes such as updates to the Federal Rules of Civil Procedure (FRCP) and other more industry-specific requirements such as HIPAA covering healthcare, the Sarbanes-Oxley Act (SOX) covering publicly traded companies, and the US Securities and Exchange Commission rules covering the financial industry, organizations have eliminated unpreparedness as an excuse for not meeting data collection and retention requirements. The consequences for failing to produce information can be crushing.
The remainder of this article will provide strategies IT administrators can use to alleviate some of this burden and better prepare their organizations to proactively meet these challenges. The first step is defining corporate policies around information management. This task certainly falls under the cliché of easier said than done, but it cannot be emphasized enough that this is absolutely necessary. A policy provides the framework for an information management strategy. It also justifies the actions IT will need to take to control corporate data.
There is plenty of information available on creating corporate policies. Depending on the size of the organization, this process should involve members from several departments within the organization. A clear and thorough policy definition makes for easier compliance and enforcement; so spend the necessary amount of time in this phase.
Under threat of repercussions that can be as severe as termination, corporate policies can be used to force employees’ adherence to corporate policies. For instance, a policy could require that no company information be shared on social media sites. Although difficult to monitor, the policy provides the cover to take appropriate action when a failure to comply is detected.
Given today’s eDiscovery and regulatory requirements, relying on users to comply with corporate policies is often not good enough. IT administrators need to have systems and processes in place to proactively manage corporate data. As has been well documented, the exponential growth of email isn’t showing signs of lessening. Although hosted email solutions seem to be gaining some steam, a majority of organizations still host and manage their own email infrastructure. The burden of management rests with the internal IT department.
When investigating an email policy enforcement system, exploring native tools is a good place to start. Most email platforms include some basic management capabilities. Microsoft Exchange, for instance, going as far back as Microsoft Exchange Server 5.5, has included a utility called Mailbox Manager to help enforce elementary retention policies. With each new version of the Exchange, there have been improved management capabilities, with Exchange 2010 incorporating some of the most advanced built-in capabilities to date. In situations where the built-in capabilities are not sufficient or are unable to meet an organization’s management needs, there are products available from third-party vendors that specialize in email management and can be used to augment or fulfill these needs.
On the topic of managing email, especially in Microsoft Exchange environments, one cannot ignore PST files. PST files are local archives created by end users of server-based email using the Microsoft Outlook email client. Since these are created locally, access to them, and quite often even their very existence, is beyond the purview of the IT administrator. Locating, managing and investigating the content within PST files can be a monumental task.
If your organization currently uses or has used PST files, it is wise to consider the use of a third-party email management product to assist with the task of identifying and locating all PST files across the company’s network storage and users’ desktops. Once located, you can use native or third-party tools to enforce your company’s email policy. If your policy calls for eliminating PST files from your environment, there are group policy options available to prevent the creation of PST files and to prevent addition of email data to existing PST files. Third party solutions can also assist in other areas such as ensuring compliance with corporate instant messaging policies.
If use of instant message is approved and necessary within your organization, it is advisable to deploy an approved corporate-wide instant messaging solution such as Microsoft Lync (formerly called Office Communicator) and IBM Lotus Sametime and disallow the use of other instant messaging options. This can be fairly easily enforced at the corporate network level to ensure compliance. Most corporate instant messaging solutions offer options to archive and store communication transcript history, providing a mechanism to capture and retain that information to comply with your organization’s messaging policy requirements.
If disabling the use of public instant messaging is not an option, capturing information transmitted across these channels will be a challenge. Here is where third-party solutions can assist. They are typically deployed as an appliance at the perimeter of the corporate network, which collects all instant message traffic and provides the data in several formats that can then be ingested into data repositories.
Hoarding of data on network file shares and in document management systems such as SharePoint is another place where data can accumulate and hide. There are a number of network storage devices that include advanced capabilities such as deduplication and enforcing user quota limits. Although these technologies assist in limiting data overload, mining the content in these collections is another challenge entirely. There are a number of search and indexing solutions available including some built into the native platforms that could alleviate the burden of managing this data. However, an obstacle encountered by a number of administrators is the need to perform a consistent search across all sources of data in an organization. Third-party archiving and eDiscovery solutions are an excellent solution to these types of business challenges.
IT administrators are well aware of many of these challenges. To address information overload, the best place to start is by creating a comprehensive and clear corporate policy regarding data storage and retention. It is important to investigate the native capabilities of the data platforms. Where these fail to meet the requirements of organizational policy guidelines, consider third-party solutions to augment these capabilities. Good luck!
For more information, visit www.sherpasoftware.com.
The amount of electronic data flowing through organizations is growing at an incredible rate. Much of this information is collected and stored. According to a whitepaper published by Osterman Research, 75 percent of the information end users need to do their jobs is stored in email. The consequences of this are numerous, and include typical data management issues such as the cost of storage and difficulties with backup and recovery. In today’s heavily regulated environment, there are more significant challenges associated with ensuring all corporate data meets relevant organization and industry requirements and is accessible for legal and eDiscovery purposes.
Where is all this data coming from? Email has been an ongoing culprit. Despite the rise of other methods of communication, email remains the primary means of corporate communication and continues to grow and generate the vast amount data being retained and managed by IT departments. In a recent report, Osterman Research found that the average email system message store size had increased by more than 25 percent during the past 12 months for nearly half of organizations. The firm further estimated that storage-related issues such as increasing message size, increasing backup and restore times, and lack of messaging-related disk space constitute three out of the five leading problems in managing messaging systems.
These issues, in particular the “slowness” of email, has created a need for a more immediate means of communication, resulting in the rise in use of instant messaging and social media. However, corporate information shared over instant messages and social networks is subject to the same regulatory and compliance requirements as email and other corporate data. As organizations grapple with the right corporate social media strategy, the fact remains that it is turning into another area through which information is distributed and warrants monitoring.
The drop in the cost of storage devices has led to another trend. Rather than taking the time to clean up their environment, individuals and organizations seemingly retain more and more, potentially unneeded, data. With the increased adoption of document management applications such as Microsoft SharePoint, duplication of data across multiple repositories, such as files stored both in network shares and also in SharePoint, is steadily on the rise.
Security is another obvious concern when looking at managing large volumes of data across multiple repositories. Although data leakage has received a lot of attention recently through the activities of Wiki Leaks, it is an age-old problem. With all the different avenues for easily extracting and sharing information - from physical media such as thumb drives to technologies such as email, instant messaging and social media outlets - there are an increasing number of ways for information to leave an organization.
While organizations have been dealing with many of these challenges for a number of years, the sheer volumes of data involved makes managing them a daunting task. Recent regulatory changes such as updates to the Federal Rules of Civil Procedure (FRCP) and other more industry-specific requirements such as HIPAA covering healthcare, the Sarbanes-Oxley Act (SOX) covering publicly traded companies, and the US Securities and Exchange Commission rules covering the financial industry, organizations have eliminated unpreparedness as an excuse for not meeting data collection and retention requirements. The consequences for failing to produce information can be crushing.
The remainder of this article will provide strategies IT administrators can use to alleviate some of this burden and better prepare their organizations to proactively meet these challenges. The first step is defining corporate policies around information management. This task certainly falls under the cliché of easier said than done, but it cannot be emphasized enough that this is absolutely necessary. A policy provides the framework for an information management strategy. It also justifies the actions IT will need to take to control corporate data.
There is plenty of information available on creating corporate policies. Depending on the size of the organization, this process should involve members from several departments within the organization. A clear and thorough policy definition makes for easier compliance and enforcement; so spend the necessary amount of time in this phase.
Under threat of repercussions that can be as severe as termination, corporate policies can be used to force employees’ adherence to corporate policies. For instance, a policy could require that no company information be shared on social media sites. Although difficult to monitor, the policy provides the cover to take appropriate action when a failure to comply is detected.
Given today’s eDiscovery and regulatory requirements, relying on users to comply with corporate policies is often not good enough. IT administrators need to have systems and processes in place to proactively manage corporate data. As has been well documented, the exponential growth of email isn’t showing signs of lessening. Although hosted email solutions seem to be gaining some steam, a majority of organizations still host and manage their own email infrastructure. The burden of management rests with the internal IT department.
When investigating an email policy enforcement system, exploring native tools is a good place to start. Most email platforms include some basic management capabilities. Microsoft Exchange, for instance, going as far back as Microsoft Exchange Server 5.5, has included a utility called Mailbox Manager to help enforce elementary retention policies. With each new version of the Exchange, there have been improved management capabilities, with Exchange 2010 incorporating some of the most advanced built-in capabilities to date. In situations where the built-in capabilities are not sufficient or are unable to meet an organization’s management needs, there are products available from third-party vendors that specialize in email management and can be used to augment or fulfill these needs.
On the topic of managing email, especially in Microsoft Exchange environments, one cannot ignore PST files. PST files are local archives created by end users of server-based email using the Microsoft Outlook email client. Since these are created locally, access to them, and quite often even their very existence, is beyond the purview of the IT administrator. Locating, managing and investigating the content within PST files can be a monumental task.
If your organization currently uses or has used PST files, it is wise to consider the use of a third-party email management product to assist with the task of identifying and locating all PST files across the company’s network storage and users’ desktops. Once located, you can use native or third-party tools to enforce your company’s email policy. If your policy calls for eliminating PST files from your environment, there are group policy options available to prevent the creation of PST files and to prevent addition of email data to existing PST files. Third party solutions can also assist in other areas such as ensuring compliance with corporate instant messaging policies.
If use of instant message is approved and necessary within your organization, it is advisable to deploy an approved corporate-wide instant messaging solution such as Microsoft Lync (formerly called Office Communicator) and IBM Lotus Sametime and disallow the use of other instant messaging options. This can be fairly easily enforced at the corporate network level to ensure compliance. Most corporate instant messaging solutions offer options to archive and store communication transcript history, providing a mechanism to capture and retain that information to comply with your organization’s messaging policy requirements.
If disabling the use of public instant messaging is not an option, capturing information transmitted across these channels will be a challenge. Here is where third-party solutions can assist. They are typically deployed as an appliance at the perimeter of the corporate network, which collects all instant message traffic and provides the data in several formats that can then be ingested into data repositories.
Hoarding of data on network file shares and in document management systems such as SharePoint is another place where data can accumulate and hide. There are a number of network storage devices that include advanced capabilities such as deduplication and enforcing user quota limits. Although these technologies assist in limiting data overload, mining the content in these collections is another challenge entirely. There are a number of search and indexing solutions available including some built into the native platforms that could alleviate the burden of managing this data. However, an obstacle encountered by a number of administrators is the need to perform a consistent search across all sources of data in an organization. Third-party archiving and eDiscovery solutions are an excellent solution to these types of business challenges.
IT administrators are well aware of many of these challenges. To address information overload, the best place to start is by creating a comprehensive and clear corporate policy regarding data storage and retention. It is important to investigate the native capabilities of the data platforms. Where these fail to meet the requirements of organizational policy guidelines, consider third-party solutions to augment these capabilities. Good luck!
For more information, visit www.sherpasoftware.com.
Labels:
eDiscovery,
Storage
Subscribe to:
Posts (Atom)















