PUE lives on with Revised Metric

June 21st, 2010

When folks from the data center industry got together about 3-5 years ago to create a data center efficiency metric, we knew that we should tie it to the actual work being created within the data center (i.e. transactions per watt, IOPs/watt, FLOPs/watt…). However, every data center and even more so, every computer has a different work being completed and thus metric to be applied. For example, a science research computer might complete one transaction per month with a lot of network and storage traffic for that one big “transaction”, while an eBay data center might have 1,000’s of transactions per second for one computer system.

So, we came up with a compromise, knowing that all data centers and their workloads were different, yet needing something to push us as an industry to higher efficiency. Well, the holly grail of data center metrics got released….P U E. Yes, Power Usage Effectiveness. While it was only a start, and a best compromise, and we knew we needed to improve upon it or come up with something better, yet is has had perhaps more influence on energy efficiency of our data centers than any other metric or industry movement.

While improving PUE only affects the infrastructure side of the data center, not the hardware or software–leaving that always equal to 1.0 with everything else being above one–more power use the higher the PUE. Our data centers have been averaging above 2.0 (meaning that at 2.0 the infrastructure power load is equal to the server load, higher than 2 means it uses more power than the server load). A recent report from EPA of about 200 data centers last year across the US shows that we are averaging north of 2.0. Other studies show that we had been averaging around 3.0 worldwide, so we have improved greatly but still can improve so much more. While 2.0 is much better than 3.0, using 50% less power for the infrastructure, we know we should be able to achieve PUEs of at most 1.5 any where in the world, any TIER level, yet at 2.0 we are using more than double the power we need to support the non-hardware loads.

One problem with the PUE metric is that it is instantaneous since it measures power and not energy and all data center power usage fluctuates with weather and usage, and really what we care about to reduce costs is total energy use over a period of time. Energy is power use over time, while power is instantaneous. Otherwise, PUE could be measured on the coldest day of the year when all systems are running more efficiently but that is not a good gauge of annual energy use and thus costs. In all of the projects I get involved with, and all of the PUEs I quote, I use total annual energy instead of one-time power measurement, and hardware measured at the rack to account within PUE UPS, PDU and other electrical distribution losses. So, PUE should be an annual average, and that is exactly what member representatives of Green Grid, SVLG, 7×24, EPA, DOE, USGBC, ASHRAE and UpTime recommended in December of this year at a meeting in DC. I provided recommendations from the SVLG along with Chris Page, Scott Noteboom and Tim Crawford representing the SVLG at this meeting with input from Olivie Sanche of Apple and many others.

Essentially the outcome was a revised PUE metric that now measures annual usage of infrastructure and IT load, which is fantastic! Also, a little more clarity or definition on how it should be measured and what should and should not be included. (Such as on-site power generation should never reduce one’s PUE, as energy in is energy in, regardless of source.) We’ll soon see PUE and PUE subscript 1, 2 & 3. These clarify where the server load was measured (UPS output, PDU or rack). Ideally, we’d all be measuring at the rack input, but many folks do not have this meter & monitoring capability, so the compromise was to allow for some acceptance of any of these points of measurement.

Even though the location of measurement will affect the measured PUE–meaning different measuring locations will result in different PUEs for the same data center–at least it’s an improvement, and will hopefully drive folks to measure at the rack–the most accurate location of measurement. It will also drive us to think about annual usage and costs, not one time or instantaneous, another big improvement in our thoughts about all buying and operational decisions. These are the key to improving efficiency and reducing costs: long-term measurement, long-term constant improvements, and buying decisions based on long-term economic analysis.

With our new PUE metric, it should re-invigorate the PUE discussions, comparisons, and improvements. Perhaps driving us all to lower PUEs, regardless of actual resulting PUE and type of data center. After all, we all gain when we each improve.

Goat Power, Forward prices of electricity, actual needs and the Green Data Center Conference

June 17th, 2010

For years I’ve spoken about the forward cost of energy due to increasing energy costs and climate legislation, which will affect the costs of some forms of energy generation. One of the key things I look for when I complete site selections for clients is the “forward” cost of electricity, which can often be much higher in a net-present value term than another site even though lower today than the same comparable site. This is because predictions of the increasing prices of electricity by market vary depending upon legislation, regulation, emissions requirements, and fuel prices. Since every utility has a difference mix of fuel sources, and each state has a different utility regulator, as well as different debt obligations, cost recovery and other factors, future utility prices will vary quite a bit. I believe that utilities with high carbon intensity or other emissions from their power generation will see higher price increases compared to utilities with lower carbon intensity per kWh. We’ve seen this in Northern Virginia with electricity prices increasing significantly over the last several years. I think we’ll see the same for other high-carbon states and those specifically with high coal production, such as North Carolina, Colorado, Texas, etc. Consider the forward price of electricity instead of just the current price of electricity into your site selection analysis.

And speaking of low-carbon intensity, Yahoo’s Quincy data center, which I led the completion of the first phase of construction before starting MegaWatt Consulting, recently released a new low-carbon option in managing our data centers: Goat Power. Enjoy this short video with my Yahoo friends Chris, Lisa and Ty and some new goat friends as well. Looks like one of the goats was particularly fond of Chris as well, or at least her shirt.

I spoke at the Green Data Center Conference in San Diego over the last three days. I taught a three-hour energy efficient data center workshop, also a one hour session about energy options and efficiency ideas for data centers, and also joined in on three panels: energy sources moderated by John Diamond, Organizations and Associations moderated by Bruce Myatt while I talked about the SVLG Data Center Efficiency program I co-chair and the McGill University high-performance co-location data center project case study that I helped with site selection and design ideas with Rumsey Engineers. Eric Soladay with Rumsey Engineers did a great job presenting the efficient data center design, with designed annual PUE of 1.06, and the very interesting snow-field concept for cooling this high-density data center without chillers or any other compressor-based cooling through 90% humidity and 90F summer time weather.

After 10+ years of talking about data center energy efficiency being important and concepts to improve the energy efficiency in data centers, as well as sharing my own experiences, I am so glad to hear that these ideas are sticking as well as the importance of energy efficiency. I was even more proud to that many of the ideas I have been pushing for the last several years as well as terms that I believe I coined nearly 10 years ago are sticking and being used in the regular vernacular of the industry: Holistic (designing and operating data centers as an entire system of hardware, software and infrastructure to achieve lowest total cost and highest availability for the intended purpose) and server hugger (the “need” (aka want) to have one’s data center and/or servers located nearby, often an emotional response and not a technical or rational need.)

Remember to look at your specific needs and also be creative with carbon reductions, like how you cut your data center grass!

The Data Center Vibratation Penalty to Storage Performance

June 10th, 2010

Every now and then a really great way to reduce energy use comes along that is so simple we all whack our head wondering, “why didn’t I think of that!” My principles of achieving ultra-efficient data centers (PUEs between 1.03-1.08; I call anything less than 1.10 ultra-efficient) are based upon simplicity and a holistic approach while meeting the need not the want or convention. Generally the simpler the better, as simple is always lower cost up front and ongoing, as well as easier to maintain, more reliable and more efficient.

So, here is one that will not catch you by surprise: a rack that saves energy. We’ve all heard of passive and active cooling racks: those with fans, heat exchangers or direct cooling systems. I explored some of the front & rear door heat exchanger racks back in 2003, which work really well for high-density applications but can be very expensive compared to better-designed data center cooling systems. But how about a rack that not only reduces energy costs but also improves hardware performance?

I’ve had the pleasure of exploring with Green Platform’s CEO, Gus Malek-Madani, their anti-vibration rack (“AVR”), a carbon-fiber composite rack actually designed to remove vibration. Why remove vibration? Green Platform claims that a typical datacenter experiences vibration levels of around 0.2 G-Root Mean Square (GRMS); this, it claims can degrade a disk drive’s performance (both I/O and throughput) by up to 66%; a fact that was borne out during a ‘rigorous’ testing exercise it did in conjunction with Sun Microsystems.

As harddrives get “larger” in capacity, bits get crammed into a smaller space. This along with smaller drives force tolerances between rotating platters and the movement of mechanical actuator arms within the drives to get tighter, and thus, vibration causes drives to slow down or have higher mis-read & writes, slowing down I/O performance. “As a result of this ‘vibration penalty,’ the company believes that up to a third of all US datacenter spending – on both hardware and power – is wasted on vibration, amounting to some $32bn of wastage. The company also says there’s evidence that reducing the impact of vibration will serve to improve the reliability of drives (and improving mean time between failure.)”

In order to back up this figure, early tests with Sun Microsystems (pre-Oracle) and Q Associates (“Effects of Data Center Vibration on Compute System Performance” by Julian Turner) showed IOPS improvement of up to 247% in random I/O. The following chart shows this storage performance degradation:



You can also watch the following video that clearly shows that just yelling into the face of storage hardware causes a very visible degradation of storage performance: http://www.youtube.com/watch?v=tDacjrSCeq4

If the vibration from yelling into a rack causes performance degradation, think about the vibration affects from HVAC systems, thousands of server fans, and even walking thru your data center.

The company says its carbon-composite design massively reduces the vibrations that can cripple hard disk drive performance, boosting performance, efficiency and even reliability. From the results of these tests stated above, they assume that most folks should see a 100% improvement in storage throughput, 50% shorter job times and consequently, 50% less power consumed per job. In testing with Sun Microsystems the AVP dissipated vibration by a factor of 10x to 1000x. In further testing with systems integrator Q-Associates, which pitted the AVP against a regular steel rack – it found that random read IOPS increased by between 56% and 246%, with random write IOPS showing a 34% to 88% improvement with the AVP.  

“The throughput and I/O rate of storage remains a significant performance bottleneck within the datacenter; though hard disk drive (HDD) capacities have increased by several orders of magnitude in the last 30 years, drive performance has improved by a much smaller factor. This issue is exacerbated by the fact that server performance, driven by Moore’s law, has increased massively, to the extent that there’s now a server-storage performance gap. The way most datacenters engineer around this problem is inefficient; typically workloads are striped across many disks in parallel, and disks are ‘short stroked’ — i.e. data is only written to the edge of platters – in order to minimize latency. Although this does address performance, the trade-off is that disk capacity is massively underutilized, wasting datacenter space and energy, not to mention the cost of reliably maintaining an unnecessarily large disk estate.”

In the many data centers I have had the pleasure of working in lately, storage is growing faster than server capacity and the greatest performance limitation is storage throughput. This product works for the high-end video/audio and scientific markets; a niche space where another of Malek-Madani’s company– Composite Products, LLC — is focused. The test results clearly show storage throughput dramatically improved by reducing vibration at the rack of storage hardware. With some 3 million storage racks currently in use inside datacenters worldwide, and growing by the second to probably eventually exceed server racks, this is a very large opportunity to improve performance while reducing energy use, always one of my main mantras. Green Platform expects to have their racks as an option from storage vendors, NetApp, EMC, and others, so that as you purchase and provision new storage systems, you pay a small incremental increase in price of the storage system for a very large improvement in performance and energy reduction. Think of all of those servers waiting so much less for data throughput and how much that can improve the utilization of those systems? Think about it.

I’m looking to conduct an end-user test with their rack; contact me if you’re interested so we can determine results for your organization.

Can we replace UPSs in our data centers?

April 27th, 2010

It has been common since I entered the data center realm 15 years ago that a data center had Uninterruptible Power Supplies (UPS) feeding all computer equipment or other critical loads. The UPS did two things: 1) kept the power flowing from batteries in the UPSs for a short duration until generators came on, utility power was restored computer equipment could be shut down; and 2) kept voltage and frequency stable for the computer load while the utility (or generator) power fluctuated, known as sags or surges. However, UPSs consume about 5-15% if the power entering them as losses in the units (a.k.a inefficiency). So if IT load equals 1 MW, UPS power will be about 1.1 MWs with the additional 100 kW lost as heat, which then requires additional cooling to keep at the roughly 75F temperature batteries and UPS run best. Here is a photo of some UPS systems:ups

Now, enter 2010. UPSs are still assumed by nearly every data center engineer and operator to be needed or required, yet, power electronics within the computer equipment can ride thru just about any voltage sag or surge a utility would pass on thru their protective equipment. Computer equipment power supplies have been rated for 100-240VAC and 50-60 Hertz for about 10 years now, so a far greater range than an utility will likely every pass on. Furthermore, due to capacitors in the power supplies, these devices can ride thru complete outages of about 15+ cycles, which is roughly 1/4 second. So the UPSs job is really now only to provide ride thru of outages over 1/4 second and until a generator comes on or as needed by the operation.

In many of the data center design charrettes that I have been part of over the last few years, we ask the users what really needs to be on UPS, avoiding the assumption that all computer load must be on UPS. Once we dive into the operations, we always come back with an answer from the data center operators that only a portion of the computer load needs to be on UPS and the rest can go down during a usually irregular utility outage. The reason is that these computers can stop operating for a few hours and not affect the business. Examples might be HR functions, crawlers, back up/long-term data storage, research computers, etc. Computers that might need to be on UPS include sales tools, accounting applications, short-term storage, email, etc. but not every application and function. Think about your own data center operations about what can go down every now and then from a utility outage (usually about once per year for a few hours) and see if you can reduce the total amount of UPS power you require and repurpose that expensive UPS capacity and energy loss to the critical functions.

Some data centers avoid UPSs entirely by putting a small battery on the computer itself, in widely publicized Google’s case, an inexpensive and readily available 9V battery. While this is an excellent idea for those that have custom computer hardware, it is not as easy to implement for most folks buying commodity servers today. Perhaps another idea better for the masses is to locate a capacitor on the computer board or within the server that can ride thru ~20+ seconds until generator(s) can supply load during a utility outage. Capacitor technology of today should make this fairly easy to implement and could be a standard feature on all computer equipment with a minimal added cost, much as the international power supplies did for us 10+ years ago and higher-efficiency power supplies (90+) are today. A great new technology that could make this easy to build on the computer board can be seen here:
http://newscenter.lbl.gov/feature-stories/2010/04/23/micro-supercapacitor/

Using a technology like this we could avoid UPSs entirely in our data centers by having enough ride thru built onto the computer boards, into the hardware, allowing us to save very expensive UPS power capacity, operating and maintenance expenses and space within our data centers for more important functions, compute and storage capacity. My thought for the day. Think about it and you might save some money and energy.

Economization and methods to data center efficiency, ASHRAAE 90.1 limits our options and efficiency

April 15th, 2010

For years, the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) has had a standard for “recommended” and “allowable” humidity and temperature ranges within a data center. These ranges were way too limited for the past many generations of computer hardware that operates within our data centers. It is rumored by many folks that these standards were set by IBM back in the 1950’s to accommodate computer punch cards and keep them from getting “soft” with humidity and to keep arc flashing between exposed copper lines on computer boards, as well as to keep vacuum tubes cool to prevent them from burning out, literally. However, computer hardware has advanced much over the last 50 years, especially the last decade or two, and many called the old ASHRAE standards very out of date.

Finally, ASHRAE advanced the standard, which so many data center designers, operators and engineers relied on for operating specs of their data centers. Even though the new standards expanded the humidity and temperature ranges considered “allowable” within our data centers, the hardware specifications allow for even much broader ranges. While ASHRAE TC9.9 allows for up to about 80 degrees F (depending upon the humidity), hardware manufacturers spec their equipment to run up to 95F inlet air and humidity generally between 5-95%. The computer hardware of today lacks vacuum tubes and paper punch cards; circuit boards are coated to prevent arc flashing; everything is solid state except two moving parts: the harddrive, which are hermetically sealed, and cooling fans to draw air thru the server. (And each of these has a limited life, already with solid-state options or complete removal (fans)).

Many studies have been completed to prove that the hardware no longer needs tight temperature and humidity ranges, nor even filtration to prevent dust accumulation. Microsoft proved this point with what I call their data center in a tent experiment, in which they located a couple of racks under a tent canopy in Redmond, WA for many months. The servers sucked in leaves, ran when it rained, snowed, blew, was hot and cold, and even had water leak onto them, without any hardware failures, proving the point that servers really don’t need any environmental controls. A link to this study is here: http://blogs.msdn.com/the_power_of_software/archive/2008/09/19/intense-computing-or-in-tents-computing.aspx

I myself completed very low humidity (10-20%) and high inlet temperature (75-85F) testing, as well as unintentionally water leaks onto servers back in 2002-2003, all without any failures whatsoever.

Another bold move was by chip-maker Intel, in which they completed what I call their data center in a box test, but different than most containers, in which they put two containers, each loaded the same with 900 servers, in Santa Fe for a year. One had air conditioning and it’s inherent humidity control as well as filtration, the other, a fan to draw outside air in with minimal filtration and no air conditioning. What a great test! Santa Fe can get hot (92F) and cold (24F), being in the high desert, normally very dry, but also with sudden thunderstorms on hot summer afternoons that made humidity range from 4-90%. Intel found that the air economized container had a very visible layer of dust on the servers, and that even though temperature and humidity ranged dramatically during this year long study, the server failure rate was 4.46% versus 3.83% in their main air-conditioned data center. A failure rate of this low is ridiculous when it is economic today to refresh (replace) hardware at 18-24 month cycles and the failures seemed to coincide with dramatic humidity changes (10 to 90% in one hour) . This study proves that the hardware is very robust and capable of needing very little to any humidity, temperature and filtration control. Intel’s own engineers say their chips are good up to 135 degrees, Centigrade (275F)! This study is available from Intel, titled “Reducing Data Center Cost with an Air Economizer”, August 2008.

I very much commend the good folks at Microsoft and Intel for completing and publishing these studies. So, why does ASHRAE still limit temperature and humidity so much more tightly than necessary? To add to this, ASHRAE recently released 90.1, a data center energy efficiency standard. Well, I’ve been touting and pushing for data center energy efficiency since the late 90’s, and I have made it a keystone of my career for about a decade, so of all people, I fully support further improving data center energy efficiency. This is why I chair the SVLG data center demonstration efficiency program (http://dce.svlg.org). But what ASHRAE has done via this standard is require air-economization, one technology to improving the mechanical efficiency of our data centers, essentially picking one technology. It would be akin to saying that all homes must have fiberglass insulation instead of saying walls and ceilings that meet a specific R-value, which is really what we want. Let technology and the commercial market compete for the best solutions, not standards that require one technology.

My concern with this standard is not that I don’t want data centers to air-economize to reduce energy use, my qualm is that air-economization is only one method to improving energy efficiency and often not the best. For example, I have been involved with the design of several data centers over the last two years, many with the fine folks at Rumsey Engineers (ww.rumseyengineers.com) and all have achieved annual PUEs of 1.04-1.08, significantly better than most data centers. In all of these low PUE data centers, we studied air-economization, heat wheels, direct and indirect evaporative cooling, chilled-water plants, geothermal exchange, and many other methods to cool the data center. In all cases, hot or cold climates, dry or humid, high elevation or low elevation, we found that air-economization was about twice as energy intensive as other cooling methods, and more expensive to build and often less reliable. So ASHRAE 90.1 could actually INCREASE energy use in our data centers (and first time cost as well) instead of the intended affect of reducing it. Why would a standards group specify a technology to achieve efficiency. Why are the temperature and humidity standards a decade or more out of date with technology? We need impetus to be more efficient, but ability to innovate and challenge each other to be as efficient as we can using a variety of technologies, old and new. Limitations to one technology are out of date the minute they are published in the data center world. With amazing new technologies being released daily, and a generation of servers lasting less than two years, a standards body that takes years to update should not be prescribing any specific technology and allowing greater flexibility in operations and design, enabling us operators, designers and manufacturers to adapt to the most-efficient resource.

I congratulate those who have come out before me about this new standard and ask with them, that ASHRAE revise their thinking once again and provide more flexible standards, not limiting ones, which only hurt our energy efficiency achievements. Thank you for reading.