Archive for the ‘Energy’ Category

Smooth-Stone & ARM-based servers

Sunday, August 15th, 2010

In the latest data centers that we have been involved in the design of, we have achieved designed PUEs of 1.04-1.09, which means that the electrical and cooling systems will use, on average over a year, 4-9% of the hardware loads (aka IT, servers, storage, network). This is a huge accomplishment and doesn’t come without a lot of experience, knowledge and constant effort to make those electrical and cooling systems ultra-efficient compared to average industry PUEs of over 2.0, meaning that cooling and electrical systems use more energy than the IT load. However, it also points out that these ultra-efficient data centers we are part of designing are now so efficient that we have to focus on the IT load to make a material affect on reducing energy use as there is very little more to save on the infrastructure side of the energy demand.

We work with clients to choose the most efficient servers and software solutions, but now is coming along an entirely game-changing technology, chips that use dramatically less power, about 1/20th, over existing technologies, and also more importantly, even much less energy in that they can turn off when not in use and immediately turn back on again when needed as processing demand increases again. In steps ARM-based processors for servers—the same technology used in our mobile devices today—that uses much less power and turns on and off much quicker than server processors of today.

Microsoft made a statement that they have been working with ARM based chips since 1997 and are now working with ARM in a new licensing agreement “to enhance our research and development activities for ARM-based products”. Quite a change from the strong partnership Microsft has had with x86 chips from Intel and Advanced Micro Devices. Even Apple has made acquisitions of companies and hiring ARM chip engineers and ARM specialists, likely for use in their growing line of smaller devices—the iPad, iPhone, iPod, etc.

Startup SeaMicro unveiled a server running on Intel Atom chips with a fabric that puts CPUs to sleep, allow for a lower energy use rack. They claim they can have as many as 2,048 CPUs into a 40U rack and use 8 kWs of power. These calculations seem to be with half of the processors off per rack, a good improvement over standard racks, but can we do even better?

Enter Smooth-Stone with a even more advanced approach with ARM-based processors designed for servers. The company is working to build a rack with a similar number of processors, but with ARM based processors, they can shut off and back on again these processors in a much more rapid fashion. These processors should be able to withstand much higher temperatures, vibration and other tolerance—after all, they come from mobile applications where these are requirements unlike the soft-glove approach in most data centers. The advantages of Smooth-Stone’s ARM-based servers should provide a significantly greater amount of onboard memory, compute performance, network bandwidth, lower costs and many other advantages in a chip that only uses single digit Watts, providing a significant performance to Watt advantage. Plus being able to quickly turn on and off every processor in a rack except one running at 1/10 to 1/20th or even less than a traditional processor when the rack is unneeded, and bringing the rack’s processors back up individually within microseconds is a distinct advantage in energy savings, the largest operating expense and driver of capital expense in data centers.

With Smooth-Stone’s SoC and software and experienced team designing the lowest power consuming servers, this may likely be the game-changing technology in IT and data centers.

The cute little button that makes you money

Wednesday, August 4th, 2010

Greening Greater Toronto study finds that data center servers operate at only 4% average utilization: “The statement is the result of a recent “Green Exchange” meeting on greening IT practices hosted by Greening Greater Toronto in partnership with the Ontario Institute of the Purchasing Management Association of Canada.”

“One of the other lessons learned from the meeting is that central-control systems are more effective at reducing energy consumption than relying on employee practices. Purchasers who implemented employee training programs to have people turn off their machines at the end of the day reported maximum penetration rates of 65 percent, declining rapidly over time.”

“In contrast, most organizations have focused on control solutions, where IT staff program computers to turn off on a timed cycle. This is often matched with settings to turn off monitors or put computers into sleep-modes after a certain period of inactivity. Purchasers report almost no user resistance to these solutions and consider it part of a larger trend of centralizing control of individual computers over a network. Most purchasers have solved common concerns about timed off-cycles with a software-based solution like the NightWatchman or Surveyor Windows server monitoring software.”

When I was at Sun Microsystems (late 90’s-2000), I found similar results. When I asked the 40,000+ Sun community to turn off desktop monitors and computers at night, rarely did they, even though s study I commissioned showed the savings to be well into millions of dollars per year (as most were left on during nights and weekends and on average, employees were only at their desk about 4 hours per work day). But when I had a third party switching device (MonitorMiser) added to all desktops that automatically turned off monitors when desktops were inactive (no mouse or keyboard input for 15 minutes), only three people of over 40,000 complained. Savings = over $3,000,000 a year in US. (Most monitors were 17-24” CRTs, and each employee averaged over two as many had several.)

I then took this a step further and asked that the Operating System turn desktops off when inactive. This was a bit more challenging, as what was inactive to user input might be actively running code all night. So I had the software engineers put in some more code to look at processor state, network activity, and keep it user selectable. This was a very crude “sleep” mode for the OS and a beginning of those for the industry. The industry followed what we did I think not for energy savings, but because Sun sales engineers started selling computers on TCO including energy use and winning deals. The sales team was realizing by the late 90’s that lower total cost, and consequently lower energy use of the equipment, helped to make sales. These and other changes led to over $10,000,000 in annual energy savings I implemented and led to earning my second EPA EnergyStar Partner of the Year Award.

It’s been great to see this early and rather crude OS function automatically put monitors and desktops to sleep and/or off states. Wow! Now look at what our desktops, servers, even networking and storage equipment can do to help it reduce energy use when underutilized.
Take this one more evolutionary step forward, and you have what I call server power management software (1E, PowerAssure, Surveyor, and many others) that automatically determines hardware utilization and state, and either puts it to sleep or off and then automatically turns it back on when needed. How far this has come from our early and rather crude versions of this with desktops at Sun.

Think about it: as a company you want to utilize all of the assets you have to perform work that maximizes revenue (or profit). But you also need assets for peak periods that are underutilized during low demand periods. Think New York City taxis. They are busy as heck on a Friday night or during rush hour yet rather idle at 4 AM on a Sunday. You wouldn’t want every one of those cabs with a paid driver idling their engine burning dollars out the tailpipe now would you? So the cabs are parked, the drivers are home asleep, and they work when demand warrants it. So why do we leave our servers (and storage and network ‘taxis’) idling 8,760 hours a day when average peak times are well less than 1,000 hours per year (often less than 100)? We do this and wonder why we have average processor utilizations of less than 10%. (Processor capacity is rarely the limiting factor in most applications these days, but that is a topic for another blog.) And yet those servers consume about 2/3 of peak power when at 0% processor utilization, so why leave them running, burning precious company dollars out through the power meter? Is it charity to our utility companies? I doubt it. So power down those servers when not needed and save precious dollars for more important tasks than burning power unproductively. After all, those servers do have an on/off button. And call the experts at MegaWatt Consulting for these and more solutions to increase your dollars. Power on…productively.

The Data Center Vibratation Penalty to Storage Performance

Thursday, June 10th, 2010

Every now and then a really great way to reduce energy use comes along that is so simple we all whack our head wondering, “why didn’t I think of that!” My principles of achieving ultra-efficient data centers (PUEs between 1.03-1.08; I call anything less than 1.10 ultra-efficient) are based upon simplicity and a holistic approach while meeting the need not the want or convention. Generally the simpler the better, as simple is always lower cost up front and ongoing, as well as easier to maintain, more reliable and more efficient.

So, here is one that will not catch you by surprise: a rack that saves energy. We’ve all heard of passive and active cooling racks: those with fans, heat exchangers or direct cooling systems. I explored some of the front & rear door heat exchanger racks back in 2003, which work really well for high-density applications but can be very expensive compared to better-designed data center cooling systems. But how about a rack that not only reduces energy costs but also improves hardware performance?

I’ve had the pleasure of exploring with Green Platform’s CEO, Gus Malek-Madani, their anti-vibration rack (“AVR”), a carbon-fiber composite rack actually designed to remove vibration. Why remove vibration? Green Platform claims that a typical datacenter experiences vibration levels of around 0.2 G-Root Mean Square (GRMS); this, it claims can degrade a disk drive’s performance (both I/O and throughput) by up to 66%; a fact that was borne out during a ‘rigorous’ testing exercise it did in conjunction with Sun Microsystems.

As harddrives get “larger” in capacity, bits get crammed into a smaller space. This along with smaller drives force tolerances between rotating platters and the movement of mechanical actuator arms within the drives to get tighter, and thus, vibration causes drives to slow down or have higher mis-read & writes, slowing down I/O performance. “As a result of this ‘vibration penalty,’ the company believes that up to a third of all US datacenter spending – on both hardware and power – is wasted on vibration, amounting to some $32bn of wastage. The company also says there’s evidence that reducing the impact of vibration will serve to improve the reliability of drives (and improving mean time between failure.)”

In order to back up this figure, early tests with Sun Microsystems (pre-Oracle) and Q Associates (“Effects of Data Center Vibration on Compute System Performance” by Julian Turner) showed IOPS improvement of up to 247% in random I/O. The following chart shows this storage performance degradation:



You can also watch the following video that clearly shows that just yelling into the face of storage hardware causes a very visible degradation of storage performance: http://www.youtube.com/watch?v=tDacjrSCeq4

If the vibration from yelling into a rack causes performance degradation, think about the vibration affects from HVAC systems, thousands of server fans, and even walking thru your data center.

The company says its carbon-composite design massively reduces the vibrations that can cripple hard disk drive performance, boosting performance, efficiency and even reliability. From the results of these tests stated above, they assume that most folks should see a 100% improvement in storage throughput, 50% shorter job times and consequently, 50% less power consumed per job. In testing with Sun Microsystems the AVP dissipated vibration by a factor of 10x to 1000x. In further testing with systems integrator Q-Associates, which pitted the AVP against a regular steel rack – it found that random read IOPS increased by between 56% and 246%, with random write IOPS showing a 34% to 88% improvement with the AVP.  

“The throughput and I/O rate of storage remains a significant performance bottleneck within the datacenter; though hard disk drive (HDD) capacities have increased by several orders of magnitude in the last 30 years, drive performance has improved by a much smaller factor. This issue is exacerbated by the fact that server performance, driven by Moore’s law, has increased massively, to the extent that there’s now a server-storage performance gap. The way most datacenters engineer around this problem is inefficient; typically workloads are striped across many disks in parallel, and disks are ‘short stroked’ — i.e. data is only written to the edge of platters – in order to minimize latency. Although this does address performance, the trade-off is that disk capacity is massively underutilized, wasting datacenter space and energy, not to mention the cost of reliably maintaining an unnecessarily large disk estate.”

In the many data centers I have had the pleasure of working in lately, storage is growing faster than server capacity and the greatest performance limitation is storage throughput. This product works for the high-end video/audio and scientific markets; a niche space where another of Malek-Madani’s company– Composite Products, LLC — is focused. The test results clearly show storage throughput dramatically improved by reducing vibration at the rack of storage hardware. With some 3 million storage racks currently in use inside datacenters worldwide, and growing by the second to probably eventually exceed server racks, this is a very large opportunity to improve performance while reducing energy use, always one of my main mantras. Green Platform expects to have their racks as an option from storage vendors, NetApp, EMC, and others, so that as you purchase and provision new storage systems, you pay a small incremental increase in price of the storage system for a very large improvement in performance and energy reduction. Think of all of those servers waiting so much less for data throughput and how much that can improve the utilization of those systems? Think about it.

I’m looking to conduct an end-user test with their rack; contact me if you’re interested so we can determine results for your organization.

Economization and methods to data center efficiency, ASHRAAE 90.1 limits our options and efficiency

Thursday, April 15th, 2010

For years, the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE) has had a standard for “recommended” and “allowable” humidity and temperature ranges within a data center. These ranges were way too limited for the past many generations of computer hardware that operates within our data centers. It is rumored by many folks that these standards were set by IBM back in the 1950’s to accommodate computer punch cards and keep them from getting “soft” with humidity and to keep arc flashing between exposed copper lines on computer boards, as well as to keep vacuum tubes cool to prevent them from burning out, literally. However, computer hardware has advanced much over the last 50 years, especially the last decade or two, and many called the old ASHRAE standards very out of date.

Finally, ASHRAE advanced the standard, which so many data center designers, operators and engineers relied on for operating specs of their data centers. Even though the new standards expanded the humidity and temperature ranges considered “allowable” within our data centers, the hardware specifications allow for even much broader ranges. While ASHRAE TC9.9 allows for up to about 80 degrees F (depending upon the humidity), hardware manufacturers spec their equipment to run up to 95F inlet air and humidity generally between 5-95%. The computer hardware of today lacks vacuum tubes and paper punch cards; circuit boards are coated to prevent arc flashing; everything is solid state except two moving parts: the harddrive, which are hermetically sealed, and cooling fans to draw air thru the server. (And each of these has a limited life, already with solid-state options or complete removal (fans)).

Many studies have been completed to prove that the hardware no longer needs tight temperature and humidity ranges, nor even filtration to prevent dust accumulation. Microsoft proved this point with what I call their data center in a tent experiment, in which they located a couple of racks under a tent canopy in Redmond, WA for many months. The servers sucked in leaves, ran when it rained, snowed, blew, was hot and cold, and even had water leak onto them, without any hardware failures, proving the point that servers really don’t need any environmental controls. A link to this study is here: http://blogs.msdn.com/the_power_of_software/archive/2008/09/19/intense-computing-or-in-tents-computing.aspx

I myself completed very low humidity (10-20%) and high inlet temperature (75-85F) testing, as well as unintentionally water leaks onto servers back in 2002-2003, all without any failures whatsoever.

Another bold move was by chip-maker Intel, in which they completed what I call their data center in a box test, but different than most containers, in which they put two containers, each loaded the same with 900 servers, in Santa Fe for a year. One had air conditioning and it’s inherent humidity control as well as filtration, the other, a fan to draw outside air in with minimal filtration and no air conditioning. What a great test! Santa Fe can get hot (92F) and cold (24F), being in the high desert, normally very dry, but also with sudden thunderstorms on hot summer afternoons that made humidity range from 4-90%. Intel found that the air economized container had a very visible layer of dust on the servers, and that even though temperature and humidity ranged dramatically during this year long study, the server failure rate was 4.46% versus 3.83% in their main air-conditioned data center. A failure rate of this low is ridiculous when it is economic today to refresh (replace) hardware at 18-24 month cycles and the failures seemed to coincide with dramatic humidity changes (10 to 90% in one hour) . This study proves that the hardware is very robust and capable of needing very little to any humidity, temperature and filtration control. Intel’s own engineers say their chips are good up to 135 degrees, Centigrade (275F)! This study is available from Intel, titled “Reducing Data Center Cost with an Air Economizer”, August 2008.

I very much commend the good folks at Microsoft and Intel for completing and publishing these studies. So, why does ASHRAE still limit temperature and humidity so much more tightly than necessary? To add to this, ASHRAE recently released 90.1, a data center energy efficiency standard. Well, I’ve been touting and pushing for data center energy efficiency since the late 90’s, and I have made it a keystone of my career for about a decade, so of all people, I fully support further improving data center energy efficiency. This is why I chair the SVLG data center demonstration efficiency program (http://dce.svlg.org). But what ASHRAE has done via this standard is require air-economization, one technology to improving the mechanical efficiency of our data centers, essentially picking one technology. It would be akin to saying that all homes must have fiberglass insulation instead of saying walls and ceilings that meet a specific R-value, which is really what we want. Let technology and the commercial market compete for the best solutions, not standards that require one technology.

My concern with this standard is not that I don’t want data centers to air-economize to reduce energy use, my qualm is that air-economization is only one method to improving energy efficiency and often not the best. For example, I have been involved with the design of several data centers over the last two years, many with the fine folks at Rumsey Engineers (ww.rumseyengineers.com) and all have achieved annual PUEs of 1.04-1.08, significantly better than most data centers. In all of these low PUE data centers, we studied air-economization, heat wheels, direct and indirect evaporative cooling, chilled-water plants, geothermal exchange, and many other methods to cool the data center. In all cases, hot or cold climates, dry or humid, high elevation or low elevation, we found that air-economization was about twice as energy intensive as other cooling methods, and more expensive to build and often less reliable. So ASHRAE 90.1 could actually INCREASE energy use in our data centers (and first time cost as well) instead of the intended affect of reducing it. Why would a standards group specify a technology to achieve efficiency. Why are the temperature and humidity standards a decade or more out of date with technology? We need impetus to be more efficient, but ability to innovate and challenge each other to be as efficient as we can using a variety of technologies, old and new. Limitations to one technology are out of date the minute they are published in the data center world. With amazing new technologies being released daily, and a generation of servers lasting less than two years, a standards body that takes years to update should not be prescribing any specific technology and allowing greater flexibility in operations and design, enabling us operators, designers and manufacturers to adapt to the most-efficient resource.

I congratulate those who have come out before me about this new standard and ask with them, that ASHRAE revise their thinking once again and provide more flexible standards, not limiting ones, which only hurt our energy efficiency achievements. Thank you for reading.

Is it possible, a data center PUE of 1.04, today?

Saturday, August 22nd, 2009

I’ve been involved in the design and development of over $6 billion of data centers, maybe about $10 billion now, I lost count after $5 billion a few years ago, so I’ve seen a few things. One thing I do see in the data center industry is more or less, the same design over and over again. Yes, we push the envelope as an industry, yes, we do design some pretty cool stuff but rarely do we sit down with our client, the end-user, and ask them what they really need. They often tell us a certain Tier level, or availability they want, and the MWs of IT load to support, but what do they really need? Often everyone in the design charrette assumes what a data center should look like without really diving deep into what is important.

When we do that, we can get some very interesting results. For example, I’ve been fortunate to have been involved with the design of three data centers this year and all three we were able to push the envelope of design and ask some of these difficult questions. Rarely did I get the answers from the end-users I wanted to hear, where they really questioned the traditional thinking and what a data center should be and why, but we did get to some unconventional conclusions about what they needed instead of automatically assuming what they needed or wanted. As a consequence, we designed three data centers with low PUEs, or even what I like to call “ultra-low PUEs“, those below 1.10. The first was at 1.08, the next at 1.06 and now we have a 1.046, OK, let’s call it 1.05 since the other two are rounded up as well. (We know we can get that one down to about 1.04 with a few more tweaks to that “what is really needed” question.)

Now, I figured that a PUE of 1.05 was going to take a few years to get to because the hardware needed to improve, i.e. chillers, UPS, transformers, etc. But what I didn’t take into account was that when we really look at what the client needs, not wants, and what we can do to design for efficiency without jumping to the same old way of designing a data center, we can reach some great results. I assume that this principal can apply to almost anything in life.

Now, you ask, how did we get to a PUE of 1.05? Let me hopefully answer a few of your questions: 1) yes, based on annual hourly site weather data; 2) all three have densities of 400-500 watts/sf; 3) all three are roughly Tier III to Tier III+, so all have roughly N+1 (I explain a little more below); 4) all three are in climates that exceed 90F in summer; 5) none use a body of water to transfer heat (i.e. lake, river, etc); 6) all are roughly 10 MWs of IT load, so pretty normal size; 7) all operate within TC9.9 recommended ranges except for a few hours a year within the  allowable range; and most importantly, 8) all have construction budgets equal to or LESS than standard data center construction. Oh, and one more thing: even though each of these sites have some renewable energy generation, this is not counted in the PUE to reduce it; I don’t believe that is in the spirit of the metric.

Now, for some of the juicy details (email or call me for more or read future blog posts). We questioned what they thought a data center should be: how much redundancy did they really need? Could we exceed ASHRAE TC9.9 recommended or even allowable ranges? Did all the IT load really NEED to be on UPS? Was N+1 really needed during the few peak hours a year or could we get by with just N during those few peak hours each year and N+1 the rest of the year?, etc. The main point of this blog post is to say that low PUEs, like that of 1.05, can be achieved, yes, been there and done that now, for the same cost or LESS than a standard design, and done TODAY, saving millions of dollars per year in energy, millions of tons of CO2, millions of dollars of capital cost up front, less maintenance, etc. We just need to really dive deep as to what we need, not what we want or think we need, and we’ll be better at achieving great things. Now, I need to apply this concept to other parts of my life; how about you?