Recently I spoke to David Emery a friend and colleague from my time at Coopers & Lybrand. He is now working on a major social media initiative for a global mobile telco. I was interested in David’s perspective as he has been working on a set of solutions to process log files at enormous scale. You might think this is a somewhat trivial use case but many modern business processes at scale generate impossibly large quantities of data that needs to be turned into information.
David and his colleagues have been using a number of open source components to attempt to solve the issue that scale up won’t scale enough and leveraging cheap compute and storage plus smart software and algorithms to deliver a solution. I think David makes a number of important comments that vendors would be well advised to heed:
- Massive Internet scale problems are now solvable and enterprises want to mine the data to generate business information
- The value of the whole solution is enormous but the sheer scale can make it unaffordable
- Open source software and scale out commodity hardware are one possible solution to scale and affordability
- Smart techniques or approaches like Hadoop and MapReduce are now becoming commonly used tools
Here is David’s story:
“Demand for storage capacity continues unabated, rising upwards along an exponential growth curve (Kryders Law) that has challenged vendors to squeeze more bang per buck into SAN, NAS and a whole array (pun intended) of predominantly vertical scaled enterprise class storage solutions.
Improvements and innovations over the years in the form cramming more ‘bits’ per inch onto a hard disk (magnetic bit density), RAID configurations and fibre optic technology connectivity have given us ever faster, larger and resillient storage solutions that we quickly fill and consume. This demand is unlikely to diminsh as applications and datasets become evermore enormous and sophisticated.
It’s not only super computing project applications driving huge demand. Whilst the data generated from the Large Hadron Collider may be an extreme example, it currently generates 2GB per 10 seconds of use, there are many less esoteric applications demanding huge volumes of storage: think Genome, DNA and RNA analysis, pharmaceutical research, financial modelling, Internet Search, Email and Web 2.0 Social networking sites.
The latter examples seem less obvious until you consider the sheer number of users: Facebook recently surpassed 400m customer accounts. It’s no surprise then, that the leading internet companies have taken a different approach to increasing storage demands rather than solely relying on the bottom up vertical approach of the traditional storage vendors.
Google and Yahoo have been key players in the development of distributed storage and analysis efforts (where there is data there is a demand to analyse and report on that data) that have yielded amongst others Hadoop, MapReduce, HDFS (Hadoop Distributed File System ) and GFS (Google File System).
In the massive scale out architectures required to drive the Google Search and Facebook web applications of the world, horizontal scale-out is king. Tiered architectures remain valid, but they are increasingly underpinned by free open source software.
It’s not only web start ups that have grown from small beginnings to large corporates that have embraced the free software stack, (Apache, Linux, Squid, MySql, Perl, Python, Nagios etc) to support expansion whilst avoiding crippling licensing costs, both small and large enterprises have joined the bandwagon as many of the barriers to entry have become irrelevant.
Product stability, maturity, wide spread adoption and readily available support have mitigated many of the perceived risks. The architecture scales, the software works and can all be built on a foundation of cheap commodity based servers. Virtualization and Cloud Computing have only reinforced this trend and Infrastructure is increasingly provided as a Service (IaaS) where the bare metal plaform is entirely abstracted and increasingly irrelevant.
The distributed architecture and horiztonal scale out approach is now beginning to shake up the Storage and Database tier and therfore, the Storage Market place. Customers want massive capacity, reliability and good performance, but they also want to avoid to vendor lock-in and large upfront investment costs. They also want more effective ways to process to such huge volumes of data.
Distributed File systems and Distributed compute processing make all of this possible. An emerging sector with players such as GlusterFS, Lustre and Ibrix has grown and the traditional storage vendors are shoring up their product ranges with similar solutions. HP bought Ibrix whilst Gluster is going down the monetized service Open Source route.
Logfile collection and processing provide a highly relevant, if more mundane, example of how these building blocks can be pulled to together to form a innovative and cost effective solution, that grows as the customer demands increase. In an infrastructure supporting a web based service supporting just under two million users, I’ve recently seen systems generate over 100GB of log file data per day.
Historically, collecting and storing such data is often overlooked or poorly implemented, if at all. It is often seen as a costly process, of limited use (typically because the value in the data is widely spread out cannot easily be retrieved in a meaningful way) and ultimately becomes little more than a burdensome risk, rentention and compliance requirement for many organisations.
Much of the data that is kept ends up on tape gathering dust. How can a customer expect to grow their service from two million users, to five and then twenty and beyond without crippling storage costs, let alone handle such large volumes of log file data and do something useful with it?
A storage platform fronted by a Distributed File System provides one possible answer. The DFS can be built upon multiple nodes running on cheap commodity hardware. More nodes can be added as required, the underlying hardware can be changed and can compromise many different nodes running on different platforms. The DFS provides the clustering, reliability and scaleout storage architecture under a single namespace, accessible by any number of standards protocols e.g. CIFS, NFS, HTTP, iSCSI etc. What’s more a multiple node system can provide readily available processing power, suitable for MapReduce type applications. Of course an alternative is to stick with large scale vendor specific storage platforms, where cost is reduced through economies of scale and risk is somewhat mitigated at the expense of lock-in.
A similar DFS approach has been successfully implemented by MailTrust (Rackspace’s mail division) to capture, collate and process huge volumes of daily log files using Syslog, Hadoop and MySQL. This may be ‘just’ log files, but the power of the data can be harnessed for better support operations and identify trends.
Of course this is possible with traditional tools and storage, but the key here is scale and affordability. I’ve recently seen other companies looking to build similar Distributed storage platforms that will also form the backbone of a private storage cloud, fronted by Eucalyptus software. Again, the whole architecture can be comprised of OpenSource software running on cheap commodity hardware.
It is the software and open standards that are increasingly enabling organisations to build massive internet web services, requiring massive storage. The database and the storage layers remain the last vertical bottleneck, but this is changing. SAN and NAS technology will not disappear, rather consumption will probably continue to grow (in line with Kryders Law), but DFS and greater flexibility are here to stay.
The success of companies such as Gluster and the wider spread adoption of HDFS and Google FS will remain the key as to how many customers, and by how much, move from hardware specific storage plaforms provided by the likes of HP, IBM and NetApps to more Open standard based solutions not requiring proprietary hardware. The same vendors will be providing much of the commodity storage anyway, but it’ll make interesting viewing watching the larger vendors respond.”