Figure 1: A (simplified) Big Data Ecosystem

[source: Steve Nimmons]

bigdataecosystem

 

In terms of ‘forces’ affecting the CIO Agenda, Information Strategy and Enterprise Architecture, Big Data is increasingly important. This is due to explosive growth in number of data source types: applications, digital media, mobiles, users, customers, unstructured data sets, sensors, emails, blogs etc. Data is complex and in mixed formats (text, video, audio), on-demand infrastructure scalability (including massively scalable storage) is needed to deliver Big Data capabilities, as are robust analytics and visualisation tools and techniques for distributed, parallel systems. Increasing bandwidth availability has also led to exponential data growth rates and capabilities e.g. social networks, video and microblogging.

Where do you start in formulating a reference architecture for Big Data and sourcing suppliers for a Big Data ecosystem?

Should you believe the Hype?

The Gartner Hype Cycle places Big Data on ‘the upslope’ towards the ‘peak of inflated expectations’. Big Data is of course already underpinning many of the web giant’s architectures (typically because necessity has been the mother of invention).

Figure 2: Gartner Hype Cycle for Emerging Tech (2011)

[Source: Gartner]

image

  • Facebook uses Hadoop to store copies of internal log and dimension data sources and as a source for reporting/analytics and machine learning. There are two clusters, a 1100-machine cluster with 8800 cores and about 12 PB raw storage and a a 300-machine cluster with 2400 cores and about 3 PB raw storage.
  • Yahoo! deploys more than 100,000 CPUs in > 40,000 computers running Hadoop. The biggest cluster has 4500 nodes (2*4cpu boxes w 4*1TB disk & 16GB RAM). This is used to support research for Ad Systems and Web Search and to do scaling tests to support development of Hadoop on larger clusters
  • eBay uses a 532 nodes cluster (8 * 532 cores, 5.3PB), Java MapReduce, Pig, Hive and HBase
  • Twitter uses Hadoop to store and process tweets, log files, and other data generated across Twitter. They use Cloudera’s CDH2 distribution of Hadoop. They use both Scala and Java to access Hadoop’s MapReduce APIs as well as Pig, Avro, Hive, and Cassandra.

Other Hadoop users include:  1&1, A9.com, About.com, Amazon.com, American Airlines, AOL, Apple, Booz Allen Hamilton, Cerner, ChaCha, comScore, EHarmony, Federal Reserve Board of Governors, foursquare, Fox Interactive Media, Freebase, Hewlett-Packard, IBM, InMobi, ImageShack, ISI, Joost, Last.fm, LinkedIn, Microsoft, Meebo, Mendeley, Metaweb, Netflix, The New York Times, Ning, Outbrain, Playdom (now part of Disney Interactive Media Group), Powerset (now part of Microsoft), Rackspace, Razorfish, StumbleUpon and Twitter.

Hadoop Overview

Figure 3: Hadoop Overview

[source: Steve Nimmons]

hadoop

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop has Commons, MapReduce and Distributed File System capabilities (HDFS) as well as sub-projects: HBase, Cassandra, Avro, Hive, Mahout, Pig, ZooKeeper and Chukwa.

Given the pervasive nature of Hadoop, this is a strong contender for any Big Data implementation. HBase is the Hadoop database. Cassandra is also a NoSQL database. Mahout is a data mining and machine learning component, Hive and Pig are querying components, Zookeeper a coordination component.

Hadoop Distributions, such as that from Cloudera, bundle Apache Hadoop with other Open Source tools to create a more feature rich ‘platform’. The Cloudera distribution is definitely one to evaluate.

A simple Reference Model

In terms of implementing ‘Big Data’ architectures there are a number of choices, particularly in the visualisation and analytics space (refer to Figure 1). A simplified reference model is provided in Table 1. This will be expanded in a series of future posts on architectures for Big Data, exploring key features and design trade-offs.

Table 1: Simplified Big Data Reference Model

[source: Steve Nimmons]

Function

Candidate Options

Storage NoSQL Databases – e.g. Cassandra, HBase, Voldemort, Membase
Processing MapReduce
Query Hive, Pig (assuming Hadoop is being used)
Analytics & Visualisation Refer Figure 1 (and Mahout for Data Mining)

Data Loaders (e.g. Sqoop) and log management (e.g. Flume, Scribe) could also be included in the reference model / ecosystem.

Further Reading and Interesting Tools

Enhanced by Zemanta

Related Posts:

 
transform

Article originally published with the BCS in Feb 2009

Legacy applications are often defined as those that work. But how do you address the issue of modernisation of large COBOL code bases many of which often contain key IPR, are difficult to maintain and lack documentation, asks Steve Nimmons CEng FBCS CITP.

There a number of potential approaches to meet the technical and business objectives including stock migration, COTS replacement, service enablement of legacy, reengineering and lift and drop. Each has distinct advantages and disadvantages. Continue reading »

Related Posts:

© 2012 Steve Nimmons Suffusion theme by Sayontan Sinha