nitneihtnotes: Hadoop Ecosystem quick reference

Hadoop Ecosystem

Data storage: This is where raw data resides. There multiple data file system supported by hadoop.

HDFS: come with Hadoop framework. Big files are splited into chunks and these chunks are replicated automatically over the cluster.
Amazon S3: this come from Amazon Web Service(AWS) which is an internet based storage. Performance might be negative effected by network traffic.
MapR-FS: provides higher availability, higher performance than HDFS. Comes with MapR's Hadoop distribution
HBase: This is columnar, multidimensional database derived from Google Big Table also based on HDFS filesystem. It maintains data in partitions and therefore can give data access efficiently in sorted manner.

Hive: SQL-like querying capabilities run on top of Hadoop.
Pig: This is a data flow engine and multiprocess execution framework. It's scripting language is call Pig Latin. The Pig interpreter translates these scripts into MapReduce jobs.
Avro: This is one of the centralization systems, which provides a rich data format, a container file to store persistent data, a remote procedure call, and so on. It uses JSON to define data types, and data is serialized in compact binary data.
Mahout: Machine learning software with core algorithms as recommendation, collaborative filtering, clustering. Algorithms are implemented on top of Hadoop using MapReduce framework.
Sqoop: this is used to transform data between hadoop world and RDMS world.

Management layer: this comprises of tools that assist in administering the Hadoop infrastructure.

Oozie: this is a workflow scheduler system to manage Apache Hadoop jobs.
Elastic MapReduce: this provisions the Hadoop cluster, running and terminating jobs and handling data transfer between EC2 and S3
Chukwa: data collection system for monitoring system, distributed system. Build on top of HDFS and MapReduce framework.
Flume: This is a distributed service comprising of multiple agents
Zookeeper: which provides open source distributed coordination and synchronization services as well as a naming registry for large distributed systems. ZooKeeper's architecture supports high availability through redundant services. It uses a hierarchical filesystem and is fault tolerant and high performing, facilitating loose coupling. ZooKeeper is already being used by many Apache projects such as HDFS and HBase, as well as its run in production by Yahoo!, Facebook, and Rackspace.

Data analytisc: third party software for understand data, get insights from data.

Pentaho: this has the capability of Data Integration (Kettle), analytics, reporting, creating dashboards, and predictive analytics directly from the Hadoop nodes. It is available with enterprise support as well as the community edition.
Storm: This is a free and open source distributed, fault tolerant, and real-time computation system for unbounded streams of data.
Splunk: This is an enterprise application, which can perform real-time and historical searches, reporting, and statistical analysis. It also provides the cloud-based flavor, Splunk Storm.

nitneihtnotes