Chém gió

Thứ Hai, 19 tháng 1, 2015

Hadoop Ecosystem quick reference

Hadoop Ecosystem


  • Data storage: This is where raw data resides. There multiple data file system supported by hadoop.
    • HDFS: come with Hadoop framework. Big files are splited into chunks and these chunks are replicated automatically over the cluster.
    • Amazon S3: this come from Amazon Web Service(AWS) which is an internet based storage. Performance might be negative effected by network traffic.
    • MapR-FS: provides higher availability, higher performance than HDFS. Comes with MapR's Hadoop distribution
    • HBase: This is columnar, multidimensional database derived from Google Big Table also based on HDFS filesystem. It maintains data in partitions and therefore can give data access efficiently in sorted manner.
  • Data access: this layer helps in accessing data from multiple stores.
    • Hive: SQL-like querying capabilities run on top of Hadoop.
    • Pig: This is a data flow engine and multiprocess execution framework. It's scripting language is call Pig Latin. The Pig interpreter translates these scripts  into MapReduce jobs.
    • Avro: This is one of the centralization systems, which provides a rich data format, a container file to store persistent data, a remote procedure call, and so on. It uses JSON to define data types, and data is serialized in compact binary data.
    • Mahout: Machine learning software with core algorithms as recommendation, collaborative filtering, clustering. Algorithms are implemented on top of Hadoop using MapReduce framework.
    • Sqoop: this is used to transform data between hadoop world and RDMS world.
  • Management layer: this comprises of tools  that assist in administering the Hadoop infrastructure.
    •  Oozie: this is a workflow scheduler system to manage Apache Hadoop jobs.
    • Elastic MapReduce: this provisions the Hadoop cluster, running and terminating jobs and handling data transfer between EC2 and S3
    • Chukwa: data collection system for monitoring system, distributed system. Build on top of HDFS and MapReduce framework.
    • Flume:  This is a distributed service comprising of multiple agents 
    • Zookeeper: which provides open source distributed coordination and synchronization services as well as a naming registry for large distributed systems. ZooKeeper's architecture supports high availability through redundant services. It uses a hierarchical filesystem and is fault tolerant and high performing, facilitating loose coupling. ZooKeeper is already being used by many Apache projects such as HDFS and HBase, as well as its run in production by Yahoo!, Facebook, and Rackspace.
  • Data analytisc: third party software for understand data, get insights from data.
    • Pentaho: this has the capability of Data Integration (Kettle), analytics, reporting, creating dashboards, and predictive analytics directly from the Hadoop nodes. It is available with enterprise support as well as the community edition.
    • Storm: This is a free and open source distributed, fault tolerant, and real-time computation system for unbounded streams of data.
    • Splunk: This is an enterprise application, which can perform real-time and historical searches, reporting, and statistical analysis. It also provides the cloud-based flavor, Splunk Storm.

Thứ Bảy, 10 tháng 1, 2015

Scrap all link on webpage using BeautifulSoup Python

What I am trying to do is manipulate HTML document.
After searching around using wget utility of Ubuntu, html2text lib try some xml manipulate lib like DOM, etree ...quite hard core for a while and I found BeautifulSoup very cool lib for scraping web, parse HTML document.
Documents for BeautifulSoup can be found here
I just list here some awesome features of beautifulsoup:

  • Ease navigate through HTML document
  • Handle tag, name, attributes easily
  • Searching ...

Let's do it.

> sudo pip install beautifulsoup4
> sudo pip install requests
import requests
from bs4 import BeautifulSoup as bfs
url = raw_input("Enter url you want to scrap ")
r = requests.get(url)
data = r.text
soup = bfs(data)
for link in soup.find_all('a'):
    print (link.get('href'))

or you can search for text content in <p> tag:

for link in soup.find_all('p'):
    print (link.text)

That's it.

Thứ Ba, 6 tháng 1, 2015

Using RHive to connect from RStudio to Hive Server on Hadoop Cluster

Scenario:
1. We have cluster running hadoop store for example web logs.
2. We have Hive run on top of hadoop cluster for query data from our web logs.
3. You are a data analyst / scientist (like me :D) using RStudio and you might want to use hive from your RStudio: big benefit of this is that you can store result of query into a dataframe in R language for later process(if you issue query from hive cli you might not store the result).
How to do that?
1. Install RHive package here, you also can find docs on that website.
Just one thing to remind: RHive require rJava package and to install rJava package you need to install jdk from oracle.
2. I assume that you already setup successfully cluster running hadoop and hive.
3. From master of your cluster start hive server using:
hive/bin/> ./hive --service hiveserver2
This command will start Hive Server listen on port 10000 by default.
4. From your machine after successfully install RHive type this:
library("RHive")
Sys.setenv(HADOOP_HOME='/home/tin/tools/hadoop')
Sys.setenv(HIVE_HOME='/home/tin/tools/hive')
Sys.setenv(HADOOP_CONF_DIR='/home/tin/tools/hadoop/conf')
rhive.init()
rhive.env()
rhive.connect(host="ip_of_your_master",user="user_running_hadoop_on_master")

Attention:

On your system you need to download hadoop and hive as you can see above I put hadoop and hive in tools directory at home.
You might need to config hadoop and hive point to your jdk
You don't need to run hadoop or hive on your system.
This is used for RStudio can find some libraries to run.

That's it you now can issue query from RStudio using apache hive.

In case you have any problem about this procedure just email me: imthientin(at)gmail.com I'm willing to help.

Thứ Năm, 1 tháng 1, 2015

[Ubuntu] Change gnome-terminal title dynamically

For some reasons you might want to change your terminal title 
1. Open terminal
2. Edit file ~/.bashrc add this function at the end of file:

function title {
    echo -en "\033]2;$1\007"
}
Comment PS1: #PS1="\[\e]0;${debian_chroot:+($debian_chroot)}\u@\h: \w\a\]$PS1"
Save it. 
3. From terminal at home directory type:
> source .bashrc 
For setting take effect on the system.
From current terminal want to change it's title just type:
> title "Your Title Here"
That's it.