Chém gió

Thứ Ba, 6 tháng 1, 2015

Using RHive to connect from RStudio to Hive Server on Hadoop Cluster

Scenario:
1. We have cluster running hadoop store for example web logs.
2. We have Hive run on top of hadoop cluster for query data from our web logs.
3. You are a data analyst / scientist (like me :D) using RStudio and you might want to use hive from your RStudio: big benefit of this is that you can store result of query into a dataframe in R language for later process(if you issue query from hive cli you might not store the result).
How to do that?
1. Install RHive package here, you also can find docs on that website.
Just one thing to remind: RHive require rJava package and to install rJava package you need to install jdk from oracle.
2. I assume that you already setup successfully cluster running hadoop and hive.
3. From master of your cluster start hive server using:
hive/bin/> ./hive --service hiveserver2
This command will start Hive Server listen on port 10000 by default.
4. From your machine after successfully install RHive type this:
library("RHive")
Sys.setenv(HADOOP_HOME='/home/tin/tools/hadoop')
Sys.setenv(HIVE_HOME='/home/tin/tools/hive')
Sys.setenv(HADOOP_CONF_DIR='/home/tin/tools/hadoop/conf')
rhive.init()
rhive.env()
rhive.connect(host="ip_of_your_master",user="user_running_hadoop_on_master")

Attention:

On your system you need to download hadoop and hive as you can see above I put hadoop and hive in tools directory at home.
You might need to config hadoop and hive point to your jdk
You don't need to run hadoop or hive on your system.
This is used for RStudio can find some libraries to run.

That's it you now can issue query from RStudio using apache hive.

In case you have any problem about this procedure just email me: imthientin(at)gmail.com I'm willing to help.

Không có nhận xét nào:

Đăng nhận xét