Chém gió

Thứ Năm, 2 tháng 1, 2014

UDF in HIVE

From log files, in some cases  you want to make up your data for example: keep track of device from users but so many model out there like: samsung galaxy, samsung galaxy I, samsung galaxy II, samsung trend, samsung note ..., LG LTE, LG Vu, LG ...., Nokia, Lumia ... just want to track brand name of device: samsung only, LG only, nokia only or huawei, lenovo... in normal Hive query it's seems difficult. So Hive provides UDF for us to make up our data.
Can be written in many common languages: java, python, blah blah.
In java: create class extends UDF class for example my class is ISP:
public final class ISP extends UDF {}.
Inside this class we have to implement 1 method name evaluate:
public String evaluate(final String S){}
So in general our class is something like this:
Note: From java build path setting please import: hadoop-core-0.x.jar (x is version) and hive-exec-0.x.jar (can be found in lib dir of hive)
package name 
import ...
public final class ISP extends UDF 
{
     public String evaluate(final String S)
    {
        // your code going here. 
        return "string after proccessed"
    }
}

Export jar file from your UDF class for example : isp.jar
From hive cli:
>add jar /path/to/isp.jar;
>create temporary function isp_name as 'your_package.ISP';
from now onward you can select using your udf:
>select isp_name(column_name) from table_name;
p/s : these cmd take effect in 1 session that's mean whenever we exit hive cli next time we log in again hive don't recognize isp_name function anymore.
In case want to use this UDF so many time and don't need to type above cmd again again we can add these cmd to .hiverc file in conf dir of Hive. (create it if needed).
----------
That'all.