nitneihtnotes

Thứ Tư, 17 tháng 2, 2016

Set current directory bash instead of full path

Edit ~/.bashrc :
change: PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\w\[\033[00m\]\$ '
PS1='${debian_chroot:+($debian_chroot)}\u@\h:\w\$ '
to:
PS1='${debian_chroot:+($debian_chroot)}\[\033[01;32m\]\u@\h\[\033[00m\]:\[\033[01;34m\]\W\[\033[00m\]\$ '
PS1='${debian_chroot:+($debian_chroot)}\u@\h:\W\$ '

Thứ Sáu, 11 tháng 12, 2015

Send list of json docs using curl with POST method.

curl -H "Content-Type: application/json" -X POST -d '{"student_id":"1234567","test_id":"t123","answers":[{"question_id":"q1","answer":"aaa"},{"question_id":"q2","answer":"bbb"},{"question_id":"q3","answer":"abc"}]}' http://localhost:3008/testsave

Thứ Hai, 17 tháng 8, 2015

[Datastax] Cassandra Java Driver

I have been working with cassandra for a while.
Today I'm going to blog about "How to connect to Cassandra from Java using datastax driver."
You can download driver here and compile it.
In this post I used these jar files: cassandra-driver-core-2.1.8-SNAPSHOT.jar which was compiled from above step.
Next one is guava-18.0.jar , metrics-core-3.0.0.jar, netty-all-4.0.12.Final.jar and for logging purpose slf4j-*.jar are included.
In order to connect to cassandra cluster we will need two parameters

contact point : list of ip address of your cluster's contact point.
key space name: name of key space in cassandra cluster you want to connect.

       

    Cluster cluster;
    Session session;
    String[] contactList= {"10.10.10.10","10.10.10.20"};
    String keyspaceName = "test";
    cluster = Cluster.builder().addContactPoints(contactPoints).build();
    session = cluster.connect(keyspaceName);

After this step we will use session object to issue our queries.
For example I want to add a column into existing table:

       
     String columnFamily = "test_table";
     String colName = "new_column";
     String dataType = "text";
     String query = String.format("ALTER TABLE %s.%s ADD %s %s", keyspaceName,columnFamily,colName,dataType);
  
     try {
         session.execute(query);
     } 
     catch (Exception e) {
         slf4jLogger.error(e.toString());
  }

Insert records into existing table in this case is users table:

     session.execute(
         "INSERT INTO users (firstname, lastname, age, city, email ) VALUES ('Tin', 'Ho', 27, 'Sai Gon', 'imthientin@gmail.com')");

To acquire data from existing table:

       
     ResultSet results = session.execute("SELECT * FROM users WHERE firstname='Tin'");
     for (Row row : results) {
         System.out.format("%s %d\n", row.getString("email"), row.getInt("age"));
  }

Some other thing we can do:

Drop column : non primary key column
Drop table/column family
Drop keyspace
Change column name: only primary key column
etc ...

My github for source code:
https://github.com/thientin/cassandra-utils

Hope this useful.

Thứ Ba, 5 tháng 5, 2015

scraper website content should pay attention

I try to get a list of product from key word on amazon website and it should be automatically done.
I use beautiful soup and urllib2.
But web site I see and what I scraped it was slightly difference(on website I see more item).
After google around I found that when we use automate tool for scraping web site we have to fake a browser by providing User-agent to the header we can do this as follow:

>>> import urllib2
>>> opener = urllib2.build_opener()
>>> opener.addheaders = [('User-agent', 'Mozilla/5.0')]
>>> url = "http://www.amazon.com/Sony-D6653-International-Version-Warranty/dp/B00TLAIFDE/ref=sr_1_1?ie=UTF8&qid=1430831439&sr=8-1&keywords=sony+z3"
>>> response = opener.open(url)
>>> page = response.read()
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(page)

Work like a cham :D

Thứ Hai, 19 tháng 1, 2015

Hadoop Ecosystem quick reference

Hadoop Ecosystem

Data storage: This is where raw data resides. There multiple data file system supported by hadoop.

HDFS: come with Hadoop framework. Big files are splited into chunks and these chunks are replicated automatically over the cluster.
Amazon S3: this come from Amazon Web Service(AWS) which is an internet based storage. Performance might be negative effected by network traffic.
MapR-FS: provides higher availability, higher performance than HDFS. Comes with MapR's Hadoop distribution
HBase: This is columnar, multidimensional database derived from Google Big Table also based on HDFS filesystem. It maintains data in partitions and therefore can give data access efficiently in sorted manner.

Data access: this layer helps in accessing data from multiple stores.

Hive: SQL-like querying capabilities run on top of Hadoop.
Pig: This is a data flow engine and multiprocess execution framework. It's scripting language is call Pig Latin. The Pig interpreter translates these scripts into MapReduce jobs.
Avro: This is one of the centralization systems, which provides a rich data format, a container file to store persistent data, a remote procedure call, and so on. It uses JSON to define data types, and data is serialized in compact binary data.
Mahout: Machine learning software with core algorithms as recommendation, collaborative filtering, clustering. Algorithms are implemented on top of Hadoop using MapReduce framework.
Sqoop: this is used to transform data between hadoop world and RDMS world.

Management layer: this comprises of tools that assist in administering the Hadoop infrastructure.

Oozie: this is a workflow scheduler system to manage Apache Hadoop jobs.
Elastic MapReduce: this provisions the Hadoop cluster, running and terminating jobs and handling data transfer between EC2 and S3
Chukwa: data collection system for monitoring system, distributed system. Build on top of HDFS and MapReduce framework.
Flume: This is a distributed service comprising of multiple agents
Zookeeper: which provides open source distributed coordination and synchronization services as well as a naming registry for large distributed systems. ZooKeeper's architecture supports high availability through redundant services. It uses a hierarchical filesystem and is fault tolerant and high performing, facilitating loose coupling. ZooKeeper is already being used by many Apache projects such as HDFS and HBase, as well as its run in production by Yahoo!, Facebook, and Rackspace.

Data analytisc: third party software for understand data, get insights from data.

Pentaho: this has the capability of Data Integration (Kettle), analytics, reporting, creating dashboards, and predictive analytics directly from the Hadoop nodes. It is available with enterprise support as well as the community edition.
Storm: This is a free and open source distributed, fault tolerant, and real-time computation system for unbounded streams of data.
Splunk: This is an enterprise application, which can perform real-time and historical searches, reporting, and statistical analysis. It also provides the cloud-based flavor, Splunk Storm.

Thứ Bảy, 10 tháng 1, 2015

Scrap all link on webpage using BeautifulSoup Python

What I am trying to do is manipulate HTML document.
After searching around using wget utility of Ubuntu, html2text lib try some xml manipulate lib like DOM, etree ...quite hard core for a while and I found BeautifulSoup very cool lib for scraping web, parse HTML document.
Documents for BeautifulSoup can be found here
I just list here some awesome features of beautifulsoup:

Ease navigate through HTML document
Handle tag, name, attributes easily
Searching ...

Let's do it.

> sudo pip install beautifulsoup4
> sudo pip install requests

import requests

from bs4 import BeautifulSoup as bfs

url = raw_input("Enter url you want to scrap ")

r = requests.get(url)

data = r.text

soup = bfs(data)

for link in soup.find_all('a'):

print (link.get('href'))

or you can search for text content in <p> tag:

for link in soup.find_all('p'):
print (link.text)

That's it.

Thứ Ba, 6 tháng 1, 2015

Using RHive to connect from RStudio to Hive Server on Hadoop Cluster

Scenario:
1. We have cluster running hadoop store for example web logs.
2. We have Hive run on top of hadoop cluster for query data from our web logs.
3. You are a data analyst / scientist (like me :D) using RStudio and you might want to use hive from your RStudio: big benefit of this is that you can store result of query into a dataframe in R language for later process(if you issue query from hive cli you might not store the result).
How to do that?
1. Install RHive package here, you also can find docs on that website.
Just one thing to remind: RHive require rJava package and to install rJava package you need to install jdk from oracle.
2. I assume that you already setup successfully cluster running hadoop and hive.
3. From master of your cluster start hive server using:
hive/bin/> ./hive --service hiveserver2
This command will start Hive Server listen on port 10000 by default.
4. From your machine after successfully install RHive type this:
library("RHive")
Sys.setenv(HADOOP_HOME='/home/tin/tools/hadoop')
Sys.setenv(HIVE_HOME='/home/tin/tools/hive')
Sys.setenv(HADOOP_CONF_DIR='/home/tin/tools/hadoop/conf')
rhive.init()
rhive.env()
rhive.connect(host="ip_of_your_master",user="user_running_hadoop_on_master")

Attention:

On your system you need to download hadoop and hive as you can see above I put hadoop and hive in tools directory at home.

You might need to config hadoop and hive point to your jdk
You don't need to run hadoop or hive on your system.
This is used for RStudio can find some libraries to run.

That's it you now can issue query from RStudio using apache hive.

In case you have any problem about this procedure just email me: imthientin(at)gmail.com I'm willing to help.

Chém gió