Tuesday, November 13, 2012

Storing Apache Hadoop WordCount Example Output to Database

Apache Hadoop WordCount example is the HelloWorld of Hadoop. Using this to Database Sinking of Hadoop output makes it easy to understand. Database I used is MySQL and the DDL for table used is as following;

CREATE TABLE word_count(word VARCHAR(254), count INT);
After creating the following Apache Hadoop Job along with Mapper and Reducer to Sink the output to Database. For this I use DBOutputFormat as the OutputFormat and DBConfiguration to specify DB configuration parameters.
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Iterator;
import java.util.StringTokenizer;

import org.apache.hadoop.filecache.DistributedCache;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.mapred.FileInputFormat;
import org.apache.hadoop.mapred.JobClient;
import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.MapReduceBase;
import org.apache.hadoop.mapred.Mapper;
import org.apache.hadoop.mapred.OutputCollector;
import org.apache.hadoop.mapred.Reducer;
import org.apache.hadoop.mapred.Reporter;
import org.apache.hadoop.mapred.lib.db.DBConfiguration;
import org.apache.hadoop.mapred.lib.db.DBOutputFormat;
import org.apache.hadoop.mapred.lib.db.DBWritable;

public class WordCount {
    public static class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, DBOutput, IntWritable> {
        private static IntWritable one = new IntWritable(1);
        private static DBOutput text = new DBOutput();
        public void map(LongWritable key, Text value,
                OutputCollector<DBOutput, IntWritable> collect, Reporter arg3)
                throws IOException {
            StringTokenizer token = new StringTokenizer(value.toString());
            while(token.hasMoreTokens()) {
                collect.collect(text, one);
    public static class WordCountReducer extends MapReduceBase implements Reducer<DBOutput, IntWritable, DBOutput, IntWritable> {

        public void reduce(DBOutput key, Iterator<IntWritable> values,
                OutputCollector<DBOutput, IntWritable> collect, Reporter arg3)
                throws IOException {
            int sum = 0;
            IntWritable no = null;
            DBOutput dbKey = new DBOutput();
            while(values.hasNext()) {
                no = values.next();
                sum += no.get();
            collect.collect(dbKey, new IntWritable(sum));
    public void run(String inputPath, String outputPath) throws Exception {
        JobConf conf = new JobConf(WordCount.class);
        DistributedCache.addFileToClassPath(new Path("<Absolute Path>/mysql-connector-java-5.1.7-bin.jar"), conf);

        // the keys are DBOutput
        // the values are counts (ints)


        FileInputFormat.addInputPath(conf, new Path(inputPath));
        DBOutputFormat.setOutput(conf, "word_count", "word", "count");
        DBConfiguration.configureDB(conf, "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/sample", "root", "root");
        //FileOutputFormat.setOutputPath(conf, new Path(outputPath));

    public static void main(String[] args) throws Exception {
        WordCount wordCount = new WordCount();
        wordCount.run(args[0], args[1]);
    private static class DBOutput implements DBWritable, WritableComparable<DBOutput> {
        private String text;
        private int no;

        public void readFields(ResultSet rs) throws SQLException {
            text = rs.getString("word");
            no = rs.getInt("count");

        public void write(PreparedStatement ps) throws SQLException {
            ps.setString(1, text);
            ps.setInt(2, no);
        public void setText(String text) {
            this.text = text;
        public String getText() {
            return text;
        public void setNo(int no) {
            this.no = no;
        public int getNo() {
            return no;

        public void readFields(DataInput input) throws IOException {
            text = input.readUTF();    
            no = input.readInt();

        public void write(DataOutput output) throws IOException {

        public int compareTo(DBOutput o) {
            return text.compareTo(o.getText());
Furthermore I have written a custom Hadoop type for key which implements DBWritable and WritableComparable. I have used this as the Output Key Class. Command to run this is as following;
./bin/hadoop jar <Path to Jar>/HadoopTest.jar WordCount <Input Folder> <Dummy Output Folder>


lotus said...


I just followed your blog and was able to put the data into the database as was supposed to do by the job but i now want to read the data and currently I am facing a problem with it. It would be of great help if you could post a job to retrieve the same data that you put in the DB.

Shazin Sadakath said...

Hi Lotus,

You can either use sqoop to retrieve data from your database as flat files and use them as input in your map reduce or you can write your custom DBInput.

sudheer said...

The information which you have provided is very good and easily understood.
It is very useful who is looking for Hadoop Online Training.


Thanks man. Saved the day !!!


Hey Can u write a similar tutorial to take data from databse ... would be very helpful!!

Punit said...

I compiled this example and after map phase is 50% completed, I get an error which says that "DBOutput cannot be cast to DBWritable".
Please help

jack wilson said...

This website is very helpful for the students who need info about the Hadoop courses.i appreciate for your post. thanks for shearing it with us. keep it up.

Hadoop Course in Chennai

ramya parvathaneni said...

useful information from your blog and real time faculty provides best online training on
hadoop online training

Hadoop online training said...

good content to viewers hadoop experts provides best online training on
hadoop online training
by real time experienced experts

satish kumar said...

thank you for sharing Storing Apache Hadoop WordCount .Hadoop online Training in Hyderabad

Victoria John said...

Thanks for sharing this informative blog. If anyone wants to get Big Data Training Chennai visit fita academy located at Chennai, which offers best Hadoop Training Chennai with years of experienced professionals.

Martina Christy said...

Cloud Computing Training

I have read your blog and i got a very useful and knowledgeable information from your blog.its really a very nice article.You have done a great job . If anyone want to get real time Cloud Computing Course in Chennai, Please visit FITA academy located at Chennai Velachery which offer best Cloud Computing Training in Chennai.

dhanamlakshmi palu said...

I learn a worthful information by this training.This makes very helpful for future reference.All the doubts are very clearly explained in this article.Thank you very much.
AWS Training in chennai | AWS Training chennai | AWS course in chennai

Suranka VMware said...

In this we just learn about what is it,what is the need for using this application.So i need a lot of information about the application.
VMWare course chennai | VMWare certification in chennai | VMWare certification chennai

surangacloud said...

Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post.
Cloud Computing Training in chennai | Cloud Computing Training chennai | Cloud Computing Course in chennai | Cloud Computing Course chennai

Kalyan Hadoop said...

Best Big Data Hadoop Training in Hyderabad @ Kalyan Orienit

Follow the below links to know more knowledge on Hadoop










Paul Miller said...

Thanks for sharing this here.

Hadoop course in t nagar
Hadoop training in adyar
Hadoop training institute in adyar

syed s said...

I was reading your blog this morning and noticed that you have a awesome
resource page. I actually have a similar blog that might be helpful or useful
to your audience.

sap sd and crm online training
sap online tutorials
sap sd tutorial
sap sd training in ameerpet

peterjohn said...

I really enjoy the blog.Much thanks again. Really Great.
Very informative article post.Really looking forward to read more. Will read on…

sap online training
sap sd online training
hadoop online training

Sai Santosh said...

This is really nice blog for all the technical issues especially relating to Hadoop. I came to know about this when I was attending hadoop training in hyderabad. Discussions will help me understand the concepts more than what I can understand on myself.

Steve Hawks said...

There are lots of information about latest technology and how to get trained in them, like Hadoop Training Chennai have spread around the web, but this is a unique one according to me. The strategy you have updated here will make me to get trained in future technologies(Hadoop Training in Chennai). By the way you are running a great blog. Thanks for sharing this.

Big Data Training | Big Data Course in Chennai