Powering AzureML with Hadoop HBase

Hello World!

Today is a freaking cool day.  Why do you ask?  Because today I am writing an article on how to use two of the coolest freaking big data/data science tools out there together to do epic shit!  Lets start with HBase.  HBase is a way to have a big data solution with query performance at an interactive level.  So many folks are starting to just dump data into HBase.  In the project teddy solution, we are dumping tweets, dialogue and dialogue annotations to power our open domain conversational api.  There really is no other way that is easy to use for us to do this.

The second part of project teddy is to predict based on an incoming conversational component, what sort of response the speaker is attempting to illicit from the teddy bear.  If we power our teddy bear with predictive analytics and big data, this would be perfect.  What better platform to do this quickly and easily than AzureML?

This is a follow up article to this one: http://indiedevspot.com/2015/06/30/writing-tweets-to-hbase-simply/

The HBase Table

From the previous article, we created a table in HBase with the following code, providing a format of rowKey plus 5 string columns.

        private void CreateTweetByWordsCells(CellSet set, TweetSentimentData tweet)
        {
            // Create a row with a key
            var row = new CellSet.Row { key = Encoding.UTF8.GetBytes(tweet.Id) };
            // Add columns to the row
            row.values.Add(
                new Cell { column = Encoding.UTF8.GetBytes("d:Text"), 
                    data = Encoding.UTF8.GetBytes(tweet.Text) });
            row.values.Add(
                new Cell { column = Encoding.UTF8.GetBytes("d:CreatedOn"),
                    data = Encoding.UTF8.GetBytes(tweet.CreatedOn.ToString()) });
            row.values.Add(
                new Cell { column = Encoding.UTF8.GetBytes("d:ReplyToId"),
                    data = Encoding.UTF8.GetBytes(tweet.ReplyToId) });
            row.values.Add(
                new Cell { column = Encoding.UTF8.GetBytes("d:Sentiment"),
                    data = Encoding.UTF8.GetBytes(tweet.Sentiment.ToString()) });
            if (tweet.Coordinates != null)
            {
                row.values.Add(
                    new Cell { column = Encoding.UTF8.GetBytes("d:Coordinates"),
                        data = Encoding.UTF8.GetBytes(tweet.Coordinates) });
            }
            set.rows.Add(row);
        }

Mapping Hive to HBase

This is the actual magic.  AzureML can connect to hadoop, however it runs Hive Queries.  To execute Hive queries against data stored in an HBase table, a table mapping has to be made between a hive table and the source HBase table that actually contains your data.  Note that as of this writing the documentation in the Azure Portal is incorrect.  It took me a while to understand how the table mapping code actually worked, so I will point out some of the misconceptions I had and describe each portion of the query.  You execute the below query in the Hive Query Console in Azure after you have your cluster up, the table created and some data in the table.

CREATE EXTERNAL TABLE TweetSampleOne(
	rowkey STRING, Text STRING, CreatedOn STRING, ReplyToId STRING, Sentiment STRING, Coordinates STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,d:Text,d:CreatedOn,d:ReplyToId,d:Sentiment,d:Coordinates')
TBLPROPERTIES ('hbase.table.name' = 'tweetSampleSentiment');

Create External simply says that the source table (in this case ‘tweetSampleSentiment’ already exists, we are making a table external to that, which maps to that.

In the parenthesis of TweetSampleOne (which is the name of the new table), is the names of each of the columns.  This should be the same number and order of the source table.

Stored By indicates which storage connector should be used.  In our case, we are using HBase, therefor we point to it’s storage handler.

With SERDEPROPERTIES confused me for quite a while.  I was originally under the impression the colons were deliminators between old and new column names in the sample code.  I finally realized that no, this is just the column name.  Notice that the column names are comma deliminated and each one matches with our C# code.

Testing the Mapping

After the above query is executed, we need to test that the mapping has in fact been made.  We can do this very simply.  Execute the following query from the hive query console:

Select * from TweetSampleOne

We should see something like below as output in the Job Log

SuccessfulQuery

Success, we are ready to hook it up to AzureML!

Enter AzureML

If you are unfamiliar with AzureML, please read one of these articles to become familiar with more of the basics.

http://indiedevspot.com/2014/10/29/understanding-azureml-part-1-regression/

http://indiedevspot.azurewebsites.net/2014/10/29/understanding-azureml-part-2-binary-classification/

Now that you understand AzureML, lets talk about creating a data reader that uses Hive Queries against our HBase cluster.  Start by dragging out a Reader.  In the search bar on the left, type “Reader”.

DataReader

click, hold and drag the module into the experimentation space as below.

DataReaderInExperiment

Now all we need to do is to fill in our properties.  It is important to note that our HBase cluster is an HDInsight cluster (PaaS solution on Azure).

  1. Click on the Reader such that the properties expands out.
  2. The data source is Hive Query
  3. The query is Select * from TweetSampleOne
  4. properties1
  5. The HCatalogue Server Uri is the uri that points to your HBase cluster.  This can be found on the dashboard for your cluster.  It is also yourclustername.azurehdinsight.net
  6. The username is your admin user name or query user name
  7. the password is your password
  8. output data is Azure.  I thought it would be HDFS at first as well, however it is Azure based and the output of Hive Queries are placed in Azure Blob Storage, the container name is the default container you set up when initializing your cluster.
  9. The storage key can be obtained by navigating to the dashboard of the storage account and clicking “Manage Storage Keys” at the bottom.
  10. properties2
  11. After all of this is complete, click RUN within the AzureML bottom context menu.
  12. After running is complete, you should get a green check.
  13. Lets right click the output node and visualize the data from the reader to verify…
  14. results

SUCCESSS!!!! we can now begin using our data from hadoop hbase for Machine Learning and Predictive Analytics!

Summary

We went through how to use an existing HDInsight HBase cluster and table, map a Hive table to it so we can query HBase via HQL and then power AzureML via that same cluster!  That is super freaking cool!

 

One thought on “Powering AzureML with Hadoop HBase

  1. Pingback: Powering AzureML with Hadoop HBase | Big Enterprise Data

Leave a Reply

Your email address will not be published. Required fields are marked *