Home > Cloud, Java > Getting Started with Google Prediction API : Machine Learning on Cloud

Getting Started with Google Prediction API : Machine Learning on Cloud

October 17, 2010

In this post, I will talk about Google Prediction API, one among many services being offered by Google allowing enterprises to run different kind of applications on their Cloud Infrastructure. This API was officially announced during Google I/O 2010. It’s access is still by invitation and one needs to sign up on the waiting list here. In order to use Prediction API, one must also have an access to Google Storage for Developers, another service also launched during Google I/O 2010, which allows people to manage large data-sets on Google’s Infrastructure. Similar to Prediction API, it’s access is also by invitation only and you need to request access in a similar way here. Grant and time of access is not guaranteed, it took couple of days for me. Once you have access for both the services, let us try to build something using them.

What does Prediction API offer In nut-shell, it allows us to get more information from our data. Using Google’s machine learning algorithms on our supplied data-set, it predicts the future outcomes of similar type of input events. It also exposes these functionalities in form of RESTful web-service allowing us to use these functionalities in our applications quite easily. We can get started with Prediction API following four simple steps:

  • Prepare Prepare data-set to be used by Prediction API to get trained.
  • Upload Upload this data-set to Google Storage for Developers.
  • Train Request Prediction API to get trained on the supplied data-set
  • Predict Use Prediction API on different input objects for making predictions.

Before going further, let us first get acquainted with some of the terminology commonly used in any Supervised Learning context:

Training Data-Set Set of Training examples which are given to Supervised Learning Algorithm. The algorithm analyzes this data to build inferences and make its predictions.

Training Example Pair of an Input object and a desired output value.

Input Object Vector of one or multiple features. A feature can be String or Numeric.

Output Value Prediction result we would expect from the algorithm when given a list of input features. There are three types of output values : Categorical, Continuous and Regression. In this post, I will be talking about Categorical outputs only.

Example Scenario All these terms may not be very clear with just few words to people new to this domain so let’s try to be better understand them through a concrete example scenario which we would be building using Prediction API. Imagine we want to predict the price of a used car based upon different criterion. In real world, there are lot of features we need to take into account while making such a prediction but for making the example both easy and realistic, we take following features into account:

  • Make
  • Year of Manufacture
  • Mileage (Odometer Reading)
  • Color
  • Number of Accidents
  • City of Registration

Step 1 : Preparing Training Data-Set Like any other Supervised Learning Algorithm, Prediction API expects training data in certain format to be able to parse it properly. In this case, it should in CSV format with left-most column as output value followed by one or multiple input features. There should be one training example for each row. For our scenario, some typical Training example rows can be :

"TwoOneZero", Alto, 2008, 20000, Silver, 0, Delhi
"TwoZeroEight", Alto, 2008, 20000, Grey, 0, Delhi
"TwoZeroEight", Alto, 2008, 20000, White, 1, Delhi
//Value encapsulated in double quotes are Output values / Categories we would expect as result from Prediction API
//All other values followed by Output value correspond to feature list mentioned above in the respective order  

One thing worth mentioning here is the value in output column. Ideally we would expect this to be numeric value corresponding to price of the car for e.g. 21000 or 10000. But any numeric figures in output value results in Prediction API returning continuous value results. As we are interested in Categoric output, so I have used some simple String representation for the numeric value(TwoOneZero corresponds to 21000) . You can have more detailed information about the supported formats from Prediction API’s documentation page.
We should now build more training examples covering different scenarios in the training data-set. The results of the Prediction API directly correspond to the quality and amount of the training data. For you to have a look and try out, you can have a look at the attached sample training data-set file along with this post. (You may need to copy all the text into a separate file for editing/uploading it into Google Storage). As the API is in preview mode, the maximum size of an individual Training Data-Set is restricted to 100MB but you can train multiple data-sets which may exceed this limit.

Step 2 : Upload Training Data-Set We need to upload the created Data-Set to Google Storage for Developers for Prediction API to use. While there are many ways to interact with Google Storage, the easiest and fastest to get started appeared to be Google Storage for Developers Manager Web Interface. You firstly need to create a bucket and as it will be in the global namespace, it’s name has to be unique. Once the bucket is created, you can then upload the data-set file in this bucket. The interface is quite intuitive and simple to use. For more information, you can refer to Google Storage for Developers home page. Imagine our bucketname to be predictiveapitest and training data-set file to be CarScan_Categoric

Step 3 : Train Data-Set

  • Authentication Token from Google : As we are accessing a Google Service which is protected by Google account, we need to get an authentication access. Detailed information about various options is available at Google Client Login, but for our case, we can use a tool cURL for the purpose. If it’s not already on your system, you may need to install it. As we need to pass the information using https, you will need SSL enabled cURL. More information about the same is provided at the end of the blog. The command for getting the Authentication token is :
    curl -X POST
      -d accountType=HOSTED_OR_GOOGLE
      -d Email=account@domain.com //Replace with your email id
      --data-urlencode Passwd=email_account_password // Replace with your email password
      -d service=xapi
      -d source=account
      -H "Content-Type:application/x-www-form-urlencoded"
      https://www.google.com/accounts/ClientLogin
    

    If everything is correct, Google Client API will return a successful response containing SID, LSID and Auth tokens. For us, only Auth token is relevant and it would be a good idea to save it in a plain-text file for making subsequent calls to Prediction API. Typical response may contain information like :

    SID=DQAAAK4AAACXXXXXXXXXXXXXXXXXXXXXXXXXSPUirehoXg
    LSID=DQAAALEAAADmYYYYYYYYYYYYYYYYYYYYYYrac1rwXHJz9whk
    Auth=DQAAALEAAADmZZZZZZZZZZZZZZZZZZ7Ohhbcwv4AnP820
    

    In case of any problems, you may have to refer to Google Client Login API web-page

  • Invoke Train Mechanism This is where our first interaction happens with Prediction API. We make an HTTP POST request to Prediction API for it to get trained on our supplied data-set. We will again use cURL tool for the purpose. The syntax of the command would be :
    curl -X POST
      -H "Content-Type:application/json"
      -d "{\"data\":{}}"
      -H "Authorization: GoogleLogin auth=<<Google Auth Token>>"
      https://www.googleapis.com/prediction/v1.1/training?data=<<Bucket Name>>%2F<<Training Data-Set Name>>
    

    If your request is successful, you should get a response like:

    {"data":{"data":"<<Your Bucket Name>>/<<Your Training Data-Set Name>>"}}
    

    As you might have noticed, we are making a call to Prediction API’s exposed method : http://www.googleapis.com/prediction/v1.1/training?data=bucket_name%2Fdataset_name;. This is first of the four exposed methods by Prediction API, we will be using regularly.

  • Check Training Status Depending upon the size of the training data-set we have supplied to Prediction API, it may take anywhere from few seconds to more than an hour. In my sample training data-set, there were around 80 rows with 6 input features on each row and it took only few seconds. I also tried with few other larger data-sets and response has been reasonably fast till now. But we can always check whether Prediction API has been trained or not by making an HTTP GET call to Prediction API. Similar to earlier calls, we will be using cURL and here is the syntax:
    curl -H "Authorization: GoogleLogin auth=<<Google Auth Token>>"
      https://www.googleapis.com/prediction/v1.1/training/<<Bucket Name>>%2F<<Training Data-Set Name>>
    

    If your training has not been completed, you will get a response like:

    {"data":{
       "data":"<<Bucket Name>>/<<Training Data-Set Name>>", "modelinfo":"Training hasn't completed."}}}
    

    In case of successful completion, you will get a response like:

    {"data":{
       "data":"<<Bucket Name>>/<<Training Data-Set Name>>", "modelinfo":"estimated accuracy: 0.xx"}}}
    

    Value of estimated accuracy gives us an idea about the quality of results, we should expect from Prediction API. A low value indicates lower quality and a higher value indicates better prediction results. In case of categorical outputs, the value can range from 0 to 1. In our sample scenario, it was 0.12 (pretty low). We can increase the accuracy estimates by providing more and better quality data-set for Prediction API to train with.

    • Step 4 : Make Predictions Everything is ready for us to test our work with some predictions. For that purpose, similar to earlier steps, we will again be making a HTTP call to Prediction API. With our sample data-set, the exact syntax would be :

      curl -X POST
        -H "Content-Type:application/json"
        -d "{\"data\" : { \"input\" : { \"mixture\" : [ "\"Alto\"", 2008, 20000, "\"Grey\"", 0, "\"Delhi\"" ] }}}"
        -H "Authorization: GoogleLogin auth=<<Google Auth Token>>"
        https://www.googleapis.com/prediction/v1.1/training/<<Bucket Name>>%2F<<Training Data-Set Name>>/predict
      

      Couple of things to note in our request :

      • Alto, Grey and Delhi are encapsulated by double quotes. This is how we pass on String values, while numeric values don’t need the encapsulation.
      • input is followed by mixture. This is way of informing Prediction API about the kind of data we are supplying to it for making predictions. API supports three types of data : text, numeric and mixture. The names are self-explanatory.

      In our case, we get following response from Prediction API:

      {"data":{"kind":"prediction#output","outputLabel":"TwoZeroEight",                                                "outputMulti":[{"label":"TwoOneZero","score":0.14988209307193756},          {"label":"TwoZeroEight","score":0.3504120409488678},
      {"label":"TwoZeroSix","score":0.32314565777778625},
      {"label":"TwoZeroFour","score":0.14953596889972687},
      {"label":"TwoZeroTwo","score":0.027024220675230026}]}}
      

      Let us try to understand the response a bit more in detail. For each possible output value, Prediction API gives us a relative score corresponding to the supplied input. Higher the score, higher is the probability that the input corresponds to the output. The syntax of the result is that the output value getting the highest score comes at the top followed by all other categories and their respective scores. In our case, Predictive API considers that a car with features Alto, 2008, 20000, Grey, 0, Delhi is most likely to be priced at TwoZeroEight with TwoZeroSix as second most likely price and so on. I have tried by different values to these input features which are not present in the data-set(This is exactly the use of the Prediction API) and the results have been fairly consistent and as per the expectations. As I have said earlier, we need to provide more and varied data-set for Prediction API to return better results.

      Summary

      • No custom coding is required, everything is just made available to us. Of-course we don’t have the same fine-level of control over the underlying algorithms to use.
      • Everything is made available on Cloud. In case of large data to be analyzed, we can leverage Google’s infrastructure for the purpose.
      • Usage is fairly simple and as all the required functionalities are exposed through RESTful Web Services, the integration in our applications is quite easy.
      • It would be interesting to see the pricing structure once Google makes it publicly available.

      That’s it abut a brief tour of Google Prediction API. Hope you find it useful. Google has been using Predictive modelling algorithms in variety of ways internally like spam detection, language detection, customer sentiment and up-sell opportunities. With more and more data at our disposal and our growing desire to make meaningful decisions out of it, this API provides a useful tool to help us achieve those goals without going into any custom implementation or infrastructure set up.

      Looking forward to your comments and would be interested to know how you are using Prediction API in your applications.

      Additional Information

      • You might get quite unpredictable results with very small data-sets (around 10 – 15 lines). Prediction API team is still working on resolving the issue.
      • In case of any problems where you would like to start all over again, API has also exposed a method to delete an already trained Model. It’s a similar web-service call as we used for train or prediction. You can find more information on the official documentation page.
      • cURL Download : You can download this tool for different operating systems here. In case of Windows, please make sure that you download SSL supporting binary along with required libraries.
      • Announcement Video : You can watch official announcement and overview presentation given by Google team members behind Prediction and Big Query APIs during Google I/O 2010