Home > Cloud, Java > Google Prediction API Part-II, Getting Continuously Valued Output through Regression

Google Prediction API Part-II, Getting Continuously Valued Output through Regression

October 23, 2010

In the previous post, I had provided an overview of Google Prediction API followed by a working example on getting started with it. There, we used a sample scenario to get Categorical output from Prediction API. In this post, I would talk a bit further about Prediction API and how we can get Continuously valued output from it. If you are new to Prediction Algorithms or have not tried Google Prediction API before, I will recommend having a quick read of my previous post or Prediction API web-site as I would be referring several things from there. Before getting started, let us try to understand the meaning and differences between Categorical and Continuously valued output in general:

Categorical Output Also referred as Classification, we assign a given set of Input features to one or multiple categories. Based upon the supplied set of category-input pairs (referred as Training Data-Set), a Prediction algorithm deduces inferences. It uses these inferences to assign any similar valid input features to one of the supplied categories. Typical usage of categorical output can be mail classification (spam, personal, official, groups etc), language recognition (french, english, spanish etc), population-classification (high, middle and low-income). In all the above scenarios, algorithm will give us a single categorical response i.e. this mail is either spam or personal or official and the result will always be among one of the supplied categories in the Training Data-Set. For more information on Categorical outputs and related algorithms, you can refer to this article.

Continuously Valued Output Everything in terms of categories and input features remain the same here also but the key difference is that Prediction algorithm in this case can give us a result which does not correspond to any of the supplied categories in our Training Data-Set. As an example, we have provided information about the stock price of a company along with list of related features like profitability of the company, competitors information, general stock-index etc of past 10 years. We would like to use this data to predict the stock price in six months from now. In that case, the algorithm based upon it’s learning from the data-set, can return a value either in between the supplied values or beyond them. Prediction Algorithms generally use regression to understand the co-relation between how the value of the output (category) changes with changes in input-features (which are mostly independent of each other). This type of prediction finds great usage in Financial and Medical domain, Meteorological forecasting, Demographics, Population estimation and related planning. For more information about Continuous output and available regression models, you can refer to this article.

Having understood the two terms and as we have already seen how to get Categorical output from a supplied data-set through Prediction API in the previous post, let’s see how we can achieve Continuously Valued output.

Data-Set Format As we would expect for getting continuously valued output, only numerical valued categories make sense. We can not think of a continuous output in case our categories are strings like spam, personal or official. That’s exactly the requirement of Google Prediction API as well i.e. we need to supply numeric values in the left-most column of the training data-set for getting Continuous valued output. In-fact that’s the way Prediction API understands whether we are expecting a Continuous or Categorical output from it. In terms of Input features, we can use either text, numeric or a mixture of both. According to the documentation, API can take hundreds of categories and tens of thousands of features for each category.

Example Scenario For this post, we will reuse the scenario of the previous post but will change the data-set to get continuously valued output this time. Similar to last time, we will take following input features for making predictions :

  • Make
  • Year of Manufacture
  • Mileage (Odometer Reading)
  • Color
  • Number of Accidents
  • City of Registration

For our data-set, some training data-set rows are:

210000, Alto, 2008, 20000, White, 0, Delhi
210000, Alto, 2008, 20000, Silver, 0, Delhi
206000, Alto, 2007, 30000, White, 0, Delhi
206000, Alto, 2007, 30000, Silver, 0, Delhi
208000, Alto, 2008, 20000, Grey, 0, Delhi
208000, Alto, 2008, 20000, Brown, 0, Delhi
208000, Alto, 2008, 30000, White, 0, Delhi
208000, Alto, 2008, 20000, White, 1, Delhi
208000, Alto, 2008, 20000, White, 0, Gurgaon
208000, Alto, 2008, 20000, Grey, 0, Indore
210000, Swift, 2008, 25000, White, 0, Delhi
210000, Spark, 2008, 27000, White, 0, Delhi

We can infer following main things from the above rows:

  • An older car (keeping other factors similar) is less expensive than a relatively newer car
  • White and Silver colored cars are more expensive than Grey, Brown and Black colored cars
  • Car with higher odometer reading is less expensive than with lower odometer reading
  • Car with more accidents is less expensive than with lesser number of accidents
  • Car being resold in Delhi or Gurgaon is less expensive than in Indore or Meerut
  • Spark is most expensive of the cars followed by Swift and last comes Alto

While some of the above mentioned points are quite obvious (older car, accidents or mileage), I have made up few other assumptions for testing purpose. You can access the full sample training data-set here(You may need to copy all the text into a separate file for editing/uploading it into Google Storage). Once we have the training data-set ready, we need to perform following steps before we can make predictions:

  1. Upload training data-set to Google Storage for developers
  2. Train Prediction API on the uploaded data-set
  3. Check the status of the training and ensure that it’s finished

Instructions about performing all the above steps are explained in my previous post. After having completed these steps with sample training data-set, I get the following estimated accuracy from Prediction API:

{"data":{"data":"<<Bucket Name>>/<<Training Data-Set Name>>","modelinfo":"estimated accuracy: 0.08"}}

Estimated Accuracy Lower number suggests lower level of accurate results coming from Prediction API. Main factors affecting the estimated accuracy are:

  • Quality of the training data-set Prediction API works on the principle of Garbage In, Garbage Out. More complete is our data-set in terms of varied input features, more accurate will be the results and vice-versa.
  • Category vs Number of Input Features in the training data-set This is something based on my experimentation. If have huge number of categories with respect to total number of rows in training data-set, then we tend to get lower accuracy rates. As a general trend, I have seen Prediction API giving better results with one category corresponding to 20 or more input feature rows.
  • Size of the training data-set If you are having a very small training data-set (less than 100 rows), you may get very unpredictable results from the API at the moment. Prediction API team is currently working on resolving the issue but for the time being you may have to add more rows in the data-set to be able to work your way. This problem may come your way only during development phase as any serious prediction would mostly mean that you are having a substantial data-set to work with.

Make Predictions Let us try to use Prediction API in some sample scenarios. It is important to keep the test data-set separate from training data-set while verifying the accuracy of results coming from Prediction API.

Test 1 : Varying Single Input Feature (Mileage) Let us try changing one parameter Mileage in our tests and see how the API responds. As a reference, we have following information in our training data-set:

206000, Alto, 2008, 20000, Black, 0, Delhi
202000, Alto, 2007, 30000, Black, 0, Delhi

We will use following command for getting results from Prediction API:

url -X POST
  -H "Content-Type:application/json"
  -d "{\"data\" : { \"input\" : { \"mixture\" : [ "\"<<Make>>\"", <<Year>>, <<Mileage>>, "\"<<Color>>\"", <<Number of Accidents>>, "\"<<City>>\"" ] }}}"
  -H "Authorization: GoogleLogin auth=<<Google Auth Token>>"
  https://www.googleapis.com/prediction/v1.1/training/<<Bucket Name>>%2F<<Training Data-Set Name>>/predict

Now let’s try predictions with following test-data:

Alto, 2008, 25000, Black, 0, Delhi
Alto, 2007, 25000, Black, 0, Delhi
Alto, 2007, 40000, Black, 0, Delhi

Following are the results:


In the first case, as the car has been driven a distance more than 20000 (information in training data-set), so we would expect it’s price to be lower than 206000, which is actually the result : 204522.953125. Similarly, in the second case, car has been driven a distance lesser than 30000(information in the training data-set), we would expect the price to be higher than 202000. We get the result as 204416.96875 which is as per our expectations. Last case is similar to the first one where car has been driven longer and we get expected results in terms of lower price but the interesting thing here is we have not provided any information in the training data-set for this kind of mileage still API has been able to understand the co-relation that increasing mileage results in lower prices.

Test 2 : Varying Multiple Input Features (Year, Mileage and Accident) Let us now change multiple input features together and see how Prediction API handles these scenarios. We will keep following rows from training data-set as reference for easier understanding:

210000, Spark, 2008, 27000, White, 0, Meerut
208000, Spark, 2008, 27000, White, 1, Meerut
206000, Spark, 2007, 37000, White, 0, Meerut

Following is our test data:

Spark, 2008, 30000, White, 1, Meerut
Spark, 2007, 35000, White, 2, Meerut
Spark, 2006, 45000, White, 3, Meerut

Here are the corresponding results:


In the first case, the car has been driven more and had an accident, so it’s value is 207775.75, which is lower than 208000. Second scenario has more accidents and more mileage, so it’s value comes out to be
205658.125, lower than 206000. In the last scenario, we are predicting against the year and accident number which is not even mentioned in the training data-set and we are getting good approximated results. This is where the real power of Prediction API comes into play.

Test 3 : Missing/New Input Features (Mileage) There may be scenarios, where we don’t have the information related to some of input features or we come across completely new features. Let’s try to test and see how the API performs:

Spark, 2008, 27000, White
Polo, 2007, 35000, White, 2, Meerut

Following are the results:


In the first case, API holds quite good as the price of Spark car with 2008 year of manufacture and 27000 mileage hovers between 208000 and 21000 across all the cities. But the second case may be completely erroneous. It depends completely upon the kind of input features we are talking. The good thing here is Prediction API does provide us results even with incomplete input features. One of the ways to improve our results in case of missing features is to add similar types of information in the training data-set which will allow Prediction API to get trained accordingly. That’s all for this post. Here is a quick summary about my key learnings:


  • Better understanding about Continuous Valued output and its potential usage
  • Using Prediction API for getting continuously valued results
  • How to improve Training data-set to get better results from Prediction API

As always, looking forward to your feedback and comments.

%d bloggers like this: