The short time I have spent on Kaggle, I have realized ensembling (stacking models) is the best way to perform well.

Well, I am not the only one to think so !!

Stacking is a Model Ensembling technique that combines predictions from multiple models and generates a new model.

model ensembling

I am gonna write a new post on model ensembling 🙂

I have experimented with multiple ensembling techniques and made a model with XGboost, LightGBM, and Keras for Zillow Zestimate problem which did perform well.

Hyper-Parameter tuning for the base models was done using Cross-Validation + Grid Search. Tuning the parameters of the combined model is where things get strenuous.

There, I began to search for a better way to build ensembled models. I found few frameworks to build better-ensembled models like Auto-sklearn, TPOT, Auto-Weka, machineJS and AutoML.

Auto-sklearn and TPOT provide a Sklearn styled API that can help you get things going quite fast. But Auto ML got better results for me atleast 🙂 is an open source Machine Learning platform which gives you a good bunch of Machine Learning algorithms to build scalable prediction models.

H20 AutoML can help in automating the machine learning workflow, which includes training and tuning of hyper-parameters of models. The AutoML process can be controlled by specifying a time-limit or defining a performance metric-based stopping criteria. AutoML returns a leaderboard with the best models ensembled.

h2o automl leaderboard

AutoML provides APIs in Python and R that comes with H2O library.

I have decided to give a try on H20 AutoML for Zillow Zestimate problem. I have used R for making the model for making the submission.


# Load train and properties data

properties <- fread("../input/properties_2016.csv", header=TRUE, stringsAsFactors=FALSE, colClasses = list(character = 50))
train      <- fread("../input/train_2016_v2.csv")
training   <- merge(properties, train, by="parcelid",all.y=TRUE)

# Initialise h20
h2o.init(nthreads = -1, max_mem_size = "8g")

# Mark predictor and response variables
x <- names(training)[which(names(training)!="logerror")]
y <- "logerror"

# Import data into H2O
train <- as.h2o(training)
test <- as.h2o(properties)

# Fit H2O AutoML Mode;
aml <- h2o.automl(x = x, y = y,
                  training_frame = train,
                  max_runtime_secs = 1800, stopping_metric='MAE')

# Store the H2O AutoML Leaderboard                  
lb <- aml@leaderboard

# Use Best Model in the leaderboard

# Generate Predictions using the leader Model
pred <- h2o.predict(aml, test)

predictions <- round(as.vector(pred), 4)

# Prepare predictions for submission file
result <- data.frame(cbind(properties$parcelid, predictions, predictions,
                          predictions, predictions, predictions,

options(scipen = 999)

# Wite results to submission file
write.csv(result, file = "submission_xgb_ensemble.csv", row.names = FALSE )

Running the AutoML model for 1800 seconds with stopping metric as MAE gave me a Public Leaderboard score of 0.06564.

That’s a good score considering that I haven’t even dealt with basic data preprocessing 🙂