The short time I have spent on Kaggle, I have realized ensembling (stacking models) is the best way to perform well.
Stacking is a Model Ensembling technique that combines predictions from multiple models and generates a new model.
I am gonna write a new post on model ensembling 🙂
I have experimented with multiple ensembling techniques and made a model with XGboost, LightGBM, and Keras for Zillow Zestimate problem which did perform well.
Hyper-Parameter tuning for the base models was done using Cross-Validation + Grid Search. Tuning the parameters of the combined model is where things get strenuous.
Auto-sklearn and TPOT provide a Sklearn styled API that can help you get things going quite fast. But H2O.ai Auto ML got better results for me atleast 🙂
H2O.ai is an open source Machine Learning platform which gives you a good bunch of Machine Learning algorithms to build scalable prediction models.
H20 AutoML can help in automating the machine learning workflow, which includes training and tuning of hyper-parameters of models. The AutoML process can be controlled by specifying a time-limit or defining a performance metric-based stopping criteria. AutoML returns a leaderboard with the best models ensembled.
AutoML provides APIs in Python and R that comes with H2O library.
I have decided to give a try on H20 AutoML for Zillow Zestimate problem. I have used R for making the model for making the submission.
library(data.table) library(h2o) # Load train and properties data properties <- fread("../input/properties_2016.csv", header=TRUE, stringsAsFactors=FALSE, colClasses = list(character = 50)) train <- fread("../input/train_2016_v2.csv") training <- merge(properties, train, by="parcelid",all.y=TRUE) # Initialise h20 h2o.init(nthreads = -1, max_mem_size = "8g") # Mark predictor and response variables x <- names(training)[which(names(training)!="logerror")] y <- "logerror" # Import data into H2O train <- as.h2o(training) test <- as.h2o(properties) # Fit H2O AutoML Mode; aml <- h2o.automl(x = x, y = y, training_frame = train, max_runtime_secs = 1800, stopping_metric='MAE') # Store the H2O AutoML Leaderboard lb <- aml@leaderboard lb # Use Best Model in the leaderboard aml@leader # Generate Predictions using the leader Model pred <- h2o.predict(aml, test) predictions <- round(as.vector(pred), 4) # Prepare predictions for submission file result <- data.frame(cbind(properties$parcelid, predictions, predictions, predictions, predictions, predictions, predictions)) colnames(result)<-c("parcelid","201610","201611","201612","201710","201711","201712") options(scipen = 999) # Wite results to submission file write.csv(result, file = "submission_xgb_ensemble.csv", row.names = FALSE )
Running the AutoML model for 1800 seconds with stopping metric as MAE gave me a Public Leaderboard score of 0.06564.
That’s a good score considering that I haven’t even dealt with basic data preprocessing 🙂