Secret sauce in 2% top kaggle

原文链接

One of the most important aspects of building any supervised learning model on numeric data is to understand the features well. Looking at partial dependence plots of a model helps you understand how the model’s output changes with any feature.

Feature understanding
Identifying noisy features (the most interesting part!)
Feature engineering
Feature importance
Feature debugging
Leakage detection and understanding
Model monitoring

We can see that higher the trend-correlation threshold to drop features, higher is the leaderboard (LB) AUC.

关于一些概念的理解

例如feature debugging

Checking if the feature’s population distribution looks right. I’ve personally encountered extreme cases like above numerous times due to minor bugs.
Always hypothesize what the feature trend will look like before looking at these plots. Feature trend not looking like what you expected might hint towards some problem. And frankly, this process of hypothesizing trends makes building ML models much more fun!

leakage detection部分

Data leakage from target to features leads to overfitting. Leaky features have high feature importance.

featexp

有监督的学习部分，特征的探索