DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

Secret sauce in 2% top kaggle

原文链接

One of the most important aspects of building any supervised learning model on numeric data is to understand the features well. Looking at partial dependence plots of a model helps you understand how the model’s output changes with any feature.

  1. Feature understanding
  2. Identifying noisy features (the most interesting part!)
  3. Feature engineering
  4. Feature importance
  5. Feature debugging
  6. Leakage detection and understanding
  7. Model monitoring

We can see that higher the trend-correlation threshold to drop features, higher is the leaderboard (LB) AUC.

关于一些概念的理解

例如feature debugging

  1. Checking if the feature’s population distribution looks right. I’ve personally encountered extreme cases like above numerous times due to minor bugs.
  2. Always hypothesize what the feature trend will look like before looking at these plots. Feature trend not looking like what you expected might hint towards some problem. And frankly, this process of hypothesizing trends makes building ML models much more fun!

leakage detection部分

Data leakage from target to features leads to overfitting. Leaky features have high feature importance.

featexp

有监督的学习部分,特征的探索