benchmark automl framework

一些思路：

Each of the four frameworks, auto_ml, auto-sklearn, TPOT, and H2O were tested with their suggested parameters, across 10 random seeds per dataset. F1 score (weighted) and mean squared error were selected as evaluation criteria for classification and regression problems, respectively
We used a best-effort approach to ensure all tests completed and that all tests had at least 3 chances to succeed within the 3 hour limit (限定时间和机会)

对于数据集的一些分析