DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

One Hot编码

独热编码,又成为一位有效编码

其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候,其中只有一位有效。可以这样理解,对于每一个特征,如果它有m个可能值,那么经过独热编码后,就变成了m个二元特征。并且,这些特征互斥,每次只有一个激活。因此,数据会变成稀疏的

优点部分:

  • 能够处理非连续型数值特征

  • 在一定程度上也扩充了特征。比如性别本身是一个特征,经过one hot编码以后,就变成了男或女两个特征

采用one hot的原因:

  • 将离散特征的取值扩展到了欧式空间,离散特征的某个取值就对应欧式空间的某个点

  • 在回归,分类,聚类等机器学习算法中,特征之间距离的计算或相似度的计算是非常重要的,而我们常用的距离或相似度的计算都是在欧式空间的相似度计算,计算余弦相似性,基于的就是欧式空间

  • 将离散型特征使用one-hot编码,可以会让特征之间的距离计算更加合理

1
2
3
4
5
6
7
8
9
10

import numpy as np
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1],[1, 0, 2]])
print "enc.n_values_ is:",enc.n_values_
print "enc.feature_indices_ is:",enc.feature_indices_
print enc.transform([[0, 1, 1]]).toarray()

输出结果部分:

1
2
3
4
enc.n_values_ is: [2 3 4]
enc.feature_indices_ is: [0 2 5 9]
[[ 1. 0. 0. 1. 0. 0. 1. 0. 0.]]

横向为相关的样本空间,纵向表示相关的特征取值范围,

eature_indices_:根据说明,明显可以看出其是对n_values的一个累加。

最后表示的为相关的one hot编码?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
"""Encode categorical integer features using a one-hot aka one-of-K scheme.

The input to this transformer should be a matrix of integers, denoting
the values taken on by categorical (discrete) features. The output will be
a sparse matrix where each column corresponds to one possible value of one
feature. It is assumed that input features take on values in the range
[0, n_values).

This encoding is needed for feeding categorical data to many scikit-learn
estimators, notably linear models and SVMs with the standard kernels.

Read more in the :ref:`User Guide <preprocessing_categorical_features>`.

Attributes
----------
active_features_ : array
Indices for active features, meaning values that actually occur
in the training set. Only available when n_values is ``'auto'``.

feature_indices_ : array of shape (n_features,)
Indices to feature ranges.
Feature ``i`` in the original data is mapped to features
from ``feature_indices_[i]`` to ``feature_indices_[i+1]``
(and then potentially masked by `active_features_` afterwards)

n_values_ : array of shape (n_features,)
Maximum number of values per feature.