DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

最近的生活和工作,感觉一团乱麻,拖延症+事情堆积,整个人,感觉都要爆炸了。。。

很多事情堆在那里,然后却不想干,有点颓废地耗费时间。。最后,事情还是逃不过,很多想做的事情,因为没有时间,只能押后。。。

看不到的未来和希望。。。

家里老姐的那样婚姻生活,最后能把老姐带向何处,双方都没长大,最后的生活,可能也是在拖着。。。

个人上,或许是因为自己有病吧,就算自己能挣钱又能怎么样,只不过九牛一毛罢了,骨子里的自卑感觉跑不了。不管怎么选,最后一关过不去,自己或许有病吧。。顺带着各种自卑吧,今年结束异地恋的想法,也是那么不切实际,拿什么去结束,看不到。。。

其实,也跟自己最近没有成长有关,技术上的提升,各种拖着,给自己不断地找理由去逃避,结果,事情还是那些事情,只是徒增悔恨罢了。。。

突如其来的房子问题,犹犹豫豫的本质,还是自己有病吧,仅有的最后一根稻草,不愿松手,存在幻想。。呵呵。。。却弄得对方家里都在四处奔波,

接下来的事情做好:

  • 日本签证,提交材料

  • 关于聊天的词云,做出第一版来,done

  • 屋子里收拾一下,然后拍几张照片,准备招租

  • 关于七夕的小程序部分,看能不能做出一个小demo出来

关于深度学习以及jupter notebook部分,开始慢慢学习

尽量充实晚上的生活,改掉自己的拖延症。。。

Kafka消息格式演进

kafka 0.7.x

Message Format

magic Attribute Crc Value

Message Set

offset size Message

magic: 1个字节,标识kafka版本,1个字节

attribute: 存储消息压缩所使用的编码, 1个字节

crc: 校验消息的内容, 4个字节

value: N-6个字节,N为Message总字节数,

message set中

offset: 8个字节,存储到磁盘之后的物理偏移量

size: 4个字节,消息的大小

发送以message set为单位进行发送,压缩的话,也是以message set的方式进行压缩,也就是value部分,以message set的方式,多条message

kafka 0.8.x

增加了key相关的信息,以及内容的长度,不再通过特定的N-6这种方式来标明

crc magic Attribute key length key value length Value

如果进行多条消息的压缩的话,这样,会缺少key部分的存储,而且这时候value为压缩之后的消息内容

kafka 0.10.x

引入了kakfka stream部分

crc magic Attribute Timestamp key length key value length Value

kafka 0.11.x

较之以前有重大的改变,消息的格式完全变了。。?

Ever wondered how to configure –num-executors, –executor-memory and –execuor-cores spark config params for your cluster?

  • Lil bit theory: Let’s see some key recommendations that will help understand it better

  • Hands on: Next, we’ll take an example cluster and come up with recommended numbers to these spark params

Lil bit theory:

  • Hadoop/Yarn/OS Deamons:

serveral daemons that’ll run in the background like NameNode, Secondary NameNode, DataNode, JobTracker and TaskTracker

num-executors, we need to make sure that we leave aside enough cores (~1 core per node)

  • Yarn ApplicationMaster (AM)

If we are running spark on yarn, then we need to budget in the resources that AM would need (~1024MB and 1 Executor)

  • HDFS Throughput

  • MemoryOverhead

Full memory requested to yarn per executor =spark-executor-memory + spark.yarn.executor.memoryOverhead

spark.yarn.executor.memoryOverhead = Max(384MB, 7% of spark.executor-memory)

So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us.

tips:

Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM.

相关的配置:

Cluster Config:
10 Nodes
16 cores per Node
64GB RAM per Node

  • First Approach: Tiny executors [One Executor per core]:
  • --num-executors = In this approach, we'll assign one executor per core
1
2
3
= `total-cores-in-cluster`
= `num-cores-per-node * total-nodes-in-cluster`
= 16 x 10 = 160
  • --executor-cores = 1 (one executor per core)
  • --executor-memory = amount of memory per executor
    1
    2
    = `mem-per-node/num-executors-per-node`
    = 64GB/16 = 4GB

Not Good!

  • Second Approach: Fat executors (One Executor per node):
  • --num-executors = In this approach, we'll assign one executor per node
1
2
= `total-nodes-in-cluster`
= 10
  • --executor-cores = one executor per node means all the cores of the node are assigned to one executor
    1
    2
    = `total-cores-in-a-node`
    = 16
  • --executor-memory = amount of memory per executor
    1
    2
    = `mem-per-node/num-executors-per-node`
    = 64GB/1 = 64GB

Not Good for HDFS throughput

  • Third Approach: Balance between Fat (vs) Tiny

    • Based on the recommendations mentioned above, Let’s assign 5 core per executors => –executor-cores = 5 (for good HDFS throughput)
    • Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15
    • So, Total available of cores in cluster = 15 x 10 = 150
    • Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30
    • Leaving 1 executor for ApplicationManager => –num-executors = 29
    • Number of executors per node = 30/10 = 3
    • Memory per executor = 64GB/3 = 21GB
    • Counting off heap overhead = 7% of 21GB = 3GB. So, actual –executor-memory = 21 - 3 = 18GB

recommended config is: 29 executors, 18GB memory each and 5 cores each!!

独热编码,又成为一位有效编码

其方法是使用N位状态寄存器来对N个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候,其中只有一位有效。可以这样理解,对于每一个特征,如果它有m个可能值,那么经过独热编码后,就变成了m个二元特征。并且,这些特征互斥,每次只有一个激活。因此,数据会变成稀疏的

优点部分:

  • 能够处理非连续型数值特征

  • 在一定程度上也扩充了特征。比如性别本身是一个特征,经过one hot编码以后,就变成了男或女两个特征

采用one hot的原因:

  • 将离散特征的取值扩展到了欧式空间,离散特征的某个取值就对应欧式空间的某个点

  • 在回归,分类,聚类等机器学习算法中,特征之间距离的计算或相似度的计算是非常重要的,而我们常用的距离或相似度的计算都是在欧式空间的相似度计算,计算余弦相似性,基于的就是欧式空间

  • 将离散型特征使用one-hot编码,可以会让特征之间的距离计算更加合理

1
2
3
4
5
6
7
8
9
10

import numpy as np
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder()
enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1],[1, 0, 2]])
print "enc.n_values_ is:",enc.n_values_
print "enc.feature_indices_ is:",enc.feature_indices_
print enc.transform([[0, 1, 1]]).toarray()

输出结果部分:

1
2
3
4
enc.n_values_ is: [2 3 4]
enc.feature_indices_ is: [0 2 5 9]
[[ 1. 0. 0. 1. 0. 0. 1. 0. 0.]]

横向为相关的样本空间,纵向表示相关的特征取值范围,

eature_indices_:根据说明,明显可以看出其是对n_values的一个累加。

最后表示的为相关的one hot编码?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
"""Encode categorical integer features using a one-hot aka one-of-K scheme.

The input to this transformer should be a matrix of integers, denoting
the values taken on by categorical (discrete) features. The output will be
a sparse matrix where each column corresponds to one possible value of one
feature. It is assumed that input features take on values in the range
[0, n_values).

This encoding is needed for feeding categorical data to many scikit-learn
estimators, notably linear models and SVMs with the standard kernels.

Read more in the :ref:`User Guide <preprocessing_categorical_features>`.

Attributes
----------
active_features_ : array
Indices for active features, meaning values that actually occur
in the training set. Only available when n_values is ``'auto'``.

feature_indices_ : array of shape (n_features,)
Indices to feature ranges.
Feature ``i`` in the original data is mapped to features
from ``feature_indices_[i]`` to ``feature_indices_[i+1]``
(and then potentially masked by `active_features_` afterwards)

n_values_ : array of shape (n_features,)
Maximum number of values per feature.

关于go goroutines部分

MNP调度,M个线程,N个goroutine,P指的是context部分

goroutine跟channel部分,密不可分,channel部分,也分no buffer channel和buffered channel

通过sync.Mutex进行lock以及unlock操作

block操作,以及recovery部分,通过内部的send func以及recv func指针来进行