DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

原文链接

A common categorization for data visualizations is exploratory and explanatory.

数据可视化的关键是可解释,可讲故事

图解机器学习

关于决策树部分,决策树的末节又叫做叶节点,预测取决于叶节点内哪一类的样本相对较多

Commonly we reduce these two modes into:
Explanatory = has a story e.g. scrollytelling
Exploratory = doesn’t have a story e.g. dashboards

原文链接

  • User-Case View 说一些用例,覆盖到所有的组件
  • Logical View 在逻辑上分层
  • Process View 物理的分布,可以谈到进程,线程,节点,集群或者时下流行的Pod
  • Deployment View 如何部署应用

最后是可能的安全,性能等边角问题

模板链接

1. Introduction

1.1 Purpose
1.2 Scope
1.3 Definitions, Acronyms and Abbreviations
1.4 References

2. Architectural Representation

3. Architectural Goals and Constraints

4. Use-Case View

4.1 Architecturally-Significant Use Cases

5. Logical View

5.1 Architecture Overview – Package and Subsystem Layering

6. Process View

6.1 Processes
6.2 Process to Design Elements
6.3 Process Model to Design Model Dependencies
6.4 Processes to the Implementation

7. Deployment View

7.1 External Desktop PC
7.2 Desktop PC
7.3 Registration Server
7.4 Course Catalog
7.5 Billing System

8. Size and Performance

9. Quality

user case

logic层采用E-R图的方式,来进行

Process View采用类图的方式来进行

类图

原文链接

mod-N hashing

This tends to rule out cryptographic ones like SHA-1or MD5. Yes they are well distributed but they are also too expensive to compute — there are much cheaper options available. Something like MurmurHash is good, but there are slightly better ones out there now. Non-cryptographic hash functions like xxHash, MetroHash or SipHash1–3 are all good replacements.

1
server := serverList[hash(key) % N]

What are the downsides of this approach? The first is that if you change the number of servers, almost every key will map somewhere else. This is bad.

optimal option

  • When adding or removing servers, only 1/nth of the keys should move.
  • Don’t move any keys that don’t need to move.

paper and system

In practice, each server appears multiple times on the circle. These extra points are called “virtual nodes”, or “vnodes”. This reduces the load variance among servers. With a small number of vnodes, different servers could be assigned wildly different numbers of keys. 引入虚拟节点来管理

One of the other nice things about ring hashing is that the algorithm is straight-forward.

1
2
3
4
5
6
7
8
9
10
func (m *Map) Add(nodes ...string) {
for _, n := range nodes {
for i := 0; i < m.replicas; i++ {
hash := int(m.hash([]byte(strconv.Itoa(i) + " " + n)))
m.nodes = append(m.nodes, hash)
m.hashMap[hash] = n
}
}
sort.Ints(m.nodes)
}

each one is hashed m.replicastimes with slightly different names ( 0 node1, 1 node1, 2 node1, …)

注意这里有replicas的概念,这样子多写几份?然后是对于nodes进行排序,方便进行binary search during lookup

1
2
3
4
5
6
7
8
9
10
func (m *Map) Get(key string) string {
hash := int(m.hash([]byte(key)))
idx := sort.Search(len(m.keys),
func(i int) bool { return m.keys[i] >= hash }
)
if idx == len(m.keys) {
idx = 0
}
return m.hashMap[m.keys[idx]]
}

Ketama

go-ketama

Not enough

Ring hashing still has some problems.

  • the load distribution across the nodes can still be uneven

Jump hash

Jump Hash addresses the two disadvantages of ring hashes:

it has no memory overhead and virtually perfect key distribution.

代码实现: github.com/dgryski/go-jump

1
2
3
4
5
6
7
8
9
10
11
func Hash(key uint64, numBuckets int) int32 {
var b int64 = -1
var j int64
for j < int64(numBuckets) {
b = j
key = key*2862933555777941757 + 1
j = int64(float64(b+1) *
(float64(int64(1)<<31) / float64((key>>33)+1)))
}
return int32(b)
}

It then uses the random numbers to “jump forward” in the list of buckets until it falls off the end

It’s fast and splits the load evenly. What’s the catch? The main limitation is that it only returns an integer in the range 0..numBuckets-1. It doesn’t support arbitrary bucket names. (With ring hash, even if two different instances receive their server lists in a different order, the resulting key mapping will still be the same.)

Multi-Probe Consistent Hashing

The basic idea is that instead of hashing the nodes multiple times and bloating the memory usage, the nodes are hashed only once but the key is hashed ktimes on lookup and the closest node over all queries is returned. The value of k is determined by the desired variance. For a peak-to-mean-ratio of 1.05 (meaning that the most heavily loaded node is at most 5% higher than the average), k is 21. With a tricky data structure you can get the total lookup cost from O(k log n) down to just O(k). My implementation uses the tricky data structure.

Maglev Hashing

Replication

副本机制

Weighted Hosts

Load Balancing

Benchmarks

benchmarks

conclusion

参考资料