DevilKing's blog

Designing-Data-Intensive-Applications

Posted on 2017-05-02 In data-intensive

原文参考：

A Review of “Designing Data-Intensive Applications”

Foundations of Data Systems

In a section of describing performance the concepts of percentiles, outliers, latency and response times are introduced to help quantify performance.

Approaches to interacting with data are discussed with SQL and MapReduce receiving the most attention

Big-O notation is introduced to explain computational complexity of algorithms

Append-only systems, b-trees, bloom filters, hash maps, sorted string tables, log-structured merge-trees are all brought up.

Storage system implementation details such as how to handle deleting records, crash recovery, partially-written records and concurrency control are covered as well. It’s also explained how the above play a role in systems such as Google’s Bigtable, HBase, Cassandra and Elasticsearch to name a few.

Page 88 onward does a good job of contrasting OLTP and OLAP systems and uses this as a segue into data warehousing.

Data cubes, ETL, column-oriented storage, star- and snowflake schemas, fact and dimension tables, sort orderings and aggregation are all discussed. Teradata, Vertica, SAP HANA and ParAccel, Redshift and Hadoop are mentioned as systems incorporating these concepts into their offerings

data flow:
REST, RPC, microservices, streams and message brokers are explained and implementations such as TIBCO, IBM WebSphere, webMethods, RabbitMQ, ActiveMQ, HornetQ, NATS and Kafka are referenced.

Distributed Data

multiple machines problems:
Single-leader, multi-leader and leaderless replication, synchronous and asynchronous replication, fault tolerance, node outages, leadership elections, replication logs, consistency, monotonic reads, consistent prefix reads and replication lag are all discussed

partitioning data in order to achieve scalability:
shards in MongoDB, Elasticsearch and SolrCloud, regions in HBase, tablets in BigTable, vnodes in Cassandra and Riak and vBuckets in Couchbase.

transaction:
ACID, weak isolation levels, dirty reads and writes, materialising conflicts, locks and MVCC.

problems about distributed systems:
Network partitions, unreliable clocks, process pauses and mitigating garbage collection issues

consistency and consensus:
The CAP theorem, linearisability, serialisability, quorums, ordering guarantees, coordinator failure, exactly-once message processing among many other topics

Derived Data

data flow engines such as Spark, Tez and Flink.

batch processing against stream processing: “offline” data vs “online” data

stream processing:
Producers, consumers, brokers, logs, offsets, topics, partitions, replaying, immutability, windowing methods, joins and fault tolerance

the future of data

The second topic is migration of data between systems becoming as easy as the following would go a long way to make data systems act in a more unix pipe-like fashion.

1 2	$ mysql \| elasticsearch

on databases that check themselves for failure and auto-heal.

Golang 内存管理初探

Posted on 2017-04-27 In golang

golang内存管理(上篇)

tcmalloc的介绍

一个Goroutine部分，分为G,M,P三个部分组成

G Goroutine执行的上下文环境
M 操作系统线程
P Processer. 调度器

逃逸分析

package main
import ()
func foo() *int {  
  var x int
  return &x
}
func bar() int {
  x := new(int)
  *x = 1
  return *x
}
func main() {}

关于执行的命令如下：

$ go run -gcflags '-m -l' escape.go
./escape.go:6: moved to heap: x
./escape.go:7: &x escape to heap
./escape.go:11: bar new(int) does not escape

注意上面的代码，foo()的x是分配到堆上，而bar()中的x是分配到栈上

回归初心

Posted on 2017-04-24 In weekly

听惯了网易云音乐，感觉太过重看中些评论，其实，有时候，只是想静静地听听音乐，回归豆瓣音乐，就是安静地听听音乐，其实挺好。。

本周完成：

完成bidder数据部分的整合
重构代码

未完成：

hawkeye相关数据的整理，关于数据部分的缺失
关于invalid connection错误的处理？

本周所得：

关于redis定义的部分，注意关于空值的一些处理，尽量不要使用空值的key去取数据
关于goroutines的使用方面，主要是内存和CPU部分的使用

下周计划：

关于hawkeye数据部分的整理
关于bidder数据以及文档的整理

这次由于懒的问题，导致断档比较严重，后续不要这样子了。。。

关于锻炼，还有学吉他部分，要慢慢恢复起来。。

船到桥头自然直。。。

Function Program

Posted on 2017-04-01 In function program

原文

offers a helpful twist on typical functional programming operations (like map(..), compose(..), etc)

主要是从JS的角度来说明functional programming，例如，ES6

大量的FP libraries in JS


function uppercase(v) { return v.toUpperCase(); }

var words = ["Now","Is","The","Time"];
var moreWords = ["The","Quick","Brown","Fox"];

var f = R.map( uppercase );
f( words );                        // ["NOW","IS","THE","TIME"]
f( moreWords );                    // ["THE","QUICK","BROWN","FOX"]

使用Ramda, R.map(..)是柯西化的

我们注意到这部分的功能是，先使用mapper的功能，也就是uppercase的分发，然后才是arrays的功能(遍历？)

因为是柯西化的，所以我们可以进行反转


var p = R.flip( R.map )( words );

p( lowercase );                 // ["now","is","the","time"]
p( uppercase );                 // ["NOW","IS","THE","TIME"]

这样，先进行遍历array的操作，然后才是相关功能的mapper? 这里面使用了R.flip(..)这样的功能进行参数的交换

由此引出的问题，在functional programming中，需要记住参数的顺序


function concatStr(s1,s2) { return s1 + s2; }

var words = ["Now","Is","The","Time"];

_.reduce( concatStr, _, words );
// NowIsTheTime

_.reduce( concatStr, "Str: ", words );
// Str: NowIsTheTime

这里面_.reduce(..)使用这样的参数顺序reducerFunction, initialValue, arr。但我们一般情况下很少填写initialValue

这样的一个解决办法就是Named Arguments。这样的一种方案不是跟python很类似？

Cache Strategy

Posted on 2017-03-29 In cache

高性能服务架构思路

Agenda:

缓存策略的概念和实例
缓存策略的难点：不同特点的缓存数据的清理机制
分布策略的概念和实例
分布策略的难点：共享数据安全性与代码复杂度的平衡

缓存

缓存策略

性能问题表现：

并发数不足，也就是同时请求的客户过多，服务器的内存耗尽
处理延迟过长，CPU占用满100%

抽象资源：

时间资源：CPU和磁盘读写
空间资源：内存和网卡带宽

缓存策略，其实是用内存的空间，换取磁盘的读写空间

网络游戏的服务端程序，所有的写操作都先去写内存中的结构，然后定期再由服务器主动写回到数据库中，这样可以把多次的写数据库操作变成一次写操作，也能节省很多写数据库的消耗。这种做法也是用空间换时间的策略。

缓存的本质，除了让“已经处理过的数据，不需要重复处理”以外，还有“以快速的数据存储读写，代替较慢速的存储读写”的策略

对缓存数据的编程处理，采用序列化的方式，也可以采用深拷贝的方式，

一般来说，缓存的数据越解决使用时的内存结构，其转换速度就越快，在这点上，Protocol Buffer采用TLV编码，就比不上直接memcpy的一个C结构体，但是比编码成纯文本的XML或者JSON要来的更快。因为编解码的过程往往要进行复杂的查表映射，列表结构等操作

缓存难点-缓存清理

使用命令清理相关内存
使用字段逻辑判断，根据一些特征数据，判断是否有不一致的地方，

静态的缓存

以及运行时变化的缓存数据，

按照重要级来分割
按照使用部分分割

分布

分布策略

划分多个进程的架构，一般会有两种策略：一种是按功能来划分，比如负责网络处理的一个进程，负责数据库处理的一个进程，负责计算某个业务逻辑的一个进程。另外一种策略是每个进程都是同样的功能，只是分担不同的运算任务而已

在调用多进程服务的策略上，我们也会有一定的策略选择，其中最著名的策略有三个：一个是动态负载均衡策略；一个是读写分离策略；一个是一致性哈希策略。

多进程技术

多线程技术，内存使用不可控，对于同一数据的使用，需要考虑复杂的”锁”问题

select/epoll，非阻塞操作，

多线程和异步的例子，最著名就是Web服务器领域的Apache和Nginx的模型。Apache是多进程/多线程模型的，它会在启动的时候启动一批进程，作为进程池，当用户请求到来的时候，从进程池中分配处理进程给具体的用户请求，这样可以节省多进程/线程的创建和销毁开销，但是如果同时有大量的请求过来，还是需要消耗比较高的进程/线程切换。而Nginx则是采用epoll技术，这种非阻塞的做法，可以让一个进程同时处理大量的并发请求，而无需反复切换。对于大量的用户访问场景下，apache会存在大量的进程，而nginx则可以仅用有限的进程（比如按CPU核心数来启动），这样就会比apache节省了不少“进程切换”的消耗，所以其并发性能会更好。

分布式编程复杂度

多线程部分

基于回调
基于协程

lamda表达式，闭包，promise手段

动态多进程fork——同质的并行任务
多线程——能明确划的逻辑复杂的并行任务
异步并发回调——对性能要求高，但中间会被阻塞的处理较少的并行任务
协程——以同步的写法编写并发的任务，但是不合适发起复杂的动态并行操作。
函数式编程——以数据流为模型的并行处理任务

函数式编程的语言，比如LISP或者Erlang，其核心数据结果是链表——一种可以表示任何数据结构的结构。我们可以把所有的状态，都放到链表这个数据列车中，然后让一个个函数去处理这串数据，这样同样也可以传递程序的状态。这是一种用栈来代替堆的编程思路，在多线程并发的环境下，非常的有价值。

分布式数据通信

消息队列

分布式缓存策略

利用哈希表来解决缓存的数据同步

我们的“表”要有能把数据在A、B两个进程间同步的能力。因此我们一般会用三种策略：租约清理、租约转发、修改广播

租约清理
租约转发

这样无需清理，

修改广播

dns的缓存策略