DevilKing's blog

冷灯看剑,剑上几分功名?炉香无需计苍生,纵一穿烟逝,万丈云埋,孤阳还照古陵

0%

Designing-Data-Intensive-Applications

原文参考:

A Review of “Designing Data-Intensive Applications”

Foundations of Data Systems

In a section of describing performance the concepts of percentiles, outliers, latency and response times are introduced to help quantify performance.

Approaches to interacting with data are discussed with SQL and MapReduce receiving the most attention

Big-O notation is introduced to explain computational complexity of algorithms

Append-only systems, b-trees, bloom filters, hash maps, sorted string tables, log-structured merge-trees are all brought up.

Storage system implementation details such as how to handle deleting records, crash recovery, partially-written records and concurrency control are covered as well. It’s also explained how the above play a role in systems such as Google’s Bigtable, HBase, Cassandra and Elasticsearch to name a few.

Page 88 onward does a good job of contrasting OLTP and OLAP systems and uses this as a segue into data warehousing.

Data cubes, ETL, column-oriented storage, star- and snowflake schemas, fact and dimension tables, sort orderings and aggregation are all discussed. Teradata, Vertica, SAP HANA and ParAccel, Redshift and Hadoop are mentioned as systems incorporating these concepts into their offerings

data flow:
REST, RPC, microservices, streams and message brokers are explained and implementations such as TIBCO, IBM WebSphere, webMethods, RabbitMQ, ActiveMQ, HornetQ, NATS and Kafka are referenced.

Distributed Data

multiple machines problems:
Single-leader, multi-leader and leaderless replication, synchronous and asynchronous replication, fault tolerance, node outages, leadership elections, replication logs, consistency, monotonic reads, consistent prefix reads and replication lag are all discussed

partitioning data in order to achieve scalability:
shards in MongoDB, Elasticsearch and SolrCloud, regions in HBase, tablets in BigTable, vnodes in Cassandra and Riak and vBuckets in Couchbase.

transaction:
ACID, weak isolation levels, dirty reads and writes, materialising conflicts, locks and MVCC.

problems about distributed systems:
Network partitions, unreliable clocks, process pauses and mitigating garbage collection issues

consistency and consensus:
The CAP theorem, linearisability, serialisability, quorums, ordering guarantees, coordinator failure, exactly-once message processing among many other topics

Derived Data

data flow engines such as Spark, Tez and Flink.

batch processing against stream processing: “offline” data vs “online” data

stream processing:
Producers, consumers, brokers, logs, offsets, topics, partitions, replaying, immutability, windowing methods, joins and fault tolerance

the future of data

  • The second topic is migration of data between systems becoming as easy as the following would go a long way to make data systems act in a more unix pipe-like fashion.
1
2
$ mysql | elasticsearch

  • on databases that check themselves for failure and auto-heal.