原文参考:
A Review of “Designing Data-Intensive Applications”
Foundations of Data Systems
In a section of describing performance the concepts of percentiles, outliers, latency and response times are introduced to help quantify performance.
Approaches to interacting with data are discussed with SQL and MapReduce receiving the most attention
Big-O notation is introduced to explain computational complexity of algorithms
Append-only systems, b-trees, bloom filters, hash maps, sorted string tables, log-structured merge-trees are all brought up.
Storage system implementation details such as how to handle deleting records, crash recovery, partially-written records and concurrency control are covered as well. It’s also explained how the above play a role in systems such as Google’s Bigtable, HBase, Cassandra and Elasticsearch to name a few.
Page 88 onward does a good job of contrasting OLTP and OLAP systems and uses this as a segue into data warehousing.
Data cubes, ETL, column-oriented storage, star- and snowflake schemas, fact and dimension tables, sort orderings and aggregation are all discussed. Teradata, Vertica, SAP HANA and ParAccel, Redshift and Hadoop are mentioned as systems incorporating these concepts into their offerings
data flow:
REST, RPC, microservices, streams and message brokers are explained and implementations such as TIBCO, IBM WebSphere, webMethods, RabbitMQ, ActiveMQ, HornetQ, NATS and Kafka are referenced.
Distributed Data
multiple machines problems:
Single-leader, multi-leader and leaderless replication, synchronous and asynchronous replication, fault tolerance, node outages, leadership elections, replication logs, consistency, monotonic reads, consistent prefix reads and replication lag are all discussed
partitioning data in order to achieve scalability:
shards in MongoDB, Elasticsearch and SolrCloud, regions in HBase, tablets in BigTable, vnodes in Cassandra and Riak and vBuckets in Couchbase.
transaction:
ACID, weak isolation levels, dirty reads and writes, materialising conflicts, locks and MVCC.
problems about distributed systems:
Network partitions, unreliable clocks, process pauses and mitigating garbage collection issues
consistency and consensus:
The CAP theorem, linearisability, serialisability, quorums, ordering guarantees, coordinator failure, exactly-once message processing among many other topics
Derived Data
data flow engines such as Spark, Tez and Flink.
batch processing against stream processing: “offline” data vs “online” data
stream processing:
Producers, consumers, brokers, logs, offsets, topics, partitions, replaying, immutability, windowing methods, joins and fault tolerance
the future of data
- The second topic is migration of data between systems becoming as easy as the following would go a long way to make data systems act in a more unix pipe-like fashion.
1 | mysql | elasticsearch |
- on databases that check themselves for failure and auto-heal.