Skip to content
gqlxj1987's Blog
Go back

The Dataflow Model

Edit page

论文地址

针对Unbounded, unordered, global-scale datasets, 考虑correctness, latency, and cost

举的例子,流式视频的广告投放问题

关于数据处理的几个重点:

a single unified model

那么总结起来:

Window

When dealing with unbounded data,windowing is required for some operations (to delineate fi-nite boundaries in most forms of grouping: aggregation,outer joins, time-bounded operations, etc.), and unneces-sary for others (filtering, mapping, inner joins, etc.).

(particularly time management [28] and semantic models [9], but alsowindowing [22], out-of-order processing [23], punctuations[30], heartbeats [21], watermarks [2], frames)

DATA FLOW MODEL

核心要素:

可map也可以reduce

pass (key, value, eventtime, window) 4-tuples

window assign->window merge:

有关watermarks,too fast or too slow

A useful insight in addressing the complete-ness problem is that the Lambda Architecture effectivelysidesteps the issue: it does not solve the completeness prob-lem by somehow providing correct answers faster; it simplyprovides the best low-latency estimate of a result that thestreaming pipeline can provide, with the promise of eventualconsistency and correctness once the batch pipeline runs

trigger同window的结合

design principles

One particularly large log join pipeline runs in streamingmode on MillWheel by default, but has a separate Flume-Java batch implementation used for large scale backfills.


Edit page
Share this post on:

Previous Post
Statefulness on K8s
Next Post
人生果实