DevilKing's blog

Alluxio 实践

Posted on 2022-01-21 In data

Alluxio为上层计算框架提供了统一的客户端和统一的API全局命名空间。在AI场景下，底层存储使用ceph，上层应用是特征计算，使用Alluxio作为中间层提供分布式共享缓存服务

核心功能：

实践部分，主要是基于alluxio on ceph的例子来讲

由于ceph-mds的性能不够，所以采用alluxio on ceph fs的方式
我们希望业务pod和Alluxio worker pod可以尽可能亲和性部署、尽量运行在同一个节点上，用domain socket 技术提升读性能。在业务上线前，通过distributeLoad命令把ceph的热点数据加载到Alluxio worker
同样面临master的压力很大的情况，采用ratis-shell的方式，扩展HA的使用
以及fuse-shell部分的调整，提升fuse部分的使用效率
增加master同client之间的动态参数配置
其他的一些调优策略：
- 日志
- HA下的元数据同步

后续的计划部分：

Posted on 2022-01-15 In data

目标:

设计方式:

基于硬件设计，内存，cpu，cache，从底层的角度入手，而非单纯的软件角度在外围在处理。。。

解决一个问题，要分场景，不同场景有不同解决方案
- Hash Table
- memcpy
- 甚至对于小规模数据，有一个特化版本, memcpySmallAllowReadWriteOverflow15
- 不排斥新算法，选取实际效果最优的
对于不同数据规模，有不同的实现
- quantileTiming
- uniqCombined
- - 小规模: flat array
  - 中规模: hash table
  - 极大规模: HyperLogLog
- keep in mind low-level details when designing your system
- design based on hardware capabilities
- choose data structures and abstractions based on the needs of the task
- provide specializations for special cases
- try the new, “best” algorithms, that you read about yesterday
- choose algorithm in runtime based on statistics
- benchmark on real datasets
- test for performance regressions in CI
- measure and observe everything
- even in production environment
- and rewrite code all the time