JindoFS: High-performance data lake storage solution for big data on the cloud

EMR JindoFS Background

Compute-storage separation has become a trend in cloud computing. Before compute-storage separation, the traditional compute-storage convergence architecture (left side of the figure below) was commonly used, but this architecture has certain problems, such as the mismatch between compute and storage capacity during cluster expansion. Users in some cases only need to expand computing capacity or storage capacity, and the traditional converged architecture can not meet the user's needs to expand computing or storage capacity separately; secondly, in the downsizing may encounter manual intervention, manual intervention after the need to ensure that the data is synchronized in multiple nodes, and when more than one copy of the need to synchronize the data may be caused by the loss of data. The compute-storage separation architecture (right side of the figure below) can well solve these problems, so that users only need to care about the computing power of the entire cluster.

EMR's existing compute-storage separation solution is based on OSS, which provides Hadoop-compatible OssFS. it has the ability of mass storage, so users don't need to worry that the storage capacity can't meet the business needs; in addition, because OssFS can access the data on OSS, so it also retains some of the advantages of OSS, such as low-cost, high-reliability, etc. However, there are also some disadvantages of OssFS. However, OssFS also has some drawbacks, such as the end of the task may move the data to the final directory, the process of file movement as well as renaming operations are slow, which is mainly due to the simulation of the semantics of the file system on the OSS system, the file system object storage does not support the atomic operation of renaming or moving; OSS is a public cloud service, its bandwidth will be limited; high-frequency access to the data consumes too much OSS bandwidth. In contrast, JindoFS overcomes these previously mentioned problems while retaining the advantages of OSSFS.

Introduction to EMR JindoFS

1) Architecture Introduction

The overall architecture of EMR JindoFS is shown in the following figure, which mainly contains parts: Namespace service and Storage service. the Namespace service is mainly responsible for the management of JindoFS metadata as well as the management of Storage service, while the Storage service is mainly responsible for the management of user data (local data and remote OSS data). EMR JindoFS is a cloud-native file system that provides the performance of local storage and the massive capacity of OSS.

- Namespace Service
Namespace service is the central service of JindoFS, which is mainly responsible for managing user's metadata, including the metadata of the JindoFS file system itself, the metadata of the sliced blocks, and the metadata of the Storage service.The JindoFS Namespace service can support multiple namespaces, and users can partition different namespaces according to different services. JindoFS Namespace service can support multiple Namespaces, so users can divide different Namespaces according to different businesses, and different Namespaces can store different business data without interfering with each other; Namespaces don't have global locks, so they can realize concurrency control based on the directory level, and can be concurrently created and concurrently deleted. In addition, Namespace can be set to different back-end storage, at this stage mainly supports RocksDB and AliCloud OTS, OTS support is expected to be released in the next version of the benefit is that if you create your own cluster in EMR, using OTS as a data back-end, the local EMR cluster can be destroyed at any time, the cluster in the creation of the time, the data of the JindoFS can still be accessed, which increases the flexibility of the user. This increases the flexibility for users.
- Storage Service
Storage service is mainly used to manage the cluster's local disk data, local cache data and OSS data. It can support different storage backends and storage media, the storage backend at this stage mainly supports the local file system and OSS, the local storage system can support HDD/SSD/DCPM and other storage media to provide cache acceleration, in addition, the storage service is optimized for the scenario that the user's small files are more, so as to avoid excessive pressure on the local file system caused by the overall performance degradation. to avoid excessive pressure on the local file system caused by too many small files, resulting in a decline in overall performance.

In terms of the entire ecosystem, JindoFS currently supports all big data components, including Hadoop, Hive, Spark, Flink, Impala, Presto and HBase, etc. Users can access EMR JindoFS by simply replacing the mode of the file access path with jfs. In addition, the next version of JindoFS will launch a Python SDK for machine learning users to access the data on JindoFS with high efficiency, and EMR JindoFS is also highly integrated with EMR Spark, such as the integration of Relational Cache, Spark-based materialized views, and Cube optimization. optimizations, etc.

EMR Jindo Usage Model

There are two main modes of EMR-Jindo usage: Block mode and Cache mode.
- Block Mode
Block mode slices JindoFS files into blocks and stores them on the local disk and OSS, and users can only see block data through OSS. This mode relies heavily on the local Namespace file system, and the local Namespace service is responsible for managing the metadata, and constructing the file data through the local metadata and block data. file data through the local metadata and Block data. Compared to the latter mode, the performance of JindoFS is the best in this mode. Block mode is suitable for the scenarios where users have certain performance requirements for data and metadata, and Block mode requires users to migrate their data to JindoFS.

Block mode supports different storage policies to fit different application scenarios. The default policy is WARM.
a) WARM: This is the default strategy, the data in the OSS and local backup respectively, the local backup can effectively provide the subsequent read acceleration, OSS backup to play the role of high availability;
b) COLD: Data has only one backup on OSS, no local backup, suitable for cold data storage;
c) HOT: data on the OSS a backup, local multiple backups, for some of the hottest data to provide further acceleration effect;
d) TEMP: only one local backup of the data, for some temporary data, to provide high-performance read and write, but at the expense of high reliability of the data, for some temporary data access.

Compared with HDFS, Block mode of JindoFS has the following advantages:
a) Utilizing the cheap and unlimited capacity of OSS, it has the advantage of cost as well as capacity;
b) Automatic separation of hot and cold data, transparent computation, and automatic migration of hot and cold data without modifying table metadata location information;
c) Simple maintenance, no need for decommissioning, nodes will be removed if they are broken, and data will not be lost on OSS;
d) fast system upgrade/restart/recovery, no block report;
e) Native support for small files, to avoid the process of small files caused by excessive pressure on the file system.

- Cache Mode
Unlike Block mode, Cache mode stores JindoFS files in the form of objects on the OSS. This mode is compatible with the existing OSS file system, users can access the original directory structure and files through the OSS, at the same time, this mode provides data and metadata caching, accelerating the performance of the user's read and write data. Users using this mode do not need to migrate data to OSS, you can seamlessly connect to the existing OSS data, but the performance compared to Block mode has a certain performance loss. In terms of metadata synchronization, users can choose different metadata synchronization strategies according to different needs.

Compared with OssFS, JindoFS's Cache mode offers the following advantages:
a) Due to the existence of local backups, read and write throughput is comparable to that of HDFS;
b) The ability to support all HDFS interfaces, supporting more scenarios, such as Delta Lake, and HBase on JindoFS;
c) JindoFS as a data and metadata cache, the user reads and writes data and List/Status operations relative to OssFS has performance improvements;
d) JindoFS as a data cache, can accelerate the user's data read and write.

The external client provides a way for users to access JindoFS outside the EMR cluster. At this stage, the client only supports the Block mode of JindoFS, and the privileges of the client are bound to the OSS privileges, so users need to have the corresponding OSS privileges to access JindoFS data through the external client.

EMR JindoFS + DCPM Performance

Below is a sharing of how to accelerate EMR JindoFS with Intel's new hardware Optane DC persistent memory. the following figure shows a typical usage scenario of Intel's Optane data center-grade persistent memory, starting from the bottom storage layer, in an RDMA/Replication scenario, using the new storage medium can achieve higher I/O performance; in the infrastructure layer, under the demand for intensive applications to create more instances, high-capacity persistent memory is a better solution; in the database layer, especially IMDB, SAP HANA and Redis can be used to support the creation of more instances through high-capacity persistent memory; similarly, in the field of AI and data analytics, low-latency devices can be used to Accelerate real-time data analysis, such as SAS, using Databricks to accelerate the analysis of machine learning scenarios; in addition, persistent memory can be used in both HPC and COMMS domains.

Using DCPM to accelerate EMR JindoFS performance test environment configuration is shown below. Among them, Spark uses EMR Spark, which is based on the open source Spark 2.4.3 version to do a lot of transformation, adding a lot of new features, including Relational Cache and JindoFS; the persistent memory is used in the way of SoAD, which is a fast device for users; the benchmarking selected three levels, namely, Micro-benchmark, TPC-DB, TPC-DB, TPC-DB, and DCPM. Three levels of benchmarking were chosen, namely Micro-benchmark, TPC-DS queries and Star Schema Benchmark.

The performance test results of EMR JindoFS using DCPM show that DCPM brings significant performance improvement for small file reads, especially in larger file and more read process scenarios; using decision support related queries as a benchmark, DCPM brings a 1.53x performance improvement for 99 query executions with 2TB of data; and with 2TB of data, DCPM brings a 1.53x performance improvement for SSB with DS queries with 2TB of data. DCPM delivers an overall 2.7x performance improvement for SSB with spark relational cache, also with 2TB data.

The following figure shows the performance results of Micro-benchmark, which tests 100 small file read operations with different file sizes (512K, 1M, 2M, 4M and 8M) and different parallelism (1-10). From the figure, we can see that DCPM brings significant performance improvement for small file reads, and the larger the file is, the higher the parallelism is, and the more obvious the performance improvement is.

The following figure shows the test results for TPC-DS, which has a data volume of 2TB, and tests 99 queries for the entire TPC-DS. Based on the normalized time, DCPM brings 1.53 times performance improvement overall. To specifically analyze the root cause of the performance improvement, as shown in the two subgraphs on the right side of the figure below, the peak memory bandwidth for read operations is 6.23GB/s, while the peak for write operations is 2.64GB/s, and there are more read scenarios in Spark, which is the reason for the performance improvement.

The following figure SSB in Spark Relational Cache + JindoFS performance test results, where SSB (Star Benchmarking) is a TPC-H based benchmark for the performance of star database systems.Relational Cache is an important feature supported by EMR Spark, which is mainly used to pre-organize and pre-calculate data by accelerating data analysis and providing functionality similar to the materialized view of a traditional data warehouse. In the SSB test, each query is executed individually using 1TB of data and the system cache is cleared between each query.Based on the normalized time, overall DCPM delivers a 2.7x performance improvement. For a single query, the performance improvement ranges from 1.9x to 3.4x. To summarize, the new DCPM hardware not only solves the I/O problem, but also brings end-to-end performance improvements to JindoFS.