8 Matching Annotations
- Mar 2020
-
developer.sh developer.sh
-
Here is a high level comparison of the tools we reviewed above:
Comparison of Delta Lake, Apache Iceberg and Apache Hive:
-
To address Hadoop’s complications and scaling challenges, Industry is now moving towards a disaggregated architecture, with Storage and Analytics layers very loosely coupled using REST APIs.
Things used to address Hadoop's lacks
-
Hive is now trying to address consistency and usability. It facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. Structure can be projected onto data already in storage.
Apache Hive offers:
- Streaming ingest of data - allowing readers to get a consistent view of data and avoiding too many files
- Slow changing dimensions - dimensions of table change slowly over time
- Data restatement - supported via INSERT, UPDATE, and DELETE
- Bulk updates with SQL MERGE
-
Delta Lake is an open-source platform that brings ACID transactions to Apache Spark™. Delta Lake is developed by Spark experts, Databricks. It runs on top of your existing storage platform (S3, HDFS, Azure) and is fully compatible with Apache Spark APIs.
Delta Lake offers:
- ACID transactions on Spark
- Scalable metadata handling
- Streaming and batch unification
- Schema enforcement
- Time travel
- Upserts and deletes
-
Apache Iceberg is an open table format for huge analytic data sets. Iceberg adds tables to Presto and Spark that use a high-performance format that works just like a SQL table. Iceberg is focussed towards avoiding unpleasant surprises, helping evolve schema and avoid inadvertent data deletion.
Apache Iceberg offers:
- Schema evolution (add, drop, update, rename)
- Hidden partitioning
- Partition layout evolution
- Time travel (reproducible queries)
- Version rollback (resetting tables)
-
in this disaggregated model, users can choose to use Spark for batch workloads for analytics, while Presto for SQL heavy workloads, with both Spark and Presto using the same backend storage platform.
Disaggregated model allows more flexible choice of tools
-
These projects sit between the storage and analytical platforms and offer strong ACID guarantees to the end user while dealing with the object storage platforms in a native manner.
Solutions to the disaggregated models:
- Delta Lake
- Apache Iceberg
- Apache Hive
-
rise of Hadoop as the defacto Big Data platform and its subsequent downfall. Initially, HDFS served as the storage layer, and Hive as the analytics layer. When pushed really hard, Hadoop was able to go up to few 100s of TBs, allowed SQL like querying on semi-structured data and was fast enough for its time.
Hadoop's HDFS and Hive became unprepared for even larger sets of data
-