Member-only story
Transactions vs Analytics — How Hive Brings SQL to Big Data Technologies
Transactional systems are optimized for day-to-day transactions that involve inserting, updating, and deleting data. Examples include ATM transactions and placing orders on Amazon. These systems typically use relational databases (RDBMS) that store data in a monolithic way.
Analytical systems are optimized for analyzing historical data in batch. Data warehouses like Teradata are analytical systems. Distributed systems like Hadoop work well for analytical purposes.
Apache Hive is an open source distributed data warehouse system built on top of Hadoop. It provides a SQL-like interface for analyzing petabytes of data stored in HDFS, S3, ADLS, or other distributed storage.
Hive tables have two components: the actual data, and metadata/schema stored in a metastore database. Storing metadata outside of HDFS allows Hive to support updates.
Datlakes offer high throughput for processing large amounts of data, but cannot offer low latency like RDBMS.
RDBMS uses schema-on-write — the schema is defined at table creation. Hive uses schema-on-read — data is loaded first into HDFS, then Hive tables are defined to create a tabular structure.
Key Hive components:
- Metastore: stores table schemas and metadata
- HiveServer2: allows clients to execute queries
- Beeline: JDBC client to interact with HiveServer
In conclusion- Hive leverages Hadoop’s distributed processing to enable fast analytics on big data stored across clustered storage. Its SQL interface and separate metadata management provide flexibility for data analysis.