Tag Archives: Apache Hive

Apache Hive 101: Enabling ACID Transactions

To create ACID tables, ensure Hive is configured to support ACID transactions by setting the following properties:

SET hive.support.concurrency=true;
SET hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on=true;
SET hive.compactor.worker.threads=1;

Hive Version Compatibility

ACID transactions in Hive are supported from version 0.14.0 onwards. Ensure that your Hive installation is at least this version. Hive 3.x introduced significant improvements and additional features for ACID transactions.

Creating ACID Tables

Full ACID Table

Full ACID tables support all CRUD (Create, Retrieve, Update, Delete) operations and require the ORC file format:

CREATE TABLE acidtbl (
key int,
value string
)
STORED AS ORC
TBLPROPERTIES ("transactional"="true")
;

Insert-Only ACID Table

Insert-only ACID tables support only insert operations and can use various storage formats:

CREATE TABLE acidtbl_insert_only (
key int,
value string
)
STORED AS TEXTFILE
TBLPROPERTIES ("transactional"="true", "transactional_properties"="insert_only")
;

Converting Tables to ACID

Non-ACID to Full ACID

To convert a non-ACID managed table to a full ACID table (requires ORC format):

ALTER TABLE nonacidtbl SET TBLPROPERTIES ('transactional'='true');

Non-ACID to Insert-Only ACID

To convert a non-ACID managed table to an insert-only ACID table:

ALTER TABLE nonacidtbl SET TBLPROPERTIES ('transactional'='true', 'transactional_properties'='insert_only');

Data Operations on ACID Tables

Inserting Data

INSERT INTO acidtbl VALUES (1, 'a');
INSERT INTO acidtbl VALUES (2, 'b');

Updating Data

UPDATE acidtbl SET value='updated' WHERE key=1;

Performing Merge Operations

MERGE INTO acidtbl USING src
ON acidtbl.key = src.key
WHEN MATCHED AND src.value IS NULL THEN DELETE
WHEN MATCHED AND (acidtbl.value != src.value) THEN UPDATE SET value = src.value
WHEN NOT MATCHED THEN INSERT VALUES (src.key, src.value);

Understanding Table Structures

ACID Tables (Transactional)

ACID tables have a specific directory structure in HDFS:

/user/hive/warehouse/t/
├── base_0000022/
│ └── bucket_00000
├── delta_0000023_0000023_0000/
│ └── bucket_00000
└── delta_0000024_0000024_0000/
└── bucket_00000
  • Base Directory: Contains the original data files.
  • Delta Directories: Store changes (inserts, updates, deletes).

Non-ACID Tables

Non-ACID tables have a simpler structure:

/user/hive/warehouse/table_name/
├── file1.orc
├── file2.orc
└── file3.orc

# For partitioned tables:
/user/hive/warehouse/table_name/
├── partition_column=value1/
│ ├── file1.orc
│ └── file2.orc
└── partition_column=value2/
├── file3.orc
└── file4.orc

File Format Considerations

  • Full ACID Tables: Only the ORC (Optimized Row Columnar) file format is supported for full ACID tables that allow all CRUD operations.
  • Insert-Only ACID Tables: These tables support various file formats, not limited to ORC. You can use formats like TEXTFILE, CSV, AVRO, or JSON.
  • Managed Tables: The managed table storage type is required for ACID tables.
  • External Tables: ACID properties cannot be applied to external tables, as changes to external tables are beyond Hive’s control.
  • Converting Existing Tables: When converting a non-ACID managed table to a full ACID table, the data must be in ORC format.
  • Default Format: When creating a full ACID table without specifying the storage format, Hive defaults to using ORC.

Managed vs. External Tables

  • Managed Tables: Support ACID transactions. They can be created as transactional (either full ACID or insert-only).

Example of a full ACID managed table:

CREATE TABLE managed_acid_table (
id INT,
name STRING
)
STORED AS ORC
TBLPROPERTIES ("transactional"="true");

Example of an insert-only ACID managed table:

CREATE TABLE managed_insert_only_table (
id INT,
name STRING
)
STORED AS TEXTFILE
TBLPROPERTIES ("transactional"="true", "transactional_properties"="insert_only");

External Tables: Do not support ACID transactions. These tables are used for data managed outside Hive’s control.

Example of an external table:

CREATE EXTERNAL TABLE external_table (
id INT,
name STRING
)
STORED AS TEXTFILE
LOCATION '/path/to/external/data';

Limitations of ACID Tables

  1. Performance Overhead: ACID tables introduce additional overhead due to the need for transactional logging and compaction processes.
  2. Storage Requirements: The delta files and base files can increase storage requirements.
  3. Compaction: Regular compaction is necessary to maintain performance and manage storage, which can add complexity.
  4. Version Dependency: Ensure that you are using a Hive version that supports the desired ACID features, as improvements and bug fixes are version-dependent.
  5. External Table Limitation: ACID properties cannot be applied to external tables.

Key Points

  1. Only managed tables can be converted to ACID tables.
  2. External tables cannot be made transactional.
  3. Full ACID tables require the ORC file format.
  4. Converting ACID tables back to non-ACID tables is not supported.

Stackademic 🎓

Thank you for reading until the end. Before you go:

Apache Hive 101: MSCK Repair Table

The MSCK REPAIR TABLE command in Hive is used to update the metadata in the Hive metastore to reflect the current state of the partitions in the file system. This is particularly necessary for external tables where partitions might be added directly to the file system (such as HDFS or Amazon S3) without using Hive commands.

What MSCK REPAIR TABLE Does

  1. Scans the File System: It scans the file system (e.g., HDFS or S3) for Hive-compatible partitions that were added after the table was created.
  2. Updates Metadata: It compares the partitions in the table metadata with those in the file system. If it finds new partitions in the file system that are not in the metadata, it adds them to the Hive metastore.
  3. Partition Detection: It detects partitions by reading the directory structure and creating partitions based on the folder names.

Why MSCK REPAIR TABLE is Needed

  1. Partition Awareness: Hive stores a list of partitions for each table in its metastore. When new partitions are added directly to the file system, Hive is not aware of these partitions unless the metadata is updated. Running MSCK REPAIR TABLE ensures that the Hive metastore is synchronized with the actual data layout in the file system.
  2. Querying New Data: Without updating the metadata, queries on the table will not include the data in the new partitions. By running MSCK REPAIR TABLE, you make the new data available for querying.
  3. Automated Ingestion: For workflows that involve automated data ingestion, running MSCK REPAIR TABLE after each data load ensures that the newly ingested data is recognized by Hive without manually adding each partition.

Command to Run MSCK REPAIR TABLE

MSCK REPAIR TABLE table_name;

Replace table_name with the name of your Hive table.

Considerations and Limitations

  1. Performance: The operation can be slow, especially with a large number of partitions, as it involves scanning the entire directory structure.
  2. Incomplete Updates: If the operation times out, it may leave the table in an incomplete state where only some partitions are added. It may be necessary to run the command multiple times until all partitions are included.
  3. Compatibility: MSCK REPAIR TABLE only adds partitions to metadata; it does not remove them. For removing partitions, other commands like ALTER TABLE DROP PARTITION must be used.
  4. Hive Compatibility: Partitions must be Hive-compatible. For partitions that are not, manual addition using ALTER TABLE ADD PARTITION is required.