May 20

Hive on HBase

Hive provides insights into the data present in HBase (and HDFS) by responding to ad hoc queries. Queries can be written in HQL(Hive Query Language) which are sql like. Hive queries are internally converted into mapreduce jobs which run in distributed fashion over the HBase and HDFS systems.

Hive vs HBase

  • Hive is structured whereas HBase in unstructured.
  • Unlike HBase, Hive is not suitable for low latency queries. Hive is optimal for running ad hoc mapreduce jobs that deliver useful insights.
  • Hive supports data types such as String, Int, Date etc whereas HBase looks at everything as a byte array.

Hive integrates with HBase, and allows to query the data lying in HBase tables. It is optimal for certain use cases and cannot be looked at as an only means of communicating with HBase. More on it in the following sections. First, let us look at how Hive integrates with HBase.

Hive on HBase

Though Hive was primarily built to analyse data lying on HDFS, it also provides an optional integration library to work with HBase tables. Below is the architecture of Hive on HBase.

Hive Architecture

We are not going to discuss the architecture of Hive in detail here, perhaps in another blog. But know that hive queries are parsed and internally converted into mapreduce jobs, which are then scheduled for execution. Hive allows us to run complex queries on the operational data in parallel.

When to use Hive

Generating analytics from HBase tables typically requires writing mapreduce jobs. For simple queries such as get and scan HBase clients can be used directly, but generating deep analytics usually involves computations and scanning other columns as well apart from row key. The biggest advantage hive delivers is that it avoids the need to code those mapreduce jobs. Instead Hive lets us write sql like queries through HQL (Hive Query Language)

Hive is ideal for data-scientists/business-analysts, and helps them get insights from huge volumes of HBase/HDFS data without having to code map/reduce jobs.

Hive client side

  • Provides JDBC/ODBC drivers
  • Provides following communication interfaces
    • Thrift
    • Web
    • CLI

HQL

  • Hive tables can be created based on existing HBase tables and can be queried upon.
  • As Hive is structured it requires mapping of tables and columns with datatypes.
  • It is not mandatory to map all the columns in HBase to the corresponding Hive table. Just map what is required for the query.
  • Multiple Hive tables can be created for a single table in HBase.

When not to use

For low latency or high throughput queries such a simple get or a range scan a direct HBase query would be optimal. You may want to explore Apache Phoenix for high TPS queries over HBase.

Conclusion:
Hive provides a sql layer on top of HBase which is useful for analysing the data by generating deep insights over large datasets. But it is not ideal for low-latency queries such as a simple gets or updates. Apache Phoenix also provides a sql layer on top of HBase that is ideal for low latency queries. However it would be great to have a unified sql layer over HBase which suits both.

Leave a reply

Your email address will not be published.