Monthly Archives: July 2017

Best practises for Hive

These are few best practise for Hive in Standalone/Cloudera/Hortonworks Cluster perspective.

  • Using ORCFile for Hive table:

Using ORCFile for every HIVE table is extremely beneficial to get fast response times for your HIVE queries.

  • Query execution engine:

Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine for faster query execution
set
hive.execution.engine=tez;

  • Use of partitions:

Partitioning can be used to partition the data to avoid scanning the whole table.
Keep in mind to not have a lot of partitions and which columns to use for partitioning.

  • Use Bucketing:

Hive Partition is further subdivided into clusters or buckets and is called bucketing or clustering. This should be done on the columns which are primarily used for JOIN operations.

  • Vectorization in Hive:

Introduced in hive-0.13, helps to improve the performance of operations like scans, aggregations, filters and joins.
set
hive.vectorized.execution.enabled = true;
set
hive.vectorized.execution.reduce.enabled = true; // vectorization on
the reduce side
set
hive.vectorized.execution.reduce.groupby.enabled = true; default
-true

it vectorization is not supported, it will be turned off.
Note: To use vectorized query execution, you must store your data in ORC format.
Partitioned tables are supported.

  • Cost-based optimization:

To use cost-based optimisation (also known as CBO), set the following parameters at the beginning of your query:
set
hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set
hive.stats.fetch.column.stats=true;
set
hive.stats.fetch.partition.stats=true;

  • Heap sizes for concurrent connections:

https://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hive_configure.html

1

  • Memory for Hive and Tez JVMs:

The memory available for the Containers (JVMs) is controlled with tez.am.launch.cmd-opts that is typically set to 80% of tez.resource.memory.mb
The container size is set with hive.tez.container.size and must be set as a multiple of yarn.scheduler.minimum-allocation-mb. The child java operations are controlled through hive.tez.java.opts and must be set to approximately 80% of hive.tez.container.size.

  • Enable indexing in Hive:

For enabling indexing in Hive,

set hive.optimize.index.filter = true;
set
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;\

This should be done on the columns which are used in the query for lookup.

  • Enhancements for smaller tables:

Hive supports MAPJOINs, which is well suited for scenarios where tables are small enough to fit in memory. The user can enable the same via auto join conversion:

set
hive.auto.convert.join=true;

  • Enable Bucketed Map Joins

If tables are bucketed by a particular column and these tables are being used in joins then we can enable bucketed map join to improve the performance. This can be done by:

set
hive.optimize.bucketmapjoin = true;

  • Skew Join optimisation:

For enabling skew join optimisation,

set
hive.optimize.skewjoin=true

  • Compressing final job output:

If you want to compress the outputs of the map, before sending back the output across the network,

set
mapred.output.compress=true

If you want to compress the intermediate outputs,

set
hive.exec.compress.intermediate=true

Notes:

  • Write intermediate results to a temporary table in the database instead of Hashmaps.
    https://hvivani.com.ar/2014/12/06/hive-dealing-with-out-of-memory-and-garbage-collector-errors/
  • The recommendation is to set the minimum heap space(-Xms) and maximum heap space(-Xmx) to have the same value.
  • For heap size greater than 4GB, G1GC is recommended especially when multiple JVMs are running per machine.
  • Instead of ORDER BY, use DISTRIBUTE BY and SORT BY so that the more reducers can be used and the task is not succumbed to use only one.
  • To optimise bottlenecks we need to start GC logging to do this we shall use below parameters in JAVA_OPTIONS while starting the JVM
    -XX:+PrintGC -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps
  • Add -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=/path/ so that a heap dump is created whenever an OutOfMemory Exception occurs which will help in debugging.