These are few best practise for Hive in Standalone/Cloudera/Hortonworks Cluster perspective.
Using ORCFile for Hive table:
Using ORCFile for every HIVE table is extremely beneficial to get fast response times for your HIVE queries.
Query execution engine:
Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine for faster query execution
Use of partitions:
Partitioning can be used to partition the data to avoid scanning the whole table.
Keep in mind to not have a lot of partitions and which columns to use for partitioning.
Hive Partition is further subdivided into clusters or buckets and is called bucketing or clustering. This should be done on the columns which are primarily used for JOIN operations.
Vectorization in Hive:
Introduced in hive-0.13, helps to improve the performance of operations like scans, aggregations, filters and joins.
hive.vectorized.execution.enabled = true;
hive.vectorized.execution.reduce.enabled = true; // vectorization on
the reduce side
hive.vectorized.execution.reduce.groupby.enabled = true; default
it vectorization is not supported, it will be turned off.
Note: To use vectorized query execution, you must store your data in ORC format.
Partitioned tables are supported.
To use cost-based optimisation (also known as CBO), set the following parameters at the beginning of your query:
Heap sizes for concurrent connections:
Memory for Hive and Tez JVMs:
The memory available for the Containers (JVMs) is controlled with tez.am.launch.cmd-opts that is typically set to 80% of tez.resource.memory.mb
The container size is set with hive.tez.container.size and must be set as a multiple of yarn.scheduler.minimum-allocation-mb. The child java operations are controlled through hive.tez.java.opts and must be set to approximately 80% of hive.tez.container.size.
Enable indexing in Hive:
For enabling indexing in Hive,
set hive.optimize.index.filter = true;
This should be done on the columns which are used in the query for lookup.
Enhancements for smaller tables:
Hive supports MAPJOINs, which is well suited for scenarios where tables are small enough to fit in memory. The user can enable the same via auto join conversion:
Enable Bucketed Map Joins
If tables are bucketed by a particular column and these tables are being used in joins then we can enable bucketed map join to improve the performance. This can be done by:
hive.optimize.bucketmapjoin = true;
Skew Join optimisation:
For enabling skew join optimisation,
Compressing final job output:
If you want to compress the outputs of the map, before sending back the output across the network,
If you want to compress the intermediate outputs,
- Write intermediate results to a temporary table in the database instead of Hashmaps.
- The recommendation is to set the minimum heap space(-Xms) and maximum heap space(-Xmx) to have the same value.
- For heap size greater than 4GB, G1GC is recommended especially when multiple JVMs are running per machine.
- Instead of ORDER BY, use DISTRIBUTE BY and SORT BY so that the more reducers can be used and the task is not succumbed to use only one.
- To optimise bottlenecks we need to start GC logging to do this we shall use below parameters in JAVA_OPTIONS while starting the JVM
-XX:+PrintGC -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps
- Add -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=/path/ so that a heap dump is created whenever an OutOfMemory Exception occurs which will help in debugging.