Best practises for Hive

These are few best practise for Hive in Standalone/Cloudera/Hortonworks Cluster perspective.

  • Using ORCFile for Hive table:

Using ORCFile for every HIVE table is extremely beneficial to get fast response times for your HIVE queries.

  • Query execution engine:

Hive can use the Apache Tez execution engine instead of the venerable Map-reduce engine for faster query execution
set
hive.execution.engine=tez;

  • Use of partitions:

Partitioning can be used to partition the data to avoid scanning the whole table.
Keep in mind to not have a lot of partitions and which columns to use for partitioning.

  • Use Bucketing:

Hive Partition is further subdivided into clusters or buckets and is called bucketing or clustering. This should be done on the columns which are primarily used for JOIN operations.

  • Vectorization in Hive:

Introduced in hive-0.13, helps to improve the performance of operations like scans, aggregations, filters and joins.
set
hive.vectorized.execution.enabled = true;
set
hive.vectorized.execution.reduce.enabled = true; // vectorization on
the reduce side
set
hive.vectorized.execution.reduce.groupby.enabled = true; default
-true

it vectorization is not supported, it will be turned off.
Note: To use vectorized query execution, you must store your data in ORC format.
Partitioned tables are supported.

  • Cost-based optimization:

To use cost-based optimisation (also known as CBO), set the following parameters at the beginning of your query:
set
hive.cbo.enable=true;
set hive.compute.query.using.stats=true;
set
hive.stats.fetch.column.stats=true;
set
hive.stats.fetch.partition.stats=true;

  • Heap sizes for concurrent connections:

https://www.cloudera.com/documentation/enterprise/5-4-x/topics/admin_hive_configure.html

1

  • Memory for Hive and Tez JVMs:

The memory available for the Containers (JVMs) is controlled with tez.am.launch.cmd-opts that is typically set to 80% of tez.resource.memory.mb
The container size is set with hive.tez.container.size and must be set as a multiple of yarn.scheduler.minimum-allocation-mb. The child java operations are controlled through hive.tez.java.opts and must be set to approximately 80% of hive.tez.container.size.

  • Enable indexing in Hive:

For enabling indexing in Hive,

set hive.optimize.index.filter = true;
set
hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;\

This should be done on the columns which are used in the query for lookup.

  • Enhancements for smaller tables:

Hive supports MAPJOINs, which is well suited for scenarios where tables are small enough to fit in memory. The user can enable the same via auto join conversion:

set
hive.auto.convert.join=true;

  • Enable Bucketed Map Joins

If tables are bucketed by a particular column and these tables are being used in joins then we can enable bucketed map join to improve the performance. This can be done by:

set
hive.optimize.bucketmapjoin = true;

  • Skew Join optimisation:

For enabling skew join optimisation,

set
hive.optimize.skewjoin=true

  • Compressing final job output:

If you want to compress the outputs of the map, before sending back the output across the network,

set
mapred.output.compress=true

If you want to compress the intermediate outputs,

set
hive.exec.compress.intermediate=true

Notes:

  • Write intermediate results to a temporary table in the database instead of Hashmaps.
    https://hvivani.com.ar/2014/12/06/hive-dealing-with-out-of-memory-and-garbage-collector-errors/
  • The recommendation is to set the minimum heap space(-Xms) and maximum heap space(-Xmx) to have the same value.
  • For heap size greater than 4GB, G1GC is recommended especially when multiple JVMs are running per machine.
  • Instead of ORDER BY, use DISTRIBUTE BY and SORT BY so that the more reducers can be used and the task is not succumbed to use only one.
  • To optimise bottlenecks we need to start GC logging to do this we shall use below parameters in JAVA_OPTIONS while starting the JVM
    -XX:+PrintGC -Xloggc:gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps
  • Add -XX:+HeapDumpOnOutOfMemoryError and -XX:HeapDumpPath=/path/ so that a heap dump is created whenever an OutOfMemory Exception occurs which will help in debugging.
Advertisements

Retrieving columns from multiple tables using hibernate

Not very recently that I ran into this problem where I had to get data from two different tables (entities). There exists a primary key – foreign key relation between these two entities. Explaining this with an example:

Table1 :

ID Column11 Column12
ID1 . .
ID2 . .

 

Table2 :

ID Column21 Column22
ID1 . .
ID2 . .

 

Now I had to retrieve distinct pairs of Column11 and Column22 for a set of IDs.

Discussing two approaches to achieve this:

Approach 1: (Using criteria.createCriteria())

final Criteria criteria = getSession().createCriteria(Table1.class).add(Restrictions.eq(“ids”, ids));

final ProjectionList projList = Projections.projectionList();

projList.add(Projections.property(“Column11”));

criteria.createCriteria(“Table2”, “Table2”);

projList.add(Projections.property(“Table2.Column22”));

criteria.setProjection(Projections.distinct(projList));

final List values = criteria.list();

Note: This requires the projections to be added in this particular order only. If we add all the projections later after creating all the criteria, it will validate all the properties on the class on which last criteria is created and will not work.

 

Approach 2: (Using criteria.createAlias())

final Criteria criteria = getSession().createCriteria(Table1.class).add(Restrictions.eq(“id”, id));

criteria.createAlias(“Table2”, “Table2”);

final ProjectionList projList = Projections.projectionList();

projList.add(Projections.property(“Column11”));

projList.add(Projections.property(“Table2.Column22”));

criteria.setProjection(Projections.distinct(projList));

final List values = criteria.list();

 

I recommend to use Approach 2 as the projections list order doesn’t matter in case of createAlias(), which makes the code handier.

 

First hibernate application

Tools used:

  1. Postgresql Database(9.3) – http://www.postgresql.org/download/
  2. Hibernate Jars (4.3.4)- http://hibernate.org/orm/downloads/
  3. Postgres JDBC Driver – http://jdbc.postgresql.org/

Steps to create:

  1. Create a testProject in your eclipse and include the hibernate Jars and Postgres JDBC driver in your class path.
  2. Create a hibernate.cfg.xml file and it should look like this.
  3. Create a DTO POJO with getters and setters method. Add @Entity Annotation to the top of that class.
  4. Create a test class with main() function. The main() function will look something like:
  5. Run the application as a Java application and see the results in postgresDB.

Changes can be updated from this location: https://github.com/volcanicprateek/hibernateTest.git

References: