hadoop - Unable to use bucket map join in hive 0.12 -

April 15, 2010

i trying out hive optimization features , encountered such problem: cannot use bucket map join in hive 0.12. after setting tried below, only 1 hashtable file generated , join turn out map join.

i have 2 tables both in rcfile format , both bucktized 10 bucket, created , populated follows(origin data generated tpc-h):

hive> create table lsm (l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipstruct string, l_shipmode string, l_comment string) clustered (l_orderkey) 10 buckets stored rcfile; hive> create table osm (o_orderkey int, o_custkey int) clustered (o_orderkey) 10 buckets stored rcfile; hive> set hive.enforce.bucketing=true; hive> insert overwrite table lsm select * orili; hive> insert overwrite table osm select o_orderkey, o_custkey orior;

and can query both table’s data normally, , lsm 790mb, osm 11mb, both 10 bucket files, want try bucket map join:

hive> set hive.auto.convert.join=true;  hive> set hive.optimize.bucketmapjoin=true; hive> set hive.enforce.bucketmapjoin=true; hive> set hive.auto.convert.join.noconditionaltask=true; hive> set hive.auto.convert.join.noconditionaltask.size=1000000000000000; hive> set hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat;

and query follows:

hive> select /*+ mapjoin(osm) */ osm.o_orderkey, lsm.* osm join lsm on osm.o_orderkey = lsm.l_orderkey;

this query generate 1 hashtable of osm , fall map join, confused it. have hint set enable bucket map join feature, or there problems in query ?

short version:

set hive> set hive.ignore.mapjoin.hint=false; make bucket map join work expected. means 10 small tables's bucket files build hash table , hash join corresponding big file's buckets.

a longer version:

dive hive 0.12 code , find hive.ignore.mapjoin.hint in hiveconf.java , set true default, means /*+ mapjoin */ hint ignored deliberately. since there 2 phase of optimization in hive, logical optimization , physical optimization, both rule based optimizations.

logical optimization

in logical optimization, mapjoin optimization followed bucketmapjoin optimization, bucketmapjoin optimization take mapjoin operator tree input , convert bucketmapjoin, hinted query first transformed mapjoin , bucketmapjoin. therefore, hint disabled logical optimisation nothing join optimisation on join tree.
physical optimization

in physical optimisation, hive.auto.convert.join tested , mapjoinresolver used , convert reduce join mapjoin. no further bucketmapjoin optimization rules in phase. that's why mapjoin in question.

Search This Blog

Silver

hadoop - Unable to use bucket map join in hive 0.12 -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -