hadoop - Unable to use bucket map join in hive 0.12 -
i trying out hive optimization features , encountered such problem: cannot use bucket map join in hive 0.12. after setting tried below, only 1 hashtable file generated , join turn out map join.
i have 2 tables both in rcfile format , both bucktized 10 bucket, created , populated follows(origin data generated tpc-h):
hive> create table lsm (l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipstruct string, l_shipmode string, l_comment string) clustered (l_orderkey) 10 buckets stored rcfile; hive> create table osm (o_orderkey int, o_custkey int) clustered (o_orderkey) 10 buckets stored rcfile; hive> set hive.enforce.bucketing=true; hive> insert overwrite table lsm select * orili; hive> insert overwrite table osm select o_orderkey, o_custkey orior;
and can query both table’s data normally, , lsm 790mb, osm 11mb, both 10 bucket files, want try bucket map join:
hive> set hive.auto.convert.join=true; hive> set hive.optimize.bucketmapjoin=true; hive> set hive.enforce.bucketmapjoin=true; hive> set hive.auto.convert.join.noconditionaltask=true; hive> set hive.auto.convert.join.noconditionaltask.size=1000000000000000; hive> set hive.input.format=org.apache.hadoop.hive.ql.io.bucketizedhiveinputformat;
and query follows:
hive> select /*+ mapjoin(osm) */ osm.o_orderkey, lsm.* osm join lsm on osm.o_orderkey = lsm.l_orderkey;
this query generate 1 hashtable of osm , fall map join, confused it. have hint set enable bucket map join feature, or there problems in query ?
short version:
set hive> set hive.ignore.mapjoin.hint=false;
make bucket map join work expected. means 10 small tables's bucket files build hash table , hash join corresponding big file's buckets.
a longer version:
dive hive 0.12 code , find hive.ignore.mapjoin.hint
in hiveconf.java
, set true
default, means /*+ mapjoin */
hint ignored deliberately. since there 2 phase of optimization in hive, logical optimization , physical optimization, both rule based optimizations.
logical optimization
in logical optimization, mapjoin optimization followed bucketmapjoin optimization, bucketmapjoin optimization take mapjoin operator tree input , convert bucketmapjoin, hinted query first transformed mapjoin , bucketmapjoin. therefore, hint disabled logical optimisation nothing join optimisation on join tree.physical optimization
in physical optimisation,hive.auto.convert.join
tested ,mapjoinresolver
used , convert reduce join mapjoin. no further bucketmapjoin optimization rules in phase. that's why mapjoin in question.
Comments
Post a Comment