Bucket Map Join
1. 测试1:两个1亿多记录的表,不存在数据倾斜与笛卡尔积,测试下来与普通的join差不多;
2. 测试2:一个4000万和一个5000多万的表join,关联键数据倾斜,并且笛卡尔积,效果明显;
create table lxw_test(imei string,sndaid string,data_time string)
CLUSTERED BY(imei) SORTED BY(imei) INTO 10 BUCKETS;
create table lxw_test1(imei string,sndaid string,data_time string)
CLUSTERED BY(imei) SORTED BY(imei) INTO 5 BUCKETS;
两个表关联键为imei,需要按imei分桶并且排序,小表(lxw_test)分桶数是大表(lxw_test1)的倍数(这点是在网上看的,需要这样,暂且这么做了);
set hive.enforce.bucketing = true;
插入数据前需要打开该选项;
insert overwrite table lxw_test
select imei,sndaid,null
from woa_all_user_info_his
where pt = '2012-05-28'
limit 40000000;
insert overwrite table lxw_test1
select imei,sndaid,data_time
from woa_all_user_info_his
where pt = '2012-05-28';
join时需要打开的参数:
set hive.optimize.bucketmapjoin = true;
set hive.optimize.bucketmapjoin.sortedmerge = true;
set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
select /*+ mapjoin(b) */ count(1)
from lxw_test1 a
join lxw_test b
on a.imei = b.imei
将小表做为驱动表,mapjoin;
包括insert数据,差不多10分钟左右;
如果这两个表做普通的join, 耗时1个多小时,没跑完,kill掉了。
分享到:
相关推荐
apache-hive-2.1.1-bin.tar apache-hive-2.1.1-bin.tar apache-hive-2.1.1-bin.tarapache-hive-2.1.1-bin.tar apache-hive-2.1.1-bin.tar apache-hive-2.1.1-bin.tarapache-hive-2.1.1-bin.tar apache-hive-2.1.1-...
含两个文件hive-jdbc-3.1.2-standalone.jar和apache-hive-3.1.2-bin.tar.gz 含两个文件hive-jdbc-3.1.2-standalone.jar和apache-hive-3.1.2-bin.tar.gz 含两个文件hive-jdbc-3.1.2-standalone.jar和apache-hive-...
02、hive-exec-2.1.1-cdh6.3.1.jar 03、hive-jdbc-2.1.1-cdh6.3.1.jar 04、hive-jdbc-2.1.1-cdh6.3.1-standalone.jar 05、hive-metastore-2.1.1-cdh6.3.1.jar 06、hive-service-2.1.1-cdh6.3.1.jar 07、libfb303-...
catalog, Hive-Metastore, containing schemas and statistics, which is useful in data exploration and query optimization. In Facebook, the Hive warehouse contains several thousand tables with over 700 ...
hive-jdbc-1.2.1-standalone.jar hive-jdbc驱动jar包,欢迎下载
dbeaver连接hive时需要的驱动包hive-jdbc-uber-2.6.5.0-292.jar
Missing Hive Execution Jar: /hive/hive1.2.1/lib/hive-exec-*.jar
hive-jdbc-uber-2.6.5.0-292.jar 驱动下载,hive-jdbc-uber-2.6.5.0-292.jar,hive-jdbc-uber-2.6.5.0-292.jar
hive-jdbc-3.1.2-standalone适用于linux
hive客户端工具DBever连接时所使用的驱动,hive-jdbc-uber-2.6.5.0-292.jar,hive-jdbc-uber-2.6.5.0-292.jar,hive-jdbc-uber-2.6.5.0-292.jar
hive-jdbc-uber-2.6.5.0-292.jar驱动
hive2.1.1 show create table 表名,hive中文乱码,替换hive-exec-2.1.1.jar
hive-jdbc-2.1.0.jar
hive-exec-3.1.2 排除guava
hive-jdbc-2.3.7-standalone,可用dbeaver连接hive数据库,在工具中进行数据库记录的新增改查
Hive连接的jar包——hive-jdbc-3.1.2-standalone.jar,使用数据库连接软件连接数据仓库时需要使用相应的驱动器驱动,希望对大家有所帮助
hive-jdbc-uber-2.6.5.0-292.jar DbVisualizer (as of version 9.5.5) Below is an example configuration using DbVisualizer: Open the Diver Manager dialog ("Tools" > "Driver Manager...") and hit the ...
hive-jdbc-2.1.1-cdh6.2.0(ieda等jdbc链接hive2.1.1);cdh6.2.0安装的hive2.1.1
apache-hive-3.1.2-bin.tar.gz, 下载自:https://mirrors.bfsu.edu.cn/apache/hive/hive-3.1.2/, 上传至CSDN备份,本资源下载后需要解压缩zip文件,才是原本的apache-hive-3.1.2-bin.tar.gz文件
用java连接hive所必须的jar包,包括hadoop-commmon-2.6.4以及hive-jdbc-1.2.1-standalone