- 浏览: 542551 次
- 性别:
- 来自: 西安
博客专栏
-
Hive入门
浏览量:43273
文章分类
最新评论
-
freeluotao:
public void readFields(D ...
MapReduce直接连接Mysql获取数据 -
passionke:
在spark-sql中通过insert插入数据到HBase表时 ...
SparkSQL读取HBase数据 -
annmi_cai:
好好学习,天天向上!
[一起学Hive]之十七-从Hive表中进行数据抽样-Sampling -
annmi_cai:
好好学习,天天向上!
[一起学Hive]之十六-Hive的WEB页面接口-HWI -
annmi_cai:
好好学习,天天向上!
[一起学Hive]之十五-分析Hive表和分区的统计信息(Statistics)
1. 需求与现状:
源表:pcup_3month_login_dtl_mes , 记录数12亿,文件数 300
统计SQL:
insert overwrite table pcup_logininfo_tmp partition(data_type = 1) select popt_id, null as sndaid, count(distinct case when login_date>='2012-02-01' and login_date<'2012-05-01' then login_date else null end) as m3_login, null as m3_login_top5, count(distinct case when login_date>='2012-05-01' and login_date<='2012-05-09' then login_date else null end) as mn_login, null as mn_login_top5, null as m3_apptype, null as mn_apptype, count(distinct case when login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='1' then login_date else null end) as m3_g_login, null as m3_g_login_top5, count(distinct case when login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='2' then login_date else null end) as m3_l_login, null as m3_l_login_top5, count(distinct case when login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='3' then login_date else null end) as m3_s_login, null as m3_s_login_top5, count(distinct case when login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='4' then login_date else null end) as m3_o_login, null as m3_o_login_top5, count(distinct case when login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='1' then login_date else null end) as mn_g_login, null as mn_g_login_top5, count(distinct case when login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='2' then login_date else null end) as mn_l_login, null as mn_l_login_top5, count(distinct case when login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='3' then login_date else null end) as mn_s_login, null as mn_s_login_top5, count(distinct case when login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='4' then login_date else null end) as mn_o_login, null as mn_o_login_top5 from pcup_3month_login_dtl_mes group by popt_id;
特点:group by 维度少,多字段count(distinct), reduce task非常少(7个)
耗时:1个半小时以上
2. 优化思路:
利用union all + group by + rownumber 代替所有的count(distinct);
根据文件大小设置合理的reduce task数量;
3. 优化后的代码:耗时20分钟左右
SET mapred.reduce.tasks = 100;
//初步过滤+去重
create table lxw_test3 as select popt_id,login_date,apptypeid from pcup_3month_login_dtl_mes where login_date>='2012-02-01' and login_date <= '2012-05-09' group by popt_id,login_date,apptypeid;
//利用rownumber 函数做去重标记
add jar hdfs://nn.dc.sh-wgq.sdo.com:8020/group/p_sdo_data/udf/snda_udf.jar; CREATE TEMPORARY FUNCTION row_number AS 'com.snda.hive.udf.UDFrow_number'; create table lxw_test4 as select type,popt_id,login_date,row_number(type,login_date,popt_id) as rn from ( select type,popt_id,login_date from ( select 'm3_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' union all select 'mn_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' union all select 'm3_g_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='1' union all select 'm3_l_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='2' union all select 'm3_s_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='3' union all select 'm3_o_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-02-01' and login_date<'2012-05-01' and apptypeid='4' union all select 'mn_g_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='1' union all select 'mn_l_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='2' union all select 'mn_s_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='3' union all select 'mn_o_login' as type,popt_id,login_date from lxw_test3 where login_date>='2012-05-01' and login_date<='2012-05-09' and apptypeid='4' ) x distribute by type,login_date,popt_id sort by type,login_date,popt_id ) y;
//用普通的聚合函数进行汇总
insert overwrite table pcup_logininfo_tmp partition(data_type = 99) select popt_id, null as sndaid, sum(case when type = 'm3_login' and rn = 1 then 1 else 0 end) as m3_login, null as m3_login_top5, sum(case when type = 'mn_login' and rn = 1 then 1 else 0 end) as mn_login, null as mn_login_top5, null as m3_apptype, null as mn_apptype, sum(case when type = 'm3_g_login' and rn = 1 then 1 else 0 end) as m3_g_login, null as m3_g_login_top5, sum(case when type = 'm3_l_login' and rn = 1 then 1 else 0 end) as m3_l_login, null as m3_l_login_top5, sum(case when type = 'm3_s_login' and rn = 1 then 1 else 0 end) as m3_s_login, null as m3_s_login_top5, sum(case when type = 'm3_o_login' and rn = 1 then 1 else 0 end) as m3_o_login, null as m3_o_login_top5, sum(case when type = 'mn_g_login' and rn = 1 then 1 else 0 end) as mn_g_login, null as mn_g_login_top5, sum(case when type = 'mn_l_login' and rn = 1 then 1 else 0 end) as mn_l_login, null as mn_l_login_top5, sum(case when type = 'mn_s_login' and rn = 1 then 1 else 0 end) as mn_s_login, null as mn_s_login_top5, sum(case when type = 'mn_o_login' and rn = 1 then 1 else 0 end) as mn_o_login, null as mn_o_login_top5 from lxw_test4 group by popt_id
发表评论
-
MapReduce和Hive支持递归子目录作为输入
2015-07-08 14:41 3093关键字:MapReduce、Hive ... -
Hadoop生态系统官网、下载地址、文档
2015-05-05 17:22 1819Apache版本: Hadoop官网:http:// ... -
大数据平台任务调度监控系统
2014-10-24 16:35 10518记录一下正在开发 ... -
Hive over HBase和Hive over HDFS性能比较分析
2014-01-22 16:16 7726环境配置: hadoop-2.0.0-cdh4.3 ... -
MapReduce直接连接Mysql获取数据
2013-05-31 17:29 6254Mysql中数据: mysql> select ... -
MapReduce读取HDFS,将结果写入MongoDB
2013-05-29 11:07 5076参考:http://docs.mongodb.org/eco ... -
Kerboers认证由于和Master时间不同步导致的错误
2013-04-03 15:56 3928刚发现Hbase集群一台RegionServer挂掉了,重 ... -
hadoop mapreduce程序jar包版本冲突解决方法
2013-01-08 16:34 3431写MR程序时往往会使用到第三方包, 如果这些包在集群中不存在, ... -
hadoop mapreduce程序jar包版本冲突解决方法
2013-01-08 16:08 0<!-- [if gte mso 9]><x ... -
windows上编译eclipse-plugin for hadoop-0.20.2-cdh3u3
2012-12-17 17:24 2550hadoop-0.20.2-cdh3u3目录为:E:\hado ... -
hadoop-error:DiskChecker$DiskErrorException: Invalid volume failure config value
2012-12-17 13:33 38562012-12-17 10:58:59,925 ERROR o ... -
hbase错误 ZooKeeperConnectionException
2012-11-08 09:12 3649用hbase的rest服务给业务方提供web访问接口, 经常 ... -
【收藏】hadoop hdfs命令
2012-11-06 09:37 1458cat 使用方法:hadoop fs -cat URI [U ... -
hive本地mr
2012-10-23 11:03 5478如果在hive中运行的sql本身数据量很小,那么使用本地mr ... -
base64加密解密的hive udf函数
2012-07-12 12:28 7505依赖hadoop,hive相关包, 源码见附件。 a ... -
使用kerberos的hadoop选择java版本需注意
2012-07-12 09:51 6676使用kerberos的hadoop集群,如果在kerberos ... -
【转】Jobtracker重启Job recovery过程分析
2012-07-10 16:02 1757Jobtracker重启Job recovery ... -
windows下用eclipse连接linux中的hadoop,并执行mr
2012-07-09 18:05 40661. 准备: linux ... -
hadoop修改kerberos默认的配置文件
2012-07-05 15:23 2057kerberos默认的配置文件krb5.conf文件位于/et ... -
mapreduce报错:java.io.IOException: Split metadata size exceeded 10000000.
2012-06-21 15:46 7252Failure Info:Job initialization ...
相关推荐
Hive思维导图之Hive优化
Hive性能优化 数据倾斜 hql优化 count(distinct)优化
作为企业Hadoop应用的核心产品,Hive承载着FaceBook、淘宝等大佬 95%... 拥有1万多个Hive作业的大电商如何进行Hive优化的?本系列课结合企业实战和场景从作业架构层面、Hql(Hive sql)语法层面、Hive参数层面依次讲述。
HIVE优化实战分享 HIVE优化实战分享 HIVE优化实战分享
目录 1 Hive 概念与连接使用: 2 2 Hive支持的数据类型: 2 2.1原子数据类型: 2 2.2复杂数据类型: 2 2.3 Hive类型转换: 3 3 Hive创建/删除数据库 3 3.1创建数据库: 3 3.2 删除数据库: 3 ...13 Hive优化 9
本文档主要总结了hive的参数优化,对hive的理解具有一定的作用
大数据分析工具 hive 高级分析函数的使用与优化,笔记整理!
hive表新增字段或者修改字段,原理和具体操作
hive表修改分区数据
利用Hive进行复杂用户行为大数据分析及优化案例(全套视频+课件+代码+讲义+工具软件),具体内容包括: 01_自动批量加载数据到hive 02_Hive表批量加载数据的脚本实现(一) 03_Hive表批量加载数据的脚本实现(二) ...
# ---------------------------------------------优化结果------------------------------------- # -- 查找姓名为张三的学生信息 # insert overwrite table student_search # select # id, -- id编号 # name, -- ...
HiveSQL详细和优化,以及部分个人点评 This is a brief tutorial that provides an introduction on how to use Apache Hive HiveQL with Hadoop Distributed File System. This tutorial can be your first step ...
详细讲解了大数据中的海量数据存储架构,hive的性能优化方案。
Hive优化方法整理 hive 数据倾斜 内连接
hive sql 拆解字段.docx
sql行列转换、一个字段包含另一个字段.sql
Hive性能优化复习总结.doc.pdf
hive是基于Hadoop的一个数据仓库工具,用来进行数据提取、转化、加载。此文档是基于hive大数据库简介、使用场景、性能优化
当多个 distinct 操作同时出现在 select 中,数据会分发多次。容易造成Reduce数据倾斜 优化点 1、如果不要求精确值,可以使用 spark-sql approx_count_distinct函数 (基数计数 hyperloglog) 2、修改SQL 基础数据...