hive创建表指定分隔符，不支持多个字符作为分隔符

superlxw1234

浏览: 541878 次
性别:
来自: 西安

最近访客更多访客>>

huageng520

rattersnake

yuanyuan7891

ticojj

博主相关

博客

微博

相册

留言

关于我

博客专栏

: Hive入门
浏览量：43133

文章分类

社区版块

存档分类

博客分类：

hive

hive 分隔符

hadoop fs -cat /tmp/liuxiaowen/1.txt

000377201207221125^^APPLE IPHONE 4S^^2
132288201210331629^^THINKING IN JAVA^^1
132288201210331629^^THIN ssss^^1111
132288201210331629^^THdd dd ddJAVA^^10

文本文件以两个尖角符作为列分隔符

hive中建表：

create external table tt(times string,
product_name string,
sale_num string
) row format delimited
fields terminated by '^^'
STORED AS TEXTFILE 
location 'hdfs://nn/tmp/liuxiaowen/';

hive> select times from tt;

000377201207221125
132288201210331629
132288201210331629
132288201210331629


hive> select product_name from tt; (结果均为空字符)
OK




Time taken: 13.498 seconds

hive> select sale_num from tt;
APPLE IPHONE 4S
THINKING IN JAVA
THIN ssss
THdd dd ddJAVA

本来应该放在product_name中的值，被分到了第三个字段sale_num.

原因是，hive中create table时候指定列分隔符，只能用一个字符，上面的建表语句真实采用的分隔符为一个尖角符^,

所以，真正的是将数据000377201207221125^^APPLE IPHONE 4S

以^为分隔符，分到了三个字段中，所以才有上面的select结果。

解决办法：

1. 将源文件中的分隔符替换为一个字符，比如\001（Ctrl+A），创建表时候指定\001作为分隔符；

2. 自定义 outputformat 和 inputformat。

Hive 的 outputformat/inputformat 与 hadoop 的 outputformat/inputformat 相当类似， inputformat 负责把输入数据进行格式化，然后提供给 Hive ， outputformat 负责把 Hive 输出的数据重新格式化成目标格式再输出到文件，这种对格式进行定制的方式较为底层，对其进行定制也相对简单，重写 InputFormat 中 RecordReader 类中的 next 方法即可。

示例代码如下：

public boolean next(LongWritable key, BytesWritable value) throws IOException {
        while (reader.next(key,text)) {
	        String strReplace = text .toString().toLowerCase().replace( "^^" , "\001" );
	        Text txtReplace = new Text();
	        txtReplace.set(strReplace );
	        value.set(txtReplace.getBytes(), 0, txtReplace.getLength());
	        return true ;
      	}
      return false ;
}

重写 Hive IgnoreKeyTextOutputFormat 中 RecordWriter 中的 write 方法，示例代码如下：

public void write (Writable w) throws IOException {
      String strReplace = ((Text)w).toString().replace( "\001" , "^^" );
      Text txtReplace = new Text();
      txtReplace.set(strReplace);
      byte [] output = txtReplace.getBytes();
      bytesWritable .set(output, 0, output. length );
      writer .write( bytesWritable );
}

自定义 outputformat/inputformat 后，在建表时需要指定 outputformat/inputformat ，如下示例：

stored as INPUTFORMAT 'com.aspire.search.loganalysis.Hive .SearchLogInputFormat' OUTPUTFORMAT 'com.aspire.search.loganalysis.Hive .SearchLogOutputFormat'

3. 通过 SerDe(serialize/deserialize) ，在数据序列化和反序列化时格式化数据。

这种方式稍微复杂一点，对数据的控制能力也要弱一些，它使用正则表达式来匹配和处理数据，性能也会有所影响。但它的优点是可以自定义表属性信息 SERDEPROPERTIES ，在 SerDe 中通过这些属性信息可以有更多的定制行为。

add jar /opt/app/hive-0.7.0-rc1/lib/hive-contrib-0.7.0.jar ;

create external table tt(times string,
product_name string,
sale_num string
) ROW FORMAT
SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES
( 'input.regex' = '([^^]*)\\^\\^([^^]*)\\^\\^([^^]*)',
'output.format.string' = '%1$s %2$s %3$s')
STORED AS TEXTFILE 
location 'hdfs://nn/tmp/liuxiaowen/';

hive> select product_name from tt;

APPLE IPHONE 4S
THINKING IN JAVA
THIN ssss
THdd dd ddJAVA

hive> select times,sale_num from tt;

000377201207221125      2
132288201210331629      1
132288201210331629      1111
132288201210331629      10

0
顶

0
踩

分享到：

hadoop-error:DiskChecker$DiskErrorExcept ... | 记录一下Hive中间和最终结果压缩

2012-12-10 14:43
浏览 24030
评论(1)
分类:编程语言
查看更多

1 楼 top_cai 2013-09-27

请问有没有具体代码啊，这个 while (reader.next(key,text)) 是什么意思啊？，没有reader /text的变量的？

发表评论

您还没有登录,请您登录后再发表评论