Can Impala query XML files stored in Hadoop/HDFS -
i'm looking whether hadoop/impala combination meet archiving, batch processing , real time ad hoc query requirements.
we persisting xml files (which formed , conform our own xsd schema) hadoop , using mapreduce process end of day batch queries etc. ad hoc user queries , application queries requiring low latency , relatively high performance we're considering impala.
what can't figure out how impala understand structure of xml files query effectively. can impala used query across xml documents in meaningful way?
thanks in advance.
hive , impala don't have mechanism work xml files (which odd, considering xml support in databases).
that being said, if faced problem, use pig import data hcatalog. @ point, it's usable hive , impala.
here's quick , dirty example of getting xml data hcatalog using pig:
--rss.pig
register piggybank.jar items = load 'rss.txt' using org.apache.pig.piggybank.storage.xmlloader('item') (item:chararray); data = foreach items generate regex_extract(item, '<link>(.*)</link>', 1) link:chararray, regex_extract(item, '<title>(.*)</title>', 1) title:chararray, regex_extract(item, '<description>(.*)</description>', 1) description:chararray, regex_extract(item, '<pubdate>.*(\\d{2}\\s[a-za-z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubdate>', 1) pubdate:chararray; store data 'rss_items' using org.apache.hcatalog.pig.hcatstorer(); validate = load 'default.rss_items' using org.apache.hcatalog.pig.hcatloader(); dump validate;
--results
(http://www.hannonhill.com/news/item1.html,news item 1,description of news item 1 here.,03 jun 2003 09:39:21) (http://www.hannonhill.com/news/item2.html,news item 2,description of news item 2 here.,30 may 2003 11:06:42) (http://www.hannonhill.com/news/item3.html,news item 3,description of news item 3 here.,20 may 2003 08:56:02)
--impala query
select * rss_items
--impala results
link title description pubdate 0 http://www.hannonhill.com/news/item1.html news item 1 description of news item 1 here. 03 jun 2003 09:39:21 1 http://www.hannonhill.com/news/item2.html news item 2 description of news item 2 here. 30 may 2003 11:06:42 2 http://www.hannonhill.com/news/item3.html news item 3 description of news item 3 here. 20 may 2003 08:56:02
--rss.txt data file
<rss version="2.0"> <channel> <title>news</title> <link>http://www.hannonhill.com</link> <description>hannon hill news</description> <language>en-us</language> <pubdate>tue, 10 jun 2003 04:00:00 gmt</pubdate> <generator>cascade server</generator> <webmaster>webmaster@hannonhill.com</webmaster> <item> <title>news item 1</title> <link>http://www.hannonhill.com/news/item1.html</link> <description>description of news item 1 here.</description> <pubdate>tue, 03 jun 2003 09:39:21 gmt</pubdate> <guid>http://www.hannonhill.com/news/item1.html</guid> </item> <item> <title>news item 2</title> <link>http://www.hannonhill.com/news/item2.html</link> <description>description of news item 2 here.</description> <pubdate>fri, 30 may 2003 11:06:42 gmt</pubdate> <guid>http://www.hannonhill.com/news/item2.html</guid> </item> <item> <title>news item 3</title> <link>http://www.hannonhill.com/news/item3.html</link> <description>description of news item 3 here.</description> <pubdate>tue, 20 may 2003 08:56:02 gmt</pubdate> <guid>http://www.hannonhill.com/news/item3.html</guid> </item> </channel> </rss>
Comments
Post a Comment