Can Impala query XML files stored in Hadoop/HDFS -

September 15, 2014

i'm looking whether hadoop/impala combination meet archiving, batch processing , real time ad hoc query requirements.

we persisting xml files (which formed , conform our own xsd schema) hadoop , using mapreduce process end of day batch queries etc. ad hoc user queries , application queries requiring low latency , relatively high performance we're considering impala.

what can't figure out how impala understand structure of xml files query effectively. can impala used query across xml documents in meaningful way?

thanks in advance.

hive , impala don't have mechanism work xml files (which odd, considering xml support in databases).

that being said, if faced problem, use pig import data hcatalog. @ point, it's usable hive , impala.

here's quick , dirty example of getting xml data hcatalog using pig:

--rss.pig

register piggybank.jar  items = load 'rss.txt' using org.apache.pig.piggybank.storage.xmlloader('item')  (item:chararray);  data = foreach items generate regex_extract(item, '<link>(.*)</link>', 1)  link:chararray,  regex_extract(item, '<title>(.*)</title>', 1)  title:chararray, regex_extract(item, '<description>(.*)</description>',  1) description:chararray, regex_extract(item, '<pubdate>.*(\\d{2}\\s[a-za-z]{3}\\s\\d{4}\\s\\d{2}:\\d{2}:\\d{2}).*</pubdate>', 1)  pubdate:chararray;  store data 'rss_items' using org.apache.hcatalog.pig.hcatstorer();   validate = load 'default.rss_items' using org.apache.hcatalog.pig.hcatloader(); dump validate;

--results

(http://www.hannonhill.com/news/item1.html,news item 1,description of news item 1 here.,03 jun 2003 09:39:21) (http://www.hannonhill.com/news/item2.html,news item 2,description of news item 2 here.,30 may 2003 11:06:42) (http://www.hannonhill.com/news/item3.html,news item 3,description of news item 3 here.,20 may 2003 08:56:02)

--impala query

select * rss_items

--impala results

    link    title   description pubdate 0   http://www.hannonhill.com/news/item1.html   news item 1 description of news item 1 here.    03 jun 2003 09:39:21 1   http://www.hannonhill.com/news/item2.html   news item 2 description of news item 2 here.    30 may 2003 11:06:42 2   http://www.hannonhill.com/news/item3.html   news item 3 description of news item 3 here.    20 may 2003 08:56:02

--rss.txt data file

<rss version="2.0">    <channel>       <title>news</title>       <link>http://www.hannonhill.com</link>       <description>hannon hill news</description>       <language>en-us</language>       <pubdate>tue, 10 jun 2003 04:00:00 gmt</pubdate>       <generator>cascade server</generator>       <webmaster>webmaster@hannonhill.com</webmaster>       <item>          <title>news item 1</title>          <link>http://www.hannonhill.com/news/item1.html</link>          <description>description of news item 1 here.</description>          <pubdate>tue, 03 jun 2003 09:39:21 gmt</pubdate>          <guid>http://www.hannonhill.com/news/item1.html</guid>       </item>       <item>          <title>news item 2</title>          <link>http://www.hannonhill.com/news/item2.html</link>          <description>description of news item 2 here.</description>          <pubdate>fri, 30 may 2003 11:06:42 gmt</pubdate>          <guid>http://www.hannonhill.com/news/item2.html</guid>       </item>       <item>          <title>news item 3</title>          <link>http://www.hannonhill.com/news/item3.html</link>          <description>description of news item 3 here.</description>          <pubdate>tue, 20 may 2003 08:56:02 gmt</pubdate>          <guid>http://www.hannonhill.com/news/item3.html</guid>       </item>    </channel> </rss>

Search This Blog

Silver

Can Impala query XML files stored in Hadoop/HDFS -

Comments

Post a Comment

Popular posts from this blog

user interface - How to replace the Python logo in a Tkinter-based Python GUI app? -

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -