join - How to denormalized 2 csv files with cardinality 0,1 and mainly 1,n with Pig? -


i need pig script. have 2 csv file , want join between them common id.

customer.csv : 1   ; nom1   ; prenom1    2   ; nom2   ; prenom2    3   ; nom3   ; prenom3      child.csv 1  ; enfant_1_1   2  ; enfant_1_2   3  ; enfant_1_3   1  ; enfant_2_1   1  ; enfant_3_1 

so 1 customer have many child child have 1 "customer".

i want create file :

1   ; nom1   ; prenom1  ; enfant_1_1  ; enfant_2_1  ; enfant_3_1     2   ; nom2   ; prenom2  ; enfant_1_2    3   ; nom3   ; prenom3  ; enfant_1_3    

this method :

first try have :

1  ; enfant_1_1  ; enfant_2_1  ; enfant_3_1 2  ; enfant_1_2 3  ; enfant_1_3 

and after join custome.csv

tell me think there easiest way :)

this script :

donnees_enfants = load '/user/cloudera/jeux/mini_jeu2.csv' using pigstorage(';') (id_parent:int,nom_enfant:chararray);  group_enfants = group donnees_enfants id_parent;  enfant_uneligne = foreach group_enfants generate group, donnees_enfants.nom_enfant;  grunt> echantillon = limit enfant_uneligne 50; grunt> dump echantillon; 

with describe : group_enfants: {group: int,donnees_enfants: {(id_parent: int,nom_enfant: chararray)}} enfant_uneligne: {group: int,{(nom_enfant: chararray)}}

the result :

(1,{( enfant_2_1  ),( enfant_1_1  ),( enfant_3_1  )}) (2,{( enfant_2_2  )}) (3,{( enfant_2_3  )}) 

i tried flatten "enfant_1_2" ... consequences had lign per child... have difficulties play tuple , bags, can me ?

thanks in advance,

edit : found solution problem , more ^^ see below

angelik

finally, found solution , works more fields child : (id, name, age).

-- 1. load 2 files

donnees_enfants = load '/user/cloudera/jeuxdenormalisation/jeux/mini_jeu2.csv' using pigstorage(';') (id:int,nom_enfant:chararray);

donnees_parents = load '/user/cloudera/jeuxdenormalisation/jeux/mini_jeu1.csv' using pigstorage(';') (id_parent:int,nom_parent:chararray,prenom_parent:chararray);

-- 2. join files left outer keep customers don't have child.

denormalisation = join donnees_parents id_parent left outer, donnees_enfants id ;

(9, nom9   , prenom9   ,9, enfant_2_9  ) (9, nom9   , prenom9   ,9, enfant_3_9  ) (9, nom9   , prenom9   ,9, enfant_1_9  ) (10, nom10  , prenom10  ,10, enfant_3_10) (10, nom10  , prenom10  ,10, enfant_1_10 ) (10, nom10  , prenom10  ,10, enfant_2_10 ) 

-- 3. groupby on customer have 1 row customer

unparent_parligne = group denormalisation (id_parent, nom_parent, prenom_parent);

((48, nom48  , prenom48  ),{(48, nom48  , prenom48  ,48, enfant_2_48 ),(48, nom48  , prenom48  ,48, enfant_1_48 )}) ((49, nom49  , prenom49  ),{(49, nom49  , prenom49  ,49, enfant_2_49 ),(49, nom49  , prenom49  ,49, enfant_1_49 )}) ((50, nom50  , prenom50  ),{(50, nom50  , prenom50  ,50, enfant_2_50 ),(50, nom50  , prenom50  ,50, enfant_1_50 )}) ((51, nom51  , prenom51  ),{(51, nom51  , prenom51  ,51, enfant_1_51 )}) 

-- 4. flatten on rows :

ligne_finale = foreach unparent_parligne generate flatten (group), flatten(bagtotuple(denormalisation.(donnees_enfants::nom_enfant,donnees_enfants::age)));

(9, nom9   , prenom9   , enfant_2_9  , enfant_3_9  , enfant_1_9  ) (10, nom10  , prenom10  , enfant_3_10, enfant_1_10 , enfant_2_10 ) (11, nom11  , prenom11  , enfant_1_11 , enfant_2_11 ) 

or if there more fields (with "donnees_enfants::age"):

(8, nom8   , prenom8   , enfant_3_8  , age_3_8 , enfant_2_8  , age_2_8 , enfant_1_8  , age_1_8 ) (9, nom9   , prenom9   , enfant_2_9  , age_2_9 , enfant_3_9  , age_3_9 , enfant_1_9  , age_1_9 ) (10, nom10  , prenom10  , enfant_3_10 , age_3_10, enfant_1_10 , age_1_10, enfant_2_10 , age_2_10) 

-- 5. store data in csv file store ligne_finale '/user/cloudera/jeuxdenormalisation/resultats/test4' using org.apache.pig.piggybank.storage.pigstorageschema(";");


Comments

Popular posts from this blog

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -

google shop client API returns 400 bad request error while adding an item -