python - parsing space delimited, named fields -


i have specific format of data (exported splunk>) mixture of csv , named fields. understand if possible in python parse such data via template (or simplified, average-human understandable regex)

"harry potter", "book", "12 mar 2014 note=""good"" language=""english""" "forrest gump", "movie", "14 march 2015 note=""good"" language=""aztec""" 

as can see first fields comma separated, comes 1 long string starts date , have few named fields (note, language).

i build list of dicts solely named fields:

[     {'note': 'good', 'language'='english'},     {'note': 'good', 'language'='aztec'} ] 

after parsing csv end last field (e.g. "12 mar 2014 note=""good"" language=""english""" first line) , stuck, solution can think of try describe line in regex (which scary :). if managed extract tuples, how translate them dict?

the csv module handle outer and doubled quoting you, out of box. columns have outer quotes (making sure delimiters, quotes , newlines in values preserved), , quotes in values doubled; csv.reader() remove outer quotes , return strings single quotes 3rd column.

the named fields can handled regular expression:

import csv import re  keyvalue = re.compile(r'([^"= ]+)="([^"]+)"')   open(filename, 'rb') infh:     reader = csv.reader(infh, skipinitialspace=true)     namedfields = [dict(keyvalue.findall(row[2])) row in reader] 

the skipinitialspace option removes spaces after delimiter; needed ensure spaces before quoted column values removed correctly, in turn ensuring quoting handled.

the re.findall() method here returns list of (key, value) tuples, , dict() type turn directly dictionary.

demo:

>>> import csv >>> import re >>> keyvalue = re.compile(r'([^"= ]+)="([^"]+)"') >>> sample = '''\ ... "harry potter", "book", "12 mar 2014 note=""good"" language=""english""" ... "forrest gump", "movie", "14 march 2015 note=""good"" language=""aztec""" ... ''' >>> reader = csv.reader(sample.splitlines(true), skipinitialspace=true) >>> [dict(keyvalue.findall(row[2])) row in reader] [{'note': 'good', 'language': 'english'}, {'note': 'good', 'language': 'aztec'}] 

Comments

Popular posts from this blog

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -

google shop client API returns 400 bad request error while adding an item -