Information Extraction from Text into Structured Data with Python -


i'm near total outsider of programming, interested in it. work in shipbrokering company , need match between positions (which ship open @ where, when) , orders (what kind of ships needed @ where, when kind of employment). , send , receive such info (positions , orders) emails , our principals , co-brokers. there thousands of such emails each day. matching reading emails manually.

i want build app matching us.

one important part of app information extraction email text.

==> question how use python extract unstructured info structured data.

sample email of order [annotation in brackets, not included in email]:

email subject: 20k dwt requirement, 20-30/mar, santos-conti      content:      acct abc [account name]     abt 20,000 mt deadweight [size of ship needed]     delivery make santos [delivery point/range, owners deliver ship charterers here]     laycan 20-30/mar [laycan (the time spread in delivery can accepted]     1 time charter grains [what kind of empolyment/trade, cargo]     duration 35 days [duration]     redelivery 1 safe port continent [redelivery point/range, charterers redeliver ship owners here.]      broker name/email/phone...  end email 

same email above can written in many different ways - writes in 1 line, use l/c instead of laycan... , there emails positions ship's name, open port, date range, ship's deadweight , other specs.

how can extract info , put structured data, python? let's have put email contents text files. thanks.

below possible approach:

step 1: classify mails in categories using subject and/or message in mail.

as stated 1 category of mails requesting position , other of mails of order. machine learning can used classify. can use set of previous mails training corpus. might consider using nltk(natural langauage toolkit) python. here link on text classification using nltk.

step 2: once email identified order mail, process fetch details(account name, size, time spread etc.) mentioned challenge here there no fixed format these data. solve problem, might consider preparing exhaustive list of synonyms each label(like account list ['acct', 'a/c', 'account', 'acnt']). should done once, going through fixed volume of previous mails.

to make solution more effective, consider implementing option active learning (i.e., prompt user if in mail lable found not found in list. e.g. in mail, if "accnt" used, wont resolved, hence user should prompted ask in category falls.)

once lable identifies, can use basic string operations, parse email in fetch relevant data in structured format.

you can refer this discussion better understanding.


Comments

Popular posts from this blog

android - Get AccessToken using signpost OAuth without opening a browser (Two legged Oauth) -

org.mockito.exceptions.misusing.InvalidUseOfMatchersException: mockito -

google shop client API returns 400 bad request error while adding an item -