Merge many tsv files by common key using Python pandas -
i have several hundred tsv files 2 fields, common key , unique sample id:
==> test1.vmat <== chrom:pos:ref:alt 144-93-02 1:14653:c:t 1 1:14677:g:a 1 1:14907:a:g 1 ==> test2.vmat <== chrom:pos:ref:alt 144-93-01 1:14653:c:t 1 1:14522:g:a 1 1:14907:a:g 1 i perform outer join on of files using field "chr:pos:ref:alt" form 1 giant matrix. example 2 files:
chrom:pos:ref:alt 144-93-02 144-93-01 1:14653:c:t 1.0 1.0 1:14522:g:a na 1.0 1:14677:g:a 1.0 na 1:14907:a:g 1.0 1.0 i got output above using following code, having trouble looping on hundreds of *tsv files in directory (path/to/testn.vmat). how can modify merge individual *tsv files directory single tsv file?
variant_field = "chrom:pos:ref:alt" outfile = "everyone.vmat" df1 = pandas.read_csv("path/to/test1.vmat", sep='\t', parse_dates=false) df2 = pandas.read_csv("path/to/test2.vmat", sep='\t', parse_dates=false) df3 = pandas.merge(df1,df2,on=variant_field, how='outer') df3.to_csv(str(outfile), sep="\t", header=true, index=false, na_rep="na", engine='python')
if make 'chrom:pos:ref:alt' index join multiple frames:
df1 = pandas.read_csv("path/to/test1.vmat", sep='\t', parse_dates=false, index_col='chrom:pos:ref:alt') in [11]: df1.join([df2], how='outer') out[11]: 144-93-02 144-93-01 1:14522:g:a nan 1 1:14653:c:t 1 1 1:14677:g:a 1 nan 1:14907:a:g 1 1 in someways it's more honest think of concat rather join:
in [12]: pd.concat([df1, df2], axis=1) out[12]: 144-93-02 144-93-01 1:14522:g:a nan 1 1:14653:c:t 1 1 1:14677:g:a 1 nan 1:14907:a:g 1 1 you can iterate through files using glob:
from glob import iglob pd.concat((pd.read_csv(f, ...) f in glob.iglob(*.vmat)), axis=1)
Comments
Post a Comment