python - Efficiently joining two dataframes based on multiple levels of a multiindex -
i have dataframe large multiindex, , secondary dataframe multiindex subset of larger one. secondary dataframe kind of lookup table. want add columns lookup table larger dataframe. primary dataframe large, want efficiently.
here imaginary example, want join df2 df1:
in [11]: arrays = [ ['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'], ....: ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'], ....: ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']] in [12]: tuples = list(zip(*arrays)) in [13]: index = pd.multiindex.from_tuples(tuples, names=['body', 'season','item']) in [14]: df1 = pd.dataframe(np.random.randn(8,2), index=index,columns=['a','b']) in [15]: df1 out[15]: b body season item sun summer 1 -0.121588 0.272774 winter 1 0.233562 -2.005623 2 -1.034642 0.315065 moon summer 1 0.184548 0.820873 2 0.838290 0.495047 3 0.450813 -2.040089 winter 1 -1.149993 -0.498148 2 2.406824 -2.031849 [8 rows x 2 columns] in [16]: index2= pd.multiindex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],names=['body','season']) in [17]: df2 = pd.dataframe(['good','bad','ugly','confused'],index=index2,columns = ['mood']) in [18]: df2 out[18]: mood body season sun summer winter bad moon summer ugly winter confused [4 rows x 1 columns]
now, suppose want add columns df2 df1? line way find job:
in [19]: df1 = df1.reset_index().join(df2,on=['body','season']).set_index(df1.index.names) in [20]: df1 out[20]: b mood body season item sun summer 1 -0.121588 0.272774 winter 1 0.233562 -2.005623 bad 2 -1.034642 0.315065 bad moon summer 1 0.184548 0.820873 ugly 2 0.838290 0.495047 ugly 3 0.450813 -2.040089 ugly winter 1 -1.149993 -0.498148 confused 2 2.406824 -2.031849 confused [8 rows x 3 columns]
it works, there 2 problems method. first, line ugly. needing reset index, recreate multiindex, makes simple operation seem needlessly complicated. second, if understand correctly, every time run reset_index() , set_index(), copy of dataframe created. working large dataframes, , seems inefficient.
is there better way this?
this not implemented internally atm, soln recommended one, see here issue
you can wrap in function if want make nicer. reset_index/set_index
copy (though can pass inplace=true
argument if want); inplace these changing index attribute.
you patch in nice function like:
def merge_multi(self, df, on): return self.reset_index().join(df,on=on).set_index(self.index.names) dataframe.merge_multi = merge_multi df1.merge_multi(df2,on=['body','season'])
however, merging definition creates new data, not sure how save you.
a better method build smaller frames, larger merge. might want this
Comments
Post a Comment