python - Efficiently joining two dataframes based on multiple levels of a multiindex -


i have dataframe large multiindex, , secondary dataframe multiindex subset of larger one. secondary dataframe kind of lookup table. want add columns lookup table larger dataframe. primary dataframe large, want efficiently.

here imaginary example, want join df2 df1:

   in [11]: arrays = [    ['sun', 'sun', 'sun', 'moon', 'moon', 'moon', 'moon', 'moon'],    ....:               ['summer', 'winter', 'winter', 'summer', 'summer', 'summer', 'winter', 'winter'],    ....:               ['one', 'one', 'two', 'one', 'two', 'three', 'one', 'two']]  in [12]: tuples = list(zip(*arrays))  in [13]: index = pd.multiindex.from_tuples(tuples, names=['body', 'season','item'])  in [14]: df1 = pd.dataframe(np.random.randn(8,2), index=index,columns=['a','b'])  in [15]: df1 out[15]:                                   b body season item sun  summer 1   -0.121588  0.272774      winter 1    0.233562 -2.005623             2   -1.034642  0.315065 moon summer 1    0.184548  0.820873             2    0.838290  0.495047             3  0.450813 -2.040089      winter 1   -1.149993 -0.498148             2    2.406824 -2.031849  [8 rows x 2 columns]   in [16]: index2= pd.multiindex.from_tuples([('sun','summer'),('sun','winter'),('moon','summer'),('moon','winter')],names=['body','season'])  in [17]: df2 = pd.dataframe(['good','bad','ugly','confused'],index=index2,columns = ['mood'])  in [18]: df2 out[18]:                  mood body season sun  summer           winter       bad moon summer      ugly      winter  confused  [4 rows x 1 columns] 

now, suppose want add columns df2 df1? line way find job:

in [19]: df1 = df1.reset_index().join(df2,on=['body','season']).set_index(df1.index.names)  in [20]: df1 out[20]:                                   b      mood body season item sun  summer 1   -0.121588  0.272774           winter 1    0.233562 -2.005623       bad             2   -1.034642  0.315065       bad moon summer 1    0.184548  0.820873      ugly             2    0.838290  0.495047      ugly             3  0.450813 -2.040089      ugly      winter 1   -1.149993 -0.498148  confused             2    2.406824 -2.031849  confused  [8 rows x 3 columns] 

it works, there 2 problems method. first, line ugly. needing reset index, recreate multiindex, makes simple operation seem needlessly complicated. second, if understand correctly, every time run reset_index() , set_index(), copy of dataframe created. working large dataframes, , seems inefficient.

is there better way this?

this not implemented internally atm, soln recommended one, see here issue

you can wrap in function if want make nicer. reset_index/set_index copy (though can pass inplace=true argument if want); inplace these changing index attribute.

you patch in nice function like:

def merge_multi(self, df, on):     return self.reset_index().join(df,on=on).set_index(self.index.names) dataframe.merge_multi = merge_multi  df1.merge_multi(df2,on=['body','season']) 

however, merging definition creates new data, not sure how save you.

a better method build smaller frames, larger merge. might want this


Comments

Popular posts from this blog

c++ - OpenCV Error: Assertion failed <scn == 3 ::scn == 4> in unknown function, -

php - render data via PDO::FETCH_FUNC vs loop -

The canvas has been tainted by cross-origin data in chrome only -