Chapter 8. Data wrangling: join, combine, and reshape
8.1 Hierarchical indexing
This feature of pandas allows one Series or DataFrame to have multiple levels of indexing.
import pandas as pd
import numpy as np
np .random .seed (0 )
data = pd .Series (np .random .randn (9 ),
index = [['a' , 'a' , 'a' , 'b' , 'b' , 'c' , 'c' , 'd' , 'd' ],
[1 ,2 ,3 ,1 ,3 ,1 ,2 ,2 ,3 ]])
data
a 1 1.764052
2 0.400157
3 0.978738
b 1 2.240893
3 1.867558
c 1 -0.977278
2 0.950088
d 2 -0.151357
3 -0.103219
dtype: float64
MultiIndex([('a', 1),
('a', 2),
('a', 3),
('b', 1),
('b', 3),
('c', 1),
('c', 2),
('d', 2),
('d', 3)],
)
1 2.240893
3 1.867558
dtype: float64
b 1 2.240893
3 1.867558
c 1 -0.977278
2 0.950088
dtype: float64
b 1 2.240893
3 1.867558
d 2 -0.151357
3 -0.103219
dtype: float64
a 0.400157
c 0.950088
d -0.151357
dtype: float64
This example data can easily be converted into a DataFrame.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
1
2
3
a
1.764052
0.400157
0.978738
b
2.240893
NaN
1.867558
c
-0.977278
0.950088
NaN
d
NaN
-0.151357
-0.103219
a 1 1.764052
2 0.400157
3 0.978738
b 1 2.240893
3 1.867558
c 1 -0.977278
2 0.950088
d 2 -0.151357
3 -0.103219
dtype: float64
Either axis of a DataFrame can have multilevel indexing.
frame = pd .DataFrame (np .arange (12 ).reshape ((4 , 3 )),
index = [['a' , 'a' , 'b' , 'b' ], [1 , 2 , 1 , 2 ]],
columns = [['Ohio' , 'Ohio' , 'Colorado' ],
['Green' , 'Red' , 'Green' ]])
frame
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
</style>
Ohio
Colorado
Green
Red
Green
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
frame .index .names = ['key1' , 'key2' ]
frame
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
Ohio
Colorado
Green
Red
Green
key1
key2
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
frame .columns .names = ['state' , 'color' ]
frame
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
state
Ohio
Colorado
color
Green
Red
Green
key1
key2
a
1
0
1
2
2
3
4
5
b
1
6
7
8
2
9
10
11
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
color
Green
Red
key1
key2
a
1
0
1
2
3
4
b
1
6
7
2
9
10
Reordering and sorting levels
The swaplevel() method takes two level numbers or names and return a new object with those levels swapped.
frame .swaplevel ('key1' , 'key2' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
state
Ohio
Colorado
color
Green
Red
Green
key2
key1
1
a
0
1
2
2
a
3
4
5
1
b
6
7
8
2
b
9
10
11
The sort_index() method sorts the data using only the values of a single level.
frame .sort_index (level = 1 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
state
Ohio
Colorado
color
Green
Red
Green
key1
key2
a
1
0
1
2
b
1
6
7
8
a
2
3
4
5
b
2
9
10
11
It is common to use these in tandem to re-order a DataFrame after swapping levels.
frame .swaplevel (0 , 1 ).sort_index (level = 0 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
state
Ohio
Colorado
color
Green
Red
Green
key2
key1
1
a
0
1
2
b
6
7
8
2
a
3
4
5
b
9
10
11
Summary statistics by level
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
state
Ohio
Colorado
color
Green
Red
Green
key2
1
6
8
10
2
12
14
16
frame .sum (level = 'color' , axis = 1 )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
color
Green
Red
key1
key2
a
1
2
1
2
8
4
b
1
14
7
2
20
10
Indexing with a DataFrame's columns
It is common to want to use a column for a row index or, vice versa, turn a row index into a column.
frame = pd .DataFrame ({
'a' : range (7 ), 'b' : range (7 , 0 , - 1 ),
'c' : ['one' , 'one' , 'one' , 'two' , 'two' , 'two' , 'two' ],
'd' :[0 ,1 ,2 ,0 ,1 ,2 ,3 ]
})
frame
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
0
0
7
one
0
1
1
6
one
1
2
2
5
one
2
3
3
4
two
0
4
4
3
two
1
5
5
2
two
2
6
6
1
two
3
frame2 = frame .set_index (['c' , 'd' ])
frame2
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
one
0
0
7
1
1
6
2
2
5
two
0
3
4
1
4
3
2
5
2
3
6
1
frame .set_index (['c' , 'd' ], drop = False )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
c
d
one
0
0
7
one
0
1
1
6
one
1
2
2
5
one
2
two
0
3
4
two
0
1
4
3
two
1
2
5
2
two
2
3
6
1
two
3
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
c
d
a
b
0
one
0
0
7
1
one
1
1
6
2
one
2
2
5
3
two
0
3
4
4
two
1
4
3
5
two
2
5
2
6
two
3
6
1
8.2 Combining and merging datasets
There are multiple ways to combine DataFrames:
pandas.merge() connects rows of DataFrames based on keys (join)
pandas.concat() binds DataFrames together.
df.combine_first() splices overlapping data to fill in missing values in one DataFrame with those that exist in another.
Database-style DataFrame joins
Merge or join opterations combine datasets by linking rows using one or more keys .
The default merge is an inner join.
df1 = pd .DataFrame ({'key' : ['b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ], 'data1' : range (7 )})
df1
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
data1
0
b
0
1
b
1
2
a
2
3
c
3
4
a
4
5
a
5
6
b
6
df2 = pd .DataFrame ({'key' : ['a' , 'b' , 'd' ], 'data2' : range (3 )})
df2
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
data2
0
a
0
1
b
1
2
d
2
pd .merge (df1 , df2 , on = 'key' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
data1
data2
0
b
0
1
1
b
1
1
2
b
6
1
3
a
2
0
4
a
4
0
5
a
5
0
If the column names for the keys are different in each DataFrame, they must be specified.
df3 = pd .DataFrame ({'lkey' : ['b' , 'b' , 'a' , 'c' , 'a' , 'a' , 'b' ], 'data1' : range (7 )})
df4 = pd .DataFrame ({'rkey' : ['a' , 'b' , 'd' ], 'data2' : range (3 )})
pd .merge (df3 , df4 , left_on = 'lkey' , right_on = 'rkey' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
lkey
data1
rkey
data2
0
b
0
b
1
1
b
1
b
1
2
b
6
b
1
3
a
2
a
0
4
a
4
a
0
5
a
5
a
0
The type of join can be specified by passing the how argument "inner", "left", "right", or "outer".
pd .merge (df1 , df2 , how = 'outer' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
data1
data2
0
b
0.0
1.0
1
b
1.0
1.0
2
b
6.0
1.0
3
a
2.0
0.0
4
a
4.0
0.0
5
a
5.0
0.0
6
c
3.0
NaN
7
d
NaN
2.0
Multiple keys can be used for a merge by passing a list to the on argument.
left = pd .DataFrame ({
'key1' : ['foo' , 'foo' , 'bar' ],
'key2' : ['one' , 'two' , 'one' ],
'lval' : [1 , 2 , 3 ]
})
right = pd .DataFrame ({
'key1' : ['foo' , 'foo' , 'bar' , 'bar' ],
'key2' : ['one' , 'one' , 'one' , 'two' ],
'rval' : [4 , 5 , 6 , 7 ]
})
pd .merge (left , right , on = ['key1' , 'key2' ], how = 'outer' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2
lval
rval
0
foo
one
1.0
4.0
1
foo
one
1.0
5.0
2
foo
two
2.0
NaN
3
bar
one
3.0
6.0
4
bar
two
NaN
7.0
The suffixes argument is used to deal with column name conflicts.
pd .merge (left , right , on = 'key1' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2_x
lval
key2_y
rval
0
foo
one
1
one
4
1
foo
one
1
one
5
2
foo
two
2
one
4
3
foo
two
2
one
5
4
bar
one
3
one
6
5
bar
one
3
two
7
pd .merge (left , right , on = 'key1' , suffixes = ('_left' , '_right' ))
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2_left
lval
key2_right
rval
0
foo
one
1
one
4
1
foo
one
1
one
5
2
foo
two
2
one
4
3
foo
two
2
one
5
4
bar
one
3
one
6
5
bar
one
3
two
7
left1 = pd .DataFrame ({'key' : ['a' , 'b' , 'a' , 'a' , 'b' , 'c' ], 'value' : range (6 )})
left1
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
value
0
a
0
1
b
1
2
a
2
3
a
3
4
b
4
5
c
5
right1 = pd .DataFrame ({'group_val' : [3.5 , 7 ]}, index = ['a' , 'b' ])
right1
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
pd .merge (left1 , right1 , left_on = 'key' , right_index = True )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
value
group_val
0
a
0
3.5
2
a
2
3.5
3
a
3
3.5
1
b
1
7.0
4
b
4
7.0
Hierarchically indexed data is a bit more complicated to merge, though can still be done by either passing the column names of the keys or by using the indices.
The multiple column names must be passed as a list.
lefth = pd .DataFrame ({'key1' : ['Ohio' , 'Ohio' , 'Ohio' , 'Nevada' , 'Nevada' ],
'key2' : [2000 , 2001 , 2002 , 2001 , 2002 ],
'data' : np .arange (5. )})
lefth
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2
data
0
Ohio
2000
0.0
1
Ohio
2001
1.0
2
Ohio
2002
2.0
3
Nevada
2001
3.0
4
Nevada
2002
4.0
righth = pd .DataFrame (np .arange (12 ).reshape ((6 , 2 )),
index = [['Nevada' , 'Nevada' , 'Ohio' , 'Ohio' ,
'Ohio' , 'Ohio' ],
[2001 , 2000 , 2000 , 2000 , 2001 , 2002 ]],
columns = ['event1' , 'event2' ])
righth
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
event1
event2
Nevada
2001
0
1
2000
2
3
Ohio
2000
4
5
2000
6
7
2001
8
9
2002
10
11
pd .merge (lefth , righth , left_on = ['key1' , 'key2' ], right_index = True )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2
data
event1
event2
0
Ohio
2000
0.0
4
5
0
Ohio
2000
0.0
6
7
1
Ohio
2001
1.0
8
9
2
Ohio
2002
2.0
10
11
3
Nevada
2001
3.0
0
1
pd .merge (lefth , righth , left_on = ['key1' , 'key2' ], right_index = True , how = 'outer' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key1
key2
data
event1
event2
0
Ohio
2000
0.0
4.0
5.0
0
Ohio
2000
0.0
6.0
7.0
1
Ohio
2001
1.0
8.0
9.0
2
Ohio
2002
2.0
10.0
11.0
3
Nevada
2001
3.0
0.0
1.0
4
Nevada
2002
4.0
NaN
NaN
4
Nevada
2000
NaN
2.0
3.0
Joining can also be done using the method instance.
pd .merge (left1 , right1 , left_on = 'key' , right_index = True )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
value
group_val
0
a
0
3.5
2
a
2
3.5
3
a
3
3.5
1
b
1
7.0
4
b
4
7.0
left1 .join (right1 , on = 'key' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
value
group_val
0
a
0
3.5
1
b
1
7.0
2
a
2
3.5
3
a
3
3.5
4
b
4
7.0
5
c
5
NaN
Concatenating along an axis
NumPy can bind arrays using the concatenate() function.
arr = np .arange (12 ).reshape ((3 , 4 ))
arr
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
np .concatenate ([arr , arr ], axis = 1 )
array([[ 0, 1, 2, 3, 0, 1, 2, 3],
[ 4, 5, 6, 7, 4, 5, 6, 7],
[ 8, 9, 10, 11, 8, 9, 10, 11]])
Series and DataFrames can be concatenated using the concat() function in pandas.
s1 = pd .Series ([0 , 1 ], index = ['a' , 'b' ])
s2 = pd .Series ([2 , 3 , 4 ], index = ['c' , 'd' , 'e' ])
s3 = pd .Series ([5 , 6 ], index = ['f' , 'g' ])
pd .concat ([s1 , s2 , s3 ])
a 0
b 1
c 2
d 3
e 4
f 5
g 6
dtype: int64
pd .concat ([s1 , s2 , s3 ], axis = 1 )
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
0
1
2
a
0.0
NaN
NaN
b
1.0
NaN
NaN
c
NaN
2.0
NaN
d
NaN
3.0
NaN
e
NaN
4.0
NaN
f
NaN
NaN
5.0
g
NaN
NaN
6.0
s4 = pd .concat ([s1 , s3 ])
s4
a 0
b 1
f 5
g 6
dtype: int64
pd .concat ([s1 , s4 ], axis = 1 )
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
0
1
a
0.0
0
b
1.0
1
f
NaN
5
g
NaN
6
pd .concat ([s1 , s4 ], axis = 1 , join = 'inner' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
The source of each piece of the joined product can be identified using the keys argument.
result = pd .concat ([s1 , s2 , s3 ], keys = ['one' , 'two' , 'three' ])
result
one a 0
b 1
two c 2
d 3
e 4
three f 5
g 6
dtype: int64
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
e
f
g
one
0.0
1.0
NaN
NaN
NaN
NaN
NaN
two
NaN
NaN
2.0
3.0
4.0
NaN
NaN
three
NaN
NaN
NaN
NaN
NaN
5.0
6.0
Most of the same logic applies to concatenating DataFrames.
df1 = pd .DataFrame (np .arange (6 ).reshape (3 , 2 ), index = ['a' , 'b' , 'c' ], columns = ['one' , 'two' ])
df1
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
one
two
a
0
1
b
2
3
c
4
5
df2 = pd .DataFrame (5 + np .arange (4 ).reshape (2 , 2 ), index = ['a' , 'c' ], columns = ['three' , 'four' ])
df2
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
pd .concat ([df1 , df2 ], axis = 1 , keys = ['level1' , 'level2' ])
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
</style>
level1
level2
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
pd .concat ({'level1' : df1 , 'level2' : df2 }, axis = 1 )
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
</style>
level1
level2
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
pd .concat ([df1 , df2 ], axis = 1 , keys = ['level1' , 'level2' ], names = ['upper' , 'lower' ])
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
</style>
upper
level1
level2
lower
one
two
three
four
a
0
1
5.0
6.0
b
2
3
NaN
NaN
c
4
5
7.0
8.0
If one DataFrame does not contain a column that is in the other, the values are filled with NaN by default.
If the row indices are irrelevant, they can be ignored and not used for arranging the data.
df1 = pd .DataFrame (np .random .randn (3 , 4 ), columns = ['a' , 'b' , 'c' , 'd' ])
df1
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
0
0.410599
0.144044
1.454274
0.761038
1
0.121675
0.443863
0.333674
1.494079
2
-0.205158
0.313068
-0.854096
-2.552990
df2 = pd .DataFrame (np .random .randn (2 , 3 ), columns = ['b' , 'd' , 'a' ])
df2
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
b
d
a
0
0.653619
0.864436
-0.742165
1
2.269755
-1.454366
0.045759
pd .concat ([df1 , df2 ], ignore_index = True )
/opt/anaconda3/envs/daysOfCode-env/lib/python3.7/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
"""Entry point for launching an IPython kernel.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
0
0.410599
0.144044
1.454274
0.761038
1
0.121675
0.443863
0.333674
1.494079
2
-0.205158
0.313068
-0.854096
-2.552990
3
-0.742165
0.653619
NaN
0.864436
4
0.045759
2.269755
NaN
-1.454366
Combining data with overlap
The combine_first method effectively merges two highly-simillar Series or DataFrames, but uses the values from the second to "patch" the missing values in the first.
a = pd .Series ([np .nan , 2.5 , np .nan , 3.5 , 4.5 , np .nan ],
index = ['f' , 'e' , 'd' , 'c' , 'b' , 'a' ])
b = pd .Series (np .arange (len (a ), dtype = np .float64 ),
index = ['f' , 'e' , 'd' , 'c' , 'b' , 'a' ])
b [- 1 ] = np .nan
f NaN
e 2.5
d NaN
c 3.5
b 4.5
a NaN
dtype: float64
f 0.0
e 1.0
d 2.0
c 3.0
b 4.0
a NaN
dtype: float64
np .where (pd .isnull (a ), b , a ) # The manual way of replacing missing data.
array([0. , 2.5, 2. , 3.5, 4.5, nan])
b [:- 2 ].combine_first (a [2 :]) # This is equivalent, but more consice.
a NaN
b 4.5
c 3.0
d 2.0
e 1.0
f 0.0
dtype: float64
DataFrames behave simillarly.
df1 = pd .DataFrame (
{'a' : [1. , np .nan , 5. , np .nan ],
'b' : [np .nan , 2. , np .nan , 6. ],
'c' : range (2 , 18 , 4 )
})
df1
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
0
1.0
NaN
2
1
NaN
2.0
6
2
5.0
NaN
10
3
NaN
6.0
14
df2 = pd .DataFrame ({
'a' : [5. , 4. , np .nan , 3. , 7. ],
'b' : [np .nan , 3. , 4. , 6. , 8. ]
})
df2
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
0
5.0
NaN
1
4.0
3.0
2
NaN
4.0
3
3.0
6.0
4
7.0
8.0
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
0
1.0
NaN
2.0
1
4.0
2.0
6.0
2
5.0
4.0
10.0
3
3.0
6.0
14.0
4
7.0
8.0
NaN
8.3 Reshaping and pivoting
Reshaping with hierarchical indexing
Hierarchical indexing provides a consistent way to rearrange data in a DataFrame.
The two primary actions are:
stack: pivots the columns to the rows
unstack: pivots from the rows into the columns
data = pd .DataFrame (np .arange (6 ).reshape ((2 , 3 )),
index = pd .Index (['Ohio' , 'Colorado' ], name = 'state' ),
columns = pd .Index (['one' , 'two' , 'three' ], name = 'number' ))
data
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
number
one
two
three
state
Ohio
0
1
2
Colorado
3
4
5
result = data .stack ()
result
state number
Ohio one 0
two 1
three 2
Colorado one 3
two 4
three 5
dtype: int64
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
number
one
two
three
state
Ohio
0
1
2
Colorado
3
4
5
By default, the innermost level is unstacked or stacked.
Different levels can be used by passing the level number or name.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
state
Ohio
Colorado
number
one
0
3
two
1
4
three
2
5
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
state
Ohio
Colorado
number
one
0
3
two
1
4
three
2
5
If not all of the values exist for a level, missing data will be introduced.
s1 = pd .Series ([0 , 1 , 2 , 3 ], index = ['a' , 'b' , 'c' , 'd' ])
s2 = pd .Series ([4 , 5 , 6 ], index = ['c' , 'd' , 'e' ])
data2 = pd .concat ([s1 , s2 ], keys = ['one' , 'two' ])
data2
one a 0
b 1
c 2
d 3
two c 4
d 5
e 6
dtype: int64
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
a
b
c
d
e
one
0.0
1.0
2.0
3.0
NaN
two
NaN
NaN
4.0
5.0
6.0
Stacking filters out the missing values by default, though this can be prevented by setting dropna=False.
one a 0.0
b 1.0
c 2.0
d 3.0
two c 4.0
d 5.0
e 6.0
dtype: float64
data2 .unstack ().stack (dropna = False )
one a 0.0
b 1.0
c 2.0
d 3.0
e NaN
two a NaN
b NaN
c 4.0
d 5.0
e 6.0
dtype: float64
When unstacking a DataFrame, the unstacked level becomes the lowest level in the result.
df = pd .DataFrame ({
'left' : result ,
'right' : result + 5 },
columns = pd .Index (['left' , 'right' ], name = 'side' ))
df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
side
left
right
state
number
Ohio
one
0
5
two
1
6
three
2
7
Colorado
one
3
8
two
4
9
three
5
10
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
side
left
right
state
Ohio
Colorado
Ohio
Colorado
number
one
0
3
5
8
two
1
4
6
9
three
2
5
7
10
Just like with unstack, the axis for stacking can be specified.
df .unstack ('state' ).stack ('side' )
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
state
Colorado
Ohio
number
side
one
left
3
0
right
8
5
two
left
4
1
right
9
6
three
left
5
2
right
10
7
Pivoting "long" to "wide" format
The author uses a long format read in from a CSV as an example of wrangling and cleaning.
data = pd .read_csv ("assets/examples/macrodata.csv" )
data .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
year
quarter
realgdp
realcons
realinv
realgovt
realdpi
cpi
m1
tbilrate
unemp
pop
infl
realint
0
1959.0
1.0
2710.349
1707.4
286.898
470.045
1886.9
28.98
139.7
2.82
5.8
177.146
0.00
0.00
1
1959.0
2.0
2778.801
1733.7
310.859
481.301
1919.7
29.15
141.7
3.08
5.1
177.830
2.34
0.74
2
1959.0
3.0
2775.488
1751.8
289.226
491.260
1916.4
29.35
140.5
3.82
5.3
178.657
2.74
1.09
3
1959.0
4.0
2785.204
1753.7
299.356
484.052
1931.3
29.37
140.0
4.33
5.6
179.386
0.27
4.06
4
1960.0
1.0
2847.699
1770.5
331.722
462.199
1955.5
29.54
139.6
3.50
5.2
180.007
2.31
1.19
periods = pd .PeriodIndex (year = data .year , quarter = data .quarter , name = 'date' )
columns = pd .Index (['realgdp' , 'infl' , 'unemp' ], name = 'item' )
data = data .reindex (columns = columns )
data .index = periods .to_timestamp ('D' , 'end' )
ldata = data .stack ().reset_index ().rename (columns = {0 : 'value' })
ldata
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
date
item
value
0
1959-03-31 23:59:59.999999999
realgdp
2710.349
1
1959-03-31 23:59:59.999999999
infl
0.000
2
1959-03-31 23:59:59.999999999
unemp
5.800
3
1959-06-30 23:59:59.999999999
realgdp
2778.801
4
1959-06-30 23:59:59.999999999
infl
2.340
...
...
...
...
604
2009-06-30 23:59:59.999999999
infl
3.370
605
2009-06-30 23:59:59.999999999
unemp
9.200
606
2009-09-30 23:59:59.999999999
realgdp
12990.341
607
2009-09-30 23:59:59.999999999
infl
3.560
608
2009-09-30 23:59:59.999999999
unemp
9.600
609 rows × 3 columns
The pivot() method of a DataFrame spreads out the information into a wide format.
pivoted = ldata .pivot (index = 'date' , columns = 'item' , values = 'value' )
pivoted
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
item
infl
realgdp
unemp
date
1959-03-31 23:59:59.999999999
0.00
2710.349
5.8
1959-06-30 23:59:59.999999999
2.34
2778.801
5.1
1959-09-30 23:59:59.999999999
2.74
2775.488
5.3
1959-12-31 23:59:59.999999999
0.27
2785.204
5.6
1960-03-31 23:59:59.999999999
2.31
2847.699
5.2
...
...
...
...
2008-09-30 23:59:59.999999999
-3.16
13324.600
6.0
2008-12-31 23:59:59.999999999
-8.79
13141.920
6.9
2009-03-31 23:59:59.999999999
0.94
12925.410
8.1
2009-06-30 23:59:59.999999999
3.37
12901.504
9.2
2009-09-30 23:59:59.999999999
3.56
12990.341
9.6
203 rows × 3 columns
Here is an example of reshaping two columns simultaneously.
It results in hierarchically indexed columns.
ldata ['value2' ] = np .random .randn (len (ldata ))
ldata .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
date
item
value
value2
0
1959-03-31 23:59:59.999999999
realgdp
2710.349
-0.187184
1
1959-03-31 23:59:59.999999999
infl
0.000
1.532779
2
1959-03-31 23:59:59.999999999
unemp
5.800
1.469359
3
1959-06-30 23:59:59.999999999
realgdp
2778.801
0.154947
4
1959-06-30 23:59:59.999999999
infl
2.340
0.378163
pivoted = ldata .pivot ('date' , 'item' )
pivoted .head ()
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead tr th {
text-align: left;
}
.dataframe thead tr:last-of-type th {
text-align: right;
}
</style>
value
value2
item
infl
realgdp
unemp
infl
realgdp
unemp
date
1959-03-31 23:59:59.999999999
0.00
2710.349
5.8
1.532779
-0.187184
1.469359
1959-06-30 23:59:59.999999999
2.34
2778.801
5.1
0.378163
0.154947
-0.887786
1959-09-30 23:59:59.999999999
2.74
2775.488
5.3
-0.347912
-1.980796
0.156349
1959-12-31 23:59:59.999999999
0.27
2785.204
5.6
1.202380
1.230291
-0.387327
1960-03-31 23:59:59.999999999
2.31
2847.699
5.2
-1.048553
-0.302303
-1.420018
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
item
infl
realgdp
unemp
date
1959-03-31 23:59:59.999999999
0.00
2710.349
5.8
1959-06-30 23:59:59.999999999
2.34
2778.801
5.1
1959-09-30 23:59:59.999999999
2.74
2775.488
5.3
1959-12-31 23:59:59.999999999
0.27
2785.204
5.6
1960-03-31 23:59:59.999999999
2.31
2847.699
5.2
Pivoting "wide" to "long" format
The inverse for the pivot() method is melt().
df = pd .DataFrame (
{'key' : ['foo' , 'bar' , 'baz' ],
'A' : [1 , 2 , 3 ],
'B' : [4 , 5 , 6 ],
'C' : [7 , 8 , 9 ]
})
df
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
A
B
C
0
foo
1
4
7
1
bar
2
5
8
2
baz
3
6
9
melted = pd .melt (df , id_vars = ['key' ])
melted
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
variable
value
0
foo
A
1
1
bar
A
2
2
baz
A
3
3
foo
B
4
4
bar
B
5
5
baz
B
6
6
foo
C
7
7
bar
C
8
8
baz
C
9
A subset of volumns to use for values can be specified.
pd .melt (df , id_vars = ['key' ], value_vars = ['A' , 'B' ])
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
key
variable
value
0
foo
A
1
1
bar
A
2
2
baz
A
3
3
foo
B
4
4
bar
B
5
5
baz
B
6
Also, they can be no group identifiers.
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
variable
value
0
key
foo
1
key
bar
2
key
baz
3
A
1
4
A
2
5
A
3
6
B
4
7
B
5
8
B
6
9
C
7
10
C
8
11
C
9
pd .melt (df , value_vars = ['A' , 'B' , 'C' ])
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
</style>
variable
value
0
A
1
1
A
2
2
A
3
3
B
4
4
B
5
5
B
6
6
C
7
7
C
8
8
C
9