Pandas in a nutshell
Introduction
Pandas is an open source data analysis library in Python and it is extensively used for Data analysis, Data munging and Cleaning. Pandas has a high performing and user friendly data structures.
What makes Pandas a great choice for data analysis? It is it’s rich and highly performant data structures which are built on top of numpy and leverage it’s fast array operations. In thispost we will see how Pandas can be used efficiently for the data analysis and will explore some of it’s basic operations. Pandas has two types of data structures:
a) Series – it’s a one dimensional array with indexes, it stores a single column or row of data in a Dataframe
b) Dataframe – it’s a tabular spreadsheet like structure representing rows each of which contains one or multiple columns
First, Let’s do some import
<span class="kn" style="color: #008080;">import</span> <span class="nn" style="color: #0000ff;">pandas</span> <span class="k" style="color: #008080;">as</span> <span class="nn" style="color: #0000ff;">pd</span>
<span class="kn" style="color: #008080;">import</span> <span class="nn" style="color: #0000ff;">numpy</span> <span class="k" style="color: #008080;">as</span> <span class="nn" style="color: #0000ff;">np</span>
<span class="kn" style="color: #008080;">from</span> <span class="nn" style="color: #0000ff;">pandas</span> <span class="k" style="color: #008080;">import</span> <span class="n" style="color: #0000ff;">Series</span><span class="p">,</span> <span class="n" style="color: #0000ff;">DataFrame
</span>
<span class="kn" style="color: #008080;">import</span> <span class="nn" style="color: #0000ff;">pandas</span> <span class="k" style="color: #008080;">as</span> <span class="nn" style="color: #0000ff;">pd</span>
<span class="kn" style="color: #008080;">import</span> <span class="nn" style="color: #0000ff;">numpy</span> <span class="k" style="color: #008080;">as</span> <span class="nn" style="color: #0000ff;">np</span>
<span class="kn" style="color: #008080;">from</span> <span class="nn" style="color: #0000ff;">pandas</span> <span class="k" style="color: #008080;">import</span> <span class="n" style="color: #0000ff;">Series</span><span class="p">,</span> <span class="n" style="color: #0000ff;">DataFrame
</span>
Series
Create a Series with the list of values and default indexes
<span class="n">this</span><span class="o">=</span><span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">2</span></span><span class="p">])</span>
<span class="n">this</span><span class="o">=</span><span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">2</span></span><span class="p">])</span>
Check how the Series object looks like
<span class="n">this</span>
0 6
1 5
2 4
3 9
4 2
dtype: int6
<span class="n">this</span><span class="o">.</span><span class="n">values</span>
array([6, 5, 4, 9, 2])
Check the Index for this Series, Pandas generates the indexes automatically and default starts from 0
<span class="n">this</span><span class="o">.</span><span class="n">index</span>
Int64Index([0, 1, 2, 3, 4], dtype=’int64′)
Create a Series of US Cities with their Population and Index is set as the City for the Series, The Object takes the Indexes as defined with their corresponding column values i.e Population
citypopulation_2016 = Series([8537673,3976322,2704958,2303482,1615017],
index=[‘NYC’,‘Los Angeles’,‘Chicago’,‘Houston’,‘Phoenix’])
<span class="n">citypopulation_2016</span>
NYC 8537673
Los Angeles 3976322
Chicago 2704958
Houston 2303482
Phoenix 1615017
dtype: int64
Find Population lesser than 2.5 million
Find the rows with population lower than 2.5 million
<span class="n">citypopulation_2016</span><span class="p">[</span><span style="color: #993300;"><span class="n">citypopulation_2016</span><span class="o"><</span><span class="mi">2500000</span></span><span class="p">]</span>
Houston 2303482
Phoenix 1615017
dtype: int64
Is Chicago in the Series
Check if Chicago is in the Series or not, it will return True or False boolean value
<span class="s">'Chicago'</span> <span class="ow">in</span> <span class="n">citypopulation_2016</span>
True
Create a list of Index
Create a list of Indexes with one extra value Philadelphia
<span class="n">cities</span> <span class="o">=</span> <span class="p">[</span><span style="color: #993300;"><span class="s">'NYC'</span><span class="p">,</span><span class="s">'Los Angeles'</span><span class="p">,</span><span class="s">'Chicago'</span><span class="p">,</span><span class="s">'Houston'</span><span class="p">,</span><span class="s">'Phoenix'</span><span class="p">,</span><span class="s">'Philadelhia'</span></span><span class="p">]</span>
Create a Series with this new Indexes, The value for Philadelhia is NaN (Not a Number)
<span class="n">Ser</span> <span class="o">=</span> <span class="n">Series</span><span class="p">(</span><span class="n">citypopulation_2016</span><span class="p">,</span><span class="n">index</span><span class="o">=</span><span class="n">cities</span><span class="p">)</span>
<span class="n">Ser</span>
NYC 8537673
Los Angeles 3976322
Chicago 2704958
Houston 2303482
Phoenix 1615017
Philadelhia NaN
dtype: float64
Philadelphia not in list that’s why it is null
Check what are the null values in the above Series
<span class="n">Ser</span><span class="o">.</span><span class="n">isnull</span><span class="p">()</span>
NYC False
Los Angeles False
Chicago False
Houston False
Phoenix False
Philadelhia True
dtype: bool
Adding two series
Add the two Series object Ser & citypopulation_2016 and store the sum to a new object Ser1
<span class="n">Ser1</span> <span class="o">=</span> <span class="n">Ser</span><span class="o">+</span><span class="n">citypopulation_2016</span>
<span class="n">Ser1</span>
Chicago 5409916
Houston 4606964
Los Angeles 7952644
NYC 17075346
Philadelhia NaN
Phoenix 3230034
dtype: float64
Name the Series
<span class="n">Ser1</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">"US Populate cities"</span>
<span class="n">Ser1</span>
Cities
Chicago 5409916
Houston 4606964
Los Angeles 7952644
NYC 17075346
Philadelhia NaN
Phoenix 3230034
Name: US Populate cities, dtype: float64
Name the Index
<span class="n">Ser1</span><span class="o">.</span><span class="n">index</span><span class="o">.</span><span class="n">name</span> <span class="o">=</span> <span class="s">'Cities'</span>
<span class="n">Ser1</span>
Cities
Chicago 5409916
Houston 4606964
Los Angeles 7952644
NYC 17075346
Philadelhia NaN
Phoenix 3230034
Name: US Populate cities, dtype: float64
Dataframes
Dataframe has many functions to read the data from various sources like Json, CSV, Excel, HTML etc. ro dataframe object.Here we would be reading the Salary data from the CSV file
<span class="n">df</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">"<span style="color: #993300;">/Users/vbabu/Documents/personal/MyGitWork/csv/mls-salaries-2016.csv</span>"</span><span class="p">)</span>
Now using the ‘head’ function you can see the first five rows by default and similarly for the bottom 5 rows use ‘tail’ function
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
club | last_name | first_name | position | base_salary | guaranteed_compensation |
---|---|---|---|---|---|
ATL | Burgos | Efrain | M | 62508 | 62508.0 |
ATL | Oblitey Otoo | Jeffrey | M | 51504 | 51504.0 |
ATL | Tambakis | Alexander | GK | 63000 | 63000.0 |
CHI | Accam | David | F-M | 700000 | 770937.5 |
CHI | Alvarez | Arturo | F | 115000 | 118264.0 |
<span class="n">df</span><span class="o">.</span><span class="n">tail</span><span class="p">()</span>
club | last_name | first_name | position | base_salary | guaranteed_compensation |
---|---|---|---|---|---|
NaN | Martinez | Cristian | M | 62508.00 | 67008.00 |
NaN | Moore | Luke | F | 115000.00 | 137500.00 |
NaN | Nyassi | Sanna | M | 135000.00 | 141250.00 |
NaN | Ovalle | Adolfo | M | 63000.00 | 73500.00 |
NaN | Tshuma | Schillo | F | 81999.96 | 119999.96 |
Get the information about the dataframe using the ‘info’ function, Which gives details like field datatype, number of values, memory usage etc.
<class ‘pandas.core.frame.DataFrame’> Int64Index: 555 entries, 0 to 554 Data columns (total 6 columns): club 549 non-null object last_name 555 non-null object first_name 553 non-null object position 555 non-null object base_salary 555 non-null float64 guaranteed_compensation 555 non-null float64 dtypes: float64(2), object(4) memory usage: 30.4+ KB
Let’s check the data for a single Column ‘last_name”, Using ‘head’ function we will check the first five values for this column
<span class="n">df</span><span class="p">[</span><span class="s">'<span style="color: #993300;">last_name</span>'</span><span class="p">]</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
0 Burgos
1 Oblitey Otoo
2 Tambakis
3 Accam
4 Alvarez
Name: last_name, dtype: object
Add a New Column to Dataframe
Let’s add ‘Age’ column to the existing dataframe, You can see there is no value set for the new column so Pandas handles it internally and set the value as NaN
<span class="n">DataFrame</span><span class="p">(</span><span class="n">df</span><span class="p">,</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'last_name'</span><span class="p">,</span><span class="s">'first_name'</span><span class="p">,</span><span class="s">'Age'</span></span><span class="p">])</span>
last_name | first_name | Age |
---|---|---|
Burgos | Efrain | NaN |
Oblitey Otoo | Jeffrey | NaN |
Tambakis | Alexander | NaN |
Accam | David | NaN |
Alvarez | Arturo | NaN |
Calistri | Joey | NaN |
Campbell | Jonathan | NaN |
Cocis | Razvan | NaN |
Conner | Drew | NaN |
Doody | Patrick | NaN |
Fernandez | Collin | NaN |
Gehrig | Eric | NaN |
Goossens | John | NaN |
Harrington | Michael | NaN |
Igboananike | Kennedy | NaN |
Johnson | Sean | NaN |
Junior | Gilberto | NaN |
Kappelhof | Johan | NaN |
LaBrocca | Nick | NaN |
Lampson | Matt | NaN |
McLain | Patrick | NaN |
Meira | Joao | NaN |
Morrell | Alex | NaN |
Polster | Matt | NaN |
Ramos | Rodrigo | NaN |
Stephens | Michael | NaN |
Thiam | Khaly | NaN |
Vincent | Brandon | NaN |
Afful | Harrison | NaN |
Ashe | Corey | NaN |
… | … | … |
Carducci | Marco | NaN |
Dean | Christian | NaN |
Flores | Deybi | NaN |
Froese | Kianz | NaN |
Harvey | Jordan | NaN |
Hurtado | Erik | NaN |
Jacobson | Andrew | NaN |
Kah | Pa Modou | NaN |
Kudo | Masato | NaN |
Laba | Matias | NaN |
Manneh | Kekuta | NaN |
McKendry | Ben | NaN |
Mezquida | Nicolas | NaN |
Morales | Pedro | NaN |
Ousted | David | NaN |
Parker | Tim | NaN |
Perez | Blas | NaN |
Rivero | Octavio | NaN |
Seiler | Cole | NaN |
Smith | Jordan | NaN |
Techera | Cristian | NaN |
Teibert | Russell | NaN |
Tornaghi | Paolo | NaN |
Waston | Kendall | NaN |
Gargan | Dan | NaN |
Martinez | Cristian | NaN |
Moore | Luke | NaN |
Nyassi | Sanna | NaN |
Ovalle | Adolfo | NaN |
Tshuma | Schillo | NaN |
555 rows × 3 columns
Let’s see how we can get the value from the index, So here we want to get the 4th row value from the dataframe. All the columns of the 4th row is shown
<span class="n">df</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="mi">4</span><span class="p">]</span>
club CHI
last_name Alvarez
first_name Arturo
position F
base_salary 115000
guaranteed_compensation 118264
Name: 4, dtype: object
Add a new column – base salary * 2
This is another example of adding a new column ‘Double_Salary’ which is double the ‘base_salary’ of existing column.
<span class="n">df</span><span class="p">[</span><span class="s">'Double_Salary'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="s">'base_salary'</span><span class="p">]</span><span class="o">*</span><span class="mi">2</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
club | last_name | first_name | position | base_salary | guaranteed_compensation | Double_Salary | taxes |
---|---|---|---|---|---|---|---|
ATL | Burgos | Efrain | M | 62508.00 | 62508.00 | 125016.00 | Base Tax |
ATL | Oblitey Otoo | Jeffrey | M | 51504.00 | 51504.00 | 103008.00 | Base Tax |
ATL | Tambakis | Alexander | GK | 63000.00 | 63000.00 | 126000.00 | Base Tax |
CHI | Accam | David | F-M | 700000.00 | 770937.50 | 1400000.00 | Base Tax |
CHI | Alvarez | Arturo | F | 115000.00 | 118264.00 | 230000.00 | Base Tax |
CHI | Calistri | Joey | M | 51500.00 | 51500.00 | 103000.00 | Base Tax |
CHI | Campbell | Jonathan | D | 70000.00 | 78125.00 | 140000.00 | Base Tax |
CHI | Cocis | Razvan | M | 160000.00 | 160000.00 | 320000.00 | Base Tax |
CHI | Conner | Drew | M | 62500.00 | 62500.00 | 125000.00 | Base Tax |
CHI | Doody | Patrick | D | 53471.00 | 53471.00 | 106942.00 | Base Tax |
CHI | Fernandez | Collin | M | 75000.00 | 77000.00 | 150000.00 | Base Tax |
CHI | Gehrig | Eric | D | 106875.00 | 112708.33 | 213750.00 | Base Tax |
CHI | Goossens | John | M | 200000.00 | 200000.00 | 400000.00 | Base Tax |
CHI | Harrington | Michael | D | 125000.00 | 125000.00 | 250000.00 | Base Tax |
CHI | Igboananike | Kennedy | F | 800000.00 | 901666.67 | 1600000.00 | Base Tax |
CHI | Johnson | Sean | GK | 250000.00 | 253000.00 | 500000.00 | Base Tax |
CHI | Junior | Gilberto | F | 1145000.00 | 1145000.00 | 2290000.00 | Base Tax |
CHI | Kappelhof | Johan | D | 480000.00 | 520000.00 | 960000.00 | Base Tax |
CHI | LaBrocca | Nick | M | 110000.00 | 110000.00 | 220000.00 | Base Tax |
CHI | Lampson | Matt | GK | 71250.00 | 76625.00 | 142500.00 | Base Tax |
CHI | McLain | Patrick | GK | 72500.00 | 72500.00 | 145000.00 | Base Tax |
CHI | Meira | Joao | D-M | 110000.00 | 126500.00 | 220000.00 | Base Tax |
CHI | Morrell | Alex | M | 51500.00 | 51500.00 | 103000.00 | Base Tax |
CHI | Polster | Matt | D-M | 84000.00 | 99000.00 | 168000.00 | Base Tax |
CHI | Ramos | Rodrigo | D | 80000.00 | 84000.00 | 160000.00 | Base Tax |
CHI | Stephens | Michael | M | 105000.00 | 115000.00 | 210000.00 | Base Tax |
CHI | Thiam | Khaly | M | 144000.00 | 144000.00 | 288000.00 | Base Tax |
CHI | Vincent | Brandon | D | 70000.00 | 91875.00 | 140000.00 | Base Tax |
CLB | Afful | Harrison | D | 275000.00 | 291666.67 | 550000.00 | Base Tax |
CLB | Ashe | Corey | D | 95000.00 | 105500.00 | 190000.00 | Base Tax |
… | … | … | … | … | … | … | … |
VAN | Carducci | Marco | GK | 63000.00 | 63000.00 | 126000.00 | Base Tax |
VAN | Dean | Christian | D | 110000.00 | 191000.00 | 220000.00 | Base Tax |
VAN | Flores | Deybi | M | 62500.00 | 68385.00 | 125000.00 | Base Tax |
VAN | Froese | Kianz | M | 66000.00 | 70500.00 | 132000.00 | Base Tax |
VAN | Harvey | Jordan | D | 165000.00 | 165000.00 | 330000.00 | Base Tax |
VAN | Hurtado | Erik | F/M | 86091.50 | 121091.50 | 172183.00 | Base Tax |
VAN | Jacobson | Andrew | M | 62500.08 | 87500.08 | 125000.16 | Base Tax |
VAN | Kah | Pa Modou | D | 97000.00 | 101850.00 | 194000.00 | Base Tax |
VAN | Kudo | Masato | F | 310000.00 | 314000.00 | 620000.00 | Base Tax |
VAN | Laba | Matias | M | 560000.00 | 720500.00 | 1120000.00 | Base Tax |
VAN | Manneh | Kekuta | M-F | 127500.00 | 157000.00 | 255000.00 | Base Tax |
VAN | McKendry | Ben | M | 52500.00 | 52500.00 | 105000.00 | Base Tax |
VAN | Mezquida | Nicolas | M-F | 88000.00 | 88000.00 | 176000.00 | Base Tax |
VAN | Morales | Pedro | M | 1232500.00 | 1258900.00 | 2465000.00 | Base Tax |
VAN | Ousted | David | GK | 360000.00 | 378933.33 | 720000.00 | Base Tax |
VAN | Parker | Tim | D | 66000.00 | 84750.00 | 132000.00 | Base Tax |
VAN | Perez | Blas | F | 215000.00 | 215000.00 | 430000.00 | Base Tax |
VAN | Rivero | Octavio | F | 890850.00 | 890850.00 | 1781700.00 | Base Tax |
VAN | Seiler | Cole | D | 51500.00 | 51500.00 | 103000.00 | Base Tax |
VAN | Smith | Jordan | D | 115000.00 | 122743.00 | 230000.00 | Base Tax |
VAN | Techera | Cristian | M | 320000.00 | 345000.00 | 640000.00 | Base Tax |
VAN | Teibert | Russell | M | 115000.00 | 182500.00 | 230000.00 | Base Tax |
VAN | Tornaghi | Paolo | GK | 62500.00 | 62500.00 | 125000.00 | Base Tax |
VAN | Waston | Kendall | D | 300000.00 | 318125.00 | 600000.00 | Base Tax |
NaN | Gargan | Dan | D | 145000.00 | 145000.00 | 290000.00 | Base Tax |
NaN | Martinez | Cristian | M | 62508.00 | 67008.00 | 125016.00 | Base Tax |
NaN | Moore | Luke | F | 115000.00 | 137500.00 | 230000.00 | Base Tax |
NaN | Nyassi | Sanna | M | 135000.00 | 141250.00 | 270000.00 | Base Tax |
NaN | Ovalle | Adolfo | M | 63000.00 | 73500.00 | 126000.00 | Base Tax |
NaN | Tshuma | Schillo | F | 81999.96 | 119999.96 | 163999.92 | Base Tax |
Add specific values to the new column based on the indexes, So here for the indexes 0 & 6, the ‘taxes’ column value would be ‘Federal Tax’ & ‘State Tax’ and rest of the indexes will have the NaN value or null
<span class="n">tax</span> <span class="o">=</span> <span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="s">'Federal Tax'</span><span class="p">,</span><span class="s">'State Tax'</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="mi">0</span><span class="p">,</span><span class="mi">6</span></span><span class="p">])</span>
<span class="n">df</span><span class="p">[</span><span class="s">'taxes'</span><span class="p">]</span> <span class="o">=</span> <span class="n">tax</span>
<span class="n">df</span>
club | last_name | first_name | position | base_salary | guaranteed_compensation | Double_Salary | taxes |
---|---|---|---|---|---|---|---|
ATL | Burgos | Efrain | M | 62508.00 | 62508.00 | 125016.00 | Federal Tax |
ATL | Oblitey Otoo | Jeffrey | M | 51504.00 | 51504.00 | 103008.00 | NaN |
ATL | Tambakis | Alexander | GK | 63000.00 | 63000.00 | 126000.00 | NaN |
CHI | Accam | David | F-M | 700000.00 | 770937.50 | 1400000.00 | NaN |
CHI | Alvarez | Arturo | F | 115000.00 | 118264.00 | 230000.00 | NaN |
CHI | Calistri | Joey | M | 51500.00 | 51500.00 | 103000.00 | NaN |
CHI | Campbell | Jonathan | D | 70000.00 | 78125.00 | 140000.00 | State Tax |
CHI | Cocis | Razvan | M | 160000.00 | 160000.00 | 320000.00 | NaN |
CHI | Conner | Drew | M | 62500.00 | 62500.00 | 125000.00 | NaN |
CHI | Doody | Patrick | D | 53471.00 | 53471.00 | 106942.00 | NaN |
CHI | Fernandez | Collin | M | 75000.00 | 77000.00 | 150000.00 | NaN |
CHI | Gehrig | Eric | D | 106875.00 | 112708.33 | 213750.00 | NaN |
CHI | Goossens | John | M | 200000.00 | 200000.00 | 400000.00 | NaN |
CHI | Harrington | Michael | D | 125000.00 | 125000.00 | 250000.00 | NaN |
CHI | Igboananike | Kennedy | F | 800000.00 | 901666.67 | 1600000.00 | NaN |
CHI | Johnson | Sean | GK | 250000.00 | 253000.00 | 500000.00 | NaN |
CHI | Junior | Gilberto | F | 1145000.00 | 1145000.00 | 2290000.00 | NaN |
CHI | Kappelhof | Johan | D | 480000.00 | 520000.00 | 960000.00 | NaN |
CHI | LaBrocca | Nick | M | 110000.00 | 110000.00 | 220000.00 | NaN |
CHI | Lampson | Matt | GK | 71250.00 | 76625.00 | 142500.00 | NaN |
CHI | McLain | Patrick | GK | 72500.00 | 72500.00 | 145000.00 | NaN |
CHI | Meira | Joao | D-M | 110000.00 | 126500.00 | 220000.00 | NaN |
CHI | Morrell | Alex | M | 51500.00 | 51500.00 | 103000.00 | NaN |
CHI | Polster | Matt | D-M | 84000.00 | 99000.00 | 168000.00 | NaN |
CHI | Ramos | Rodrigo | D | 80000.00 | 84000.00 | 160000.00 | NaN |
CHI | Stephens | Michael | M | 105000.00 | 115000.00 | 210000.00 | NaN |
CHI | Thiam | Khaly | M | 144000.00 | 144000.00 | 288000.00 | NaN |
CHI | Vincent | Brandon | D | 70000.00 | 91875.00 | 140000.00 | NaN |
CLB | Afful | Harrison | D | 275000.00 | 291666.67 | 550000.00 | NaN |
CLB | Ashe | Corey | D | 95000.00 | 105500.00 | 190000.00 | NaN |
VAN | Carducci | Marco | GK | 63000.00 | 63000.00 | 126000.00 | NaN |
VAN | Dean | Christian | D | 110000.00 | 191000.00 | 220000.00 | NaN |
VAN | Flores | Deybi | M | 62500.00 | 68385.00 | 125000.00 | NaN |
VAN | Froese | Kianz | M | 66000.00 | 70500.00 | 132000.00 | NaN |
VAN | Harvey | Jordan | D | 165000.00 | 165000.00 | 330000.00 | NaN |
VAN | Hurtado | Erik | F/M | 86091.50 | 121091.50 | 172183.00 | NaN |
VAN | Jacobson | Andrew | M | 62500.08 | 87500.08 | 125000.16 | NaN |
VAN | Kah | Pa Modou | D | 97000.00 | 101850.00 | 194000.00 | NaN |
VAN | Kudo | Masato | F | 310000.00 | 314000.00 | 620000.00 | NaN |
VAN | Laba | Matias | M | 560000.00 | 720500.00 | 1120000.00 | NaN |
VAN | Manneh | Kekuta | M-F | 127500.00 | 157000.00 | 255000.00 | NaN |
VAN | McKendry | Ben | M | 52500.00 | 52500.00 | 105000.00 | NaN |
VAN | Mezquida | Nicolas | M-F | 88000.00 | 88000.00 | 176000.00 | NaN |
VAN | Morales | Pedro | M | 1232500.00 | 1258900.00 | 2465000.00 | NaN |
VAN | Ousted | David | GK | 360000.00 | 378933.33 | 720000.00 | NaN |
VAN | Parker | Tim | D | 66000.00 | 84750.00 | 132000.00 | NaN |
VAN | Perez | Blas | F | 215000.00 | 215000.00 | 430000.00 | NaN |
VAN | Rivero | Octavio | F | 890850.00 | 890850.00 | 1781700.00 | NaN |
VAN | Seiler | Cole | D | 51500.00 | 51500.00 | 103000.00 | NaN |
VAN | Smith | Jordan | D | 115000.00 | 122743.00 | 230000.00 | NaN |
VAN | Techera | Cristian | M | 320000.00 | 345000.00 | 640000.00 | NaN |
VAN | Teibert | Russell | M | 115000.00 | 182500.00 | 230000.00 | NaN |
VAN | Tornaghi | Paolo | GK | 62500.00 | 62500.00 | 125000.00 | NaN |
VAN | Waston | Kendall | D | 300000.00 | 318125.00 | 600000.00 | NaN |
NaN | Gargan | Dan | D | 145000.00 | 145000.00 | 290000.00 | NaN |
NaN | Martinez | Cristian | M | 62508.00 | 67008.00 | 125016.00 | NaN |
NaN | Moore | Luke | F | 115000.00 | 137500.00 | 230000.00 | NaN |
NaN | Nyassi | Sanna | M | 135000.00 | 141250.00 | 270000.00 | NaN |
NaN | Ovalle | Adolfo | M | 63000.00 | 73500.00 | 126000.00 | NaN |
NaN | Tshuma | Schillo | F | 81999.96 | 119999.96 | 163999.92 | NaN |
Delete Column
‘taxes’ column deleted from the dataframe
<span class="k">del</span> <span class="n">df</span><span class="p">[</span><span class="s">'taxes'</span><span class="p">]</span>
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">()</span>
club | last_name | first_name | position | base_salary | guaranteed_compensation | Double_Salary |
---|---|---|---|---|---|---|
ATL | Burgos | Efrain | M | 62508 | 62508.0 | 125016 |
ATL | Oblitey Otoo | Jeffrey | M | 51504 | 51504.0 | 103008 |
ATL | Tambakis | Alexander | GK | 63000 | 63000.0 | 126000 |
CHI | Accam | David | F-M | 700000 | 770937.5 | 1400000 |
CHI | Alvarez | Arturo | F | 115000 | 118264.0 | 230000 |
Dataframe from Dictionary
<span class="n">Myfruits</span> <span class="o">=</span> <span class="p">{</span><span class="s">'fruits'</span><span class="p">:[</span><span style="color: #993300;"><span class="s">'Orange'</span><span class="p">,</span><span class="s">'Mango'</span><span class="p">,</span><span class="s">'Grapes'</span><span class="p">,</span><span class="s">'Guava'</span></span><span class="p">],</span><span class="s">'Price'</span> <span class="p">:[</span><span style="color: #993300;"><span class="mi">20</span><span class="p">,</span><span class="mi">50</span><span class="p">,</span><span class="mi">90</span><span class="p">,</span><span class="mi">100</span></span><span class="p">]}</span>
<span class="n">dffruits</span><span class="o">=</span><span class="n">DataFrame</span><span class="p">(</span><span class="n">Myfruits</span><span class="p">)</span>
<span class="n">dffruits</span>
Price | fruits |
---|---|
20 | Orange |
50 | Mango |
90 | Grapes |
100 | Guava |
Reindexing
Create a new Series for colors as index
<span class="n">df</span> <span class="o">=</span> <span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">5</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">7</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'Red'</span><span class="p">,</span><span class="s">'Blue'</span><span class="p">,</span><span class="s">'Green'</span><span class="p">,</span><span class="s">'Brown'</span><span class="p">,</span><span class="s">'Violet'</span></span><span class="p">])</span>
<span class="n">df</span>
Red 6
Blue 4
Green 5
Brown 9
Violet 7
dtype: int64
Re-index and add a new index ‘Pink’ in the existing Series, Therefore the value of index ‘Pink’ is null
<span class="n">df</span><span class="o">.</span><span class="n">reindex</span><span class="p">([</span><span style="color: #993300;"><span class="s">'Red'</span><span class="p">,</span><span class="s">'Blue'</span><span class="p">,</span><span class="s">'Green'</span><span class="p">,</span><span class="s">'Brown'</span><span class="p">,</span><span class="s">'Violet'</span><span class="p">,</span><span class="s">'Pink'</span></span><span class="p">])</span>
Red 6
Blue 4
Green 5
Brown 9
Violet 7
Pink NaN
dtype: float64
Fill the null values with a default value
<span class="n">df</span><span class="o">.</span><span class="n">reindex</span><span class="p">([</span><span style="color: #993300;"><span class="s">'Red'</span><span class="p">,</span><span class="s">'Blue'</span><span class="p">,</span><span class="s">'Green'</span><span class="p">,</span><span class="s">'Brown'</span><span class="p">,</span><span class="s">'Violet'</span><span class="p">,</span><span class="s">'Pink'</span></span><span class="p">],</span><span class="n">fill_value</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
Red 6
Blue 4
Green 5
Brown 9
Violet 7
Pink 10
dtype: int64
Reindexing rows, columns or both
Create a Dataframe with random integers and index value as defined below and columns also as shown
<span class="n">df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">random_integers</span><span class="p">(</span><span class="mi">25</span><span class="p">,</span><span class="n">size</span><span class="o">=</span><span class="mi">25</span><span class="p">)</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="mi">5</span><span class="p">,</span><span class="mi">5</span><span class="p">)),</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'A'</span><span class="p">,</span><span class="s">'B'</span><span class="p">,</span><span class="s">'D'</span><span class="p">,</span><span class="s">'E'</span><span class="p">,</span><span class="s">'F'</span></span><span class="p">],</span>
<span class="n">columns</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'col1'</span><span class="p">,</span><span class="s">'col2'</span><span class="p">,</span><span class="s">'col3'</span><span class="p">,</span><span class="s">'col4'</span><span class="p">,</span><span class="s">'col5'</span></span><span class="p">])</span>
<span class="n">df</span>
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
A | 17 | 3 | 16 | 16 | 18 |
B | 2 | 24 | 5 | 3 | 15 |
C | 20 | 22 | 5 | 21 | 16 |
D | 13 | 13 | 10 | 3 | 20 |
E | 15 | 18 | 20 | 2 | 2 |
Summing up columns and rows
Sum along the rows by using axis =1
<span class="n">df</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
A 70
B 49
D 84
E 59
F 57
dtype: int64
Sum along the rows by using axis=0
<span class="n">dframe</span><span class="o">.</span><span class="n">sum</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
col1 65
col2 60
col3 53
col4 82
col5 77
dtype: int64
Simple Stats
<span class="n">dframe</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>
col1 | col2 | col3 | col4 | col5 |
---|---|---|---|---|
5.000000 | 5 | 5.000000 | 5.000000 | 5.00000 |
13.000000 | 12 | 10.600000 | 16.400000 | 15.40000 |
8.774964 | 10 | 5.549775 | 5.727128 | 8.70632 |
1.000000 | 2 | 1.000000 | 9.000000 | 2.00000 |
8.000000 | 3 | 11.000000 | 15.000000 | 12.00000 |
16.000000 | 11 | 13.000000 | 16.000000 | 18.00000 |
16.000000 | 19 | 13.000000 | 17.000000 | 21.00000 |
24.000000 | 25 | 15.000000 | 25.000000 | 24.00000 |
Find the correlation between each of the variables of the dataframe using ‘corr’ function
<span class="n">corr</span><span class="o">=</span><span class="n">dframe</span><span class="o">.</span><span class="n">corr</span><span class="p">()</span>
Plot the Correlation heatmap matrix using Seaborn – A Data Visualization Library
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="nn">sns</span>
<span class="o">%</span><span class="k">matplotlib</span> inline
<span class="n">sns</span><span class="o">.</span><span class="n">heatmap</span><span class="p">(</span><span class="n">corr</span><span class="p">)</span>
Missing Values
<span class="n">df</span> <span class="o">=</span> <span class="n">DataFrame</span><span class="p">([[</span><span style="color: #993300;"><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">7</span></span><span class="p">],[</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">NaN</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">NAN</span><span class="p">,</span><span class="mi">9</span></span><span class="p">],[</span><span style="color: #993300;"><span class="n">np</span><span class="o">.</span><span class="n">NaN</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><spn class="mi">4</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="n">np</span><span class="o">.</span><span class="n">NaN</span></span><span class="p">]])</span>
<span class="n">df</span>
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 5 | 4 | 3 | 2 | 7 |
1 | 6 | 3 | NaN | NaN | 9 |
2 | NaN | 3 | 4 | 2 | NaN |
drop all the columns where null value is available
<span class="n">df1</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="n">df1</span>
|1|0| |— |— | |0|4| |1|3| |2|3|
Show rows with atleast ‘thresh’ no. of values, Since all the rows contains minimum three non null values that’s why all rows are displayed
<span class="n">df2</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">dropna</span><span class="p">(</span><span class="n">thresh</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
<span class="n">df2</span>
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 5 | 4 | 3 | 2 | 7 |
1 | 6 | 3 | NaN | NaN | 9 |
2 | NaN | 3 | 4 | 2 | NaN |
fillna
fill nulls with values
<span class="n">df</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">'5'</span><span class="p">)</span>
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
0 | 5 | 4 | 3 | 2 | 7 |
1 | 6 | 3 | 5 | 5 | 9 |
2 | 5 | 3 | 4 | 2 | 5 |
Drop rows
<span class="n">dframe</span><span class="o">.</span><span class="n">drop</span><span class="p">(</span><span class="s">'B'</span><span class="p">)</span>
col1 | col2 | col3 | col4 | col5 |
---|---|---|---|---|
8 | 2 | 1 | 15 | 18 |
16 | 3 | 11 | 9 | 12 |
16 | 11 | 13 | 25 | 24 |
1 | 25 | 13 | 17 | 21 |
Selecting only 2 columns
<span class="n">dframe</span><span class="p">[[</span><span class="s">'col1'</span><span class="p">,</span><span class="s">'col2'</span><span class="p">]]</span>
col1 | col2 |
---|---|
8 | 2 |
24 | 19 |
16 | 3 |
16 | 11 |
1 | 25 |
Selecting Index
<span class="n">dframe</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="s">'D'</span><span class="p">]</span>
col1 16
col2 3
col3 11
col4 9
col5 12
Name: D, dtype: int64
Selecting on Condition
<span class="n">dframe</span><span class="p">[</span><span class="n">dframe</span><span class="p">[</span><span class="s">'col1'</span><span class="p">]</span><span class="o">></span><span class="mi">10</span><span class="p">]</span>
Out[207]: col1 col2 col3 col4 col5 B 24 19 15 16 2 D 16 3 11 9 12 E 16 11 13 25 24
Selecting values
<span class="n">df</span><span class="o">=</span><span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">2</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'A'</span><span class="p">,</span><span class="s">'B'</span><span class="p">,</span><span class="s">'C'</span><span class="p">,</span><span class="s">'D'</span></span><span class="p">])</span>
<span class="n">df</span>
A 6
B 4
C 3
D 2
dtype: int64
Select the Index ‘C’ value
<span class="n">df</span><span class="p">[</span><span class="s">'C'</span><span class="p">]</span>
3
1
3
Select First two values
<span class="n">df</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span>
A 6
B 4
dtype: int64
Sorting
Sort will be done by default on the Indexes, so before sorting index was P,R,S,Q and after sorting it changed to P,Q,R,S
<span class="n">df</span><span class="o">=</span><span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">6</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">1</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[</span><span style="color: #993300;"><span class="s">'P'</span><span class="p">,</span><span class="s">'R'</span><span class="p">,</span><span class="s">'S'</span><span class="p">,</span><span class="s">'Q'</span></span><span class="p">])</span>
<span class="n">df</span>
P 6
R 4
S 2
Q 1
dtype: int64
<span class="n">df</span><span class="o">.</span><span class="n">sort_index</span><span class="p">()</span>
P 6
Q 1
R 4
S 2
dtype: int64
Order by values
<span class="n">df</span><span class="o">.</span><span class="n">sort_values</span><span class="p">()</span>
Q 1
S 2
R 4
P 6
dtype: int64
<span class="n">df</span><span class="o">.</span><span class="n">rank</span><span class="p">()</span>
P 4
R 3
S 2
Q 1
dtype: float64
MultiIndexing
Here we have first index as 1,2,3 and the second index has a,b for all the first index values, as shown here
<span class="n">Sr</span> <span class="o">=</span> <span class="n">Series</span><span class="p">([</span><span style="color: #993300;"><span class="mi">5</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">9</span><span class="p">,</span><span class="mi">6</span></span><span class="p">],</span><span class="n">index</span><span class="o">=</span><span class="p">[[</span><span style="color: #993300;"><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">2</span><span class="p">,</span><span class="mi">3</span><span class="p">,</span><span class="mi">3</span></span><span class="p">],[</span><span style="color: #993300;"><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span><span class="p">,</span><span class="s">'a'</span><span class="p">,</span><span class="s">'b'</span></span><span class="p">]])</span>
<span class="n">Sr</span>
1 a 5
b 4
2 a 3
b 7
3 a 9
b 6
dtype: int64
Verify the Index Levels
<span class="n">Sr</span><span class="o">.</span><span class="n">index</span>
MultiIndex(levels=[[<span style="color: #993300;">1, 2, 3</span>], [<span style="color: #993300;">'a', 'b'</span>]],
labels=[[<span style="color: #993300;">0, 0, 1, 1, 2, 2</span>], [<span style="color: #993300;">0, 1, 0, 1, 0, 1</span>]])
Get Values by first Index
<span class="n">Sr</span><span class="p">[</span><span class="mi">3</span><span class="p">]</span>
a 9
b 6
dtype: int64
Get Values by Second Index
<span class="n">Sr</span><span class="p">[:,</span><span class="s">'b'</span><span class="p">]</span>
1 4
2 7
3 6
dtype: int64
<span class="n">Sr</span><span class="o">.</span><span class="n">unstack</span><span class="p">()</span>
a | b |
---|---|
5 | 4 |
3 | 7 |
9 | 6 |
Group By
Read a CSV file with the data of Booker Prize Winners from 1996-2016
<span class="n">df</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'<span style="color: #993300;">./Desktop/Reports-Delete/Booker-Prize.csv</span>'</span><span class="p">)</span>
Get the first 3 rows of the dataset
<span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
Year | Author | Title | Genre(s) | Country | Unnamed: 5 | Unnamed: 6 |
---|---|---|---|---|---|---|
1969 | P. H. Newby | Something to Answer For | Novel | United Kingdom | NaN | NaN |
1970 | Bernice Rubens | The Elected Member | Novel | United Kingdom | NaN | NaN |
1970 | J. G. Farrell | Troubles | Novel | United Kingdom\nIreland | NaN | NaN |
Groupby Country, Author & Genre to know who are the Authors won the title more than once and which is the country won the Award maximum number of time
<span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span style="color: #993300;"><span class="s">'Country'</span><span class="p">,</span><span class="s">'Author'</span><span class="p">,</span><span class="s">'Genre(s)'</span></span><span class="p">])</span><span class="o">.</span><span class="n">count</span><span class="p">()</span><span class="o">.</span><span class="n">sort_values</span><span class="p">(</span><span class="n">by</span><span class="o">=</span><span class="p">[</span><span class="s">'<span style="color: #993300;">Year</span>'</span><span class="p">],</span><span class="n">ascending</span><span class="o">=</span><span class="p">[</span><span class="k">False</span><span class="p">])</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
Year | Title | ||||
---|---|---|---|---|---|
Country | Author | Genre(s) | |||
Australia | Peter Carey | Historical novel | 2 | 2 | |
United Kingdom\nIreland | J. G. Farrell | Novel | 2 | 2 | |
United Kingdom | Hilary Mantel | Historical novel | 2 | 2 | |
South Africa | J. M. Coetzee | Novel | 2 | 2 | |
United Kingdom | Kazuo Ishiguro | Historical novel | 1 | 1 |
Pandas is an amazing library for data analysis and I have shown some of the basic stuffs to learn but to get more hands on with the library and it’s functions, Use Varied set of data and practice a lot. The only mantra to learn Pandas and gain expertise is Practice, Practice & Practice