Pandas

Pandas

Table of contents

No heading

No headings in the article.

I use the Jupyter notebook I installed from Anaconda. Reading from the book Pandas for everyone by Daniel Y. Chen. I learned that operations are vectorized, and methods that work on series and data frames are vectorized.

Using the datasets scientists.csv.

#Importing necessary library
import pandas as pd

# Importing Data set
scientists = pd.read_csv('scientist.csv')
scientists.head()

>>># Data Output

Name    Born    Died    Age    Occupation
0    Rosaline Franklin    1920-07-25    1958-04-16    37    Chemist
1    William Gosset    1876-06-13    1937-10-16    61    Statistician
2    Florence Nightingale    1820-05-12    1910-08-13    90    Nurse
3    Marie Curie    1867-11-07    1934-07-04    66    Chemist
4    Rachel Carson    1907-05-27    1964-04-14    56    Biologist

ages = scientists['Age']
print(ages)
>>> # Data Output
0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

print(ages + ages)
>>> # Data Output
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

Vectors with integers (scalars)

When you operate on a vector using a scalar, the scalar will be recycled across the elements in the vector.

print(ages + 100)
>>> # Data Output
0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64

print(ages * 2)
>>> # Data Output
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

Vectors with different lengths

When you are working with vectors of different lengths, the behavior will depend on the type of vectors.

With a series, the vectors will perform an operation matched by the index. The rest of the index will be filled with a 'missing' value, which is denoted with NaN, for 'not a number'.

This type of behavior is called 'Broadcasting' and it differs between languages. Broadcasting in Pandas refers to how operations are calculated between arrays with different shapes.

print(ages + pd.Series([1, 100]))
>>> # Data Output
0     38.0
1    161.0
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64

Vector with common index labels

Data alignment is very common in Pandas and it is almost always automatic. If possible, things will always align themselves with index labels when actions are performed.

print(ages)
>>> # Data Output
0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64

rev_ages = ages.sort_index(ascending=False)
print(rev_ages)
>>> # Data Output
7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37
Name: Age, dtype: int64

print(ages * 2)
>>> # Data Output
0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64

You can download the book and Dataset and practice.

Happy Learning!!!