Pandas: How to get started.
Pandas library in python provides a lot of features that are useful not only in data science projects but also for quick data manipulations or conversions.
Here I will list a few things that I used in my first Pandas project.
Import pandas
First and foremost you would need to import pandas into your script.
import pandas as pd
Read a comma separated file
Next you can load a data set saved as a comma separated file (.csv) using
df = pd.read_csv(‘datafile.csv’)
I prefer CSV format as it is the vanilla format and if need arises you use a simple text editor to peruse it.
Pandas uses data frame to store the data in memory, similar to R.
In this blog df will be used to denote the variable is a data frame.
Display stats
Displays stats of the data.
matdf.describe()
Print all the column names
Prints all the variables in the data frame
df.columns.values
Display the start and end of the data frame
Displays the first and the last ’n’ rows in a data frame respectively.
’n’ equals 5 by default.
matdf.head(n)<br>
matdf.tail(n)
Concatenate two data frames
Concatenates 2 data frames along the required axis.
When axis equals 0, more observations are added to the resulting data frame.
When axis equals 1, more variables (dimensions) are added to the resulting data frame.
A good explanation can be found here.
df = pd.concat([df1, df2], axis=1)
Unique values in a column
To display the unique values in a column.
Good to see different values of an ordinal variable.
df.columnName.unique()
To make a data frame from arrays
The Simplest way to create a data frame
pd.DataFrame([[1.0,2.0],[3.0,4.0]], columns=[‘a’, ‘b’])
Drop a column
To delete a column
df.drop(‘columnName’, axis = 1)
Set a column as index
Allows you to label rows using an already present column.
matdf = df.set_index(‘columnName’)
Select conditionally
loc command allows you to select rows (indices) from column2 and substitute them with value2, that equals a condition to the values in column1.
Note: columnName1 and columnName2 can be the same column name.
df.loc[df[‘columnName1’] == value1,’columnName2'] = value2
Get a subset of data frame
Get stats from a column for only those rows (indices) where the conditions are satisfied in one or more columns.
df.columnName3[(df.columnName1 == value1) & (df.columnName2 == value2 )].mean()
Write a data frame to CSV
After you preprocess the data you might want to save it to the disk.
to_csv function comes in handy to save a data frame as a CSV file,
matdf.to_csv(‘matYearCountry.csv’)
That's all, go fire up your first pandas project.