Pandas is an open-source Python library used for data manipulation and analysis. Pandas can work with different types of data sets, including CSV files, Excel files, JSON files, XML files, and relational database tables.
By the end of this article, we will have a good understanding of the basics of Pandas. We will learn about the installation and setup, the data structures of Pandas, why Pandas is used, some of its limitations, and common use cases.
So, without further ado, let’s get started!
1. Installing Pandas
To install Pandas on your system, make sure that we have the Python version 3.8
, 3.9
, 3.10
, or 3.11
installed, as these Python versions officially support Pandas.
Let’s discuss two methods to install Pandas in our system.
1.1 Installing with pip
The pip
is a Python package installation manager that makes it easy to install Python libraries or frameworks. If you have Python version 3.4 or higher, then Pip comes by default. Otherwise, you will need to install pip before installing Pandas.
Now, first launch the command prompt and type the following command:
pip install pandas
Hit Enter and we will see that Pandas will start installing. Now we can use Pandas in the Python programs.

2.2. Installing with Anaconda
Another way to install Pandas is to install Anaconda. Anaconda is a Python distribution that provides you with access to different tools.
When you install Anaconda, it installs all the major libraries automatically. To install Anaconda, first download it using this link: https://www.anaconda.com/download.
Now, launch it, but remember to check the following boxes:

Just click on the Install button and wait for the installation to complete. Once Anaconda is installed, we can use Pandas in your Windows command prompt, VS Code editor, or PowerShell prompt (one of the tools available in the Anaconda Navigator).
If you are going to use Pandas, it is a best practice to use it in a Jupyter Notebook. Jupyter Notebook is a web-based, interactive computing notebook.
To use Jupyter, open Anaconda Navigator in your system and open the Jupyter Notebook. You can see the Jupyter Notebook option in the image below.

Just click on the Launch button and the notebook will open on the localhost page, as we can see below.

We can click on the New button and select Python 3. We are now ready to use the Jupyter Notebook.
3. Data Structures in Pandas
Data structures in Pandas are very useful for processing and analyzing data. Pandas has two key data structures that are widely used in Pandas:
- Series
- DataFrames
Let us learn more about these data structures in detail.
4. Series in Pandas
A series is a one-dimensional array-like structure that can hold any data type, including numbers, booleans, and objects.

4.1. Creating a Series
To create a Series, you can use the pd.Series()
function and pass a list or array of values.
import pandas as pd
arr = ['J', 'A', 'V', 'A']
result = pd.Series(arr)
print(result)
The program output:
0 J
1 A
2 V
3 A
We can also create a Series with custom indexes.
import pandas as pd
result = pd.Series(['J', 'A', 'V', 'A'], index = [5, 4, 3, 2])
print(result)
The program output:
5 J
4 A
3 V
2 A
4.2. Accessing the Values
To access the values and indexes from a Series, we can use the .values
and .index
attributes, respectively.
import pandas as pd
arr = ['J', 'A', 'V', 'A']
result = pd.Series(arr)
print(result.index)
print(result.values)
The program output:
RangeIndex(start=0, stop=4, step=1)
['J' 'A' 'V' 'A']
We can also access the partial values from the series. Suppose we want only the first three characters from a series of 6 elements.
import pandas as pd
s = pd.Series(['P', 'Y', 'T', 'H', 'O', 'N'])
print(s[0:3])
The program output:
0 P
1 Y
2 T
Suppose you have a Series of some elements and we only want to print the elements that are greater than 3.
import pandas as pd
s = pd.Series([3, 4, 5, 1, 2])
print(s[s > 3])
The program output:
1 4
2 5
5. Dataframe in Pandas
A Pandas DataFrame is a two-dimensional data structure that can hold multiple Series. It is like a SQL table or spreadsheet, with multiple rows and columns, and can hold different types of data.

5.1. Creating a Dataframe
The most common way to create a DataFrame is by using a list of lists. We simply have to pass the list of lists to the pd.DataFrame()
function. Each list represents a row in the DataFrame, and each element of each list represents a column.
import pandas as pd
data = [['A', 'B', 'C'], ['D', 'E', 'F']]
df = pd.DataFrame(data)
print(df)
The program output:
0 1 2
0 A B C
1 D E F
Another way to create a DataFrame is to use a dictionary. The keys of the dictionary will represent the column names of the DataFrame, while the values of the dictionary will represent the column data.
import pandas as pd
data = {
"Name": ['Satyam', 'John', 'Kelvin'],
"Age": [22, 30, 28]
}
df = pd.DataFrame(data)
print(df)
The program output:
Name Age
0 Satyam 22
1 John 30
2 Kelvin 28
5.2. Loading a DataFrame from CSV File
To load data from a file into pandas, we can use the read_csv()
function. The read_csv()
function takes the file path as its first argument, and it also takes other arguments that specify how the data should be loaded.

import pandas as pd
data = pd.read_csv("C:\\Users\\Downloads\\test_data.csv")
print(data)
The program output:
Name Age Sex Salary
0 Satyam 22 Male 20000
1 John 21 Male 21000
2 Nicolas 23 Female 45000
3 Julia 32 Female 62000
5.3. Selecting Records in DataFrame
To access the columns of a DataFrame, we can just pass the column name.
import pandas as pd
data = {'Name': ['Satyam', 'Julia', 'Jacky', 'David'],
'Age': [22, 31, 36, 26],
'Gender': ['M', 'F', 'F', 'M']}
df = pd.DataFrame(data)
print(df['Name'])
The program output:
0 Satyam
1 Julia
2 Jacky
3 David
We can also use the .loc
and .iloc
methods to access values in a DataFrame. The .loc
method is used to access groups of rows and columns by label, while the .iloc
method is used to retrieve any particular value from a row or column by index values.
import pandas as pd
data = {'Name': ['Satyam', 'Julia', 'Jacky', 'David'],
'Age': [22, 31, 36, 26],
'Gender': ['M', 'F', 'F', 'M']}
df = pd.DataFrame(data)
print(df.loc[2])
print(df.iloc[[0, 2], [1, 2]])
The program output:
Name Jacky
Age 36
Gender F
Age Gender
0 22 M
2 36 F
6. Data Visualization with Pandas
Pandas can be used to create basic data visualizations such as line plots, bar plots, scatter plots, and so on. To do this, we need to import another library called Matplotlib. Pandas is used to create a DataFrame of the data, and then Matplotlib is used to show the visualization of the data.
6.1. Line Plot
A line plot is a type of graph that is used to show the relationship between two variables. It is represented by a line.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Time": [10, 11, 12, 13, 14],
"Temperature": [30, 32, 38, 40, 35]
})
plt.plot(df["Time"], df["Temperature"])
plt.xlabel("Time")
plt.ylabel("Temperature")
plt.show()
The program output:

As we can see above, we used Pandas to create a dataframe and Matplotlib to give the structure for the visualization. We defined that time should be on the X-axis and temperature should be on the Y-axis. We then used the show()
function to print the line plot.
6.2. Bar Plot
Bar plots are used to represent the data in the form of rectangular bars.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Month": ["Jan", "Feb", "Mar", "Apr", "May"],
"Sales": [100, 150, 300, 180, 210]
})
plt.bar(df["Month"], df["Sales"])
plt.xlabel("Month")
plt.ylabel("Sales")
plt.show()
The program output:

6.3. Scatter Plot
Scatter plots are used to show the relationship between the variables and it uses dots to represent the relationship.
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({
"Age": [25, 30, 35, 40, 45],
"Income": [45000, 50000, 65000, 60000, 70000]
})
plt.scatter(df["Age"], df["Income"])
plt.xlabel("Age")
plt.ylabel("Income")
plt.show()
The program output:

7. Data Manipulation with Pandas
Pandas library has various popular use cases, which is why it is preferred by machine learning engineers and data scientists. We are going to use the following data to understand some of the popular use cases.

We can create the table above using this code:
import pandas as pd
df = pd.DataFrame({'City': ["London", "Paris", "New York", "Tokyo", "Rome", "Barcelona"],
'City Population': [88000, 21452, 88190, 139223, 28740, 16255],
'City Area': [1572, 105, 784, 219, 1285, 101],
'Currency':['GBP','EUR','USD','JPY','EUR', 'EUR'],
'Continent':["Europe", "Europe", "North America", "Asia", "Europe", "Europe"],
'Main Language': ["English", "French", "English", "Japanese", "Italian", "Catalan"]})
7.1. Sort Values
Sorting data is one of the most common use cases in Pandas. We have millions of rows of data, and to analyze data, we have to sort it either in ascending or descending order.
The below example will sort the continent in ascending order, and if two continent names are the same, it will sort the city in descending order for that particular continent.
df.sort_values(by = ['Continent','City'], ascending=[True,False])
The program output:

The continent has been sorted in ascending order, but where the two continents are the same, the cities are sorted in descending order.
7.2. Add New Columns
Adding new columns in a specific location is a very common task that most data analysts do on a daily basis. Generally, a new column is added at the end, but it can also be added at a specific location.
Suppose we want to add a new column called “Population Density” at location 3. Population density can be calculated as the city’s population divided by the city’s area.
df.insert(loc=3, column='Population Density', value=(df['City Population']/df['City Area']))
The program output:

7.3. Column Selection
Column selection based on the datatype is also a very common operation. For example, we might need to select all the int64
data types to perform some operation with the data type int64
only. There are also many cases where we have to select only the strings to perform some basic operations like converting them to uppercase or lowercase.
df.select_dtypes(include=['object'])
The above example will only include object columns, and in our case, four columns are object.
The program output:

In a similar manner, we can only include or exclude floats, numbers, objects, etc.
7.4. Partial Match
Partial matches are used when we have to find all the columns that match a particular string. Suppose we want all the columns that have the string “Popu”. To do this, you can use the filter
function.
df.filter(like='Popu')
The program output:

Remember that the string we pass is case-sensitive. We cannot pass popu because all the columns that contain the word Popu start with a capital P
.
8. Advantages
Pandas has several advantages that make it a better choice for machine learning engineers and data scientists.
- Pandas provides a DataFrame object, which stores data in a 2D form like a spreadsheet. This helps in performing various column operations, which simplifies data analytics tasks.
- Pandas provides a wide range of built-in functions that help in the analysis of data easily and efficiently. No large lines of code are required, as Pandas is full of functions.
- Pandas can handle large datasets easily without compromising functionality. Pandas has various functions that help in handling large amounts of data.
- Pandas helps us customize the data according to our needs. It provides us the flexibility to reshape the data according to our needs for analysis.
- We can use Pandas with other libraries such as NumPy, Matplotlib, and SciPy to use their functionality together for better data analysis.
9. Limitations
Pandas has some limitations in addition to its advantages.
- Pandas requires a large amount of memory, especially when working with large datasets. This can be a problem if you have limited memory availability.
- There may be performance issues when working with large datasets. This is because Pandas uses a single-threaded approach to processing data.
- Pandas depends on other libraries, such as NumPy or Matplotlib. This means that you need to make sure that these libraries are installed as well.
- The syntax of Pandas can be confusing and complex for beginners, especially when working with special operations.
10. Conclusion
In this Python tutorial, we have discussed how to get started with Pandas. We took a look at the data structures of Pandas, how to use Pandas for data visualization, followed by the advantages and limitations of Pandas.
Happy Learning !!
Comments