Python Pandas 101: A Beginners Guide

Pandas is an open-source Python library used for data manipulation and analysis. Pandas can work with different types of data sets, including CSV files, Excel files, JSON files, XML files, and relational database tables.

By the end of this article, we will have a good understanding of the basics of Pandas. We will learn about the installation and setup, the data structures of Pandas, why Pandas is used, some of its limitations, and common use cases.

So, without further ado, let’s get started!

1. Installing Pandas

To install Pandas on your system, make sure that we have the Python version 3.8, 3.9, 3.10, or 3.11 installed, as these Python versions officially support Pandas.

Let’s discuss two methods to install Pandas in our system.

1.1 Installing with pip

The pip is a Python package installation manager that makes it easy to install Python libraries or frameworks. If you have Python version 3.4 or higher, then Pip comes by default. Otherwise, you will need to install pip before installing Pandas.

Now, first launch the command prompt and type the following command:

pip install pandas

Hit Enter and we will see that Pandas will start installing. Now we can use Pandas in the Python programs.

2.2. Installing with Anaconda

Another way to install Pandas is to install Anaconda. Anaconda is a Python distribution that provides you with access to different tools.

When you install Anaconda, it installs all the major libraries automatically. To install Anaconda, first download it using this link: https://www.anaconda.com/download.

Now, launch it, but remember to check the following boxes:

Just click on the Install button and wait for the installation to complete. Once Anaconda is installed, we can use Pandas in your Windows command prompt, VS Code editor, or PowerShell prompt (one of the tools available in the Anaconda Navigator).

If you are going to use Pandas, it is a best practice to use it in a Jupyter Notebook. Jupyter Notebook is a web-based, interactive computing notebook.

To use Jupyter, open Anaconda Navigator in your system and open the Jupyter Notebook. You can see the Jupyter Notebook option in the image below.

Just click on the Launch button and the notebook will open on the localhost page, as we can see below.

We can click on the New button and select Python 3. We are now ready to use the Jupyter Notebook.

3. Data Structures in Pandas

Data structures in Pandas are very useful for processing and analyzing data. Pandas has two key data structures that are widely used in Pandas:

  • Series
  • DataFrames

Let us learn more about these data structures in detail.

4. Series in Pandas

A series is a one-dimensional array-like structure that can hold any data type, including numbers, booleans, and objects.

4.1. Creating a Series

To create a Series, you can use the pd.Series() function and pass a list or array of values.

import pandas as pd
arr = ['J', 'A', 'V', 'A']
result = pd.Series(arr)
print(result)

The program output:

0    J
1    A
2    V
3    A

We can also create a Series with custom indexes.

import pandas as pd
result = pd.Series(['J', 'A', 'V', 'A'], index = [5, 4, 3, 2])
print(result)

The program output:

5    J
4    A
3    V
2    A

4.2. Accessing the Values

To access the values and indexes from a Series, we can use the .values and .index attributes, respectively.

import pandas as pd
arr = ['J', 'A', 'V', 'A']
result = pd.Series(arr)

print(result.index)
print(result.values)

The program output:

RangeIndex(start=0, stop=4, step=1)
['J' 'A' 'V' 'A']

We can also access the partial values from the series. Suppose we want only the first three characters from a series of 6 elements.

import pandas as pd
s = pd.Series(['P', 'Y', 'T', 'H', 'O', 'N'])
print(s[0:3])

The program output:

0    P
1    Y
2    T

Suppose you have a Series of some elements and we only want to print the elements that are greater than 3.

import pandas as pd
s = pd.Series([3, 4, 5, 1, 2])
print(s[s > 3])

The program output:

1    4
2    5

5. Dataframe in Pandas

A Pandas DataFrame is a two-dimensional data structure that can hold multiple Series. It is like a SQL table or spreadsheet, with multiple rows and columns, and can hold different types of data.

5.1. Creating a Dataframe

The most common way to create a DataFrame is by using a list of lists. We simply have to pass the list of lists to the pd.DataFrame() function. Each list represents a row in the DataFrame, and each element of each list represents a column.

import pandas as pd

data = [['A', 'B', 'C'], ['D', 'E', 'F']]
df = pd.DataFrame(data)

print(df)

The program output:

     0  1  2
0  A  B  C
1  D  E  F

Another way to create a DataFrame is to use a dictionary. The keys of the dictionary will represent the column names of the DataFrame, while the values of the dictionary will represent the column data.

import pandas as pd

data = {
    "Name": ['Satyam', 'John', 'Kelvin'],
    "Age": [22, 30, 28]
}
df = pd.DataFrame(data)

print(df)

The program output:

     Name  Age
0  Satyam   22
1  John   30
2  Kelvin   28

5.2. Loading a DataFrame from CSV File

To load data from a file into pandas, we can use the read_csv() function. The read_csv() function takes the file path as its first argument, and it also takes other arguments that specify how the data should be loaded.

import pandas as pd
data = pd.read_csv("C:\\Users\\Downloads\\test_data.csv")
print(data)

The program output:

      Name   Age     Sex   Salary
0   Satyam    22    Male    20000
1     John    21    Male    21000
2  Nicolas    23  Female    45000
3    Julia    32  Female    62000

5.3. Selecting Records in DataFrame

To access the columns of a DataFrame, we can just pass the column name.

import pandas as pd

data = {'Name': ['Satyam', 'Julia', 'Jacky', 'David'],
        'Age': [22, 31, 36, 26],
        'Gender': ['M', 'F', 'F', 'M']}

df = pd.DataFrame(data)

print(df['Name'])

The program output:

0    Satyam
1     Julia
2     Jacky
3     David

We can also use the .loc and .iloc methods to access values in a DataFrame. The .loc method is used to access groups of rows and columns by label, while the .iloc method is used to retrieve any particular value from a row or column by index values.

import pandas as pd

data = {'Name': ['Satyam', 'Julia', 'Jacky', 'David'],
        'Age': [22, 31, 36, 26],
        'Gender': ['M', 'F', 'F', 'M']}

df = pd.DataFrame(data)

print(df.loc[2])
print(df.iloc[[0, 2], [1, 2]])

The program output:

Name      Jacky
Age          36
Gender        F

   Age Gender
0   22      M
2   36      F

6. Data Visualization with Pandas

Pandas can be used to create basic data visualizations such as line plots, bar plots, scatter plots, and so on. To do this, we need to import another library called Matplotlib. Pandas is used to create a DataFrame of the data, and then Matplotlib is used to show the visualization of the data.

6.1. Line Plot

A line plot is a type of graph that is used to show the relationship between two variables. It is represented by a line.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
    "Time": [10, 11, 12, 13, 14],
    "Temperature": [30, 32, 38, 40, 35]
})

plt.plot(df["Time"], df["Temperature"])

plt.xlabel("Time")
plt.ylabel("Temperature")

plt.show()

The program output:

As we can see above, we used Pandas to create a dataframe and Matplotlib to give the structure for the visualization. We defined that time should be on the X-axis and temperature should be on the Y-axis. We then used the show() function to print the line plot.

6.2. Bar Plot

Bar plots are used to represent the data in the form of rectangular bars.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
    "Month": ["Jan", "Feb", "Mar", "Apr", "May"],
    "Sales": [100, 150, 300, 180, 210]
})

plt.bar(df["Month"], df["Sales"])

plt.xlabel("Month")
plt.ylabel("Sales")

plt.show()

The program output:

6.3. Scatter Plot

Scatter plots are used to show the relationship between the variables and it uses dots to represent the relationship.

import pandas as pd
import matplotlib.pyplot as plt

df = pd.DataFrame({
    "Age": [25, 30, 35, 40, 45],
    "Income": [45000, 50000, 65000, 60000, 70000]
})

plt.scatter(df["Age"], df["Income"])

plt.xlabel("Age")
plt.ylabel("Income")

plt.show()

The program output:

7. Data Manipulation with Pandas

Pandas library has various popular use cases, which is why it is preferred by machine learning engineers and data scientists. We are going to use the following data to understand some of the popular use cases.

We can create the table above using this code:

import pandas as pd

df = pd.DataFrame({'City': ["London", "Paris", "New York", "Tokyo", "Rome", "Barcelona"],
                   'City Population': [88000, 21452, 88190, 139223, 28740, 16255],
                   'City Area': [1572, 105, 784, 219, 1285, 101],
                   'Currency':['GBP','EUR','USD','JPY','EUR', 'EUR'],
                   'Continent':["Europe", "Europe", "North America", "Asia", "Europe", "Europe"],
                   'Main Language': ["English", "French", "English", "Japanese", "Italian", "Catalan"]})

7.1. Sort Values

Sorting data is one of the most common use cases in Pandas. We have millions of rows of data, and to analyze data, we have to sort it either in ascending or descending order.

The below example will sort the continent in ascending order, and if two continent names are the same, it will sort the city in descending order for that particular continent.

df.sort_values(by = ['Continent','City'], ascending=[True,False])

The program output:

The continent has been sorted in ascending order, but where the two continents are the same, the cities are sorted in descending order.

7.2. Add New Columns

Adding new columns in a specific location is a very common task that most data analysts do on a daily basis. Generally, a new column is added at the end, but it can also be added at a specific location.

Suppose we want to add a new column called “Population Density” at location 3. Population density can be calculated as the city’s population divided by the city’s area.

df.insert(loc=3, column='Population Density', value=(df['City Population']/df['City Area']))

The program output:

7.3. Column Selection

Column selection based on the datatype is also a very common operation. For example, we might need to select all the int64 data types to perform some operation with the data type int64 only. There are also many cases where we have to select only the strings to perform some basic operations like converting them to uppercase or lowercase.

df.select_dtypes(include=['object'])

The above example will only include object columns, and in our case, four columns are object.

The program output:

In a similar manner, we can only include or exclude floats, numbers, objects, etc.

7.4. Partial Match

Partial matches are used when we have to find all the columns that match a particular string. Suppose we want all the columns that have the string “Popu”. To do this, you can use the filter function.

df.filter(like='Popu')

The program output:

Remember that the string we pass is case-sensitive. We cannot pass popu because all the columns that contain the word Popu start with a capital P.

8. Advantages

Pandas has several advantages that make it a better choice for machine learning engineers and data scientists.

  • Pandas provides a DataFrame object, which stores data in a 2D form like a spreadsheet. This helps in performing various column operations, which simplifies data analytics tasks.
  • Pandas provides a wide range of built-in functions that help in the analysis of data easily and efficiently. No large lines of code are required, as Pandas is full of functions.
  • Pandas can handle large datasets easily without compromising functionality. Pandas has various functions that help in handling large amounts of data.
  • Pandas helps us customize the data according to our needs. It provides us the flexibility to reshape the data according to our needs for analysis.
  • We can use Pandas with other libraries such as NumPy, Matplotlib, and SciPy to use their functionality together for better data analysis.

9. Limitations

Pandas has some limitations in addition to its advantages.

  • Pandas requires a large amount of memory, especially when working with large datasets. This can be a problem if you have limited memory availability.
  • There may be performance issues when working with large datasets. This is because Pandas uses a single-threaded approach to processing data.
  • Pandas depends on other libraries, such as NumPy or Matplotlib. This means that you need to make sure that these libraries are installed as well.
  • The syntax of Pandas can be confusing and complex for beginners, especially when working with special operations.

10. Conclusion

In this Python tutorial, we have discussed how to get started with Pandas. We took a look at the data structures of Pandas, how to use Pandas for data visualization, followed by the advantages and limitations of Pandas.

Happy Learning !!

Comments

Subscribe
Notify of
guest
1 Comment
Most Voted
Newest Oldest
Inline Feedbacks
View all comments

About Us

HowToDoInJava provides tutorials and how-to guides on Java and related technologies.

It also shares the best practices, algorithms & solutions and frequently asked interview questions.