Pandas Basics in Python

Python : Pandas

Pandas is an open-source data manipulation library for Python. It is used for data manipulation, analysis, and cleaning tasks. Pandas provides a simple and efficient way to manipulate data in the form of Series (1-dimensional) and DataFrame (2-dimensional) objects.

Pandas Data Structures:

Pandas provides two primary data structures:

  • Series: A one-dimensional labeled array capable of holding any data type.
  • DataFrame: A two-dimensional labeled data structure with columns of potentially different types.

Creating a Pandas Series:

A Pandas Series can be created using a list, dictionary, or ndarray.

import pandas as pd
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s)

# output
# 0    10
# 1    20
# 2    30
# 3    40
# 4    50
# dtype: int64

Creating a Pandas DataFrame:

A Pandas DataFrame can be created using a [dictionaries]((https://learngolangonline.com/python/dictionaries), list of dictionaries, or ndarray.

import pandas as pd
data = {'Name': ['John', 'Steve', 'Sarah', 'Mike'], 'Age': [25, 30, 28, 35], 'Salary': [50000, 60000, 55000, 70000]}
df = pd.DataFrame(data)
print(df)

Output:

    Name  Age  Salary
0   John   25   50000
1  Steve   30   60000
2  Sarah   28   55000
3   Mike   35   70000

Pandas Basic Operations:

  1. Indexing:

    Pandas provides different ways to index and select data. The loc method is used to label-based indexing, and the iloc method is used for integer-based indexing.

    import pandas as pd
    data = {'Name': ['John', 'Steve', 'Sarah', 'Mike'], 'Age': [25, 30, 28, 35], 'Salary': [50000, 60000, 55000, 70000]}
    df = pd.DataFrame(data)
    print(df.loc[0])  # select first row
    print(df.iloc[0])  # select first row
    print(df['Age'])  # select 'Age' column
    
  2. Filtering:

    We can filter rows based on certain conditions.

    import pandas as pd
    data = {'Name': ['John', 'Steve', 'Sarah', 'Mike'], 'Age': [25, 30, 28, 35], 'Salary': [50000, 60000, 55000, 70000]}
    df = pd.DataFrame(data)
    print(df[df['Age'] > 28])  # filter rows where Age is greater than 28
    
  3. Adding and Removing Rows/Columns:

    We can add and remove rows and columns from a DataFrame.

    import pandas as pd
    
    # create a sample dataframe
    data = {'name': ['John', 'Emily', 'Kate'], 'age': [25, 30, 35], 'city': ['New York', 'Paris', 'London']}
    df = pd.DataFrame(data)
    
    # add a new row
    df.loc[3] = ['David', 28, 'Tokyo']
    print(df)
    
    # add a new column
    df['country'] = ['USA', 'France', 'UK', 'Japan']
    print(df)
    
    # remove a row
    df = df.drop(2)
    print(df)
    
  4. Loading data:

    Pandas can load data from various sources including CSV, Excel, SQL, and more. The read_csv() function is commonly used to load data from CSV files into a Pandas DataFrame. For example, to load a CSV file named "data.csv" into a DataFrame, we can use the following code:

    import pandas as pd
    df = pd.read_csv('data.csv')
    
  5. Viewing data:

    To view the data in a DataFrame, we can use the head() or tail() functions to view the top or bottom rows, respectively. For example, to view the top 5 rows of a DataFrame named "df", we can use the following code:

    print(df.head())
    
  6. Data selection:

    Pandas provides various methods for selecting data from a DataFrame. We can select columns using the column name, select rows using boolean indexing, and select subsets of rows and columns using the loc[] and iloc[] functions. For example, to select a column named "column_name" from a DataFrame named "df", we can use the following code:

    column = df['column_name']
    

    To select rows based on a condition, we can use boolean indexing. For example, to select rows where a column named "column_name" equals a certain value, we can use the following code:

    subset = df[df['column_name'] == value]
    
  7. Data manipulation:

    Pandas provides various methods for manipulating data in a DataFrame. We can add, remove, or modify columns, and perform mathematical operations on the data. For example, to add a new column named "new_column" to a DataFrame named "df" that is the sum of two other columns, we can use the following code:

    df['new_column'] = df['column1'] + df['column2']
    
  8. Data aggregation:

    Pandas provides various methods for aggregating data in a DataFrame. We can group data by a column and calculate statistics on the groups, or use pivot tables to summarize the data. For example, to group a DataFrame named "df" by a column named "column_name" and calculate the mean value of another column named "column2", we can use the following code:

    grouped_data = df.groupby('column_name')['column2'].mean()