Using Pandas info() and describe() Methods Effectively

When working with large datasets, it is crucial to have a solid grasp of your data before you dive into analysis or modeling. In this regard, the Python library Pandas is an invaluable tool for data scientists and analysts. It provides numerous functionalities that simplify the process of data manipulation and analysis. Two of the essential methods in the Pandas library that help in understanding the datasets are `info()` and `describe()`. Both methods offer a quick and easy way to get a sense of the structure and statistical summary of the data, respectively. In this article, we’ll explore how to use these methods effectively to glean information about your datasets, which can facilitate better decision-making and more robust data analysis.

Understanding the info() Method

The `info()` method in Pandas is used primarily to get a concise summary of a DataFrame. When you call this method on a DataFrame object, it provides important information such as the number of entries, the column count, the data type of each column, how many non-null values are present, and the memory usage. This method is particularly useful when you want to quickly understand the structure of a new dataset or when you need to verify data types and missing values.

Using info() to Explore Dataset Structure

Let’s look at an example of how to use the `info()` method. Consider a DataFrame `df` that includes various types of data. You would call the method as follows:


import pandas as pd

# Sample data creation
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 23, 34, 29],
    'City': ['New York', 'Paris', 'Berlin', 'London'],
    'Salary': [70000, 80000, 75000, 65000]
}

df = pd.DataFrame(data)
df_info = df.info()

Running the above code will output something similar to:


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    4 non-null      object 
 1   Age     4 non-null      int64  
 2   City    4 non-null      object 
 3   Salary  4 non-null      int64  
dtypes: int64(2), object(2)
memory usage: 256.0+ bytes

From this output, you immediately know the DataFrame has 4 entries (rows) and 4 columns, each of which is completely filled with non-null values. The data types are also shown, with `Name` and `City` being `object` type (often synonymous with string data), and `Age` and `Salary` being `int64` type. Aspects like memory usage can help us scale our data processing according to the available resources.

Understanding the describe() Method

The `describe()` method goes a step further by providing a statistical summary of the data. This method gives us a quick overview of the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. By default, `describe()` summarizes only numeric fields, returning information such as count, mean, standard deviation, min, max, and various percentiles for each column.

Running describe() on Numeric Data

Now, let’s take the DataFrame `df` and use the `describe()` method to get a statistical summary:


# Sample data already defined in the earlier df variable
df_description = df.describe()
print(df_description)

The output will provide a statistical summary for the numerical columns in the DataFrame:


             Age        Salary
count   4.000000       4.000000
mean   28.500000   72500.000000
std     4.509250    6454.985751
min    23.000000   65000.000000
25%    26.750000   68750.000000
50%    28.500000   72500.000000
75%    30.250000   76250.000000
max    34.000000   80000.000000

From this summary, we can gather a wealth of information regarding the `Age` and `Salary` columns in our dataset. For instance, we can see that the average age is approximately 28.5 years, with a standard deviation of approximately 4.5, which indicates the variability of the ages in the dataset.

Furthermore, you can customize the `describe()` method to include all columns, including those that contain categorical data. By passing `include=’all’` to the method, Pandas will output statistics for non-numeric data types as well.


df_full_description = df.describe(include='all')
print(df_full_description)

The output will now provide additional information for non-numeric data types:


        Name   Age     City        Salary
count      4     4        4           4
unique     4   NaN        4          NaN
top     John   NaN  New York        NaN
freq       1   NaN        1          NaN
mean     NaN  28.5      NaN   72500.000
std      NaN   4.509250  NaN   6454.986
min      NaN  23.0      NaN   65000.000
25%      NaN  26.75     NaN   68750.000
50%      NaN  28.5      NaN   72500.000
75%      NaN  30.25     NaN   76250.000
max      NaN  34.0      NaN   80000.000

For categorical columns like `Name` and `City`, we get the count, the number of unique entries, the most common value (top), and the frequency of the most common value (freq).

Optimizing Pandas info() and describe() for Large Datasets

Efficient Memory Usage with info()

When dealing with large datasets, it is important to optimize the usage of memory. The `info()` method offers an option to estimate the memory usage with the parameter `memory_usage=’deep’`. This provides a more accurate insight into the actual memory consumption, which is particularly helpful when optimizing performance.


df.info(memory_usage='deep')

Describing Specific Columns or Data Types with describe()

With large datasets, it might be impractical to get a statistical summary for all columns. The `describe()` method can be tailored to describe specific columns or data types by using the `include` and `exclude` parameters. For example, to display only the statistics for the numerical data, you would run:


df.describe(include=[np.number])

Conversely, if you want to exclude numeric data types and focus only on objects or categorical data, simply revise the parameter accordingly.


df.describe(exclude=[np.number])

Conclusion

In summary, understanding how to use Pandas `info()` and `describe()` methods effectively is a fundamental skill for anyone working with data in Python. These methods facilitate a rapid assessment of data health and structure, providing essential insights into the dataset’s nature. Whether you are pre-processing data, exploring a new dataset, or simply trying to get a high-level overview of your data, `info()` and `describe()` can help streamline the process, saving time and laying the groundwork for more in-depth analysis.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts deeply skilled in Apache Spark, PySpark, and Machine Learning, alongside proficiency in Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They're not just experts; they're passionate educators, dedicated to demystifying complex data concepts through engaging and easy-to-understand tutorials.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top