Handling Excel Files in Python with pandas and openpyxl

Handling Excel files is a common task in many data-driven applications, especially when it comes to data analysis and reporting. Python provides powerful libraries such as pandas and openpyxl to facilitate this task. In this guide, we will delve deep into how these two libraries can be used to effectively manage Excel files, covering their features, functionalities, and practical applications. Let’s embark on this journey to utilize the robustness of pandas and openpyxl for Excel operations.

Contents hide

1 Managing Excel Files with Python

1.1 Understanding Pandas

1.1.1 Reading Excel Files with Pandas

1.1.2 Writing to Excel Files with Pandas

1.2 Diving into openpyxl

1.2.1 Reading Excel Files with openpyxl

1.2.2 Writing to Excel Files with openpyxl

1.3 Combining pandas and openpyxl

1.3.1 Transforming Data with Pandas

1.3.2 Adding Excel Features with openpyxl

2 Conclusion

3 About Editorial Team

4 You Might Also Like:

Managing Excel Files with Python

Python’s ecosystem provides a plethora of libraries for handling Excel files, but pandas is the most widely used for data manipulation, while openpyxl adds capabilities to read and write Excel 2010 xlsx/xlsm/xltx/xltm files. Combining the power of pandas and openpyxl, data engineers and analysts can handle almost any data processing tasks involving Excel files.

Understanding Pandas

Pandas is a powerful data manipulation library in Python, known for its data structures, Series and DataFrames, which make data handling and analysis straightforward. It offers various IO capabilities to read from and write to Excel files, benefiting from a user-friendly interface and efficient operations.

Reading Excel Files with Pandas

To read Excel files into pandas data structures, we use the `read_excel()` function. This function can read from both local and remote Excel files, and can handle different sheets as well.


import pandas as pd

# Reading a single sheet from an Excel file
df = pd.read_excel('sales_data.xlsx', sheet_name='Sheet1')

# Displaying the DataFrame
print(df.head())


   Date       Product  Sales
0 2021-01-01  Widget    50
1 2021-01-02  Widget    70
2 2021-01-03  Gadget    40
3 2021-01-04  Widget    30
4 2021-01-05  Gadget    80

To read multiple sheets from an Excel file, you can specify a list of sheet names, or use `None` to read all sheets into a dictionary of DataFrames.


# Reading multiple sheets
dfs = pd.read_excel('sales_data.xlsx', sheet_name=['Sheet1', 'Sheet2'])

# Reading all sheets
all_sheets = pd.read_excel('sales_data.xlsx', sheet_name=None)

Writing to Excel Files with Pandas

The `to_excel()` method in pandas allows you to export DataFrames to Excel files. This method requires the path to save the file and can write data to specified sheets as well.


# Writing to a single sheet
df.to_excel('output.xlsx', index=False)

# Writing to multiple sheets
with pd.ExcelWriter('multi_sheet_output.xlsx') as writer:
    dfs['Sheet1'].to_excel(writer, sheet_name='Sheet1')
    dfs['Sheet2'].to_excel(writer, sheet_name='Financials')

Pandas provides support for customizing Excel file outputs, such as setting index labels, column formats, and more.

Diving into openpyxl

While pandas efficiently handles data manipulation and format conversion, `openpyxl` is specialized in Excel file formatting. Openpyxl provides a comprehensive set of features for creating and modifying Excel files.

Reading Excel Files with openpyxl

Openpyxl enables you to work with Excel files by exposing a workbook object that represents the file. To read data, access the specific sheet required and iterate through its contents.


from openpyxl import load_workbook

# Load an existing workbook
workbook = load_workbook(filename='sales_data.xlsx')

# Printing all sheet names
print(workbook.sheetnames)

# Accessing a specific sheet
sheet = workbook['Sheet1']

# Iterate over rows and print them
for row in sheet.iter_rows(values_only=True):
    print(row)


('Date', 'Product', 'Sales')
('2021-01-01', 'Widget', 50)
('2021-01-02', 'Widget', 70)
('2021-01-03', 'Gadget', 40)
('2021-01-04', 'Widget', 30)
('2021-01-05', 'Gadget', 80)

Writing to Excel Files with openpyxl

Using openpyxl, one can create new Excel files, add sheets, and write data to cells effortlessly.


from openpyxl import Workbook

# Create a new Workbook
wb = Workbook()

# Select the active worksheet
ws = wb.active

# Write data to cells
ws['A1'] = 'Date'
ws['B1'] = 'Product'
ws['C1'] = 'Sales'
ws.append(['2021-01-06', 'Widget', 60])

# Save the workbook
wb.save('new_sales_data.xlsx')

Combining pandas and openpyxl

By combining the strengths of pandas and openpyxl, users can utilize pandas for data analysis and transformations, then openpyxl for detailed Excel tasks such as formatting and adding charts.

Transforming Data with Pandas

After using pandas to load and manipulate data, export it to an Excel file that openpyxl can further enhance with additional Excel-specific functionalities.


# Data manipulations
df['New Metric'] = df['Sales'] * 1.1

# Export with pandas
df.to_excel('enhanced_data.xlsx', index=False)

Adding Excel Features with openpyxl

Post-export, use openpyxl to add features such as cell styling, filtering, and charts to the already transformed file.


from openpyxl.drawing.image import Image

# Load the workbook
workbook = load_workbook('enhanced_data.xlsx')
sheet = workbook.active

# Adding styles (example: change the font of the header)
from openpyxl.styles import Font
header_font = Font(bold=True, size=12)

for cell in sheet["1:1"]:
    cell.font = header_font

# Saving the styled workbook
workbook.save('final_enhanced_data.xlsx')

Conclusion

The robust Python libraries pandas and openpyxl collectively provide a comprehensive toolkit for managing Excel files. With pandas, you can effectively load and transform data. Combined with openpyxl’s capabilities, you can easily handle more advanced Excel-specific features. Whether for data analysis or presentational purposes, mastering these tools will enhance your ability to manipulate Excel files efficiently and effectively.

About Editorial Team

Our Editorial Team is made up of tech enthusiasts who are highly skilled in Apache Spark, PySpark, and Machine Learning. They are also proficient in Python, Pandas, R, Hive, PostgreSQL, Snowflake, and Databricks. They aren't just experts; they are passionate teachers. They are dedicated to making complex data concepts easy to understand through engaging and simple tutorials with examples.