Recently, I decided to experiment with the Python library Polars at work. I’ve been hearing a lot about it, and I wanted to see how it compares to Pandas, which we currently use for our data preprocessing. Here’s what I learned!
Why Polars?
Polars has been trending in the data engineering world, and I was curious to see if it lived up to the hype. Plus, I’m a firm believer in learning by doing, so I thought this would be a great opportunity to get hands-on experience with a new tool.
It is a “blazingly fast” [1] DataFrame library written in Rust, which:
- Utilizes all available cores on your machine.
- Optimizes queries to reduce unneeded work/memory allocations.
- Handles datasets much larger than your available RAM.
- A consistent and predictable API.
- Adheres to a strict schema (data-types should be known before running the query).
What it promises seem really impressive, and is backed up by existing benchmarks like the most recent one from DuckDB. [2]
Finally, it is easy to migrate to Polars from Pandas, as it provides us with a user guide for people coming from Pandas. [3]
The Experiment
To get a fair comparison, I took an existing project we have in production and replaced the preprocessing pipeline. This pipeline, originally built with Pandas, does the following:
- Loads parquet data
- Performs data type casting
- Filters the data
- Adds new columns/features
- Saves the result back to parquet
I implemented the same pipeline using Polars and compared the performance.
For context, the dataset I used had:
- 2.4 million rows
- 121 columns
So it is somewhat a small dataset.
The Results
I ran each version of the pipeline 10 times and took the average runtime. The results were impressive:
- Polars: 27.25 seconds
- Pandas: 123.7 seconds
That’s a 78% improvement in speed! 🚀
Those results are very encouraging, as the Polars implementation was done without much knowledge of the library, and I expect that you could squeeze out even better performances by writing better queries.
Challenges
It wasn’t all smooth sailing. Some functions I was used to in Pandas, like index_of
on a List or Array, don’t exist in Polars yet. However, the Polars Discord community was incredibly helpful and helped me find workarounds quickly.
Feedback and Next Steps
When I presented these results to my colleagues, they were impressed by the numbers. However, since this particular pipeline doesn’t run daily and isn’t too time-consuming, they didn’t see an immediate need to switch.
To build a stronger case, I’m planning two more experiments:
- Test the improvements for our inference pipeline
- Try Polars on a larger project with more than 10 million rows
Stay tuned for part 2 of this series, where I’ll share the results of these additional tests!
Have you tried Polars? What has your experience been? Let me know!
Sources
[1] https://docs.pola.rs/#key-features