Reading CSVs with Python: a comparison of methods


Introduction

Programming with Python can be a beautiful thing. For pretty much everything you could want to do with the language, there's always going to be more than one way to make that happen. Some solutions will be easier to read, use a package, or have better performance than others. Today I'll be focusing on the performance aspect of things. Reading data from CSV files is something that you're going to come across if you're like me and use Python for machine learning and data science. In areas like deep learning, the models often benefit from very large datasets; thousands if not millions of data points. Because of this , reading in and manipulating your training/testing and holdout data in a quick and efficient manor is very important. Today, we'll be looking at a few different ways to accomplish this in Python 3 while looking at the pros and cons of each method.


Method 1: The standard library

Python's built-in CSV module is a good starting point for this experiment. With no need for any external packages, using the standard library's solution makes sense for those who are particularly worried about some of the problems than could arise from using an external solution, like security or long-term support. Performance wise, this method does very well. It's light weight yet flexible enough to handle most cases. Let's take a look at how to implement.

								
import csv
import numpy as np

with open('./data.csv', 'r', encoding="utf8", newline='') as csvfile:
	reader = csv.reader(csvfile, delimiter=',')
	header = next(reader)
	rows = []
	results = []

	for row in reader:
		# convert strings 'True' and 'False' to 1 or 0
		results.append(int(row.pop().lower() in ['true']))
		rows.append(np.array(row, dtype=None))

	features = np.array(rows)
								 
							
Results: 6.03 seconds to read in 130k rows from csv file

Let's dive in a little closer and find out what exactly is going on here. The result of this block of code is two NumPy matrices. One called rows and the other called results. This set up is designed for a dataset that has a bunch of rows where each column is one of the features and the last column is the classification that matches that row of features. In the case above, we are converting the last column from "true" or otherwise into a 1 or 0; good old binary classification. This method can make a lot of sense when you have some kind operation you want to do on a row, column, or specific index before using it. While this will add some overhead, unless you're using NumPy and can apply vector operations to many items at once, this is a good way to go.


Method 2: Pandas

In the next method, we'll be taking a look at using Pandas to read in our CSV data. For those unfamiliar, Pandas is a well known and established library that excels at organizing and manipulating data. Build on top of NumPy, Pandas maintains a similar level of performance while also exposing an easy-to-use interface to deal with the data.

								
import pandas as pd

df = pd.read_csv('./data.csv')
# check the shape of the data 
print (df.shape)
								
							
Results: 2.47 seconds to read in 130k rows from csv file

And that's all there is to it. Pandas takes care of pretty much everything behind the scenes while still being tunable if needed. What's different about this approach is that instead of creating a NumPy matrix, it creates a Pandas DataFrame. If you've never use these before, you may be hesitant about adding yet another abstraction to your data. Again, we're focusing on performance, and that's actually something Pandas excels at. Not only can it read in large amounts of data very quickly, it does really well when applying some function on an entire column. It can also make manipulating large datasets much less cumbersome. If you're unfamiliar with Pandas, I highly recommend checking it out.


Method 3: NumPy

The last method I'll go over is using NumPy's built-in file reader. This method reads a file into the program as a NumPy matrix. It's also important when using this method to make sure you properly set the dtype, or set it to None to try and have it set automatically.

								
import numpy as np

# returns a n x m numpy matrix
data = np.genfromtxt('./data.csv', delimiter=',', encoding="utf8", dtype=None)
								
							

I can't really say there's a good time to be using this method when working with CSVs. It uses quite a bit more RAM and CPU than using Pandas, not to mention with large datasets it is much slower. I was also having a lot of issues with quoted text data and in not knowing where one column started or ended. However, I didn't just add this section for no reason. This method is very powerful for dealing with text files that you need to read in. It has a powerful set of optional parameters that can handle many cases of differently formatted text files and as a result can take a lot of work away from needing to write a custom interpreter. Here is a list of all the options if you want to learn more.

Conclusion

After getting a look at these three methods, hopefully you have a little better understanding of what method will work best for you and in what cases each excels at. In my option and from the experience I've had, I would sat that Pandas is the way to go. It's a battle-tested library that preforms incredibly well, especially with larger datasets. It also makes a number of powerful operations that make mutating the data incredible easy. If you're not already using it, I highly recommend checking it out. This helpful page will help you get started with all the basics in just 10 minutes.


Show comments


New Comment