In my line of research (computational electrodynamics), I often have to manipulate quite large ASCII files, in the range of 10-100MB of size. These files usually contain mappings of electromagnetic fields, and all the information is simply stored column-wise in ASCII format. I know very well that there are smarter ways to store this kind of information (
HDF5 format to mention one of the many), but sometimes, either for laziness or to avoid linking one more library, it is simpler to save everything to an ASCII file, and to gzip it to save some room on the hard drive. When the time for data analysis and plotting comes, I love to use Numpy and Matplotlib, and loadtxt() is an excellent tool to efficiently load small ASCII files. Nevertheless the process of reading files several million lines long becomes cumbersome, and a different approach to solve this speed (and size) issue is needed. Long story short, let us say that we have a bunch of ASCII
*.gz files in a given
folder, that this data are
n_col columns wide and several million lines long, this snippet does the trick to import them all, unzip them, read and format them and store them in a python list of numpy arrays. First of all we need to import some libraries:
import numpy as np
import scipy as sp
import glob as glob
then for the useful part:
data_files=glob.glob("path/to/folder/*.gz"); #read in the file names in folder
data_list=[]; #list for imported data storage
for file in data_files:
gunzipped_file=gzip.open(file,"r") #unzip
data=np.float32(gunzipped_file.read().split()); #read data
data=data.reshape(data.size/n_col,n_col) #reshape data
data_list.append(data); #storing in list
This is it, and it is fast.