Archive for November, 2006

Parsing CSV files in kdb+

Monday, November 13th, 2006

During the last few days I’ve been assisting a new kdb+ user to generate a historical database from a bunch of large csv files (~5GB each). I’ve not been involved too much - just the odd email here and there, pointing them in the right direction. It’s been interesting both in terms of finding a working solution, and observing a new user’s progress. I think it is worthwhile documenting the path of learning to parse a csv file in kdb+. So here goes:

Firstly, a comma separated value (csv) file contains rows of fields seperated by commas. e.g.

2006.06.06,100,180.0,VOD.L
2006.06.07,100,180.0,BT.L
2006.06.08,100,180.0,BASF.DE

it may or may not have a header row decribing the field names. In the above example the following might be appropriate

date,size,price,sym

To begin with, we should look at how to parse an individual field. This can be done via the cast operator $

/ parse a date string into a date object
q)”D”$”2006.06.06″

You can replace “D” with any of the following characters, according to your desired target type

“*BXHIJEFCSMDZUVT ”

As we have several fields to parse, we can use the 0: function, which takes 2 parameters

(format;delimiter) 0:charvector

e.g.

q)(enlist “D”;”,”)0:”2006.06.06″

and with more than one field, seperated by a comma

q)(”DI”;”,”)0:”2006.06.06,100″

and to extend that further using a list of char vectors

q)(”DI”;”,”)0:(”2006.06.06,100″;”2006.06.07,200″)

and perhaps with a descriptive header as the first element in the list

q)(”DI”;enlist “,”)0:(”date,amount”;”2006.06.06,100″;”2006.06.07,200″)

If the “,” is enlisted, the 0: function will take the first element as the header.

So far this has all been about parsing char vectors and lists of char vectors. What happened to reading csv files? Well, even though you now know what’s going on inside 0: you should pause to consider how to read the file into memory. This can be done with the read0 function, which simply takes the filename as the only parameter, e.g.

q)data:read0`:myfile.csv

You can then pass it into our previous examples as

q)(”DI”;enlist “,”)0:data

More experienced users will recognise that this can all be done in one go, as

q)(”DI”;enlist “,”)0:`:myfile.csv

This happens to create a table just as we wanted.

But if the file has no header, and we use

q)(”DI”;”,”)0:`:myfile.csv

we get a list of lists of parsed data in memory. How do we get that into a table?

Users may recall from their first encounters with kdb+ that a table can be created from a dictionary, given a few constraints, e.g

q)flip `a`b`c!(10 20;30 40;50 60)

so we shall do the same here

q)flip `date`amount!(”DI”;”,”)0:`:myfile.csv

What about large csv files? Say 5GB files? Fortunately, there is a helper function .Q.fs which takes 2 parameters, a function to parse each chunk of data and the filename. i.e.

.Q.fs[fn;filename]

and it can be used as

.Q.fs[{`tmp insert flip `date`amount!(”DI”;”,”)0:x};`:myfile.csv]

On an example csv file I notice it is parsing around 600 rows at a time, but I expect this will be dependent on how much data is on each row.

Inside the function to parse the chunk of data we can do all kinds of things, including appending the data to a splayed table on disk. Beware that the table must be fully enumerated to do that.

Interested parties might also like to look at

http://kx.com/q/e/csvguess.q

and

http://kx.com/q/e/csvutil.q

Maybe in another post I’ll describe how to create a historical database from the csv files.