Writing a CSV Sniffer for Ruby

10/22/2015

I needed a way to decipher the delimiter character for a csv file I was processing in Ruby.

In order to move massive amounts of data around between organizations we (at my current job) have found file transfer to be significantly more performant than web service calls. Therefore we often share data with our clients through Secure FTP.  A particular project I was working on required parsing csv files that are provided by clients which could have different formats. So I needed an automated way to tell the csv parser whether to use commas, tabs or pipes in parsing without having to manually examine the files or setup some per-client configuration. That's why I wrote the csv_sniffer ruby gem. The word "sniffer" is meant to indicate that the determination of the properties of the csv file is done heuristically. Plus Python has a csv.Sniffer class that does many of the same things my Ruby csv_sniffer does. So the terminology is already familiar to developers in another dynamic language.

The design goals of csv_sniffer were:

  1. To answer the question the fastest, simplest and resource-conscious way possible at the same time maintaining as high of an accuracy standard as more complex methods
  2. To be easy to use
To that end, all csv_sniffer methods are static and simply require a single argument which is the file name. As little of the file is read as possible and file handles are always proactively closed the instant they are not needed. The detection algorithms themselves are also designed to be efficient and use pre-determined knowledge to be simple and fast.

Initially csv_sniffer started out with delimiter detection, quote enclosed values detection and quote enclosing character detection. I later added in header detection by porting the Python algorithm for header detection to Ruby.

All the source code, tests and usage examples are available on GitHub under the MIT license so it's free use, modify and redistribute. The gem itself is is published to rubygems.org and can be installed with a simple gem install csv_sniffer.

You Might Also Like

0 comments