Writing a CSV Sniffer for Ruby
10/22/2015I needed a way to decipher the delimiter character for a csv file I was processing in Ruby.
In
order to move massive amounts of data around between organizations we
(at my current job) have found file transfer to be significantly more
performant than web service calls. Therefore we often share data with
our clients through Secure FTP. A particular project I was working on
required parsing csv files that are provided by clients which could have
different formats. So I needed an automated way to tell the csv parser
whether to use commas, tabs or pipes in parsing without having to
manually examine the files or setup some per-client configuration.
That's why I wrote the csv_sniffer
ruby gem. The word "sniffer" is meant to indicate that the
determination of the properties of the csv file is done heuristically.
Plus Python has a csv.Sniffer class that does many of the same things my
Ruby csv_sniffer does. So the terminology is already familiar to
developers in another dynamic language.
The design goals of csv_sniffer were:
- To answer the question the fastest, simplest and resource-conscious way possible at the same time maintaining as high of an accuracy standard as more complex methods
- To be easy to use
Initially csv_sniffer started out with delimiter detection, quote enclosed values detection and quote enclosing character detection. I later added in header detection by porting the Python algorithm for header detection to Ruby.
All the source code, tests and usage examples are available on GitHub under the MIT license so it's free use, modify and redistribute. The gem itself is is published to rubygems.org and can be installed with a simple
gem install csv_sniffer
.
0 comments