CSV

From Just Solve the File Format Problem
Revision as of 03:45, 1 November 2012 by KevinAshley (Talk | contribs)

Jump to: navigation, search

Contents

CSV - description

CSV - comma separated values - is a text-based format typically used for the storage or exchange of database-like records. In essence, CSV files consist of a series of records each of which contains zero or more fields. The fields are separated by a known delimiter - canonically a comma - and the records are typically separated by whatever constitutes a newline on the system which generated the CSV file. A quote character is used to surround fields which themselves contain the delimiter character or the quote character, and in some implementations is used to surround any field which contains non-alphanumeric characters. The quote character is typically " but is often '.

A simplistic and quite possibly syntactically invalid BNF definition for CSV is as follows:

<CSVFile> ::= <Record>*

<Record> ::= { <Field> (<Delimiter> <Field>)* } <EOL>

<Field> ::= <SimpleField> <QuotedField>

<SimpleField> :== AlphaNum*  ; Any sequence of alpha-numeric characters

<QuotedField> :== <QuoteChar> <Anychar>* <QuoteChar>  ; See below for quite how flexible <anychar> is

<QuoteChar> :== " | '  ;but note that they generally must match

<Delimiter> :== ","


Implementations vary in their interpretation and generation of CSV files. The best, as ever, are strict in what they generate but generous in what they accept. Known variants of what is acceptable include:

  • Whether quoted fields must be quoted with " or ' or whether either is acceptable
  • Whether <EOL> is ASCII NL, CR, CR NL, NL CR or any combination of these. (For instance, some implementations expecting a bare NL but seeing a record ending in CR NL will treat the CR as part of the final field; some will see it as a record delimiter on its own, making a blank record following; some will correctly recognise CR NL as a variant form of <EOL>)
  • Whether all records must contain the same number of fields or not
  • Whether special interpretation can be given to the first record, naming the fields in subsequent records (implementations that accept this will typically expect every record to contain the same number of fields as the first record)
  • Whether quotes only appear around fields which themselves contain either a quote character, a delimiter or a newline or whether quotes can be placed around any field
  • Whether quotes inside a field are doubled or escaped. E.g. if the quote character is " and a field's value is you"re should the field appear as "you\"re" or "you""re" ? (The second option is the most widely adopted, and that used by MS Excel)
  • Whether it is possible to place quotes or delimiters inside field values at all (Some simplistic CSV libraries simply split the input line at every occurrence of a delimiter, ignoring quoting.)
  • Whether quoting a field allows it to contain a newline or whether a newline always terminates a record regardless of where it appears.

With the generic interpretation of CSV given above, it is possible to see Tab-Delimited-Values as a special case of CSV that frequently avoids many of these problems with quoting, since typical textual data is less likely to contain real tab characters than commas or quotes. Tab-delimited files rarely if ever need to use quote characters to surround fields and therefore can be handled simply by splitting lines at each tab character, as long as one does not need to support multi-line fields.


Examples

Other descriptions

Code

The public release of the NDAD code contained a generic set of routines to read and write CSV files which supported arbitrary quoting and multi-line fields, as well as supporting files which contained different numbers of fields in each record.


--KevinAshley (talk) 03:28, 1 November 2012 (UTC)

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox