CSV
Contents |
CSV - description
CSV - comma separated values - is a text-based format typically used for the storage or exchange of database-like records. In essence, CSV files consist of a series of records each of which contains zero or more fields. The fields are separated by a known delimiter - canonically a comma - and the records are typically separated by whatever constitutes a newline on the system which generated the CSV file. A quote character is used to surround fields which themselves contain the delimiter character or the quote character, and in some implementations is used to surround any field which contains non-alphanumeric characters. The quote character is typically " but is often '.
RFC 4180 gives a spec for the format.
A simplistic and quite possibly syntactically invalid BNF definition for CSV is as follows:
<CSVFile> ::= <Record>*
<Record> ::= { <Field> (<Delimiter> <Field>)* } <EOL>
<Field> ::= <SimpleField> <QuotedField>
<SimpleField> :== AlphaNum* ; Any sequence of alpha-numeric characters
<QuotedField> :== <QuoteChar> <Anychar>* <QuoteChar> ; See below for quite how flexible <anychar> is
<QuoteChar> :== " | ' ;but note that they generally must match
<Delimiter> :== ","
Implementations vary in their interpretation and generation of CSV files. The best, as ever, are strict in what they generate but generous in what they accept. Known variants of what is acceptable include:
- Whether quoted fields must be quoted with " or ' or whether either is acceptable
- Whether <EOL> is ASCII NL, CR, CR NL, NL CR or any combination of these. (For instance, some implementations expecting a bare NL but seeing a record ending in CR NL will treat the CR as part of the final field; some will see it as a record delimiter on its own, making a blank record following; some will correctly recognise CR NL as a variant form of <EOL>)
- Whether all records must contain the same number of fields or not
- Whether special interpretation can be given to the first record, naming the fields in subsequent records (implementations that accept this will typically expect every record to contain the same number of fields as the first record)
- Whether quotes only appear around fields which themselves contain either a quote character, a delimiter or a newline or whether quotes can be placed around any field
- Whether quotes inside a field are doubled or escaped. E.g. if the quote character is " and a field's value is you"re should the field appear as "you\"re" or "you""re" ? (The second option is the most widely adopted, and that used by MS Excel)
- Whether it is possible to place quotes or delimiters inside field values at all (Some simplistic CSV libraries simply split the input line at every occurrence of a delimiter, ignoring quoting.)
- Whether quoting a field allows it to contain a newline or whether a newline always terminates a record regardless of where it appears.
With the generic interpretation of CSV given above, it is possible to see Tab-Delimited-Values as a special case of CSV that frequently avoids many of these problems with quoting, since typical textual data is less likely to contain real tab characters than commas or quotes. Tab-delimited files rarely if ever need to use quote characters to surround fields and therefore can be handled simply by splitting lines at each tab character, as long as one does not need to support multi-line fields. The ASCII character set's C0 controls actually include control characters such as Unit Separator intended for use as delimiters, but those are not often encountered.
A Microsoft Excel CSV file may write or expect a semicolon as separator, depending on the system's regional settings. This is to provide support in areas where the comma is used as the decimal separator instead of the period.
Byte Order Mark
Some UTF8 CSV files are written with a prepending "Byte Order Mark" or BOM consisting of 3 bytes: EF BB BF
.
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [1]
Examples
Other descriptions
Code
- NDAD The public release of the NDAD code contained a generic set of routines to read and write CSV files which supported arbitrary quoting and multi-line fields, as well as supporting files which contained different numbers of fields in each record. readcsv_v & writecsv are buried in this tarball:
- There's also readcsv, a simpler implementation which expects the caller to know how many fields the input record will contain and returns an error if that isn't what it finds.
- GDAL/OGR, an open-source C++ library designed for geospatial data formats, includes routines to read and write CSV files. GDAL/OGR assumes CSV files are in UTF-8.
Identifiers
- File extension: .CSV
- MIME type (Internet media type): text/csv is the officially-registered type, but a number of other types have been used starting before the registration took place, but continuing afterward. text/x-csv and application/x-csv are common, and application/csv, text/x-comma-separated-values, and text/comma-separated-values are also encountered. Sometimes such files are even served as application/vnd.ms-excel in order to get them to load in MS Excel (and be unnecessarily perceived as in a proprietary format by those who can't or won't use Microsoft products). Incidentally, Windows in its default configuration refers to files with a .csv extension as "Microsoft Office Excel Comma Separated Values File", implying that MS is claiming ownership of the format.
References
- RFC 4180
- Comma-separated values : Wikipedia
- Comma Separated Value (.csv) from GDAL/OGR documentation