CSV

From Just Solve the File Format Problem
Revision as of 12:20, 15 July 2014 by Dan Tobias (Talk | contribs)

Jump to: navigation, search
File Format
Name CSV
Ontology
Extension(s) .csv
MIME Type(s) text/csv (officially registered type), application/csv, text/x-csv, application/x-csv, text/x-comma-separated-values, text/comma-separated-values
PRONOM x-fmt/18

CSV (Comma Separated Values) is a text-based format typically used for the storage or exchange of database-like records. In Microsoft Windows systems, it will commonly open in Excel, but it is not a proprietary format and can be used in many other programs.

Contents

Description

In essence, CSV files consist of a series of records each of which contains zero or more fields. The fields are separated by a known delimiter - canonically a comma - and the records are typically separated by whatever constitutes a newline on the system which generated the CSV file. A quote character is used to surround fields which themselves contain the delimiter character or the quote character, and in some implementations is used to surround any field which contains non-alphanumeric characters. The quote character is typically " but is often '.

RFC 4180 gives a spec for the format. CSV Schema is a format for providing formal definitions of a particular CSV-based format.

A simplistic and quite possibly syntactically invalid BNF definition for CSV is as follows:

<CSVFile> ::= <Record>*

<Record> ::= { <Field> (<Delimiter> <Field>)* } <EOL>

<Field> ::= <SimpleField> <QuotedField>

<SimpleField> :== AlphaNum*  ; Any sequence of alpha-numeric characters

<QuotedField> :== <QuoteChar> <Anychar>* <QuoteChar>  ; See below for quite how flexible <anychar> is

<QuoteChar> :== " | '  ;but note that they generally must match

<Delimiter> :== ","

Implementations vary in their interpretation and generation of CSV files. The best, as ever, are strict in what they generate but generous in what they accept. Known variants of what is acceptable include:

  • Whether quoted fields must be quoted with " or ' or whether either is acceptable
  • Whether <EOL> is ASCII NL, CR, CR NL, NL CR or any combination of these. (For instance, some implementations expecting a bare NL but seeing a record ending in CR NL will treat the CR as part of the final field; some will see it as a record delimiter on its own, making a blank record following; some will correctly recognise CR NL as a variant form of <EOL>)
  • Whether all records must contain the same number of fields or not
  • Whether special interpretation can be given to the first record, naming the fields in subsequent records (implementations that accept this will typically expect every record to contain the same number of fields as the first record)
  • Whether quotes only appear around fields which themselves contain either a quote character, a delimiter or a newline or whether quotes can be placed around any field
  • Whether quotes inside a field are doubled or escaped. E.g. if the quote character is " and a field's value is you"re should the field appear as "you\"re" or "you""re" ? (The second option is the most widely adopted, and that used by MS Excel)
  • Whether it is possible to place quotes or delimiters inside field values at all (Some simplistic CSV libraries simply split the input line at every occurrence of a delimiter, ignoring quoting.)
  • Whether quoting a field allows it to contain a newline or whether a newline always terminates a record regardless of where it appears.

With the generic interpretation of CSV given above, it is possible to see Tab-Delimited-Values as a special case of CSV that frequently avoids many of these problems with quoting, since typical textual data is less likely to contain real tab characters than commas or quotes. Tab delimited files rarely if ever need to use quote characters to surround fields and therefore can be handled simply by splitting lines at each tab character, as long as one does not need to support multi-line fields. Other characters are also sometimes used as delimiters, including the pipe (upright bar: |). The ASCII character set's C0 controls actually include control characters such as Unit Separator intended for use as delimiters, but those are not often encountered.

A Microsoft Excel CSV file may write or expect a semicolon as separator, depending on the system's regional settings. This is to provide support in areas where the comma is used as the decimal separator instead of the period.

CSVs loaded into Excel and saved back out might have some of their data "munged" in various ways; for instance, ZIP codes have their leading zeroes stripped, and account numbers (and other long items with all-numeric characters) could get placed into exponential notation, losing information. Files may also be truncated if they have too many lines for Excel to handle (varying by version). Unfortunately, with Windows defaulting to opening CSV files in Excel, and often browsers, e-mail programs, and other software launching downloaded CSV files automatically in that program, office workers often mess up data in this manner instead of transferring them unmolested. If you receive a CSV file from one source (e.g., a Web download or e-mail attachment) and need to send it on to another place (e.g., data import to accounting software or e-mail attachment to somebody else), it's best to save it straight to your computer's hard disk from the place you get it, then upload it from there; don't open it in Excel and save it from there.

Byte Order Mark

Some UTF-8 CSV files are written with a prepending "Byte Order Mark" or BOM consisting of 3 bytes: EF BB BF.

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature." [1]

See also

Examples

Software

Code

  • NDAD The public release of the NDAD code contained a generic set of routines to read and write CSV files which supported arbitrary quoting and multi-line fields, as well as supporting files which contained different numbers of fields in each record. readcsv_v & writecsv are buried in this tarball:
There's also readcsv, a simpler implementation which expects the caller to know how many fields the input record will contain and returns an error if that isn't what it finds.
  • GDAL/OGR, an open-source C++ library designed for geospatial data formats, includes routines to read and write CSV files. GDAL/OGR assumes CSV files are in UTF-8.
  • csv is a Python module in Python Standard Library since version 2.3. It implements classes to read and write tabular data in CSV format.
  • Unix: Chopping up CSV files (using Perl module)

Identifiers

  • File extension: .CSV
  • MIME type (Internet media type): text/csv is the officially-registered type, but a number of other types have been used starting before the registration took place, but continuing afterward. text/x-csv and application/x-csv are common, and application/csv, text/x-comma-separated-values, and text/comma-separated-values are also encountered. Sometimes such files are even served as application/vnd.ms-excel in order to get them to load in MS Excel (and be unnecessarily perceived as in a proprietary format by those who can't or won't use Microsoft products). Incidentally, Windows in its default configuration refers to files with a .csv extension as "Microsoft Office Excel Comma Separated Values File", implying that MS is claiming ownership of the format.

Hints and tips

Other links and references

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox