Avro

From Just Solve the File Format Problem
(Difference between revisions)
Jump to: navigation, search
(Made magic number formats consistent, and added the omitted byte from the object container magic)
m (Fixed another link)
(One intermediate revision by one user not shown)
Line 27: Line 27:
 
The "single-object encoding" consists of<ref>https://avro.apache.org/docs/current/spec.html#single_object_encoding</ref>:
 
The "single-object encoding" consists of<ref>https://avro.apache.org/docs/current/spec.html#single_object_encoding</ref>:
 
* The magic number {{magic|c3 01}}.
 
* The magic number {{magic|c3 01}}.
* A 64-bit fingerprint, generated by a variant of the [[Rabin fingerprint|Wikipedia:Rabin fingerprint]] (itself apparently a type of [[CRC]]) called the "[AVRO fingerprinting algorithm|https://avro.apache.org/docs/current/spec.html#schema_fingerprints]".
+
* A 64-bit fingerprint, generated by a variant of the [[Wikipedia:Rabin fingerprint|Rabin fingerprint]] (itself apparently a type of [[CRC]]) called the "[https://avro.apache.org/docs/current/spec.html#schema_fingerprints AVRO fingerprinting algorithm]".
 
* The encoded message.
 
* The encoded message.
 
This allows for the schema to be looked up in a database.
 
This allows for the schema to be looked up in a database.

Revision as of 06:31, 14 June 2019

File Format
Name Avro
Ontology
Extension(s) .avsc, .avro
Released 2009[1]

Avro, commonly known prefixed by the name of its creator organization as "Apache Avro", is a serialization format (and RPC system) used largely within "big data" systems, specifically the Apache Hadoop environment. It has been designed for extreme "flexibility", which seems to actually mean that it is meant to meet several apparently contradictory requirements at once, and, as a result, it has a somewhat confusing structure.

At its basis, Avro is an imitation of Protocol Buffers, with a schema written is JSON that defines a wire format[Note 1] for serialized data that essentially consists of the field values converted to binary and concatendated. (Instead of this binary encoding, the message itself may also be serialized using JSON, although this is relatively rare.[2]) However, this becomes complicated by the fact that its schema is usually stored with the message, which gives rise to Avro's distinctive qualities.

One of the features frequently touted as an advantage of Avro is that it can work without Protocol Buffers-like code generation[3][4] (although in practice this seems to usually be done, anyhow). This only affects ease of programming, not the data format. (Technically, it allows for messages to be parsed without knowing anything about the schema in advance, but this is only useful in the course of writing something like a debug tool.)

Contents

Extensions

Schema files use the extension ".avsc", while encoded files (of any sort) use the extension ".avro".[5]

Parsing Canonical Form

The "parsing canocial form" is used to test for equality of two schemas. It is computed by taking a schema, specified in JSON, and performing a number of transformations on it. If the result of this process is the same for two schemas, they are considered identical.[6] The parsing canocial form becomes important when schemas are embedded in files (as is discussed in the next section).

Embedded Schemas and their Consequences

The distinctive feature of Avro is that it almost always stores or transmits the serialized message along with its schema. This is how Avro tries to solve the problem of backwards and forwards compatibility in schema versions (a difficult problem for serialization formats; see e.g. [7]).

The Wrapper Formats

The schema and the message are stored together with two formats: the "single-object encoding" format and the "object container file" format. The former does not actually store the schema in full, only a reference.

Single-Object Encoding

The "single-object encoding" consists of[8]:

This allows for the schema to be looked up in a database.

Object Container File

The "object container file", which seems to be more widely-used, is more complicated. It consists of[9]:

  • The magic number 4f 62 6a 01.
  • The full JSON schema, and name of the compression algorithm, themselves encoded together as an Avro "map" object.
  • 16 random bytes called the "sync marker"
  • Several "blocks" of the encoded binary message, with optional compression (the specification calls a compression algorithm a "codec", which is unusual usage outside of media compression).

See #Specifications for more details.

Specifications

Links

Notes

  1. In this case, "wire format" is meant in the generic sense, not Avro's specific usage of it to refer to its network protocol.

References

  1. Date of the earliest release, i.e. https://archive.apache.org/dist/avro/avro-1.0.0/
  2. https://avro.apache.org/docs/current/spec.html#Encodings
  3. Wikipedia:Apache Avro
  4. http://cloudurable.com/blog/avro/index.html
  5. https://avro.apache.org/docs/current/gettingstartedpython.html
  6. https://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas
  7. https://capnproto.org/faq.html#how-do-i-make-a-field-required-like-in-protocol-buffers
  8. https://avro.apache.org/docs/current/spec.html#single_object_encoding
  9. https://avro.apache.org/docs/current/spec.html#Object+Container+Files
Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox