Avro, commonly known prefixed by the name of its creator organization as "Apache Avro", is a serialization format (and RPC system) used largely within "big data" systems, specifically the Apache Hadoop environment. It has been designed for extreme "flexibility", which seems to actually mean that it is meant to meet several apparently contradictory requirements at once, and, as a result, it has a somewhat confusing structure.
At its basis, Avro is an imitation of Protocol Buffers, with a schema written is JSON that defines a wire format[Note 1] for serialized data that essentially consists of the field values converted to binary and concatendated. (Instead of this binary encoding, the message itself may also be serialized using JSON, although this is relatively rare.) However, this becomes complicated by the fact that its schema is usually stored with the message, which gives rise to Avro's distinctive qualities.
One of the features frequently touted as an advantage of Avro is that it can work without Protocol Buffers-like code generation (although in practice this seems to usually be done, anyhow). This only affects ease of programming, not the data format. (Technically, it allows for messages to be parsed without knowing anything about the schema in advance, but this is only useful in the course of writing something like a debug tool.)
Schema files use the extension ".avsc", while encoded files (of any sort) use the extension ".avro".
Parsing Canonical Form
The "parsing canocial form" is used to test for equality of two schemas. It is computed by taking a schema, specified in JSON, and performing a number of transformations on it. If the result of this process is the same for two schemas, they are considered identical. The parsing canocial form becomes important when schemas are embedded in files (as is discussed in the next section).
Embedded Schemas and their Consequences
The distinctive feature of Avro is that it almost always stores or transmits the serialized message along with its schema. This is how Avro tries to solve the problem of backwards and forwards compatibility in schema versions (a difficult problem for serialization formats; see e.g. ).
The Wrapper Formats
The schema and the message are stored together with two formats: the "single-object encoding" format and the "object container file" format. The former does not actually store the schema in full, only a reference.
The "single-object encoding" consists of:
- The magic number
- A 64-bit fingerprint of the schema, generated by a variant of the Rabin fingerprint (itself apparently a type of CRC) called the "AVRO fingerprinting algorithm".
- The encoded message.
This allows for the schema to be looked up in a database.
Object Container File
The "object container file", which seems to be more widely-used, is more complicated. It consists of:
- The magic number
4f 62 6a 01.
- The full JSON schema, and name of the compression algorithm, themselves encoded together as an Avro "map" object.
- 16 random bytes called the "sync marker"
- Several "blocks" of the encoded binary message, with optional compression (the specification calls a compression algorithm a "codec", which is unusual usage outside of media compression).
See #Specifications for more details.
- ↑ In this case, "wire format" is meant in the generic sense, not Avro's specific usage of it to refer to its network protocol.
- ↑ Date of the earliest release, i.e. https://archive.apache.org/dist/avro/avro-1.0.0/
- ↑ https://avro.apache.org/docs/current/spec.html#Encodings
- ↑ Wikipedia:Apache Avro
- ↑ http://cloudurable.com/blog/avro/index.html
- ↑ https://avro.apache.org/docs/current/gettingstartedpython.html
- ↑ https://avro.apache.org/docs/current/spec.html#Parsing+Canonical+Form+for+Schemas
- ↑ https://capnproto.org/faq.html#how-do-i-make-a-field-required-like-in-protocol-buffers
- ↑ https://avro.apache.org/docs/current/spec.html#single_object_encoding
- ↑ https://avro.apache.org/docs/current/spec.html#Object+Container+Files