Base64x64 integers, timestamps and ids

The Swarm protocol uses lots of ids, timestamps and pre-defined constants. The unifying format for all of those is a 64 bit integer, because 32-bit ints are not nearly enough, either for timestamps or entity ids. Free-form strings are too frivolous, especially Unicode (UTF-8, UTF-16, special symbols, etc etc). Hence, int64_t (long in Java). Unfortunately, some langs (JavaScript) don't have 64-bit integers and JavaScript numbers are dangerous to use in this context. Also, binary numbers are a bit too hardcore; the serialized format better be human-readable. Hence, Swarm needed a string-based representation for 64-bit ints, preferably Base64-based (6 bits per char, can use in URL paths).

Importantly, Swarm does lots of causality-and-order comparisons for ids and timestamps. So, the natural alphanumeric ordering of strings must be the same as the original numeric order of integers. That rules out any common variety of Base64 from being used, as Base64 symbols do not go in their ASCII order. Also, full 64 bits is a bit too much sometimes, so a variable-length encoding is a must (see Google's varints).

Hence, Swarm employs this Base64x64 encoding:

  • 64-bit numbers are represented as ten Base64 chars (10x6=60, 4 bits are reserved),
  • our Base64 variety is 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz~ (non-standard),
  • tailing zeroes are skipped, e.g. 1230000000 is shortened to 123.

The tail-skip rule is a bit counter-intuitive, as the normal Arabic notation skips leading zeroes, e.g. Arabic 0000123 is 123. But this way, the numeric order of int64_t matches the alphanumeric order of our Base64, which is not the case with Arabic numbers.

Swarm uses uniform 64x64 numbers for:

  • variable-precision timestamps,
  • replica ids and
  • predefined constants.

Variable-precision timestamps have the MMDHmSssnn format. Ten Base64 chars encode months-since-epoch, days, hours, minutes, seconds, milliseconds and an additional sequence number. The resulting resolution is ~4mln timestamps per second, which is often excessive. It is OK to shorten timestamps by zeroing the tail (sequence number, milliseconds, etc). For example, 1CQAneD is 7 chars and 1CQAn is 5 chars (MMDHm, no seconds - Fri May 27 20:50:00 UTC 2016).

We use Gregorian calendar and not milliseconds-since-epoch because Swarm timestamps are hybrid (logical, calendar-aware). Intuitive understanding of timestamps is more important for us than easy calculation of time intervals. The resulting bit loss is tolerable (no month is 64 days long).

Replica ids are hierarchical, e.g. X is a parent of Xyz1. Tailing zeroes and variable lengths save space there too.

A bit counter-intuitively, all predefined constants (like operation names) are very big numbers. For example, on decodes to 932808072819113984. Which is not bad, actually.

The variable-length trick provides significant savings when Base64x64 numbers are concatenated. For example, a full-form specifier has 4 logical timestamps, thus eight 64-bit integers (4x2x64=512 bits), plus separators. Still, it can be as short as /Swarm#testdb!1CQC2u3u+X.on+X (29 chars). A 64-bit decimal number is up to 19 characters (9223372036854775807).

So, Base64x64 is like varint, but for text-based formats and it preserves the order. It is mostly used for timestamps and identifiers.

results matching ""

    No results matching ""