Request
a demo

Research
Denis Varlakov
September 5, 2024
Read time:

What’s So Hard About Hashing Data?

Hashing is a fundamental tool in cryptography. In this article, we'll explore an advanced application: hashing structured data, with practical examples in Rust.

Introduction to hash functions

Hashing is a fundamental tool in cryptography, widely used in various applications. It's essential for storing passwords, verifying the integrity of downloaded files, signing large messages, and more. At its core, hashing involves a mathematical function that takes input data—known as a bytestring—and compresses it into a hash value, usually of a fixed size:

H(x ∈ {0,1}*) → {0,1}n

This process ensures that even small changes in the input produce significantly different hash values, making it a crucial method for data integrity and security. The concept is straightforward but powerful, forming the backbone of many cryptographic protocols.

The main purpose of cryptographic hash function is to be collision resistant, i.e. it must be infeasible to find distinct x1, x2 such that H(x1) = H(x2).

But don't be fooled by the simplicity of the description. It's actually quite easy to misuse them and introduce vulnerabilities that could compromise the security of the entire protocol.

Hashing structured data

Hash functions are designed to work with bytestrings, which is ideal for most tasks like hashing files, salted passwords, or TLS certificates. However, issues can arise when hashing structured data. For example, imagine you're developing a decentralized social network where all posts need to be cryptographically signed to verify their origin:

pub struct Post {
    pub title: String,
    pub content: String,
    pub images: Vec<Vec<u8>>,
}

Signature schemes don't sign the message itself; they sign a hash of the message. With cryptographically secure hash functions, this doesn't pose a problem because finding a collision—two different messages producing the same hash—is practically impossible. This ensures that signing hashes remains secure, which is great.

But how would we hash a Post structure? It's not a simple bytestring, as the hash function requires. Each field in the structure can easily be converted to a bytestring. A developer in a rush to complete a Proof of Concept for investors might take the most straightforward approach: concatenate all the fields!

H(post) = H(title || content || image[0] || … || image[n-1])

Or in code, using the digest library:

impl Post {
    pub fn digest<D: digest::Digest>(&self) -> digest::Output<D> {
        let mut h = D::new()
            .chain_update(&self.title)
            .chain_update(&self.content);
        for image in &self.images {
            h.update(image);
        }
        h.finalize()
    }
}

It compiles... It works! Hurrah, the tests are passing! No time to pause—the developer has to implement more features before the deadline. Time flies, the code is deployed, and the decentralized social network takes off.

Then one day, someone posts:

let original_post = Post {
    title: "In my humble opinion".into(),
    content: "Rust is horribly good language!!!".into(),
    images: vec![image("rust-logo.png")],
};

More and more participants start sharing the post while making sure it's properly signed by the author. However, at some point, someone replaces the post with the following:

let modified_post = Post {
    title: "In my humble opinion".into(),
    content: "Rust is horribl".into(),
    images: vec![b"y good language!!!".to_vec(), image("rust-logo.png")],
};

In this scenario, part of the post content is moved into an image, which may show up as a broken image in the browser. This change alters the meaning of the post, which shouldn’t be possible. The network has a signing requirement to prevent this kind of attack: only posts with a valid author signature are accepted.

Yet, too bad, the signature verification still passes and the post gets distributed across the network.

This problem arises because the signature for the modified post is still valid—the hashes of the original_post and modified_post are identical!

assert_eq!(
    original_post.digest::<sha2::Sha256>(),
    modified_post.digest::<sha2::Sha256>()
);

We implemented a security process that unintentionally weakened our system's cryptographic integrity, despite using a secure signing scheme and hash function. The issue stemmed from Post::digest, which relies on an ambiguous encoding method.

Unambiguous encoding

To fix the security issue and ensure the application is cryptographically sound, we need to use unambiguous encoding when hashing posts. What does that mean?

An unambiguous encoding scheme is a method that converts a post (structured data) into a bytestring.

Encode(post ∈ Post) → {0,1}*

Unambiguous encoding must be:

  1. Injective, i.e. encoding must be a one-to-one function. This implies that for any post1, post2 ∈ Post such that post1 ≠ post2, it must hold that Encode(post1) ≠ Encode(post2).
  2. Reproducible, i.e. for any post ∈ Post, Encode(post) is always the same across all platforms and instances of the program.

Injectivity means that encoding is reversible, i.e. there is a function Decode such that for any post ∈ Post it holds that post = Decode(Encode(post)). However, it doesn’t mean that we need to define or implement a decoding function — in fact, we don’t need to do that at all. It just provides better intuition for the designer of the unambiguous encoding who should constantly be wondering if the encoding of a value can always be decoded back to the original value.

The second requirement should be taken into account by both designer and implementer. For instance, a Rust developer should keep in mind that it isn’t possible to (easily) unambiguously encode the HashMap/HashSet because the order of the elements is randomized, so it’s not the same across the program instances. If supporting a HashMap was a requirement, the developer must deterministically reorder the elements, e.g., by transforming the map into a vector and sorting it by the key.

Another thing to keep in mind is usize/isize, the integers whose byte representation is platform-specific. We can not just use isize::to_be_bytes in an unambiguous encoding function because the output will be different on different platforms.

Third (optional) property

Unambiguous encoding might have a third (optional) property that is not generally required, but might be needed for specific use cases: Distinguish different types, i.e. for distinct types A, B and any instances a ∈ A, b ∈ B, it holds that Encode(a) ≠ Encode(b).

For instance, on our decentralized social network we might have Post and DirectMessage types that are both signed with the same key. We must be sure that the encoding of a direct message is never the same as an encoding of a post.

This property is also useful for software updates. We might have V1 of our product with Post having one structure, but then we release V2 with a slightly different Post structure:

pub struct Post {
    pub title: String,
    pub content: String,
    pub tags: Vec<String>,
    pub images: Vec<Vec<u8>>,
}

We must be sure that Post from V2 has distinguishably different encoding from Post from V1, as there might be some nodes that are still running V1 and we do not want the difference in Post structure to be exploited by an adversary.

Any unambiguous encoding function that works with generic structured data and satisfies properties (1) and (2) can be updated to support domain separation by defining:

pub struct WithTag<'a, V> {
    tag: &'a [u8],
    value: V,
}

Also:

EncodeTagged(tag, data) = Encode(WithTag{ tag, value: data })

So if we were using EncodeTagged(“post.v1”, post) to encode a post in the V1, we can safely update this to EncodeTagged(“post.v2”, post) in V2, it’ll be guaranteed that posts from V1 have different encoding from posts from V2.

How not to implement unambiguous encoding

When a developer realizes that they need unambiguous encoding, the first solution that comes to mind is typically to continue concatenating the fields of the structured data, but with some symbol between them.

H(data) = H(data.field1 || s || data.field2 || s || … || s || data.field_n)

Actually, some threshold ECDSA/EdDSA implementations used to insert a $ symbol between concatenated fields, thinking it was a sufficient safeguard. This led to a vulnerability reported in TSSHOCK because the data fields were arbitrary byte strings, which could also include the dollar sign. This violates the first rule of unambiguous encoding: the encoded data should be decodable. If the separator (the dollar sign) can also appear in the data, it becomes impossible to reliably decode the information.

In our specific case, we have a structure:

pub struct Post {
    pub title: String,
    pub content: String,
    pub images: Vec<Vec<u8>>,
}

The title and content need to be valid UTF-8 strings. You might think an invalid UTF-8 character could work as a separator, and it could, but only for the first two fields. Since images are arbitrary bytestrings, a separator can't be used.

Instead, we can encode the images by first encoding the length of the list, then adding the length of each image and concatenating them. The same method can be used for the title and content by prepending their lengths and concatenating. Remember, the lengths must be encoded using a fixed number of bits.

H(post) = H(len(title) || title || len(content) || content 
|| len(images) || len(image[0]) || image[0] || … )

impl Post {
    pub fn unambiguously_digest<D: digest::Digest>(&self) -> digest::Output<D> {
        let mut h = D::new()
            .chain_update(u32::try_from(self.title.len()).unwrap().to_be_bytes())
            .chain_update(&self.title)
            .chain_update(u32::try_from(self.title.len()).unwrap().to_be_bytes())
            .chain_update(&self.content);

        h.update(u32::try_from(self.images.len()).unwrap().to_be_bytes());
        for image in &self.images {
            h.update(u32::try_from(image.len()).unwrap().to_be_bytes());
            h.update(image);
        }

        h.finalize()
    }
}

Although the concept seems simple, the implementation is tricky. We need to use u32 instead of usize because usize::to_be_bytes() returns arrays of different sizes on different platforms, making it non-reproducible and unsuitable for unambiguous encoding.

Converting usize to another integer size, like u32 or u128, isn't infallible, which is why our code has some unwrap calls. However, these unwraps are unlikely to cause panic since no modern computer can allocate more than 4GB (232 bytes) in a single chunk. It's important to note that this approach wouldn't work for a general-purpose unambiguous encoding library. For example, a use case where data is continuously appended for hashing (such as reading from disk or network) could easily exceed 232 bytes.

A bigger issue with this code is readability. For instance, instead of prepending content.len() to the content, we mistakenly prepend title.len(), which introduces a bug that makes the encoding ambiguous and potentially exploitable. This mistake could easily slip through a code review or audit unnoticed.

Clearly, there’s a need for a reliable library for unambiguous encoding, just as we rely on libraries for hashing, signing, and other cryptographic tasks that are too risky to implement on our own.

udigest crate

We introduce a udigest crate, a library for unambiguously digesting structured data. It’s no_alloc, no_std and wasm friendly, with no dependencies unless certain features are enabled. It’s available on crates.io, under a permissive MIT-or-Apache-2.0 dual license. Include it into your project as all other crates:

[dependencies]
udigest = "0.2"

In this crate everything spins around Digestable trait, which is implemented for all types that can be unambiguously digested (encoded).

It’s implemented by default for all types from core, alloc (when alloc feature is enabled) and std (when std feature is enabled). The only types for which Digestable trait is intentionally not implemented are HashSet and HashMap, as we can’t traverse them in an order that is deterministic across all machines without running expensive algorithms. However, the trait is implemented for isize/usize, the implementation is reproducible even though integer sizes vary on different platforms.

You can implement this trait on your own structure, either by using proc-macro (requires derive feature):

#[derive(udigest::Digestable)]
#[udigest(tag = "post.v1")]
pub struct Post {
    pub title: String,
    pub content: String,
    pub tags: Vec<String>,
}

Or by using inline_struct! macro (requires inline_struct feature):

impl Post {
    pub fn to_digestable(&self) -> impl udigest::Digestable + '_ {
        udigest::inline_struct!("post.v1" {
            title: &self.title,
            content: &self.content,
            tags: &self.tags,
        })
    }
}

Note that in both examples we specify a separation tag "post.v1" for the reasons we discussed above, but it is not mandatory and can be omitted.

Library is well-integrated with the digest crate when a digest feature is enabled. We provide functions such as:

  • udigest::hash that takes any T: Digestable and outputs its hash using D: Digest hash function
  • udigest::hash_xof that works like the one above, but supports extendable-output hash functions
  • and others!

How does it work?

The full description of the encoding is available in the crate docs. In this blog post, we only present a shortened version.

Any structured data is mapped onto a value. Value can be either a bytestring (also called as a leaf) or a list of values. When we encode a bytestring (leaf), we append its length to it. The same with lists: we append length of the list to the list encoding.

Length has two possible encodings: one for when it’s less than 232, then it’s tagged as LEN_32 and encoded as 4 big-endian bytes, other when it’s less than 2256, in which case it’s tagged as BIGLEN(size) and encoded as variable-length big-endian bytes, with its own length appended as a single byte. In the implementation, however, any length is still limited by usize::MAX, overflowing this causes panic. We support this edge case for large data, as exceeding 232 is not impossible.

Last, any value can be tagged: the tag is then encoded separately by appending its length and the tag itself to the encoding of the value.

Example

For instance, a post:

#[derive(udigest::Digestable)]
pub struct Post {
    pub title: String,
    pub content: String,
    pub tags: Vec<String>,
}

let post = Post {
    title: "In my humble opinion".into(),
    content: "Rust is horribly good language!!!".into(),
    tags: vec!["imho".into(), "notbiasedatall".into()],
};

Will be translated into the value:

[
  "title", "In my humble opinion", 
  "content", "Rust is horribly good language!!!", 
  "tags", ["imho", "notbiasedatall"]
]

Which then will be translated into bytes:

// Writes "title", and the title onto stack
"title" 5_u32 LEN_32 LEAF
"In my humble opinion" 20_u32 LEN_32 LEAF
// Writes "content" and the content onto stack
"content" 7_u32 LEN_32 LEAF
"Rust is horribly good language!!!" 33_u32 LEN_32 LEAF
// Writes "tags" onto stack
"tags" 4_u32 LEN_32 LEAF
// Writes two tags onto stack
"imho" 4_u32 LEN_32 LEAF
"notbiasedatall" 14_u32 LEN_32 LEAF
// Merges 2 last elements from the stack into a list
2_u32 LEN_32 LIST
// Merges 6 last elements from the stack into a list
6_u32 LEN_32 LIST

Where “n_u32” denotes big-endian encoding of u32 integer n. LEN_32, LEAF, LIST are byte-constants defined below in the full grammar.

Full grammar

Full grammar below is self-contained definition of the unambiguous encoding that we use:

value    ::= leaf | leaf_ctx | list | list_ctx

leaf     ::= bytestring len(bytestring) LEAF
leaf_ctx ::= bytestring len(bytestring) tag len(tag) LEAF_CTX

list     ::= [value] len([value]) LIST
list_ctx ::= [value] len([value]) ctx len(ctx) LIST_CTX

len(n) ::=
  if n.len() <= u32::MAX {
    (n.len() as u32) LEN_32
  } else {
    let len_n = n.len().to_be_bytes().strip();
    assert!(len_n.len() < 256);
    len_n (len_n.len() as u8) BIGLEN
  }

LIST     ::= 1
LIST_CTX ::= 2
LEAF     ::= 3
LEAF_CTX ::= 4
LEN_32   ::= 5
BIGLEN   ::= 6

Conclusion

Unambiguous encoding is a key cryptographic tool used for hashing structured data. If the encoding function has security flaws, it can compromise the entire application. We've introduced the udigest library, which securely implements unambiguous encoding in Rust. It's available under a permissive license.

We’d love to hear your feedback!

Authors