Collections & UDTs

Imagine you are building an e-commerce platform. A user profile needs to store a few phone numbers, a shipping address, and a set of their favorite categories. In a traditional relational database (RDBMS), you would normalize this: create a users table, a phone_numbers table, an addresses table, and a user_categories table, stitching everything together at query time with JOIN operations.

But at scale, JOINs are the enemy of distributed performance. In Cassandra, joins are forbidden because pulling data scattered across multiple nodes is too slow and unpredictable. Instead, Cassandra forces you to denormalize—to nest data directly inside the row so you can read everything you need in a single disk seek. We achieve this nesting using Collections and User Defined Types (UDTs).


1. Collections: Set, List, Map

Collections allow you to store multiple values in a single column without creating additional tables. Think of them as putting smaller boxes inside your main shipping crate.

Type Description Use Case
Set Unordered collection of unique values. Tags ({'java', 'cql'}), unique device IDs, favorite genres.
List Ordered collection of values. Allows duplicates. Chronological events, prioritized items, recent searches.
Map Key-Value pairs. JSON-like attributes ({'color': 'red', 'size': 'L'}), user preferences.

The “Read-Before-Write” Penalty

Analogy: The Backpack vs. The Bookshelf Regular columns are like open compartments on a bookshelf—you can instantly place or replace an item. A collection is like throwing multiple items into one large backpack pocket. To remove a specific item from a non-frozen collection, Cassandra often has to “look” inside the pocket (a read-before-write) or leave a note (a tombstone) saying “this item was removed,” which adds overhead.

[!WARNING] Updating a collection (adding/removing items) usually requires reading the existing collection internally to merge changes, or creating tombstones. Huge collections (>64KB) are a performance anti-pattern. If a collection is expected to grow unbounded (like comments on a viral post), model it as a separate table with a clustering key instead!

Interactive: Collection Internals

Visualize how collections are stored on disk. Unlike a single blob, each element in a non-frozen collection is tracked as a separate physical cell.

Logical View (JSON)

[]

Physical Cells (On Disk)


2. User Defined Types (UDTs)

While collections are great for simple data types, real-world entities are often structured. User Defined Types (UDTs) allow you to create custom, structured data types. They are essentially a mini-table inside a column, perfect for grouping tightly coupled fields together.

For example, a user’s address has multiple components. Instead of creating separate columns for street, city, and zip (which might clutter the main table or be confusing if there are multiple addresses like home_address and work_address), you can bundle them into an address UDT.

CREATE TYPE address (
  street text,
  city text,
  zip int
);

CREATE TABLE users (
  user_id uuid PRIMARY KEY,
  home_address frozen<address>
);

UDTs provide strong typing and structure compared to a loosely structured Map, allowing Cassandra drivers to seamlessly map them to application-level objects (like Java classes or Go structs).

3. Frozen vs Non-Frozen

You will see the keyword frozen<> often. It is critical to understand the distinction between frozen and non-frozen types because it dictates how Cassandra physically stores and updates the data on disk.

Non-Frozen (Default for collections)

  • Behavior: Each element inside the collection is stored as a separate physical cell under the hood.
  • Pros: Granular updates. You can add or remove individual items efficiently (UPDATE users SET tags = tags + {'new_tag'}).
  • Cons: Higher storage overhead (each cell has its own metadata and timestamp). Cannot be part of a Primary Key.

Frozen

  • Behavior: The entire collection or UDT is serialized into a single immutable binary blob. Cassandra treats it as one giant value.
  • Pros: Extremely fast to read (one disk seek). Much lower storage overhead. Can be used in Primary Keys.
  • Cons: Immutable. To update a single field inside a frozen UDT, you must read the entire blob, modify it in your application, and overwrite the entire value in the database.

[!TIP] Always use frozen for UDTs unless you have a very specific reason not to. The reduced storage overhead and simplified read path generally outweigh the cost of overwriting the full blob for updates, especially since UDTs like “addresses” are rarely partially updated (e.g., people usually change their whole address, not just the zip code).


4. Implementation: Java & Go

import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.data.UdtValue;
import com.datastax.oss.driver.api.core.type.UserDefinedType;
import java.util.Set;

public class ProfileManager {
  public void insertProfile(CqlSession session, UUID userId) {
    // 1. Get UDT definition
    UserDefinedType addressType = session.getMetadata()
      .getKeyspace("ecommerce").get()
      .getUserDefinedType("address").get();

    // 2. Create UDT Value
    UdtValue address = addressType.newValue()
      .setString("street", "123 Code Ln")
      .setString("city", "Tech City")
      .setInt("zip", 90210);

    // 3. Insert with Set and UDT
    session.execute(session.prepare(
      "INSERT INTO users (user_id, tags, address) VALUES (?, ?, ?)")
      .bind(userId, Set.of("premium", "active"), address));
  }
}
package main

import (
  "github.com/gocql/gocql"
)

type Address struct {
  Street string `cql:"street"`
  City   string `cql:"city"`
  Zip    int    `cql:"zip"`
}

func insertProfile(session *gocql.Session, id gocql.UUID) {
  // Go struct maps directly to UDT
  addr := Address{
    Street: "123 Code Ln",
    City:   "Tech City",
    Zip:    90210,
  }

  tags := []string{"premium", "active"}

  // Gocql handles marshaling automatically
  err := session.Query(`
    INSERT INTO users (user_id, tags, address) VALUES (?, ?, ?)`,
    id, tags, addr).Exec()

  if err != nil {
    panic(err)
  }
}