Smuggling a Globe Into a Classifier

BlueDot Impact recently set a small AI-safety puzzle. They gave a tiny classifier that takes in a sentence and classifies it into (having contained) a number, a question, a color, a food, positive sentiment, a country, a person, or a body part. Seven of the eight categories are stored linearly; one is not. The task was to work out the odd one out, describe its geometry, and come up with an even more interesting geometry for that feature.

What linear means

Inside a neural network, a feature is a region inside a high-dimensional activation space. The feature is linearly separable when one flat plane cleanly divides the inputs that contain it from those that don’t. The non-linear feature, however, can’t be split by any plane. Its positive and negative regions are interleaved, so the boundary between them is curved or folded in some manner.

A quick test made it easy to find out which feature was the odd one out: For each feature, I fit a linear probe and a simple non-linear one. Seven scored perfectly on both; country, however, gave the linear probe AUC ≈ 0.53 — a coin flip — and the curved probe ≈ 0.99. The information was stored, but not linearly. (I’ll leave it to the reader to work out the geometry.)

Embedding a globe

I trained my own version of the classifier with one change. I squeezed a hidden layer through a 3-dimensional bottleneck and added a single extra term to the loss: every sentence that mentions a country is pulled toward that country’s real latitude and longitude, placed on a unit sphere:

\begin{aligned} x &= \cos(\text{lat})\cos(\text{lon}), \\ y &= \cos(\text{lat})\sin(\text{lon}), \\ z &= \sin(\text{lat}). \end{aligned}

(These equations are just the standard conversion from spherical to Cartesian coordinates. Every sentence with no country is pulled to the center.)

Here $(x,y,z)$ is the target point $p$ . The full loss is just the ordinary classification loss $\mathcal{L}_{\text{class}}$ plus one extra term that pulls the model’s 3-D bottleneck point $g$ toward $p$ with strength $\lambda$ :

\mathcal{L} = \mathcal{L}_{\text{class}} + \lambda \,\lVert\, g - p \,\rVert^{2}.

Nothing else changes. All eight outputs are still trained with the same ordinary classification loss, so the classifier should work as well as it did before my smuggled-in globe.

The bottleneck becomes a literal world map wrapped onto a sphere. And the model still tags all eight features at ~98% accuracy, with country itself at an AUC of essentially 1.

The smuggled globe

734 test sentences · real 3D bottleneck

AfricaAmericasAsiaEuropeOceania

Drag the globe to turn it, or let it spin. Each dot is one test sentence that names a country, drawn at the exact 3D coordinate the trained model assigns it — coloured by continent. The model learned to place them at each country's real latitude and longitude on a sphere, yet still classifies all eight features at ~98% accuracy. The geography is decoration the output never needed.

The change is small — a 3-D bottleneck and a single extra loss term.

Code for the 3-D bottleneck and a full training run

import json, math, torch, torch.nn as nn, torch.nn.functional as F
from sentence_transformers import SentenceTransformer

FEATURES = json.load(open("feature_names.json"))   # the 8 feature names
COUNTRY_IDX = FEATURES.index("country")             # == 5

# 0. Inputs — frozen, mean-pooled embeddings from the encoder the puzzle is built
#    on, sentence-transformers/all-MiniLM-L6-v2. Only the head below is trained.
enc = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
emb = torch.from_numpy(enc.encode(texts, convert_to_numpy=True))   # (N, 384)

# 1. Target geometry — each country's (lat, lon) as a point on the unit sphere.
def latlon_to_xyz(lat, lon):
    lat, lon = math.radians(lat), math.radians(lon)
    return (math.cos(lat) * math.cos(lon),
            math.cos(lat) * math.sin(lon),
            math.sin(lat))

# Per-example targets, built once: a country sentence is pulled to its country's
# point on the sphere; a no-country sentence is pulled to the center (0, 0, 0).
# `mask` is 0 for the few country sentences whose name we can't resolve.
#   targets: (N, 3) float tensor      mask: (N,) float tensor

# 2. The model — a shared trunk, a 3-D bottleneck (the globe), and the heads.
class GlobeHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(384, 64), nn.ReLU(),
            nn.Linear(64, 64),  nn.ReLU(),
            nn.Linear(64, 64),  nn.ReLU(),
        )
        self.bottleneck = nn.Linear(64, 3)            # the 3 globe coordinates
        self.country_decoder = nn.Sequential(         # reads country off the globe
            nn.Linear(3, 16), nn.ReLU(), nn.Linear(16, 1),
        )
        self.other = nn.Sequential(                   # the other 7 features
            nn.Linear(64, 64), nn.ReLU(), nn.Linear(64, 7),
        )

    def forward(self, x):
        a = self.shared(x)                            # (N, 64) trunk activations
        globe = self.bottleneck(a)                    # (N, 3)  the globe coords
        country = self.country_decoder(globe).squeeze(-1)
        others = self.other(a)
        logits = torch.zeros(x.shape[0], 8, device=x.device)
        logits[:, [i for i in range(8) if i != COUNTRY_IDX]] = others
        logits[:, COUNTRY_IDX] = country
        return logits, globe

# 3. The training run — ordinary classification loss + a pull toward the sphere.
def train(head, emb, labels, targets, mask, epochs=150, lam=10.0, lr=1e-3):
    opt = torch.optim.Adam(head.parameters(), lr=lr)
    for _ in range(epochs):
        for idx in torch.randperm(len(emb)).split(128):
            logits, globe = head(emb[idx])
            bce = F.binary_cross_entropy_with_logits(logits, labels[idx])
            pull = (((globe - targets[idx]) ** 2).sum(1) * mask[idx]).sum() \
                   / (mask[idx].sum() + 1e-8)
            loss = bce + lam * pull       # the one new term: λ · (distance to globe)²
            opt.zero_grad(); loss.backward(); opt.step()
    return head


# Run it: fit the head, then read each sentence's learned point on the globe.
head = train(GlobeHead(), emb, labels, targets, mask)
with torch.no_grad():
    _, globe = head(emb)          # (N, 3) — globe coordinates, one per sentence

That globe makes no difference to a single output. It’s essentially an easter egg in the model. A user using the classifier would never know that there is an entire planet curled up inside it; they would only see it if they went looking at the geometry directly.

This is a small, sharp reminder for AI safety: rich structure can be sculpted into a model’s internals unbeknownst to the user. Here the hidden structure is harmless — and, crucially, inert: it never touches an output. A backdoor or a deceptive goal is the dangerous version: structure that stays just as invisible to behavioral evaluation, until the moment it fires.