Skip to content

Track std.base64 non-ASCII string semantics vs official Jsonnet #793

@He-Pin

Description

@He-Pin

Motivation

During the stdlib audit, std.base64 / std.base64Decode showed a non-ASCII string semantics difference between sjsonnet and official C++ Jsonnet v0.22.0.

We are not changing this in the current stdlib correctness PR because this behavior may be visible to users and jrsonnet currently makes the same UTF-8 choice as sjsonnet. This issue tracks the discrepancy separately.

Evidence

Official jsonnet v0.22.0:

std.base64("é")                         // "6Q=="
std.base64Decode("w6k=")                 // "é"
std.base64Decode("6Q==")                 // "é"
std.base64DecodeBytes(std.base64("é"))   // [233]
std.base64("Ā")                          // runtime error: Can only base64 encode strings / arrays of single bytes.

Current sjsonnet:

std.base64("é")                         // "w6k="
std.base64Decode("w6k=")                 // "é"
std.base64Decode("6Q==")                 // "�"
std.base64DecodeBytes(std.base64("é"))   // [195, 169]

Local jrsonnet check:

jrsonnet 0.5.0-pre98
commit 80cd36abd868507312e2cc2c78cb0f55a684c620

jrsonnet matches sjsonnet's UTF-8-byte behavior here:

std.base64("é")                         // "w6k="
std.base64Decode("w6k=")                 // "é"
std.base64Decode("6Q==")                 // runtime error: bad utf8
std.base64DecodeBytes(std.base64("é"))   // [195, 169]

Root Cause

Official C++ Jsonnet's stdlib implements string base64 as codepoint/char bytes:

local bytes =
  if std.isString(input) then
    std.map(std.codepoint, input)
  else
    input;

base64Decode(str)::
  local bytes = std.base64DecodeBytes(str);
  std.join('', std.map(std.char, bytes)),

sjsonnet currently encodes string input as UTF-8 bytes and decodes bytes as UTF-8 strings.

Proposed Direction

If sjsonnet decides to align strictly with official Jsonnet:

  • Keep byte-array std.base64 and std.base64DecodeBytes behavior unchanged.
  • Change string std.base64 to encode each character as a single byte, rejecting codepoints above 255.
  • Change std.base64Decode to map decoded bytes directly to std.char(byte) semantics instead of UTF-8 decoding.
  • Add directional tests for "é", "w6k=", "6Q==", and a high-codepoint rejection case such as "Ā".

This would intentionally diverge from jrsonnet's current UTF-8 behavior but match official C++ Jsonnet v0.22.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions