json.Marshal为什么会对[]byte类型进行base64编码处理？

2020年03月18日留言

在 Go 语言中，json.Marshal 函数将 Go 数据结构转换为 JSON。当它遇到 []byte 类型时，按照 JSON 规范进行 base64 编码处理。这种处理方式有几个原因：

二进制安全：JSON 格式是文本格式，设计用于轻松阅读和编写。而 []byte 通常包含二进制数据，这可能包括无法打印的字符或可能干扰 JSON 解析的控制字符。Base64 编码将二进制数据转换为纯文本形式，使其在 JSON 中安全传输。
兼容性：JSON 在许多编程环境和网络传输中广泛使用。Base64 编码确保 []byte 数据在这些不同环境中保持一致性和兼容性，因为所有系统都能处理文本。
保留数据完整性：Base64 编码是一种可逆编码，可以完美还原原始二进制数据。这意味着通过 JSON 传输的二进制数据可以在接收端准确无误地重建。
避免编码问题：直接将二进制数据嵌入 JSON 可能会引起编码问题。例如，如果二进制数据包含 JSON 字符串的结束引号或其他特殊字符，可能会破坏 JSON 的格式。Base64 编码消除了这种风险。

简而言之，json.Marshal 对 []byte 类型进行 Base64 编码是为了确保数据在 JSON 格式中的安全性、一致性和无误性。这允许 JSON 跨语言和平台安全可靠地传输复杂和多样化的数据类型。

正文开始

json Marshal默认会对[]byte类型进行base64编码处理

base64.go:

package main

import (
	"encoding/json"
	"fmt"
)

// golang json Marshal默认对[]byte类型进行base64编码处理(源码里有base64的逻辑)，Unmarshal时也只能用[]byte类型接收才能还原。(如果用interface{}接收，得到的是base64后的内容)

type test1 struct {
	X string
	Y []byte
}
type test2 struct {
	X string
	Y interface{}
}

func main() {
	a := test1{X: "geek", Y: []byte("geek")}
	fmt.Println("原始的a:", a)

	b, _ := json.Marshal(a)
	fmt.Println("经过Marshal之后得到的b:", string(b))

	var c test1
	var d test2
	json.Unmarshal(b, &c)
	json.Unmarshal(b, &d)
	fmt.Println("Unmarshal 上面得到的b，之前的[]byte字段用[]byte接收:", c)
	fmt.Println("Unmarshal 上面得到的b，之前的[]byte字段用interface{}接收:", d)
}

在线运行

输出：

原始的a: {geek [103 101 101 107]}
经过Marshal之后得到的b: {"X":"geek","Y":"Z2Vlaw=="}
Unmarshal 上面得到的b，之前的[]byte字段用[]byte接收: {geek [103 101 101 107]}
Unmarshal 上面得到的b，之前的[]byte字段用interface{}接收: {geek Z2Vlaw==}

src/encoding/json/encode.go

func encodeByteSlice(e *encodeState, v reflect.Value, _ encOpts) {
	if v.IsNil() {
		e.WriteString("null")
		return
	}
	s := v.Bytes()
	e.WriteByte('"')
	encodedLen := base64.StdEncoding.EncodedLen(len(s))
	if encodedLen <= len(e.scratch) {
		// If the encoded bytes fit in e.scratch, avoid an extra
		// allocation and use the cheaper Encoding.Encode.
		dst := e.scratch[:encodedLen]
		base64.StdEncoding.Encode(dst, s)
		e.Write(dst)
	} else if encodedLen <= 1024 {
		// The encoded bytes are short enough to allocate for, and
		// Encoding.Encode is still cheaper.
		dst := make([]byte, encodedLen)
		base64.StdEncoding.Encode(dst, s)
		e.Write(dst)
	} else {
		// The encoded bytes are too long to cheaply allocate, and
		// Encoding.Encode is no longer noticeably cheaper.
		enc := base64.NewEncoder(base64.StdEncoding, e)
		enc.Write(s)
		enc.Close()
	}
	e.WriteByte('"')
}

在 json.Unmarshal时也有类似反向处理，src/encoding/json/decode.go：

为什么要这样做？

JSON 格式本身不支持二进制数据。必须对二进制数据进行转义，以便可以将其放入 JSON 中的字符串元素。

而在进行json处理时，[]byte 始终被编码为 base64格式，而不是直接作为utf8字符串输出。

因为JSON规范中不允许一些 ASCII 字符。 ASCII 的 33 个控制字符（[0..31] 和 127）以及 " 和 \ 必须排除。这样剩下 128-35 = 93 个字符

而Base64（基底64）是一种基于64个可打印字符来表示二进制数据的表示方法，Base64中的可打印字符包括字母A-Z、a-z、数字0-9，这样共有62个字符，此外还有两个可打印的符号(在不同系统中而有所不同)。

也就是说base64可以将任意的字符串，输出为用A-Z、a-z、数字0-9以及两个根据系统而定的可打印符号，这样共64个字符编码的格式。这样也就解决了35个特殊字符，不符合JSON规范的问题。

详见：

The problem with UTF-8 is that it is not the most space efficient encoding. Also, some random binary byte sequences are invalid UTF-8 encoding. So you can’t just interpret a random binary byte sequence as some UTF-8 data because it will be invalid UTF-8 encoding. The benefit of this constrain on the UTF-8 encoding is that it makes it robust and possible to locate multi byte chars start and end whatever byte we start looking at.

As a consequence, if encoding a byte value in the range [0..127] would need only one byte in UTF-8 encoding, encoding a byte value in the range [128..255] would require 2 bytes ! Worse than that. In JSON, control chars, “ and \ are not allowed to appear in a string. So the binary data would require some transformation to be properly encoded.

Let see. If we assume uniformly distributed random byte values in our binary data then, on average, half of the bytes would be encoded in one bytes and the other half in two bytes. The UTF-8 encoded binary data would have 150% of the initial size.

Base64 encoding grows only to 133% of the initial size. So Base64 encoding is more efficient.

What about using another Base encoding ? In UTF-8, encoding the 128 ASCII values is the most space efficient. In 8 bits you can store 7 bits. So if we cut the binary data in 7 bit chunks to store them in each byte of an UTF-8 encoded string, the encoded data would grow only to 114% of the initial size. Better than Base64. Unfortunately we can’t use this easy trick because JSON doesn’t allow some ASCII chars. The 33 control characters of ASCII ( [0..31] and 127) and the “ and \ must be excluded. This leaves us only 128-35 = 93 chars.

So in theory we could define a Base93 encoding which would grow the encoded size to 8/log2(93) = 8*log10(2)/log10(93) = 122%. But a Base93 encoding would not be as convenient as a Base64 encoding. Base64 requires to cut the input byte sequence in 6bit chunks for which simple bitwise operation works well. Beside 133% is not much more than 122%.

This is why I came independently to the common conclusion that Base64 is indeed the best choice to encode binary data in JSON. My answer presents a justification for it. I agree it isn’t very attractive from the performance point of view, but consider also the benefit of using JSON with it’s human readable string representation easy to manipulate in all programming languages.

If performance is critical than a pure binary encoding should be considered as replacement of JSON. But with JSON my conclusion is that Base64 is the best.

图片来自Go-Json编码解码,推荐阅读

由此带来的问题及解决

通过对[]byte进行base64编码的方式，解决了[]byte转为字符串后可能不符合JSON规范的问题，但同时，使用base64编码，会使编码后的数据相较原数据，稳定增大1/3 (详见base64词条介绍)。由此会增大存储空间和传输过程的负担。

这里在讨论有没有更好的方式 binary-data-in-json-string-something-better-than-base64

扩展： base64的变种

然而，标准的Base64并不适合直接放在URL里传输，因为URL编码器会把标准Base64中的/和+字符变为形如%XX的形式，而这些%号在存入数据库时还需要再进行转换，因为ANSI SQL中已将%号用作通配符。

为解决此问题，可采用一种用于URL的改进Base64编码，它不在末尾填充=号，并将标准Base64中的+和/分别改成了-和_，这样就免去了在URL编解码和数据库存储时所要做的转换，避免了编码信息长度在此过程中的增加，并统一了数据库、表单等处对象标识符的格式。

另有一种用于正则表达式的改进Base64变种，它将+和/改成了!和-，因为+，*以及前面在IRCu中用到的[和]在正则表达式中都可能具有特殊含义。

此外还有一些变种，它们将+/改为_-或.（用作编程语言中的标识符名称）或.-（用于XML中的Nmtoken）甚至:（用于XML中的Name）。

所以在很多项目中，能看到类似代码：

package TLSSigAPI

import (
	"encoding/base64"
	"strings"
)

func base64urlEncode(data []byte) string {
	str := base64.StdEncoding.EncodeToString(data)
	str = strings.Replace(str, "+", "*", -1)
	str = strings.Replace(str, "/", "-", -1)
	str = strings.Replace(str, "=", "_", -1)
	return str
}

func base64urlDecode(str string) ([]byte, error) {
	str = strings.Replace(str, "_", "=", -1)
	str = strings.Replace(str, "-", "/", -1)
	str = strings.Replace(str, "*", "+", -1)
	return base64.StdEncoding.DecodeString(str)
}

原文链接: https://dashen.tech/2020/03/18/json-Marshal为什么会对-byte类型进行base64编码处理？/

版权声明: 转载请注明出处.

清澄秋爽

苹果树下的思索者书写是对思维的缓存

json.Marshal为什么会对[]byte类型进行base64编码处理？

json Marshal默认会对[]byte类型进行base64编码处理

为什么要这样做？

由此带来的问题及解决

扩展： base64的变种

文章目录