这篇文章刚出来那几天,正好比较忙。这两天,抽空把这篇文章读了一下。以下是简要的读书笔记。这篇文章包含附录长达97页。其中正文40多页。
这篇文章在非正式场合宣传中(如微博微信或公众号),经常被称为”从第一性原则出发的深度网络“。但是,也受到一些批评。第一性原则是一个foundermental的原则,例如物理中的对称原则(能量守恒定律、宇称守恒定律),用这个词来刻画本文,可能有点浮夸。
ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction
从最大下降率原则出发的一种白盒深度神经网络。
maximizes the coding rate difference between the whole dataset and the average of all the subsets
最大化“整体数据集” "所有子集的平均"之间的编码率差异
constructed layer-by-layer via forward propagation, instead of learned via back propagation
逐层搭建,只前向,不后向
What exactly do we try to learn from and about the data?
我们究竟想从数据中学习什么
我们想从数据中学习什么样的表示?我们需要构建什么样的网络来实现这个目标
learn a low-dimensional linear discriminative representation of the data:
目标是学习低维线性判别性特征
rate reduction
运价减成理论可以解决这个问题。
where within-class variability and structural information are completely suppressed and ignored
cross entropy loss有各种问题。其中一个问题是,类内的差异性和结构信息会被抑制和忽略。
neural collapsing phenomenon. That is, features of each class are mapped to a one-dimensional vector whereas all other information of the class is suppressed.
分类最终用一维来表达会出现上述的神经崩溃问题。
The precise geometric and statistical properties of the learned features are also often obscured,
所学习到的特征会缺乏几何特性和统计特性
Formally, it seeks to maximize the mutual information I(z, y) (Cover and Thomas, 2006) between z and y while minimizing I(x, z) between x and z:
x -> z > y
最大化 z和y之间的互信息
最小化x和z之间的互信息?
和vae那么像?
our framework uses the label y as only side information to assist learning distcriminative yet diverse (not minimal) representations; these representations optimize a different intrinsic objective based on the principle of rate reduction
所以这篇文章只是用标签来做一个辅助信息来学习特征。
Typically, such representations are learned in an end-to-end fashion by imposing certain heuristics on geometric or statistical “compactness” of z
autoencoder也有类似的功能
fail to capture all internal subclass structures or to explicitly discriminate among them for classification or clustering purposes.
但是自编码可能会对类问题的区分没学习好。
model collapsing in learning generative models for data that have mixed multi-modal structures
所以自编码可能带来模型崩溃的问题(交叉熵会带来神经崩溃的问题)。
If the above contractive learning seeks to reduce the dimension of the learned representation, contrastive learning (Hadsell et al., 2006; Oord et al., 2018; He et al., 2019) seems to do just the opposite
然后这里又讨论了对比学习(这篇文章好像把各种工作的拉进来讨论,想做一个通用性的东西)。
contractive learning and contrastive learning,
这篇文章把自编码相关工作称为收缩学习,自监督学习称为对比学习(可能是为了单词的对称之美)。
As we may see from the practice of both contractive learning and contrastive learning, for a good representation of the given data, people have striven to achieve certain tradeoff between the compactness and discriminativeness of the representation. Contractive learning aims to compress the features of the entire ensemble, whereas contractive learning expands features of any pair of samples. Hence it is not entirely clear why either of these two seemingly opposite heuristics seems to help learn good features. Could it be the case that both mechanisms are needed but each acts on different part of the data? As we will see, the rate reduction principle precisely reconciles the tension between these two seemingly contradictory objectives by explicitly specify to compress (or contract) similar features in each class whereas to expand (or contrast) the set of all features in multiple classes。其中:
Contractive learning aims to compress the features of the entire ensemble,
自编码收缩特征,而
, whereas contractive learning expands features of any pair of samples.
对比学习扩张特征。
As we will see, the rate reduction principle precisely reconciles the tension between these two seemingly contradictory objectives by explicitly specify to compress (or contract) similar features in each class whereas to expand (or contrast) the set of all features in multiple classes
本文提出的方法可以综合这两方面的优劣。可以收缩同类的特征(那岂不是又落入cross entropy的缺点里面?),而扩张不同类的特征。
So, how do we design a neural networks?
那么,我们要怎么设计神经网络呢?
Design a network
设计一个网络
or search for a neural networks
网络结构搜索
there has been apparent lack of direct justification of the resulting network architectures from the desired learning objectives, e.g. cross entropy or contrastive learning.
问题在于,利用cross entropy和对比学习做目标来设计神经网络结构是未必准确的。作者认为,自己提出的方法相对这些目标,可行一些。
To a large extent, this work will resolve this issue and reveal some fundamental relationships between sparse coding and deep representation learning.
这篇文章从一个很高的程序,解释了sparse coding与深度表达学习之间的关系 。
However, in both cases the forward-constructed networks seek a representation of the data that is not directly related to a specific (classification) task. To resolves limitations of both the ScatteringNet and the PCANet, this work shows how to construct a data-dependent deep convolution network in a forward fashion that leads to a discriminative representation directly beneficial to the classification task.
这篇文章认为,用到标签信息是坏事。所以这篇文章的思想自监督学习有莫大的关联。
To do so, we require our learned representation to have the following properties, called a linear discriminative representation (LDR):
压缩编码想要线性可分。类内压缩/类内可分/还要deverse
Here, however, although the intrinsic structures of each class/cluster may be low-dimensional, they are by no means simply linear (or Gaussian) in their original representation x and they need be to made linear through a nonlinear transform z = f(x).
Unlike LDA (or similarly SVM), here we do not directly seek a discriminant (linear) classifier. Instead, we use the nonlinear transform to seek a linear discriminative representation7 (LDR) for the data such that the subspaces that represent all the classes are maximally incoherent.
那么,它和LDA/SVM就很像了。不过还是有区别的:LDA是用线性提一个特征,然后加一个非线性核,最后线性可分。而这篇文章,并不是用一个线性来提特征。这篇文章直接用一个非线性映射(如神经网络)来提一个特征,然后线性分类。(注:这个区别好像不明显。对LDA而言,我们可以认为加了非线性核之后才是相想的特征,这样来看,这篇文章的方法就和它们没有区别了)
In this paper, we attempt to provide some answers to the above questions and offer a plausible interpretation of deep neural networks by deriving a class of deep (convolution) networks from first principles. We contend that all key features and structures of modern deep (convolution) neural networks can be naturally derived from optimizing the rate reduction objective, which seeks an optimal (invariant) linear discriminative representation of the data. More specifically, the basic iterative projected gradient ascent scheme for optimizing this objective naturally takes the form of a deep neural network, one layer per iteration.
这篇文章称,它从第一性原理的角度,解释了神经网络。
is there a simple but principled objective that can measure the goodness of the resulting representations in terms of all these properties? The key to these questions is to find a principled “measure of compactness” for the distribution of a random variable z or from its finite samples Z
如何衡量紧凑性
To alleviate this difficulty, another related concept in information theory, more specifically in lossy data compression, that measures the “compactness” of a random distribution is the so-called rate distortion (Cover and Thomas, 2006):
比率失真可以衡量紧凑性
Given a random variable z and a prescribed precision eps > 0, the rate distortion R(z, eps) is the minimal number of binary bits needed to encode z such that the expected decoding error is less than , i.e., the decoded zb satisfies E[kz − zbk2] ≤ eps
比率失真:给定一个eps,最小需要多少比特的编码
Therefore, the compactness of learned features as a whole can be measured in terms of the average coding length per sample (as the sample size m is large), a.k.a. the coding rate subject to the distortion:
综上,紧凑性可以所有样本的平均编码长度来衡量。
In general, the features Z of multi-class data may belong to multiple low-dimensional subspaces. To evaluate the rate distortion of such mixed data more accurately, we may partition the data Z into multiple subsets:
如果含有多个类别的话,则每一类都分开搞,然后求平均。
Shortly put, learned features should follow the basic rule that similarity contracts and dissimilarity contrasts.
学习的过程和一般的metric learning基本一样。
To be more precise, a good (linear) discriminative representation Z of X is one such that, given a partition Π of Z, achieves a large difference between the coding rate for the whole and that for all the subsets:
很棒,这个公式和我们当年的triplet loss有异曲同工之秒。它最大化整体的code rate(diverse),最小化类内的code rate。(但问题也来了,这使得类内的差异性变小,出现了作者claim的cross entropy的问题:类内的差异性和结构信息会被抑制和忽略。前后文矛盾了)
In this work, to simplify the analysis and derivation, we adopt the simplest possible normalization schemes, by simply enforcing each sample on a sphere or the Frobenius norm of each subset being a constant
在这一段里面,作者和bn拉了一下关系。最终作者采用的是l2 norm,这些规范化技术在triplet loss和contrastive learning中是很常用的。
We refer to this as the principle of maximal coding rate reduction (MCR2 ), an embodiment of Aristotle’s famous quote: “the whole is greater than the sum of the parts.”
虽然作者用了the whole is greater than the sum of the parts来解释,但我更偏向用”最大化整体的code rate(diverse),最小化类内的code rate“来解释。
另外一个问题来了:用mcr2类似于triplet loss,这就是文章前文所称的"所以这篇文章只是用标签来做一个辅助信息来学习特征"吗?这个似乎不太严谨。
MCR2 focuses on learning representations z(θ) rather than fitting labels.
所以,这个Claim也有一定的争议。
The maximal coding rate reduction can be viewed as a generalization to information gain (IG), which aims to maximize the reduction of entropy of a random variable, say z, with respect to an observed attribute, say π: maxπ IG(z,π) .= H(z) − H(z | π), i.e., the mutual information between z and π (Cover and Thomas, 2006). Maximal information gain has been widely used in areas such as decision trees (Quinlan, 1986).
确实看形式来说,MCR2和互信息有一定的联系。全体与局部的互信息。
We here reveal nice properties of the optimal representation with the special case of linear subspaces, which have many important use cases in machine learning
我们将介绍MCR的美丽性质。
Between-class Discriminative:
As long as the ambient space is adequately large (n ≥ k j=1 dj ), the subspaces are all orthogonal to each other
类间两两正交
Maximally Diverse Representation:
差异性最大化
In other words, the MCR2 principle promotes embedding of data into multiple independent subspaces,14 with features distributed isotropically in each subspace (except for possibly one dimension). In addition, among all such discriminative representations, it prefers the one with the highest dimensions in the ambient space
本文的方法可以更好的保证每一类的空间比较正交,且偏好维度更高。传统的cross entropy无法做到这样。
Orthogonal low-rank embedding (OLE) loss。∆R, OLE is always negative and achieves the maximal value 0 when the subspaces are orthogonal, regardless of their dimensions. So in contrast to ∆R, this loss serves as a geometric heuristic and does not promote diverse representations. In fact, OLE typically promotes learning one-dim representations per class, whereas MCR2 encourages learning subspaces with maximal dimensions (Figure 7 of Lezama et al. (2018) versus our Figure 17).
和OLE的区别是:OLE和cross entropy一样,偏向把维度弄成1,也就是前面所说的作者claim的cross
entropy的问题:类内的差异性和结构信息会被抑制和忽略。(我不太赞同这个观点。就算是cross
entropy,我们所说的特征是指分类前一层。真的前一层会偏向把维度变成1吗?有实验经验的朋友会马上回答”No“。总之,作者cross-entropy的缺点时,前后文时常相互矛盾,有时指分类的前一层特征,有时有想用分类层来解释。)
Relation to contractive or contrastive learning.
本文的方法和对比学习(以及作者用的收缩学习这个词)和很大的相关性。尤其是与逐层对比学习相关性极大。我认为,对比学习可能优于本文的方法。本文认为,所提的方法可以与对比学习相结合。