(ICML20) Streaming Submodular Maximization under a $k$-Set System

发表于 2020-12-01 更新于 2021-10-08 分类于学习，算法阅读次数：

本文字数： 8.2k

号称第一个解决 streaming + non-monotone + $k$-system constraint 问题的算法出现了. 究竟是怎么一回事呢? 现在就让小编带大家一起来看看吧.

A Framework

文章提出了一个新的将适用于单调次模函数的流算法转变为适用于非单调次模函数的流算法的框架. 其实这种 converting 框架在 AAAI2018 的一篇文章中已经被提出过一次了, 不过在那篇文章中要求所用的流算法具有 \[ f(S)\geq\alpha f(S\cup T) \] 这样的性质 ($S$ 为算法输出, $T$ 为任一可行解, $\alpha>0$ 为某个常数).

本文中的框架对输入算法要求具有 $(\alpha,\gamma)$-approximation 的性质:

Definition: Consider a data stream algorithm for maximizing a non-negative submodular function $f:2^{\mathcal{N}}\to\mathbb{R}_{\geq0}$ subject to a constraint $(\mathcal{N},\mathcal{I})$. We say that such an algorithm is an $(\alpha,\gamma)$-approximation algorithm, for some $\alpha>1$ and $\gamma\geq0$, if it returns two sets $S\subseteq A\subseteq\mathcal{N}$ such that $S\in\mathcal{I}$, and for all $T\in\mathcal{I}$ we have \[ \mathbb{E}[f(T\cup A)]\leq\alpha\mathbb{E}[f(S)]+\gamma \]

现在我们来看看这个 framework, 记 $\text{StreamingAlg}$ 为输入的流算法, $\text{ConstrainedAlg}$ 为同一问题的 offline 算法.

(Alg. 1) Non-monotone Data Stream Algorithm

Input: a positive integer $r$
Output: a set $S\in\mathcal{I}$
Initialize $r$ independent copies of $\text{StreamingAlg}$: $\text{StreamingAlg}^{(1)},\dots,\text{StreamingAlg}^{(r)}$
while there are more elements in the stream do
Let $D_0$ be a singleton set containing the next element of the stream
for $i=1$ to $r$ do $D_i\leftarrow\text{StreamingAlg}^{(i)}(D_{i-1})$
Let $D_0\leftarrow\emptyset$
for $i=1$ to $r$ do
$(S_i,A_i,D_i)\leftarrow\text{StreamingAlg}^{(i)}_{\mathbf{end}}(D_{i-1})$
$S_i'\leftarrow\text{ConstrainedAlg}(A_i)$
return the set $S$ maximizing $f$ among $\{S_i,S_i'\}_{i=1}^r$

算法的思想很简单, 首先对每个到来的元素 $u$, 使用 $r$ 个独立相同的 $\text{StreamingAlg}$ 程序来依次进行处理: 第一个程序的输入为 $\{u\}$, 之后每个程序以被上一个程序扔出内存的部分 (可能是因为 constraint 不满足, 或者是因为部分元素被置换出去) 为输入进行处理. 当 stream 整个 pass 结束后, 仍按次序输出这 $r$ 个子程序的结果. 由于 $\text{StreamingAlg}$ 是 $(\alpha,\gamma)$-approximation 算法, 我们将集合 $S,A$ 输出, 同时也对集合 $A$ 跑一次 offline 程序 $\text{ConstrainedAlg}$. 在输出 $S,A$ 的同时, 我们也将所有在该子程序内存中但不在 $A$ 中的数据构成的集合 $D$ 输出, 作为下一个子程序的输入 (第一个子程序的输入为空集). 这个过程可以简单描述成如下图所示.


仿佛就是 streaming 版本的 RepeatedGreedy (实际上接下来的 Proposition 1 的推导完全类似 RepeatedGreedy 的近似比推导)

对于这个框架, 我们有如下结论.

Proposition 1

Given an $(\alpha,\gamma)$-approximation data stream algorithm $\text{StreamingAlg}$ for maximizing a non-negative submodular function subject to some constraint and an offline $\beta$-approximation algorithm $\text{ConstrainedAlg}$ for the same problem. There exists a data stream algorithm returning a feasible set $S$ that obeys \[ \mathbb{E}[f(S)]\geq\frac{(r-1)f(O)-r\gamma}{r\alpha+r(r-1)\beta/2} \] Furthermore,

this algorithm is deterministic if $\text{StreamingAlg}$ and $\text{ConstrainedAlg}$ are both deterministic.
the space complexity of this algorithm is upper bounded by $O(rM_{\text{StreamingAlg}}+M_{\text{ConstrainedAlg}})$, where $M_{\text{StreamingAlg}}$ and $M_{\text{ConstrainedAlg}}$ represent the space complexities of thier matching algorithms under the assumption that the input for $\text{StreamingAlg}$ is a subset of the full input and the input for $\text{ConstrainedAlg}$ is the $A$ set produced by $\text{StreamingAlg}$ on some such subset.

Algorithms

下面介绍论文提出的解决 $k$-system 约束的算法. 为了方便分析, 先给出一个预先知道 $M=\max_{u\in\mathcal{N},\{u\}\in\mathcal{I}}f(\{u\})$ 和最大独立集大小 $\rho$ 的版本.

(Alg. 2) Streaming Algorithm for $k$-system

Input: a threshold $\tau\in[M,2M]$, the size $\rho$ of the largest independent set, and parameter $k$ of the constraint
Output: a solution $T\in\mathcal{I}$
Let $\ell\leftarrow\lfloor\log_2(4\rho)\rfloor$ and $h\leftarrow\lceil\log_2(2k+1)\rceil$
for $i=0$ to $\ell$ do Initialize $E_i\leftarrow\emptyset$
for every element $u$ arriving do
Let $m(u)\leftarrow f(u|\cup_{i=0}^{\ell}E_i)$
if $m(u)>0$ then Let $i(u)\leftarrow\lfloor\log_2(\tau/m(u))\rfloor$ else Let $i(u)\leftarrow\infty$
if $0\leq i(u)\leq\ell$ and $E_{i(u)}+u\in\mathcal{I}$ then Update $E_{i(u)}\leftarrow E_{i(u)}+u$
for $j=0$ to $h-1$ do
Let $i\leftarrow j$ and $T_j\leftarrow\emptyset$
while $i\leq\ell$ do
while there is an element $u\in E_i$ such that $T_j+u\in\mathcal{I}$ do Update $T_j\leftarrow T_j+u$
$i\leftarrow i+h$
return the set $T$ maximizing $f$ among $T_0,T_1,\dots,T_{h-1}$

下面我们将详细分析一下这个算法的性质. 首先我们可以看出 Alg. 2 仍采用了类似 sieve 的方法. 那么, 在 streaming 条件下使用 threshold 有什么困难? 对 knapsack, 同时超过 threshold 且超过 budget 的元素可以对 threshold 设置成和 budget 相关的一个量来 bound 住这部分元素的值. 但对 $k$-system, 如果硬上 multi-solution + threshold, 超过 threshold 但因 independent set 约束不能加入集合的这部分元素无法像 knapsack 那样简单地处理. 算法对不同 threshold 构造了 $\ell$ 个集合, 然后最终的结果通过在这 $\ell$ 个集合中按一定规则进行挑选得到. 首先我们需要说明集合 $E$ 的性质是好的, 这由以下引理保证.

Lemma 1: For every set $S\in\mathcal{I}$, \[ f(E|\emptyset)\geq\dfrac{f(S\cup E|\emptyset)-\tau/4}{2k+1} \]

Proof Sketch of Lemma 1: 算法构造了 $\ell+1$ 个区间 $(2^{-(i+1)}\tau,2^{-i}\tau]$ 和与之对应的集合 $E_i$, $i=0,1,\dots,\ell$. 每当 $u$ 到来时的 marginal gain 满足 \[ m(u)=f(u|\text{Pre}(u,E))\in (2^{-(i+1)}\tau,2^{-i}\tau] \] 时 (这里 $E=\cup_{i=0}^{\ell}E_i$), 我们就将 $u$ 加入到 $E_i$ 中 (在满足 $k$-system 条件约束下). 由 greedy 性质可知 $E_i$ 是 $U_i=\{u\in\mathcal{N}:i(u)=i\}$ 的一个 base, 再结合 $U_i$ 之间的不交性启发我们考虑 $S$ 在各 $U_i$ 中的部分: $|S\cap U_i|\leq k|E_i|$, $i=0,1,\dots,\ell$. 而对 $i<0\vee i>\ell$ 的部分, 我们就可以通过调整 $\ell,\tau$ 的值将其 bound 住.

由于我们现在得到的 $E$ 并不一定是 independet set, 所以我们需要从 $E$ 中选出满足约束条件的集合的同时, 也要要求其和 $E$ 能建立起某种联系 (方便近似比的导出). 下面来看这个关键的引理.

Lemma 2: For every integer $0\leq j<h$ we have \[ f(T_j|\emptyset)\geq\frac{1}{4k}\sum_{\substack{0\leq i\leq\ell\\ i\equiv j\pmod{h}}}\sum_{u\in E_i}m(u) \]

Proof of Lemma 2: 由次模性 \[ f(T_j|\emptyset)=\sum_{u\in T_j}f(u|\text{Pre}(u,T_j))\geq\sum_{u\in T_j}f(u|\text{Pre}(u,E))=\sum_{u\in T_j}m(u)\tag{1} \] 我们记 $T_j^i$ 是从 $E_j$ 开始对 $E_i$ 做完 greedy 后得到的集合 ($j\leq i\leq \ell, i\equiv j\pmod{h}$). 设 $T_j^{j-h}=\emptyset$, 下面我们证明一个更强的结论 \[ \sum_{u\in T_j^i}m(u)\geq\frac{1}{4k}\sum_{\substack{j\leq r\leq i\\ r\equiv j\pmod{h}}}\sum_{u\in E_r}m(u)+\frac{|T_j^i|\tau}{2^{i+2}k},\quad \forall i\in[-h,\ell],\ i\equiv j(\text{mod}\ h) \] 我们对 $i$ 作归纳证明. 当 $i<0$ 时, 显然不等式两边都为 0, 下面考虑 $0\leq i\leq\ell$ 的情况. 由于 $T_j^{i}$ 是 $T_j^{i-h}\cup E_i$ 的 base, 故 $|E_i|\leq k|T_j^i|$, 再使用 $i-h$ 的归纳猜想, 我们有 \[ \begin{aligned} \sum_{u\in T_j^i}m(u)=&\sum_{u\in T_j^i\backslash T_j^{i-h}}m(u)+\sum_{u\in T_j^{i-h}}m(u)\geq\Big(\frac{|E_i|}{k}-|T_j^{i-h}|\Big)\frac{\tau}{2^{i+1}}+\sum_{u\in T_j^{i-h}}m(u)\\ \geq&\frac{1}{4k}\sum_{u\in E_i}m(u)+\frac{|E_i|\tau}{2^{i+2}k}-\frac{|T_j^{i-h}|\tau}{2^{i+1}} + \frac{1}{4k}\sum_{\substack{j\leq r\leq i-h\\ r\equiv j\pmod{h}}}\sum_{u\in E_r}m(u)+\frac{|T_j^{i-h}|\tau}{2^{i-h+2}k}\\ =&\frac{1}{4k}\sum_{\substack{j\leq r\leq i\\ r\equiv j\pmod{h}}}\sum_{u\in E_r}m(u)+\frac{|E_i|\tau}{2^{i+2}k}+\frac{|T_j^{i-h}|\tau}{2^{i+2}k}(2^h-2k)\\ \geq&\frac{1}{4k}\sum_{\substack{j\leq r\leq i\\ r\equiv j\pmod{h}}}\sum_{u\in E_r}m(u)+\frac{|E_i|\tau}{2^{i+2}k}+\frac{|T_j^{i-h}|\tau}{2^{i+2}k}\\ \geq&\frac{1}{4k}\sum_{\substack{j\leq r\leq i\\ r\equiv j\pmod{h}}}\sum_{u\in E_r}m(u)+\frac{|T_j^i|\tau}{2^{i+2}k} \end{aligned} \] 故由归纳法知结论成立. 取 $i=j+h\lfloor(\ell-j)/h\rfloor$ 结合式 $(1)$ 即知 Lemma 2 欲证式成立, 证毕.

Proposition 2

Alg. 2 stores $O(\rho(\log\rho+\log k))$ elements at every given time point. For a $k$-system $(\mathcal{N},\mathcal{I})$, the algorithm returns a set $T$ such that \[ f(T)\geq\frac{f(E\cup S)-\tau/4}{4kh(2k+1)} \] for every set $S\in\mathcal{I}$.

Proof of Proposition 2: space complexity 是显然的. 另外可由 \[ \begin{aligned} f(T)\geq&\frac{\sum_{j=0}^{h-1}f(T_j)}{h}\geq f(\emptyset)+\frac{1}{4kh}\sum_{j=0}^{h-1}\sum_{\substack{0\leq i\leq\ell\\ i\equiv j\pmod{h}}}\sum_{u\in E_i}m(u)\\ =&f(\emptyset)+\frac{1}{4kh}\sum_{i=0}^{\ell}\sum_{u\in E_i}m(u)\geq f(\emptyset)+\frac{f(E|\emptyset)}{4kh}\geq\frac{f(E\cup S)-\tau/4}{4kh(2k+1)} \end{aligned} \] 得到.

有了 Proposition 2 后我们就可以来分析对单调和非单调的目标函数的结果了. 当目标函数 $f$ 是单调的时候直接使用 Alg. 2 即可: 取 $U=O$, 再由 $\tau\leq2M\leq 2f(O)$, 得到 \[ f(T)\geq\frac{f(O)-f(O)/2}{4kh(2k+1)}\Rightarrow f(O)\leq8kh(2k+1)f(T) \] 而对于非单调的情形我们需要采用 Alg. 1 框架: 取 $A=E$ 根据定义可知 Alg. 2 是一个 $(4kh(2k+1),\tau/4)$-approximation 的 streaming 算法. 我们使用 Alg. 2 作为 Alg. 1 中的 $\text{StreamingAlg}$, 使用 RepeatedGreedy (此时 $\beta=k+O(\sqrt{k})$) 作为 Alg. 1 中的 $\text{ConstrainedAlg}$, 取 $r=4$ 就有 \[ f(T)\geq\frac{(4-1)f(O)-\tau}{4\alpha+6\beta}\geq\frac{3f(O)-2f(O)}{16kh(2k+1)+6(k+O(\sqrt{k}))}\Rightarrow f(O)\leq O(k^2\log k)f(T) \] 所以综合以上两种情况我们就有如下关于近似比的结论.

Proposition 3

There is a streaming $O(k^2\log k)$-approximation algorithm for the problem of maximizing a non-negative submodular function subject to a $k$-system constraint.

对于更强的 $k$-extendible 约束, 有稍好的结果 (证略).

Proposition 4

There is a streaming $O(k\log k)$-approximation algorithm for the problem of maximizing a non-negative submodular function subject to a $k$-extendible system constraint.

由于无法预先知道 $M,\rho$, 文章采取了逐步减弱条件的方式来达到同样效果.

路人博客的写作方法