$$ X\in\mathbb{R}^{N\times m}: \text{Input datasets, N samples of m dimensional configuration space}\\Y\in\mathbb{R}^{N\times n}: \text{Output datasets, N samples of n dimensional task space} $$
$$ Z\in\{0,1\}^{m} : \text{Masking vector for X indicating } \\X_O=X\text{diag}(Z) \text{ and } X_{\bar{O}}=X(\mathbb{I_m}-\text{diag}(Z))\\X_O:\text{Observable X where rank($X_O$)$\le m$.} $$
Goal is to find the optimal sensor placement that satisfies
$$ \tau^*=\argmax_\tau~\log p(Y|X,\tau) $$
where \tau is a masking probability for each dimension.
The log probability can be further expanded as
$$ \log p(Y|X,\tau)= \log p(Y,Z|X,\tau) - \log p(Z|Y,X,\tau) $$
where Z is a masking variable for each dataset sampled from \tau. Then,
$$ \log p(Y|X,\tau)= \sum_Z p(Z|Y,X,\tau^{(t)})\log p(Y,Z|X,\tau) - \sum_Z p(Z|Y,X,\tau^{(t)})\log p(Z|Y,X,\tau) $$
and the second term is an informatic entropy of the probability distribution:
$$ H(\tau|\tau^{(t)})=- \sum_Z p(Z|Y,X,\tau^{(t)})\log p(Z|Y,X,\tau)\ge H(\tau^{(t)}|\tau^{(t)}) $$