Bayesian Variable Selection

Most use cases of millipede are covered by making use of one of the four Bayesian variable selection methods listed below.

NormalLikelihoodVariableSelector

class NormalLikelihoodVariableSelector(dataframe, response_column, assumed_columns=[], sigma_scale_factor_column=None, S=5, prior='isotropic', include_intercept=True, tau=0.01, tau_intercept=0.0001, c=100.0, nu0=0.0, lambda0=0.0, precision='double', device='cpu', subset_size=None, explore=5, precompute_XX=False, xi_target=0.2)[source]

Bayesian variable selection for a linear model with a Normal likelihood. The likelihood variance is controlled by an Inverse Gamma prior. This class is appropriate for continuous-valued responses.

Usage:

selector = NormalLikelihoodVariableSelector(dataframe, 'response', ...)
selector.run(T=2000, T_burnin=1000)
print(selector.summary)

The details of the model used in NormalLikelihoodVariableSelector are as follows. The covariates \(X\) and responses \(Y\) are defined as follows:

\[X \in \mathbb{R}^{N \times P} \qquad \qquad Y \in \mathbb{R}^{N}\]

and are provided by the user. The user should put some thought into whether the covariates \(X\) and responses \(Y\) should be centered and/or normalized. This is generally a good idea for the responses \(Y\), but whether pre-processing for \(X\) is advisable depends on the nature of the dataset.

The inclusion of each covariate is governed by a Bernoulli random variable \(\gamma_p\). In particular \(\gamma_p = 0\) corresponds to exclusion and \(\gamma_p = 1\) corresponds to inclusion. The prior probability of inclusion is governed by \(h\) or alternatively \(S\):

\[h \in [0, 1] \qquad \rm{with} \qquad S \equiv hP\]

Alternatively, if \(h\) is not known a priori we can put a prior on \(h\):

\[h \sim {\rm Beta}(\alpha, \beta) \qquad \rm{with} \qquad \alpha > 0 \;\;\;\; \beta > 0\]

Putting this together, the model specification for an isotopric prior (with an intercept \(\beta_0\) included) is as follows:

\[ \begin{align}\begin{aligned}&\gamma_p \sim \rm{Bernoulli}(h) \qquad \rm{for} \qquad p=1,2,...,P\\&\sigma^2 \sim \rm{InverseGamma}(\nu_0 / 2, \nu_0 \lambda_0 / 2)\\&\beta_0 \sim \rm{Normal}(0, \sigma^2\tau_\rm{intercept}^{-1})\\&\beta_\gamma \sim \rm{Normal}(0, \sigma^2 \tau^{-1} \mathbb{1}_\gamma)\\&Y_n \sim \rm{Normal}(\beta_0 + X_{n, \gamma} \cdot \beta_\gamma, \sigma^2) \qquad \rm{for} \qquad n=1,2,...,N\end{aligned}\end{align} \]

Note that the dimension of \(\beta_\gamma\) depends on the number of covariates included in a particular model (i.e. on the number of non-zero entries in \(\gamma\)). The hyperparameters \(\nu_0\) and \(\lambda_0\) govern the prior over \(\sigma^2\). The default choice \(\nu_0=\lambda_0=0\) corresponds to an improper prior \(p(\sigma^2) \propto 1/\sigma^2\).

For a gprior the prior over the coefficients (including the intercept \(\beta_0\) if it is included) is instead specified as follows:

\[\beta_{\gamma} \sim \rm{Normal}(0, c \sigma^2 (X_\gamma^{\rm{T}} X_\gamma)^{-1})\]

where \(c > 0\) is a user-specified hyperparameter.

Parameters
  • dataframe (DataFrame) – A pandas.DataFrame that contains covariates and responses. Each row encodes a single datapoint. All columns apart from the response column (and the columns in assumed_columns if there are any) are assumed to be covariates.

  • response_column (str) – The name of the column in dataframe that contains the continuous-valued responses.

  • assumed_columns (list) – A list of the names of the columns in dataframe that correspond to covariates that are always assumed to be part of the model. Defaults to []. Note that these columns do not have PIPs, as they are always included in the model.

  • sigma_scale_factor_column (str) – The name of the (optional) column in dataframe that contains positive scale factors that are used to scale the standard deviation of the Normal likelihood for each datapoint. For example, specifying 2.0 for a particular datapoint results in \(\sigma \rightarrow 2 \times \sigma\) for that datapoint. Defaults to None.

  • S – Controls the expected number of covariates to include in the model a priori. Defaults to 5.0. To specify covariate-level prior inclusion probabilities provide a pandas.Series with index that corresponds to covariate columns in dataframe and that specifies covariate-level prior inclusion probabilities. If a tuple of positive floats (alpha, beta) is provided, the a priori inclusion probability is a latent variable governed by the corresponding Beta prior so that the sparsity level is inferred from the data. Note that for a given choice of alpha and beta the expected number of covariates to include in the model a priori is given by \(\frac{\alpha}{\alpha + \beta} \times P\). Also note that the mean number of covariates in the posterior can vary significantly from prior expectations, since the posterior is in effect a compromise between the prior and the observed data.

  • prior (str) – One of the two supported priors for the coefficients: ‘isotropic’ or ‘gprior’. Defaults to ‘isotropic’.

  • include_intercept (bool) – Whether to include an intercept term. If included the intercept term is is included in all models so that the corresponding coefficient does not have a PIP. Defaults to True.

  • tau (float) – Controls the precision of the coefficients in the isotropic prior. Defaults to 0.01.

  • tau_intercept (float) – Controls the precision of the intercept in the isotropic prior. Defaults to 1.0e-4.

  • c (float) – Controls the precision of the coefficients in the gprior. Defaults to 100.0.

  • nu0 (float) – Controls the prior over the variance in the Normal likelihood. Defaults to 0.0.

  • lambda0 (float) – Controls the prior over the variance in the Normal likelihood. Defaults to 0.0.

  • precision (str) – Whether computations should be done with ‘single’ (i.e. 32-bit) or ‘double’ (i.e. 64-bit) floating point precision. Defaults to ‘double’. Note that it may be ill-advised to use single precision.

  • device (str) – Whether computations should be done on CPU (‘cpu’) or GPU (‘gpu’). Defaults to ‘cpu’.

  • explore (float) – This hyperparameter controls how greedy the MCMC algorithm is. Defaults to 5.0. For expert users only.

  • subset_size (int) – If subset_size is not None subset_size controls the amount of computational resources to use during MCMC inference. Otherwise all available computational resources are used. This argument is intended to be used for datasets with a very large number of covariates (e.g. tens of thousands or more). A typical value might be ~5-10% of the total number of covariates; smaller values result in more MCMC iterations per second but may lead to high variance PIP estimates. Defaults to None.

  • precompute_XX (bool) – Whether the covariance matrix \(X^{\rm T} X \in \mathbb{R}^{P \times P}\) should be pre-computed. Defaults to False. Note that setting this to True may result in out-of-memory errors for sufficiently large covariate matrices \(X\). However, if sufficient memory is available, setting precompute_XX to True should be faster.

  • xi_target (float) – This hyperparameter controls how frequently the MCMC algorithm makes \(h\) updates if \(h\) is a latent variable. Defaults to 0.20. For expert users only.

run(T=2000, T_burnin=1000, verbosity='bar', report_frequency=200, streaming=True, seed=None)[source]

Run MCMC inference for \(T + T_{\rm burn-in}\) iterations. After completion the results of the MCMC run can be accessed in the summary and stats attributes. Additionally, if streaming == False the samples attribute will contain raw samples from the MCMC algorithm.

The summary DataFrame contains five columns. The first column lists the Posterior Inclusion Probability (PIP) for each covariate. The second column lists the posterior mean of the coefficient that corresponds to each covariate. The third column lists the posterior standard deviation for each coefficient. The fourth and fifth columns are analogous to the second and third columns, respectively, with the difference that the fourth and fifth columns report conditional posterior statistics. For example, the fourth column reports the posterior mean of each coefficient conditioned on the corresponding covariate being included in the model.

Parameters
  • T (int) – Positive integer that controls the number of MCMC samples that are generated (i.e. after burn-in/adaptation). Defaults to 2000.

  • T_burnin (int) – Positive integer that controls the number of MCMC samples that are generated during burn-in/adaptation. Defaults to 1000.

  • verbosity (str) – Controls the verbosity of the run method. If stdout, progress is reported via stdout. If bar, then progress is reported via a progress bar. If None, then nothing is reported. Defaults to bar.

  • report_frequency (int) – Controls the frequency with which progress is reported if the verbosity argument is stdout. Defaults to 200, i.e. every 200 MCMC iterations.

  • streaming (bool) – If True, MCMC samples are not stored in memory and summary statistics are computed online. Otherwise all T MCMC samples are stored in memory. Defaults to True. Only disable streaming if you wish to do something with the samples in the samples attribute (and have sufficient memory available).

  • seed (int) – Random number seed for reproducibility. Defaults to None.

BernoulliLikelihoodVariableSelector

class BernoulliLikelihoodVariableSelector(dataframe, response_column, assumed_columns=[], S=5, tau=0.01, tau_intercept=0.0001, precision='double', device='cpu', explore=5, xi_target=0.25, subset_size=None)[source]

Bayesian variable selection for a generalized linear model with a Bernoulli likelihood and a logistic link function. This class is appropriate for binary-valued responses.

Usage:

selector = BernoulliLikelihoodVariableSelector(dataframe, 'response', ...)
selector.run(T=2000, T_burnin=1000)
print(selector.summary)

The details of the model used in BernoulliLikelihoodVariableSelector are as follows. The covariates \(X\) and responses \(Y\) are defined as follows:

\[X \in \mathbb{R}^{N \times P} \qquad \qquad Y \in \{0, 1\}^{N}\]

and are provided by the user. The user should put some thought into whether \(X\) should be centered and/or normalized.

The inclusion of each covariate is governed by a Bernoulli random variable \(\gamma_p\). In particular \(\gamma_p = 0\) corresponds to exclusion and \(\gamma_p = 1\) corresponds to inclusion. The prior probability of inclusion is governed by \(h\) or alternatively \(S\):

\[h \in [0, 1] \qquad \rm{with} \qquad S \equiv hP\]

Alternatively, if \(h\) is not known a priori we can put a prior on \(h\):

\[h \sim {\rm Beta}(\alpha, \beta) \qquad \rm{with} \qquad \alpha > 0 \;\;\;\; \beta > 0\]

The rest of the model is specified as:

\[ \begin{align}\begin{aligned}&\gamma_p \sim \rm{Bernoulli}(h) \qquad \rm{for} \qquad p=1,2,...,P\\&\beta_0 \sim \rm{Normal}(0, \tau_\rm{intercept}^{-1})\\&\beta_\gamma \sim \rm{Normal}(0, \tau^{-1} \mathbb{1}_\gamma)\\&Y_n \sim \rm{Bernoulli}(\sigma(\beta_0 + X_{n, \gamma} \cdot \beta_\gamma)) \qquad \rm{for} \qquad n=1,2,...,N\end{aligned}\end{align} \]

where \(\sigma(\cdot)\) is the logistic or sigmoid function. Note that the dimension of \(\beta_\gamma\) depends on the number of covariates included in a particular model (i.e. on the number of non-zero entries in \(\gamma\)). The intercept \(\beta_0\) is always included in the model.

Parameters
  • dataframe (DataFrame) – A pandas.DataFrame that contains covariates and responses. Each row encodes a single datapoint. All columns apart from the response column (and the columns in assumed_columns if there are any) are assumed to be covariates.

  • response_column (str) – The name of the column in dataframe that contains the binary-valued responses.

  • assumed_columns (list) – A list of the names of the columns in dataframe that correspond to covariates that are always assumed to be part of the model. Defaults to []. Note that these columns do not have PIPs, as they are always included in the model.

  • S – Controls the expected number of covariates to include in the model a priori. Defaults to 5.0. To specify covariate-level prior inclusion probabilities provide a pandas.Series with index that corresponds to covariate columns in dataframe and that specifies covariate-level prior inclusion probabilities. If a tuple of positive floats (alpha, beta) is provided, the a priori inclusion probability is a latent variable governed by the corresponding Beta prior so that the sparsity level is inferred from the data. Note that for a given choice of alpha and beta the expected number of covariates to include in the model a priori is given by \(\frac{\alpha}{\alpha + \beta} \times P\). Also note that the mean number of covariates in the posterior can vary significantly from prior expectations, since the posterior is in effect a compromise between the prior and the observed data.

  • tau (float) – Controls the precision of the coefficients in the isotropic prior. Defaults to 0.01.

  • tau_intercept (float) – Controls the precision of the intercept in the isotropic prior. Defaults to 1.0e-4.

  • precision (str) – Whether computations should be done with ‘single’ (i.e. 32-bit) or ‘double’ (i.e. 64-bit) floating point precision. Defaults to ‘double’. Note that it may be ill-advised to use single precision.

  • device (str) – Whether computations should be done on CPU (‘cpu’) or GPU (‘gpu’). Defaults to ‘cpu’.

  • subset_size (int) – If subset_size is not None subset_size controls the amount of computational resources to use during MCMC inference. Otherwise all available computational resources are used. This argument is intended to be used for datasets with a very large number of covariates (e.g. tens of thousands or more). A typical value might be ~5-10% of the total number of covariates; smaller values result in more MCMC iterations per second but may lead to high variance PIP estimates. Defaults to None.

  • explore (float) – This hyperparameter controls how greedy the MCMC algorithm is. Defaults to 5.0. For expert users only.

  • xi_target (float) – This hyperparameter controls how frequently the MCMC algorithm makes Polya-Gamma updates or \(h\) updates if the latter is a latent variable. Defaults to 0.25. For expert users only.

run(T=2000, T_burnin=1000, verbosity='bar', report_frequency=100, streaming=True, seed=None)[source]

Run MCMC inference for \(T + T_{\rm burn-in}\) iterations. After completion the results of the MCMC run can be accessed in the summary and stats attributes. Additionally, if streaming == False the samples attribute will contain raw samples from the MCMC algorithm.

The summary DataFrame contains five columns. The first column lists the Posterior Inclusion Probability (PIP) for each covariate. The second column lists the posterior mean of the coefficient that corresponds to each covariate. The third column lists the posterior standard deviation for each coefficient. The fourth and fifth columns are analogous to the second and third columns, respectively, with the difference that the fourth and fifth columns report conditional posterior statistics. For example, the fourth column reports the posterior mean of each coefficient conditioned on the corresponding covariate being included in the model.

Parameters
  • T (int) – Positive integer that controls the number of MCMC samples that are generated (i.e. after burn-in/adaptation). Defaults to 2000.

  • T_burnin (int) – Positive integer that controls the number of MCMC samples that are generated during burn-in/adaptation. Defaults to 1000.

  • verbosity (str) – Controls the verbosity of the run method. If stdout, progress is reported via stdout. If bar, then progress is reported via a progress bar. If None, then nothing is reported. Defaults to bar.

  • report_frequency (int) – Controls the frequency with which progress is reported if the verbosity argument is stdout. Defaults to 200, i.e. every 200 MCMC iterations.

  • streaming (bool) – If True, MCMC samples are not stored in memory and summary statistics are computed online. Otherwise all T MCMC samples are stored in memory. Defaults to True. Only disable streaming if you wish to do something with the samples in the samples attribute (and have sufficient memory available).

  • seed (int) – Random number seed for reproducibility. Defaults to None.

BinomialLikelihoodVariableSelector

class BinomialLikelihoodVariableSelector(dataframe, response_column, total_count_column, assumed_columns=[], S=5, tau=0.01, tau_intercept=0.0001, precision='double', device='cpu', subset_size=None, explore=5, xi_target=0.25)[source]

Bayesian variable selection for a generalized linear model with a Binomial likelihood and a logistic link function. This class is appropriate for count-valued responses that are bounded.

Usage:

selector = BinomialLikelihoodVariableSelector(dataframe, 'response', 'total_count', ...)
selector.run(T=2000, T_burnin=1000)
print(selector.summary)

The details of the model used in BinomialLikelihoodVariableSelector are as follows. The covariates \(X\), responses \(Y\), and total counts \(T\) are defined as:

\[X \in \mathbb{R}^{N \times P} \qquad \qquad Y \in \mathbb{Z}_{\ge 0}^{N} \qquad \qquad T \in \mathbb{Z}_{\ge 1}^{N}\]

and are provided by the user. The user should put some thought into whether \(X\) should be centered and/or normalized.

The inclusion of each covariate is governed by a Bernoulli random variable \(\gamma_p\). In particular \(\gamma_p = 0\) corresponds to exclusion and \(\gamma_p = 1\) corresponds to inclusion. The prior probability of inclusion is governed by \(h\) or alternatively \(S\):

\[h \in [0, 1] \qquad \rm{with} \qquad S \equiv hP\]

Alternatively, if \(h\) is not known a priori we can put a prior on \(h\):

\[h \sim {\rm Beta}(\alpha, \beta) \qquad \rm{with} \qquad \alpha > 0 \;\;\;\; \beta > 0\]

The rest of the model is specified as:

\[ \begin{align}\begin{aligned}&\gamma_p \sim \rm{Bernoulli}(h) \qquad \rm{for} \qquad p=1,2,...,P\\&\beta_0 \sim \rm{Normal}(0, \tau_\rm{intercept}^{-1})\\&\beta_\gamma \sim \rm{Normal}(0, \tau^{-1} \mathbb{1}_\gamma)\\&Y_n \sim \rm{Binomial}(T_n, \sigma(\beta_0 + X_{n, \gamma} \cdot \beta_\gamma)) \qquad \rm{for} \qquad n=1,2,...,N\end{aligned}\end{align} \]

where \(\sigma(\cdot)\) is the logistic or sigmoid function and \(T_n\) denotes the \(N\)-dimensional vector of total counts. That is each Binomial likelihood is equivalent to \(T_n\) corresponding Bernoulli likelihoods. Note that the dimension of \(\beta_\gamma\) depends on the number of covariates included in a particular model (i.e. on the number of non-zero entries in \(\gamma\)). The intercept \(\beta_0\) is always included in the model.

Parameters
  • dataframe (DataFrame) – A pandas.DataFrame that contains covariates and responses. Each row encodes a single datapoint. All columns apart from the response and total count column (and the columns in assumed_columns if there are any) are assumed to be covariates.

  • response_column (str) – The name of the column in dataframe that contains the count-valued responses.

  • total_count_column (str) – The name of the column in dataframe that contains the total count for each datapoint.

  • assumed_columns (list) – A list of the names of the columns in dataframe that correspond to covariates that are always assumed to be part of the model. Defaults to []. Note that these columns do not have PIPs, as they are always included in the model.

  • S – Controls the expected number of covariates to include in the model a priori. Defaults to 5.0. To specify covariate-level prior inclusion probabilities provide a pandas.Series with index that corresponds to covariate columns in dataframe and that specifies covariate-level prior inclusion probabilities. If a tuple of positive floats (alpha, beta) is provided, the a priori inclusion probability is a latent variable governed by the corresponding Beta prior so that the sparsity level is inferred from the data. Note that for a given choice of alpha and beta the expected number of covariates to include in the model a priori is given by \(\frac{\alpha}{\alpha + \beta} \times P\). Also note that the mean number of covariates in the posterior can vary significantly from prior expectations, since the posterior is in effect a compromise between the prior and the observed data.

  • tau (float) – Controls the precision of the coefficients in the isotropic prior. Defaults to 0.01.

  • tau_intercept (float) – Controls the precision of the intercept in the isotropic prior. Defaults to 1.0e-4.

  • precision (str) – Whether computations should be done with ‘single’ (i.e. 32-bit) or ‘double’ (i.e. 64-bit) floating point precision. Defaults to ‘double’. Note that it may be ill-advised to use single precision.

  • device (str) – Whether computations should be done on CPU (‘cpu’) or GPU (‘gpu’). Defaults to ‘cpu’.

  • subset_size (int) – If subset_size is not None subset_size controls the amount of computational resources to use during MCMC inference. Otherwise all available computational resources are used. This argument is intended to be used for datasets with a very large number of covariates (e.g. tens of thousands or more). A typical value might be ~5-10% of the total number of covariates; smaller values result in more MCMC iterations per second but may lead to high variance PIP estimates. Defaults to None.

  • explore (float) – This hyperparameter controls how greedy the MCMC algorithm is. Defaults to 5.0. For expert users only.

  • xi_target (float) – This hyperparameter controls how frequently the MCMC algorithm makes Polya-Gamma updates or \(h\) updates if the latter is a latent variable. Defaults to 0.25. For expert users only.

run(T=2000, T_burnin=1000, verbosity='bar', report_frequency=100, streaming=True, seed=None)[source]

Run MCMC inference for \(T + T_{\rm burn-in}\) iterations. After completion the results of the MCMC run can be accessed in the summary and stats attributes. Additionally, if streaming == False the samples attribute will contain raw samples from the MCMC algorithm.

The summary DataFrame contains five columns. The first column lists the Posterior Inclusion Probability (PIP) for each covariate. The second column lists the posterior mean of the coefficient that corresponds to each covariate. The third column lists the posterior standard deviation for each coefficient. The fourth and fifth columns are analogous to the second and third columns, respectively, with the difference that the fourth and fifth columns report conditional posterior statistics. For example, the fourth column reports the posterior mean of each coefficient conditioned on the corresponding covariate being included in the model.

Parameters
  • T (int) – Positive integer that controls the number of MCMC samples that are generated (i.e. after burn-in/adaptation). Defaults to 2000.

  • T_burnin (int) – Positive integer that controls the number of MCMC samples that are generated during burn-in/adaptation. Defaults to 1000.

  • verbosity (str) – Controls the verbosity of the run method. If stdout, progress is reported via stdout. If bar, then progress is reported via a progress bar. If None, then nothing is reported. Defaults to bar.

  • report_frequency (int) – Controls the frequency with which progress is reported if the verbosity argument is stdout. Defaults to 200, i.e. every 200 MCMC iterations.

  • streaming (bool) – If True, MCMC samples are not stored in memory and summary statistics are computed online. Otherwise all T MCMC samples are stored in memory. Defaults to True. Only disable streaming if you wish to do something with the samples in the samples attribute (and have sufficient memory available).

  • seed (int) – Random number seed for reproducibility. Defaults to None.

NegativeBinomialLikelihoodVariableSelector

class NegativeBinomialLikelihoodVariableSelector(dataframe, response_column, psi0_column, assumed_columns=[], S=5, tau=0.01, tau_intercept=0.0001, precision='double', device='cpu', log_nu_rw_scale=0.05, explore=5.0, xi_target=0.25, init_nu=5.0, subset_size=None)[source]

Bayesian variable selection for a generalized linear model with a Negative Binomial likelihood and an exponential link function. This class is appropriate for count-valued responses that are unbounded.

Usage:

selector = NegativeBinomialLikelihoodVariableSelector(dataframe, 'response', 'psi0', ...)
selector.run(T=2000, T_burnin=1000)
print(selector.summary)

The details of the model used in NegativeBinomialLikelihoodVariableSelector are as follows. The covariates \(X\), responses \(Y\), and offsets \(\psi_0\) are defined as follows

\[X \in \mathbb{R}^{N \times P} \qquad \qquad Y \in \mathbb{Z}_{\ge 0}^{N} \qquad \qquad \psi_0 \in \mathbb{R}^{N}\]

and are provided by the user. The user should put some thought into whether \(X\) should be centered and/or normalized.

The inclusion of each covariate is governed by a Bernoulli random variable \(\gamma_p\). In particular \(\gamma_p = 0\) corresponds to exclusion and \(\gamma_p = 1\) corresponds to inclusion. The prior probability of inclusion is governed by \(h\) or alternatively \(S\):

\[h \in [0, 1] \qquad \rm{with} \qquad S \equiv hP\]

Alternatively, if \(h\) is not known a priori we can put a prior on \(h\):

\[h \sim {\rm Beta}(\alpha, \beta) \qquad \rm{with} \qquad \alpha > 0 \;\;\;\; \beta > 0\]

The full model specification for the Negative Binomial case is as follows:

\[ \begin{align}\begin{aligned}&\gamma_p \sim \rm{Bernoulli}(h) \qquad \rm{for} \qquad p=1,2,...,P\\&\beta_0 \sim \rm{Normal}(0, \tau_\rm{intercept}^{-1})\\&\beta_\gamma \sim \rm{Normal}(0, \tau^{-1} \mathbb{1}_\gamma)\\&\log \nu \sim \rm{ImproperPrior}(-\infty, \infty)\\&Y_n \sim \rm{NegBinomial}(\rm{mean}=\rm{exp}(\beta_0 + X_{n, \gamma} \cdot \beta_\gamma + \psi_{0, n}), \nu) \qquad \rm{for} \qquad n=1,2,...,N\end{aligned}\end{align} \]

Here \(\nu\) governs the dispersion or variance of the Negative Binomial likelihood. The vector \(\psi_0 \in \mathbb{R}^N\) allows the user to supply a datapoint-specific offset. In many cases setting \(\psi_{0, n} = 0\) is a reasonable choice. We note that we use a parameterization of the Negative Binomial distribution where the variance is given by

\[\rm{variance} = \rm{mean} + \rm{mean}^2 / \nu\]

so that small values of \(\nu\) correspond to large dispersion/variance and \(\nu \to \infty\) recovers the Poisson distribution.

Note that above the dimension of \(\beta_\gamma\) depends on the number of covariates included in a particular model (i.e. on the number of non-zero entries in \(\gamma\)). The intercept \(\beta_0\) is always included in the model.

Parameters
  • dataframe (DataFrame) – A pandas.DataFrame that contains covariates and responses. Each row encodes a single datapoint. All columns apart from the response and psi0 column (and the columns in assumed_columns if there are any) are assumed to be covariates.

  • response_column (str) – The name of the column in dataframe that contains the count-valued responses.

  • psi0_column (str) – The name of the column in dataframe that contains the offset \(\psi_{0, n}\) for each datapoint.

  • assumed_columns (list) – A list of the names of the columns in dataframe that correspond to covariates that are always assumed to be part of the model. Defaults to []. Note that these columns do not have PIPs, as they are always included in the model.

  • S – Controls the expected number of covariates to include in the model a priori. Defaults to 5.0. To specify covariate-level prior inclusion probabilities provide a str that specifies a column in dataframe that contains covariate-level prior inclusion probabilities. If a tuple of positive floats (alpha, beta) is provided, the a priori inclusion probability is a latent variable governed by the corresponding Beta prior so that the sparsity level is inferred from the data. Note that for a given choice of alpha and beta the expected number of covariates to include in the model a priori is given by \(\frac{\alpha}{\alpha + \beta} \times P\). Also note that the mean number of covariates in the posterior can vary significantly from prior expectations, since the posterior is in effect a compromise between the prior and the observed data.

  • tau (float) – Controls the precision of the coefficients in the isotropic prior. Defaults to 0.01.

  • tau_intercept (float) – Controls the precision of the intercept in the isotropic prior. Defaults to 1.0e-4.

  • precision (str) – Whether computations should be done with ‘single’ (i.e. 32-bit) or ‘double’ (i.e. 64-bit) floating point precision. Defaults to ‘double’. Note that it may be ill-advised to use single precision.

  • device (str) – Whether computations should be done on CPU (‘cpu’) or GPU (‘gpu’). Defaults to ‘cpu’.

  • log_nu_rw_scale (float) – This hyperparameter controls the proposal distribution for \(\log \nu\) updates. Defaults to 0.05. For expert users only.

  • subset_size (int) – If subset_size is not None subset_size controls the amount of computational resources to use during MCMC inference. Otherwise all available computational resources are used. This argument is intended to be used for datasets with a very large number of covariates (e.g. tens of thousands or more). A typical value might be ~5-10% of the total number of covariates; smaller values result in more MCMC iterations per second but may lead to high variance PIP estimates. Defaults to None.

  • explore (float) – This hyperparameter controls how greedy the MCMC algorithm is. Defaults to 5.0. For expert users only.

  • xi_target (float) – This hyperparameter controls how frequently the MCMC algorithm makes Polya-Gamma updates or \(h\) updates if the latter is a latent variable. Defaults to 0.25. For expert users only.

  • init_nu (float) – This hyperparameter controls the initial value of the dispersion parameter nu. Defaults to 5.0. For expert users only.

run(T=2000, T_burnin=1000, verbosity='bar', report_frequency=100, streaming=True, seed=None)[source]

Run MCMC inference for \(T + T_{\rm burn-in}\) iterations. After completion the results of the MCMC run can be accessed in the summary and stats attributes. Additionally, if streaming == False the samples attribute will contain raw samples from the MCMC algorithm.

The summary DataFrame contains five columns. The first column lists the Posterior Inclusion Probability (PIP) for each covariate. The second column lists the posterior mean of the coefficient that corresponds to each covariate. The third column lists the posterior standard deviation for each coefficient. The fourth and fifth columns are analogous to the second and third columns, respectively, with the difference that the fourth and fifth columns report conditional posterior statistics. For example, the fourth column reports the posterior mean of each coefficient conditioned on the corresponding covariate being included in the model.

Parameters
  • T (int) – Positive integer that controls the number of MCMC samples that are generated (i.e. after burn-in/adaptation). Defaults to 2000.

  • T_burnin (int) – Positive integer that controls the number of MCMC samples that are generated during burn-in/adaptation. Defaults to 1000.

  • verbosity (str) – Controls the verbosity of the run method. If stdout, progress is reported via stdout. If bar, then progress is reported via a progress bar. If None, then nothing is reported. Defaults to bar.

  • report_frequency (int) – Controls the frequency with which progress is reported if the verbosity argument is stdout. Defaults to 200, i.e. every 200 MCMC iterations.

  • streaming (bool) – If True, MCMC samples are not stored in memory and summary statistics are computed online. Otherwise all T MCMC samples are stored in memory. Defaults to True. Only disable streaming if you wish to do something with the samples in the samples attribute (and have sufficient memory available).

  • seed (int) – Random number seed for reproducibility. Defaults to None.