Bayesian Ensemble Estimation from SAS (BEES)
The BEES module is available from the "Bayesian Ensemble Estimator" selection in the Analyze section of the main menu.
The BEES module is designed to help users fit their SAS data with an ensemble model built from a library of candidate states. Populations of states in different combinations of candidate profiles are reweighted using a Bayesian Monte Carlo (BMC) approach, and the module will select the best model that balances both a high quality goodness-of-fit and a minimum number of contributing states.
Theoretical profiles must have matching q-values to the supplied experimental scattering.
The fitting routine can simultaneously fit experimental SAS profiles and auxiliary data measurements, but it is not required.
Fitting can be done according to Shannon Sampling (Χ2free) or the standard Χ2 metric.
- While either metric may yield similar results, use of the Χ2free metric is highly recommended, as Shannon Sampling reduces the degree of correlation between q-values. Also, Shannon Sampling will prevent the highly dense SAXS data from out-weighing auxiliary data that contains significantly fewer data points, if supplied.
Theoretical profiles are supplied in the form of .zip archives, and the zip archive must also contain a file called "filelist.txt", which lists the names of each file within the archive.
- If no second dimension is supplied, "filelist.txt" is just a list of scattering profiles. If a second dimension is supplied, then "filelist.txt" is a 2-column file, where the first column of each row contains the name of the theoretical scattering profile, and the second column contains the name to its associated second-dimension theoretical profile. An example of filelist.txt can be found here
This module is not designed to handle reweighting every single profile calculated from the conformations of a simulation run (i.e., a Complex Monte Carlo trajectory). Some form of clustering or data reduction needs to be done before using this module.
Progress tracking in BEES is handled in a highly discretized manner, where the tracker is updated only between steps in ensemble size. As a result, the SASSIE-web progress tracker may appear to hang, followed by large increases in completion percentage. This is the expected behavior.
Standard BEES Usage
- run name: Prefix for folder and file generation
- interpolated data file: The experimental scattering data. Must be a 3-column file: column 1 = q, column 2 = I(q), column 3 = err(q).
- D_max: Dmax value of the molecule, as estimated from the experimental scattering intensity. Used to conduct Shannon sampling of the scattering curve.
- theoretical profiles zip file: Zip archive containing the theoretical scattering profiles, as well as the associated "filelist.txt". Also must include theoretical profiles for auxiliary data, if supplied.
- number of Monte Carlo move attempts: Number of steps that each BMC parameter search conducts.
- number of Monte Carlo equilibration steps: Number of steps to remove from the beginning of each BMC parameter search (larger values reduce the effect of the initial, randomly selected distribution of population).
- number of Monte Carlo replicas: Number of independent BMC parameter searches to conduct for each combination of candidate states. Results of all replicas are combined to determine the estimated populations.
- number of processors: Number of parallel processor tasks to spawn for the BEES routine. Computational time may be reduced by using an increased number of processors, as long as (number of candidate combinations) > (number of processors). Limited to values between 1 and 16.
In order to recreate the inputs used above, users should download the following files:
Once the best ensemble of candidate states is determined, BEES will print information regarding the model (shown above), namely: the contributing scattering states and their populations, observed error in population, and individual ability to fit experimental data, as well as the goodness-of-fit metrics for the ensemble model and its associated AIC or BIC value. Also at the conclusion of the run, BEES will generate interactive plots below the progress bar that contain the best ensemble's fit to the experimental SAS profile, along with the model residual errors in the fit.
As noted in the output, results are also stored in the form of several files that are saved to the ./[runname]/bayesianensembleestimator/ subdirectory:
- *_all_models.dat: A comma separated variable file that contains model information for every sampled ensemble model. Each row represents a different model combination, and candidates with "0.0" population and "0.0" standard deviation were not considered in that particular model. Example
- *_ensemble_sas.dat: A 2-column file that contains the ensemble model scattering spectrum. Column 1 = q, column 2 = I(q). Example
- *_final_model.dat: A text file containing the same information as is presented in the GUI output text box. Example
- *_plots.html: Contains several interactive plots for examining not only the best identified model, but also for comparing the performance of the alternative models. Example
The separate panels of the *_plots.html contain different levels of information regarding the BEES-observed models:
- Best Model: contains scattering spectrum and residual errors for the best IC-identified model.
- Top 10 Models: Allows users to compare the spectra and residuals of the 10 highest performing models. A small table provides the performance metric, ensemble size, a list of ensemble members, and the model χ2 for each of these 10 models. Below the table, users can select the model of interest from the dropdown menu (model0 = best model, model9 = 10th highest performing model) to be plotted. Also included in these figures is the option to plot the profiles of each candidate state, and these profiles can be turned on and off by clicking the associated item in the plot legend.
- Compare All Models: At the top of this tab, a histogram of relative model performances is shown, and the histograms are grouped according to model ensemble size. Visualization of each group can be toggled by clicking on the associated legend entry. Since the relative performance measures model strength against the best IC-identified model (larger values = more comparable model performance), this plot can be used to quickly determine the strength of overfitting for different ensemble sizes. This histogram will also show the strength of evidence for the best IC-identified model over any other model. If the histogram is sparse (few bars with values > 0.7), then the IC-identified can be considered robust. However, if there exist models with relative performances of ~0.8-0.9, then closer inspection of these alternative models is strongly encouraged.
- Also included within this tab is a table of information for all the sampled models. This table is further linked to the population bar plot below, which displays the model population of states for the table-selected model. Both model populations (thick grey bars) and their BMC-derived standard deviations (narrow black bars) are shown.
Using BEES with Auxiliary Data
Advanced BEES Inputs
- use auxiliary data: If selected, allows the user to upload an auxiliary data file for simultaneous fitting.
- auxiliary data file: If supplied, BEES will search for the population of states that simultaneously best satisfies both the SAS data and the auxiliary data. Must be a 2-column file: column 1 = auxiliary measurement, column 2 = error in auxiliary measurement.
- model every combination: If this option is selected, BEES will not stop searching for model parameters once over-fitting is observed. Instead, it will search for model parameters for every possible combination of candidate states, from the size of a single state to the state of all candidates re-weighted simultaneously.
- model only full ensemble: If selected, the only model that will be generated is the solution containing all of the possible candidate states. This option overrides the "model every combination" selection.
- use bic: If selected, calculate over-fitting using the Bayesian Information Criterion (BIC) and not the Akaike Information Criterion (AIC).
- Advanced Monte Carlo Inputs: If selected, allows for the altering of BMC integrator parameters.
- walk width: Defines the width (σ) of the Gaussian distribution that is used to randomly determine the amount of population to add to each basis member per BMC iteration: p(δw) α e-(δw)2/2σ2
- increment one population at a time: If selected, the BMC search routine will adjust the population of one candidate member (prior to normalization). Default behavior is to randomly increment all populations and then normalize.
- minimum population cutoff: If any population within a model falls below minimum population cutoff (after normalization), then set that population to 0.00 and weights are re-normalized.
Users that wish to recreate the inputs shown above should download the following files:
Outputs with Auxiliary Spectra
Most Advanced Options in BEES will not alter the outputs of the program, with the exception being the case in which auxiliary data was supplied. If an auxiliary data set is supplied, then the SASSIE-web will now create an additional set of plots beneath the scattering profile figures that present the ensemble fit to the auxiliary spectrum and the associated error in the fit. As noted previously, results are also stored in the form of several files that are saved to the ./[runname]/bayesian_ensemble_estimator/ subdirectory:
"*" denotes the additional file that is only produced if an auxiliary data file is supplied. This file is a 1-column text file that contains the values for the ensemble model at each auxiliary measurement. An example of this output can be found here.
Furthermore, the first and second tabs of the HTML page will also possess spectra for the auxiliary data alongside the SAS profiles. An example be found here.
- Fully sampling every sub-basis can be incredibly expensive! In fact, the total number of combinations (and therefore the computational time) scales exponentially with the number of candidate states (see below), so some form of data reduction is strongly encouraged before using the BEES module.
- BEES Benchmark vs Candidate Pool Size with varying processor counts:
- BEES Benchmark vs Candidate Pool Size on six processors, assuming overfitting is observed at 4 states:
- Cumulative BEES Benchmarks for the 14-member candidate pool states, run on six processors:
References and Citations
- BEES: Bayesian Ensemble Estimation from SAS, a SASSIE-web Module. S. Bowerman, J.E. Curtis, J. Clayton, E.H. Brookes, and J. Wereszczysnki. In Preparation.
- Determining Atomistic SAXS Models of Tri-Ubiquitin Chains from Bayesian Analysis of Accelerated Molecular Dynamics Simulations. S. Bowerman, A.S.J.B. Rana, A. Rice, G.H. Pham, E.R. Strieter, and J. Wereszczynski. J. Chem. Theory Comput., 13, 2418 (2017). BIBTEX EndNote Plain Text