SASSIE-web is an online simulation and analysis tool for the modelling of biomolecular structures using small angle scattering data. It is based on the original, standalone, program SASSIE (Curtis et al. 2012) and retains all of its core features. This guide is designed to get you familiar with the basic features of the program as quickly as possible. The features covered will be:
Important Note: Before you start this tutorial you will need to register for an account for and login to SASSIE-web. Instructions on how to register can be found here.
The only data needed to work through this tutorial are:
You should download these files to your computer now. A good idea is to create a directory called SASSIE-web-tutorial and save all downloads from the tutorial in this location.
You will also need to familiarize yourself with a molecular viewing program that can display PDB and DCD files. We recommend VMD and provide a quick tutorial here.
Once logged in to SASSIE-web the page should look something like the figure below.
During this tutorial, when instructed to select something from the Main Menu but no menu is visible on the left hand side of the page you must click on the Main Menu toggle to reveal it.
To choose a project name where your work will be stored and to access SASSIE modules that are still in Beta status, click on the Head icon. The User Configuration menu will appear.
Choose an existing project or create a new project name. In addition, select the Beta checkbox. (You can also choose to select the Retired checkbox to access retired SASSIE modules. In addition, you can update the background and foreground screen colors by selecting the Update colors checkbox.) Click the 'Submit' button to connect to the project and to access the Beta modules. The project name should now appear at the top of the web page. Here, we have chosen project name 'test1'.
In this tutorial we will model the conformation of the HIV-1 Gag protein following the study of Datta et al. 2007. HIV Gag is a long polyprotein which is cleaved to form the functional proteins required by the virus. The viral proteins which form the domains in Gag are the matrix (MA), capsid (CA), p2 and nucleocapsid (NC). A structure stitched together from evidence based on crystal structures and models of the individual domains is shown below. A similar structure will be used as the starting structure for our simulations.
Datta et al. 2007 identified 5 flexible regions (labelled I-V), which are highlighted in the figure above. The table alongside the picture shows the residues which make up each region (we are going to need this information to select regions to be varied when we run Monte Carlo simulations).
Data interpolation is necessary to create a new data file that is spaced on a uniform grid from the experimental data file. More information on the module is available in the Data Interpolation documentation. Here we interpolate the SANS data from a HIV-1 Gag protein at a concentration of 1 mg/ml in a 100% D2O buffer.
Select the 'Tools' button at the top of the Main Menu. Click on Main Menu Toggle if necessary.
This will reveal a list of buttons for each tool running across the top of the page (just below the top bar with the Session Management and Main Menu Toggle icons).
Select the 'Data Interpolation' button from this menu.
You should now see a page like the one below. This page is used to enter all of the information needed to do the data interpolation.
The figure shows the values for each field as required for data interpolation.
Edit the values on your screen to match the screenshot. An explanation of the field and how to edit it can be found below.
run name: user defined name of folder that will contain the results.
experimental data file: Name of input file with experimental data with at least three columns: q, I(q), and error in I(q). Here we use the sans_data.sub file.
output file name: Name of file that will contain the interpolated data. Here we choose the name sans_data.dat.
I(0): Experimentally determined value of scattering intensity at q = 0. Here we used the value of 0.04 that was derived from a Guinier fit to the data.
I(0) error: Experimentally determined value of the error of the scattering intensity at q = 0. Here we use the value of 0.001 that was obtained from the Guinier fit to the data.
new delta q: Desired spacing of q-values (1/Angstrom). This should be chosen so that your first interpolated data point falls within the q-range of the experimental data. For this tutorial, the value has been set to 0.02 since the first data point occurs at a value of ~0.013.
number of new q-values: Integer number of desired q-values. For this tutorial, the value has been set to 16 to that the maximum q value is 0.3.
Once you have understood the input fields and made sure that your values agree with the figure click on the 'Submit' button to start simulation.
As the run continues the progress bar beneath the submit button should update. A graph beneath this should will show the variation of the radius of gyration over the steps of the Monte Carlo simulation. Once complete the output should look similar to the figure below.
The output will show a plot of the original and interpolated data, the name of the input file and the name of the interpolated data file as well as the directory in which it is located.
Note that roll-over help will indicate options to resize, zoom and reset the view of the plot.
test1/run_0/data_interpolation
PDB Scan is used to assess whether an input PDB is ready for simulation and where possible to provide files enabling CHARMM forecfield parameterization. Information on missing atoms and residues and those not covered as standard by the CHARMM 27 forcefield are reported. PDB files do not need to have header information. At this time, only PDB files of proteins are supported. More information on the module is available in the PDB Scan documentation. Here we examine the PDB file that describes the starting HIV-1 Gag protein structure.
Select the 'Build' button from the Main Menu of SASSIE-web and then click on the PDB Scan button.
You should now see a page like th one below. This page is used to enter all of the information needed to check the PDB file.
pdb file input: The PDB file that we want to examine. Here we use the gag_start.pdb file.
Once you have entered the file name, click on the 'Submit' button to start the file scan.
As the run continues the progress bar beneath the submit button should update. Once complete the output should look similar to the figure below.
The text output region provides a brief summary of the PDB Scan report.
test1/run_0/pdbscan
A JSmol vizualization of the protein is produced and is shown below the text output region. Holding down the left mouse button and moving the cursor over the picture allows you to rotate the view, the scroll wheel facilitates zooming in and out. Right clicking on the image allows you to access all of the JSmol options.
The full PDB scan report can be found below the image of the structure.
This PDB file is ready for simulation so we can proceed to create an ensemble of structures for comparison to the SANS data.
The primary way to vary structures in SASSIE is via Monte Carlo simulations which rotate the backbone dihedral angles of flexible regions within proteins to sample a wide range of structures. More information can be found in the Monomer Monte Carlo documentation. Here we setup and run such a simulation before visualizing the range of stuctures produced.
Select the 'Simulate' button from the Main Menu of SASSIE-web and then click on the 'Monomer Monte Carlo' button.
You should now see a page like the one below. This page is used to enter all of the information needed to run a Monte Carlo simulation.
The figure shows the values for each field as required for our simulation. An explanation of some of the fields can be found below.
reference pdb: The starting structure for the simulation. Here we use the gag_start.pdb file.
number of trial attempts: Number of times the simulation will try to vary the structure (some structures will be discarded by the Monte Carlo algorithm) For this tutorial set the value to 1000. For real studies tens of thousands of structures are needed.
return to previous structure: Number of discarded structures in a row that are considered before returning to a randomly-selected structure that was previously accepted
number of flexible regions to vary: single number
residue range for each flexible region: comma-separated list of the range of residues to vary for each flexible region
maximum angle(s): comma-separated list of the maximum angle sampled in a single Monte Carlo step for each flexible region
structure alignment region: a single range of residues for structural alignment of all the flexible segments. This makes it easy to make visual comparisons of each frame in the output trajectory.
overlap basis: Select either heavy atoms, all, backbone or enter atom name. The atom name option will spawn futher inputs:
overlap basis: Enter an atom name to check for overlap.
overlap cutoff (angsgtroms): Overlap basis atoms closer than this distance defines an overlap condition.
Once you have understood the input fields and made sure that your values agree with the figure click on the 'Submit' button to start simulation.
As the run continues the progress bar beneath the submit button should update. A graph beneath this should will show the variation of the radius of gyration over the steps of the Monte Carlo simulation. Once complete the output should look similar to the figure below.
test1/run_0/monomer_monte_carlo
You should now download the output trajectory using the file browser.
Note: the 'Configurations and statistics saved in' line in the output gives a relative path under the project directory.
Check the box next to 'run_0' to select it for download.
Beneath the file tree is an option labelled 'Compression type'. Select an option suitable for your operating system from the list (for Windows select 'zipped' for Linux 'bzip2 tarball').
Click the 'Download' button.
A progress bar will appear monitoring the upload of your files to the server. Once complete a link will appear beneath the download button.
Once the download is complete uncompress the file in a location of your choice.
You should now load the PDB gag_start.pdb (you will find this in the run_0 directory you just downloaded) and DCD (run_0/monomer_monte_carlo/run_0.dcd) into VMD to observe the variation produced even in our very short Monte Carlo simulation. Remember the DCD file contains coordinates alone, you need to load the PDB first so that the visualization software knows about the atoms they represent and how they are connected. You can also download and visualize the DCD file run_0.dcd generated from this particular Quick Start run for comparison.
Next we calculate a theoretical scattering curve for each of the trial structures we have generated. The SASSIE workflow operates by calculating the scattering intensities at evenly spaced Q values and matching these against interpolated experimental values.
The file sans_data.dat contains our previously interpolated experimental data. In order to create the correct data points in our theoretical curves we need three pieces of information:
A number of scattering calculators are available in SASSIE. Here we use SasCalc. More information can be found in the SasCalc documentation. The starting structure must be a complete structure without missing residues or atoms (including hydrogen atoms) in order to obtain accurate scattering profiles. Atom and residue naming must be compatiable with those defined in the CHARMM force field.
Select 'Calculate' from the Main Menu.
Click the 'SasCalc' button.
Now you need to enter the information to run the scattering calculator. SasCalc can be used to calculate the scattering for SAXS and SANS and/or for several SANS contrasts at the same time.
The module is first run using the "converged number of golden vectors" option on just one structure. Choose this option from the SasCalc method menu in the Advanced Input section of the page.
Other than the values listed above you can keep the default values for this tutorial (see figure below).
The single structure that we used to start the simulation is used as both the reference pdb and the trajectory file filename (PDB in this case) so it is already uploaded to the SASSIE-web server. Thus, you can either upload it again from your local computer or locate it on the server and read it from there.
To read the file from the server:
Once you have understood the input fields and made sure that your values agree with the figure click on the 'Submit' button to start the calculation.
A scattering curve will be calculated for the starting structure (the progress bar should reach 100% and a message stating the run finished appear in the window beneath when the job has completed). Note that the files are written to a sub-directory of sascalc/ that is named according to the D2O percentage in the solvent. This is useful when calculating the scattering curves for more than one contrast.
test1/run_0/sascalc/neutron_D2Op_100
The run_0_00001.log file from the inital SAS Curve calculation indicates that 35 golden vectors were required for convergence to the desired tolerance (0.01 in this case).
We now use this information to calculate the scattering curves for all of the generated structures using the "fixed number of golden vectors" option from the SasCalc method menu as shown below.
The reference pdb is the starting structure and the trajectory file filename (DCD) comes from the result of the Monte Carlo simulation, so both are already on the SASSIE-web server. Thus, you can either upload them again from your local computer or locate them on the server and read them from there.
When all input fields are complete:
A scattering curve will be calculated for all of the structures generated by the Monte Carlo simulation (the progress bar should reach 100% and a message stating the run finished appear in the window beneath when the job has completed). Note that the files written during the initial SAS curve calculation will be overwritten since we chose the same run name (run_0) in both cases. If you wish to save the files from the initial calculation, use a different run name.
test1/run_0/sascalc/neutron_D2Op_100
Now we compare our theoretical curves to the experimental data to see which of our structures are plausible models of the real protein using Chi-Square Filter. More information can be found in the Chi-Square Filter documention.
Select 'Analyze' from the Main Menu.
Click the 'Chi-Square Filter' button.
We now need to select the path containing the theoretical scattering curves and the file containing the experimental data. In addition we need to input the value of I(0) to enable comparison of the two curves (see the picture below).
To set the path to the scattering curves generated in the previus step:
interpolated data file
I(0)
We eventually may want to create 'weight files' that record which frames meet criteria that make them successful models of our data. This means those with low chi square values. However, we don't know the range of chi square values we have at this stage. So, we set the 'number of weight files' to 0 at this time.
Sas type
number of weight files
Note: There are list boxes that allow the selection of the format of the input theoretical curves and the metric used to compare the curves. Here we wish to use the defaults of 'SasCalc' and 'reduced chi-square'.
Click 'Submit'.
Once complete you the run you should see outputs similar to those below.
In the text output you will see the minimum chi square (X2) values is given.
The top plot shows the variation of chi squared (y-axis) with the radius of gyration (x-axis). Chi squared is a measure of the quality of fit of the theoretical curve to the experimental one. It is a percentage and the lower the value the better.
The bottom plot shows a direct comparison of the best, worst and average theoretical curves with experiment (goal).
test1/run_0/chi_square_filter/neutron_D2Op_100
/spectra
Now that we know the range of chi square values that we have, we can compare the theoretical curves to the data a second time and create a weight file that flags all structures with chi square values below a certain number. Now, we set the 'number of weight files' to 1.
run name:
number of weight files
Weight files contain information on which frames in our simulation meet specific criteria provided in the expression box.
enter expression
x2 < 3.0
This selects all frames with a chi square less than 3.0. Adjust this value if necessary to suit the results from your simulation.
weight file name
low Rg cutoff
Enter a value if you wish to also restrict the Rg range to be above this value. The default value is 0 so that all Rg value are acceptable.
Click 'Submit'.
Once complete you the run you should see outputs like those below.
These results are essentially the same as those from our first comparison above except that we have now generated a weight file.
test1/run_1/chi_square_filter/neutron_D2Op_100
/spectra
Now we can filter out the best fit structures and vizualize them using the Extract Utilities. More information can be found in the Extract Utilities documentation.
Select 'Tools' from the Main Menu.
Click the 'Extract Utilities' button.
In this module we can select structures from the DCD we created from the Monte Carlo simulation using the weight files generated in the Chi-Square Filter module.
In this case, we chose to select the weight file from the server.
Check the tick box labelled 'extract trajectory' (this will reveal the options shown in the screenshot)
Select the usual 'reference pdb' and the DCD output from the Monte Carlo simulation
Input 'best_gag.dcd' as the 'output filename'
Choose 'weight file' from the 'select option' listbox.
Click 'Submit'
When the process is finished your output should look like the one below.
In the event that none of your frames pass the filter then you can download these preprepared files and try the filtering process:
test1/run_1/extract_utilities
Download the 'best_gag.dcd' file as you did the unfiltered DCD and then vizualize the structure again in VMD (you will need to load a suitable PDB first as before). You should see that the filtered structures are all noticeably more compact that the starting structure and the majority of those in the unfiltered DCD.
Another way to visualize the structures sampled in the 'run_0.dcd' and 'best_gag.dcd' files for comparison is to use the Density Plot module. The density plot below shows the envelope sampled by all of the accepted structures as well as that sampled by only the best fit structures. The black region is the envelope represented by residues 283-353, which is approximately the alignment region that we defined in the Monomer Monte Carlo module. The blue and yellow regions represent the envelope sampled by residues 1-282 for all accepted structures (blue) and for the best fit structures (yellow). The red and green regions represent the envelope sampled by residues 354-431 for all accepted structures (red) and the best fit structures (green). This representation makes it easier to see that the envelope represented by the best fit structures is significantly smaller than that represented by all of the accepted structures. Remember that our sample has only 692 accepted structures and 27 best fit structures. For a real study, several thousand accepted structures would be needed to determine if this observation holds true.
More information on how to use the Density Plot module can be found in the Density Plot documentation.
Now we can minimize the best fit gag structures using the Energy Minimization module. More information can be found in the Energy Minimization documentation.
Select 'Simulate' from the Main Menu.
Click the 'Energy Minimization' button.
run name: user defined name of folder that will contain the results.
reference pdb: PDB file with naming information for coordinates that will be extracted. We are using the gag_start.pdb file.
input filename (dcd or pdb): file containing starting conformation(s) for simulation. The number of atoms must match that in the reference pdb. For files with multiple frames each one will be simulated. We are using the run_0.dcd file.
PSF file name (dcd or pdb): PSF file with topology information, must match the reference pdb and input dcd/pdb. Here we use the gag_start.psf file.
output file name (dcd): filename for the output DCD contaiing the final frames resulting from simulation
number of processors: number of processors used to run the simulation (1-4). We are using use 2 processors.
keep run output files: choice of whether to retain NAMD log files and other output from each simulation
run type: select which of the four combinations of minimization and molecular dynamics to run
number of minimization steps: number of steps of the conjugate gradient minimization to apply to each structure. We are using 1000 steps for each structure in the DCD file.
Click 'Submit'
NOTE: This will take awhile (~ 30 min).
When the process is finished your output should look like the one below.
test1/run_1/energy_minimization
Download the 'min_best_gag.dcd' and 'min_best_gag.dcd.pdb' files and then vizualize the structures in VMD. You can load 'best_gag.dcd' (along with a suitable PDB file) again as well for comparison. Go through the two structures frame by frame. You should notice very little difference in the structures.
If desired, you can compare the minimized structures to the SANS data by calculating their theoretical SANS curves and comparing them to the SANS data again to see how different the best fit chi square values are after the minimization.
First, calculate the theoretical SANS curves using SasCalc.
run name:
Set the run name to run_2. Once the inputs have been entered, click 'Submit'.
Once the run is complete, you should see outputs like those below.
test1/run_2/sascalc/neutron_D2Op_100
Then, compare the theoretical SANS curves to the SANS data using Chi-Square Filter.
run name:
number of weight files
Set the 'number of weight files' to 0 since we are already dealing with the best fit structures.
Click 'Submit'.
Once the run is complete, you should see outputs like those below.
test1/run_2/chi_square_filter
/spectra
Conformation of the HIV-1 Gag Protein in Solution S. A. K. Datta, J. E. Curtis, W. Ratcliff, P. K. Clark, R. M. Crist, J. Lebowitz, S. Krueger, A. Rein, J. Mol. Biol. 365, 812-824 (2007). BIBTex, Endnote, Plain Text
SASSIE: A program to study intrinsically disordered biological molecules and macromolecular ensembles using experimental scattering restraints J. E. Curtis, S. Raghunandan, H. Nanda, S. Krueger, Comp. Phys. Comm. 183, 382-389 (2012). BIBTeX, EndNote, Plain Text