SMRTSCAPE (SMRT Subread Coverage & Assembly Parameter Estimator) is tool in development as part of our PacBio sequencing projects for predicting and/or assessing the quantity and quality of useable data required/produced for HGAP3 de novo whole genome assembly. The current documentation is below. Some tutorials will be developed in the future - in the meantime, please get in touch if you want to use it and anything isn’t clear.
The main functions of
Estimate Genome Coverage and required numbers of SMRT cells given predicted read outputs. NOTE: Default settings for SMRT cell output are not reliable and you should speak to your sequencing provider for up-to-date figures in their hands.
Summarise the amount of sequence data obtained from one or more SMRT cells, including unique coverage (one read per ZMW).
Calculate predicted coverage from subread data for difference length and quality cutoffs.
Predict HGAP3 length and quality settings to achieve a given coverage and accuracy.
SMRTSCAPE will be available in the next SLiMSuite download. The
coverage=T mode can be run from the EdwardsLab server at: http://www.slimsuite.unsw.edu.au/servers/pacbio.php. (This is currently running a slightly old implementation but should be updated shortly.)
Version: 1.10.1 Last Edit: 26/05/16
### ~ General Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### genomesize=X : Genome size (bp)  ### ~ Genome Coverage Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### coverage=T/F : Whether to generate coverage report [False] avread=X : Average read length (bp)  smrtreads=X : Average assemble output of a SMRT cell  smrtunits=X : Units for smrtreads=X (reads/Gb/Mb) [reads] errperbase=X : Error-rate per base [0.14] maxcov=X : Maximmum X coverage to calculate  bysmrt=T/F : Whether to output estimated coverage by SMRT cell rather than X coverage [False] xnlist=LIST : Additional columns giving % sites with coverage >= Xn [1+`minanchorx`->`targetxcov`+`minanchorx`] ### ~ SubRead Summary Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### summarise=T/F : Generate subread summary statistics including ZMW summary data [False] seqin=FILE : Subread sequence file for analysis [None] batch=FILELIST : Batch input of multiple subread fasta files (wildcards allowed) if seqin=None  targetcov=X : Target percentage coverage for final genome [99.999] targeterr=X : Target errors per base for preassembly [1/genome size] calculate=T/F : Calculate X coverage and target X coverage for given seed, anchor + RQ combinations [False] minanchorx=X : Minimum X coverage for anchor subreads  minreadlen=X : Absolute minimum read length for calculations (use minlen=X to affect summary also)  rq=X,Y : Minimum (X) and maximum (Y) values for read quality cutoffs [0.8,0.9] rqstep=X : Size of RQ jumps for calculation (min 0.001) [0.01] ### ~ Preassembly Fragmentation analysis Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### preassembly=FILE: Preassembly fasta file to assess/correct over-fragmentation (use seqin=FILE for subreads) [None] ### ~ Assembly Parameter Options ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ### parameters=T/F : Whether to output predicted "best" set of parameters [False] targetxcov=X : Target 100% X Coverage for pre-assembly  xmargin=X : "Safety margin" inflation of X coverage  mapefficiency=X : [Adv.] Efficiency of mapping anchor subreads onto seed reads for correction [1.0] xsteplen=X : [Adv.] Size (bp) of increasing coverage steps for calculating required depths of coverage [1e6] parseparam=FILES: Parse parameter settings from 1+ assembly runs  paramlist=LIST : List of parameters to retain for parseparam output (file or comma separated, blank=all)  predict=T/F : Whether to add XCoverage prediction and efficiency estimation from parameters and subreads [False] ### ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ###
SMRTSCAPE has a number of functions for PacBio sequencing projects, concerned with predicting and/or assessing the quantity and quality of useable data produced:
- Summarise subreads (
summarise=T). This function summarises subread data from a given
seqin=FILEfasta file, or a set of subread fasta files given with
- Calculate length cutoffs (
calculate=T). Calculates length cutoffs for different XCoverage combinations from subread summary data.
- Parameters (
parameters=T). This function attempts to generate predicted optimum assembly settings from the
- Genome Coverage (
coverage=T). This method tries to predict genome coverage and accuracy for different depths of PacBio sequencing.
- Preassembly fragmentation analysis (
- Parse Parameters (
parseparam=FILES). This method parses assembly parameters from the SmrtPipeSettings Summary file.
- Prediction (
predict=T). This method compares predicted coverage from seed reads with estimated coverage from preassembly data.
These are explored in more detail below.
The setup phase primarily sorts out the
SMRTSCAPE objects. With the exception of pure
ParseParam runs, it will also set up:
- Genome Size. Unless given with
genomesize=X, this will be asked for as it is required X coverage/depth and accuracy calculations.
- Target Error Rate. Unless given with
targeterr=X, a target error rate of 1/
GenomeSizewill be set.
- Basefile. Finally, if no
basefile=Xsetting is given, the basename for output files will be set to the basename of
seqin=FILEif given (stripping
.subreadsif present), else
Summarise subreads (summarise=T)
This function summarises subread data from a given
seqin=FILE fasta file, or a set of subread fasta files given with
batch=FILELIST. This uses the
summarise=T function of
rje_seqlist to first give text output to the log file of some summary statistics:
- Total number of sequences.
- Total length of sequences.
- Min. length of sequences.
- Max. length of sequences.
- Mean length of sequences.
- Median length of sequences.
- N50 length of sequences.
Next, Pacbio-specific subread header information will be parsed to generate three summary tables (described in more detail below):
*.zmw.tdtsummary of all subreads.
*.unique.tdtsummary of the longest subread per ZMW.
*.rq.tdtsummary of subread data for different Read Quality values.
These are generated by parsing both the sequence data and the sequence name data:
>m150625_001530_42272_c100792502550000001823157609091582_s1_p0/9/0_3967 RQ=0.784 >m150625_001530_42272_c100792502550000001823157609091582_s1_p0/11/0_20195 RQ=0.868
These are split into component parts:
m150625_001530_42272_c100792502550000001823157609091582_s1_p0= SMRT cell (
/9= ZMW (
/0_3967= raw read positions used for subread (5’ adaptor removed) (
RQ=0.784= read quality (mean per base accuracy) (
The numbers and “best” subreads for each ZMW are summarised in
#RN log file entries.
Subread output (*.zmw.tdt)
The first output file is the
*.zmw.tdt, which stores the subread information:
RQ= as above.
RN= subread number within ZMW (1 to N).
Len= length of subread.
Seq= sequence file position.
The unique key for this file is:
Unique subread output (*.unique.tdt)
This table is made from
*.zmw.tdt by reducing each ZMW’s output to a single read. Reads are kept in preference of (a) longest, (b) read quality (if tied), and finally (c) earliest read (if tied for both). It hase the same fields as
*.zmw.tdt with keys determined by:
Read Quality output (*.rq.tdt)
This table outputs the number and percentage of subreads and longest reads at each read quality (
RQ= read quality (per base accuracy)
xerr= the Xdepth required @ that
RQto meet the
targeterr=Xerror per base accuracy.
subread= number of subreads with that
unique= number of “best” unique subread per ZMW with that
f.subreads= proportion of subreads with that
f.unique= proportion of unique subreads with that
cum.subreads= proportion of subreads with quality >=
cum.unique= proportion of unique subreads with quality >=
x.subreads= XCoverage of subreads with quality >=
x.unique= XCoverage of unqiue subreads with quality >=
MeanRQ= mean read quality of bases in subreads with quality >=
Mean.XErr= the Xdepth required at that
MeanRQto meet the
targeterr=Xerror per base accuracy.
Calculate length cutoffs (calculate=T)
This function calculates length cutoffs for different XCoverage combinations from subread summary data. It uses the
*.rq.tdt files from above, generating if required (and regenerating if
XCovLimit data are calculated. These are the summed read lengths required to generate the desired genome coverage (
targetcov=X) at different depths of X coverage. Note that the square root of the
targetcov=X value is used, as the HGAP assembly process involves two layers of genome coverage: (1) coverage of seed reads by anchor reads to generate the pre-assembly; (2) coverage of the pre-assembly.
XCovLimit data are calculated by incrementing total summed read lengths in 1 Mb increments (adjusted with
xsteplen=X). At each incremment,
genomesize=X is used to calculate the total X coverage. The probability of the target
X coverage (starting at
1X) given the total X coverage is then calculated using a Poisson distribution. If this probability exceeds the target genome coverage, the current summed length is set as the
X and the target
X increased by 1. The total summed read length is then incremented by
xsteplen and the process repeated until the summed length reaches the total length of all subreads of at least the size set by
minreadlen=X (default 500 bp).
Next, the read quality steps to calculate are established using
rq=X,Y (minimum (
X) and maximum (
Y) read quality cutoffs, defaults=0.8-0.9) and
rqstep=X (the size of
RQ steps for calculations, default=0.01 (min 0.001).
Finally, the seed and anchor lengths required to achieve certain minimum Xdepths of target genome coverage are calculated and output to
TargetErr is used to establish the min. required anchor read Xdepth (
AnchorMinX). If this exceeds the specified
minanchorx=X value, this will be used as the minimum target instead. Next, the seed read length (
SeedLen) required to get a combined unique seed read length to give a minimum Xdepth (
targetxcov=X is calculated. Where
xmargin > 0, additional seed lengths will also be calculated to give deeper minimum seed Xdepths. e.g.
targetxcov=3 xmargin=2 will calculate
5X. For each
SeedLen, anchor read length cutoffs are also calculated such that the summed length of unique reads where
AnchorLen <= length <
SeedLen is sufficent to give
AnchorMinX values from the minimum established above until
XMargin is reached.
If there is insufficient data to meet the minimum seed and anchor read depths, the “optimal” seed XCoverage will be calculated from the unique seed reads, as described for
coverage=T. In essence,
SeedMinX will be reduced to meet
AnchorMinX until it falls below zero and then the two values will be optimised to try and maximise genome coverage at the required target error. Users may prefer instead to relax the
minanchorx=X values. This optimisation is based on unique Xcoverage and the precise values subsequently output in
*.cutoffs.tdt may differ. (Relaxed variations could be run with higher
xmargin=X values to output a range from which the chosen values can be selected for
mapefficiency=X is used during this process to reduce the effective seed coverage at a given summed length and thus inflate the required
SeedX needed to achieve a given
This table contains the main output from the
calculate=T function. Unique entries are determined by combinations of
RQ= Read Quality Cutoff.
SeedMinX= Min seed XCoverage for TargetCov % of genome (unique reads). If this is <1, it indicates the proportion of the genome covered at 1X.
SeedLen= Seed length Cutoff.
SeedX= Total Seed read XCoverage (all reads).
AnchorMinX= Min anchor XCoverage for TargetCov % of genome (unique reads). If this is <
minanchorx=X, it indicates the proportion of the genome covered at
AnchorLen= Anchor read (i.e. overall subread) length cutoff.
AnchorX= Total Anchor read XCoverage (all reads).
Preassembly fragmentation analysis (preassembly=FILE)
[ ] : Add preassembly details here.
Optimal Assembly Parameters (parameters=T)
parameters=T function attempts to generate predicted optimum assembly settings from the
calculate=T table data.
*.cutoffs.tdttable is read in (or generated if required).
xmargin=Xsettings are used to determine the desired
SeedMinXvalue, i.e. the miniumum depth of coverage of the chosen seed reads. (NB. The
mapefficiency=Xsetting is used for inflating the required XCoverage needed to meet this Min XDepth during the calculate process.)
- Each potential
RQcutoff is now assessed and the
targeterr=Xsetting used to determine the required Xdepth per base for error correction. Note: This uses the minimum RQ value and should thus be conservative. Using the mean RQ was considered but it was deemed that conservative is best in this scenario.
- The desired
AnchorMinXis calculated from the target Xdepth and
xmargin=X. The final value is reduced by one to allow for the fact that the seed read counts as X=1 for error estimation.
- Parameter combinations have now been reduced to those with the desired Seed and Anchor minimum Xdepth. Two parameter combinations are then output:
- MaxLen parameters using the mininum RQ value and thus the longest seed and anchor read length cutoffs.
- MaxRQ parameters using the maximum RQ value.
Note that the predicted parameter settings are only output as
#PARAM log entries: the full set (including these “optimal” ones) are part of the
Genome Coverage (coverage=T)
This method tries to predict genome coverage and accuracy for different depths of PacBio sequencing based on predicted usable output statistics and the genome size. Statistics for existing runs can be generated using the
summarise=T option and used to inform
calculate=T if run together.
Setup. If the
summarise=T option is used and/or there is an existing
BASEFILE.unique.tdt file (and
force=F) then the
*.unique.tdt table will be used to generate SMRT cells statistics: mean read count (used to populate
smrtreads=X); mean total read count (used to populate
avread=X); mean RQ (used to populated
errperbase=X). Otherwise, the corresponding commandline options will be used. If
smrtreads=X will be recalculated as
XnList. The second stage of setup is to calculate the %coverage at certain X depth coverage to be calculated along with overall depth of coverage etc. These numbers are based on the subread
minanchorx. Any levels explicitly chosen by
xnlist=LIST will also be calculated.
TargetXDepth. Next, target Xdepth values are calculated in the same fashio as
XCovLimit data (above), except that these are now converted to Xcoverage (based on
GenomeSize) rather than total subread lengths.
Accuracy. The % genome coverage and accuracy for different X coverage of a genome are then calculated assuming a non-biased error distribution. Calculations use binomial/poisson distributions, assuming independence of sites. Accuracy is based on a majority reads covering a particular base with the correct call, assuming random calls at the other positions (i.e. the correct bases have to exceed 33% of the incorrect positions). The
errperbase=X parameter is used for this calculation.
Coverage. Coverage statistics are then calculated for each Xdepth of sequencing (or SMRT cell if
bysmrt=T) First, the optimal seed read Xcoverage is calculated. The target seed Xdepth (
targetxcov=X) and anchor depth (
minanchorx=X) are used to identify the total target Xcoverage. If this is met (inflating required seed coverage to account for
mapefficiency=X), the largest seedX value that meets the
minanchorx=X anchor depth is selected. If this cannot be achieved with seedX >= 1, the optimal balance between seed length and anchor length is achieved by maximising the probability of 1X seed coverage and
MinAnchorX+ anchor coverage. This seed read length is then used to generate the predicted coverage output.
Main output is the
*.covergae.tdt file. All calculations are based on subreads, and therefore using the “raw” polymerase read data for the
smrtreads=X value for SMRT cells will overestimate coverage. Note that
smrtreads=X can be used to input sequence capacity in Gb (or Mb) rather than read counts by changing
XCoverage= Total mean depth of coverage.
SMRT= Total number of SMRT cells.
%Coverage= Estimated %coverage of assembly.
%Accuracy= Estimated %accuracy of assembled genome. This is established by working out the predicted proportion of the genome at each Xcoverage (given the total
XCoverage) and the accuracy at that depth (as described above).
%Complete= Product of %coverage and %accuracy.
SeedX= Estimated optimal depth of coverage for seed reads.
Parsing assembly parameters (ParseParam=FILES)
This method will take a bunch of
*.settings text files (wildcards allowed) and parse out the assembly parameter settings into a delimited text file. The contents of these files should be consistent with the
Assembly_Metrics_*.xlsx file produced by the SMRT Portal.
Output for this method is a
*.settings.tdt file, which has the following field headers:
Setting. The full setting name, e.g.
Prefix. The setting prefix, e.g.
Suffix. The setting suffix, e.g.
minCorCov. (This is used for some of the
Variable. Whether the parameter is variable (“TRUE”) or fixed (“FALSE”) in the set of
*.settingsfiles being parsed.
Assemblies. Each assembly (the
*.settings) will get its own field containing the actual value used for that parameter in that assembly.
NOTE: The files coming off SMRT Portal have some undesirable non-unicode characters in them. These are hopefully stripped by
SMRTSCAPE but it is possible that some parameters may not be correctly parsed.
It is possible to parse a selected subset of parameters using
paramlist=LIST. (This is easiest where
LIST is a text file with one parameter per line.) This should be a list of the full parameter name, i.e. the content of the
[ ] : Add the recommended list of parameters here.
Predicting assembly coverage (Predict=T)
This function predicts coverage from parsed assembly parameters and compares to pre-assembly subreads if possible. Its primary function is to check that the parameter settings from
calculate=T are working as expected (at least in terms of preassembly generation) and to tweak the
mapefficiency=X option if required. Where a reference is available, it can also be used to test the make
SRMTSCAPE calculations in terms of coverage etc.
Predict uses data from the
parseparam=FILES functions. (These will be run if required.) As such, it requires the original subread data (
seqin=FILE) and the list of
*.settings files that identifies the assemblies. (See
*.preassembly.fasta files should match the
*.settings files. (The file looked for will be identified as a
#PREX log entry.)
If it already exists, the
*.predict.tdt will be loaded and updated. Otherwise a new
*.predict.tdt file will be created. (See below.)
Predict first loads in the relevant data and assembly parameters (see output) before calculating expected coverage from subread data and observed coverage from preassembly data.
The output of
Predict mode is a
*.predict.tdt output file with the following fields:
Assembly= The assembly base filename for a given file in
SeedX= mean depth of coverage for seed reads given seed length and min RQ score.
AnchorX= mean depth of coverage for anchor reads given seed length and min RQ score.
SeedMinX= minimum depth of coverage of unique seed reads to achieve
AnchorMinX= minimum depth of coverage of unique anchor reads to achieve
PreCov= predicted base coverage of pre-assembly.
CorPreCov= corrected predicted base coverage of pre-assembly given
PreX= average depth of coverage of
PreMinX= minimum depth of coverage of
*.preassembly.fastasequences to achieve
SeedXas an estimate of the loss of seed sequence during the preassembly mapping phase. Ideally, this should be close to the
mapefficiency=Xsetting. (NOTE: SMRTSCAPE has not undergone extensive testing of this assumption.