Notes for Stampy v1.0.20 -- Gerton Lunter, September 2012 ------------------------------------------------------------- 1. Summary ========== Stampy has the following features: - Maps single, paired-end, mate pair Illumina reads to a reference - Fast: about 10 (with BWA) or 15 hours (without) per Gbase - Low memory footprint: 2.7 Gb shared memory for a 3Gbase genome - High sensitivity for indels and divergent reads, up to 10-15% - Low mapping bias for reads with SNPs or indels - Well calibrated mapping quality scores - Input: Fastq and Fasta; gzipped or plain; SAM and BAM - Output: SAM, Maq's map file - Optionally calculates per-base alignment posteriors - Optionally processes part of the input - Handles reads up to 4500 bases Bug reports are highly appreciated. If you can, please design a small test file that reproduces the bug, and email the precise command line and details of the system you ran the program on. However, please do read this documentation before submitting a bug report. 2. Building =========== Just type "make" Currently the linux x86_64 platform is supported, and support for Mac OS-X 10.6 (x64_64 only) is experimental. Let me know if you need to run Stampy on another platform. Stampy needs Python version 2.6 or 2.7. Both 2-byte and 4-byte Unicode encodings are supported. If your default python is not 2.6 or 2.7, but say python2.6 is installed, use make python=python2.6 For Mac, only Python 2.6 with 2-byte Unicode is supported. Most errors that occur at this stage are related to the Python installation. Here is a checklist in case something goes wrong: - Check that python version 2.6 or 2.7 is installed on your system (type "python" in a shell). - Check that the executables python2.6/python2.7 and the related python2.6-config or python2.7-config can be found in your path, and that they live in the same directory as the Python executable you use. - If you get linking errors, check that the -L bit of the output of "python-config--ldflags" points to a directory that contains libpython. If not, your python installation is faulty; setting paths properly or re-installing Python might resolve the problem. - Make sure the installation and run-time versions of python agree. - For Mac users, check that you're using the standard Python ("which python"), rather than one from a third-party distribution such as Fink or Darwin. Some versions of these distributions have incomplete Python installations causing problems at the linking stage. 3. Quick-start guide ==================== Build a genome (.stidx) file: ./stampy.py --species=human --assembly=hg18_ncbi36 \ -G hg18 /data/genomes/hg18/*.fa.gz Build a hash (.sthash) file: ./stampy.py -g hg18 -H hg18 Single-end mapping: ./stampy.py -g hg18 -h hg18 -M reads.fastq.gz Paired-end mapping: ./stampy.py -g hg18 -h hg18 -M reads_1.fastq reads_2.fastq To speed up mapping, use BWA (recommended) and multithreading: bwa aln -q10 -t8 hg18 reads_1.fastq > 1.sai bwa aln -q10 -t8 hg18 reads_2.fastq > 2.sai bwa sampe hg18 1.sai 2.sai reads_1.fastq reads_2.fastq | \ samtools view -Sb - > bwa.bam ./stampy.py -g hg18 -h hg18 -t8 --bamkeepgoodreads -M bwa.bam Note that it is not necessary to sort the BWA BAM file. It is possible to re-map from sorted BAM file(s) directly; Stampy will pair up reads by label name before mapping: ./stampy.py -g hg18 -h hg18 -M sample.BAM sample_unmapped.BAM Set divergence for mapping to a foreign reference: ./stampy.py -g hg18 -h hg18 --substitutionrate=0.05 \ -M reads.fastq.gz Set the initial insert size distribution -- only inserts within 4sd of mean are considered for training the actual size distribution: ./stampy.py -g hg18 -h hg18 --insertsize=400 --insertsd=75 \ -M reads_1.fastq reads_2.fastq Use (post v1.3) Solexa quality scores; default is Sanger qualities: ./stampy.py -g hg18 -h hg18 --solexa -M reads.fastq.gz Process the 3rd of each set of 8 reads or read pairs: ./stampy.py -g hg18 -h hg18 --processpart=3/8 \ -M reads.fastq.gz 4. Running Stampy ================= First you have to build a genome file: ./stampy.py -G hg18 /data/genomes/hg18/*.fa.gz You may provide .fa or gzipped .fa files, and the .fa files may contain more than one chromosome or contig. As chromosome or contig identifer, the first word on the Fasta header line is taken. If the header follows NCBI formatting, the "ref" field is taken. The genome file has extension .stidx; building this takes a few minutes for large genomes. After building the genome file, you need to build a hash table: ./stampy.py -g hg18 -H hg18 This takes about 10 minutes for large genomes, and produces a file hg18.sthash. Finally, you're ready to map some reads: ./stampy.py -g hg18 -h hg18 -M illuminareads.fastq.gz Note that the hash file may be large (up to 2 Gb); to improve startup speed you may want to keep it on a local drive. Note also that the file system must support memory mapped files for Stampy to work; NFS does, but e.g. GlusterFS does not. Some file systems support memory mapped files but become exceedingly slow; if this happens try moving both files to a local drive. You need not unzip the input fastq files. You can use .fasta files too, using the option --inputformat=fasta. By default Stampy assumes that the quality scores use Sanger encoding (score 0 is '!', ascii 33), and use the phred (=log p) scale, not the logit (=log p/(1-p)) scale. Paired-end mapping is done by supplying two fastq files, one for each mate: ./stampy.py -g hg18 -h hg18 -M solexareads_1.fastq \ solexareads_2.fastq (If -M is not the last option on the command line, the input files need to be separated by a comma rather than a space.) It is not required to explicitly set parameters for the insert size distribution; Stampy automatically determines these within the first few hundred mappings, and adapts the scoring accordingly. The default settings are an average separation of 250 and standard deviation 60, which is broad enough to capture all currently used libraries within the default 4 standard deviations. If a library with an insert size distribution outside this range is used, you need to change these defaults for autocalibration to work. 5. Faster mapping with BWA ============================ For large reference genomes and not very divergent data sets (<1% say), it is highly recommended to use BWA to speed up mapping. First, create a BWA index for the same reference genome as used by Stampy, then map the reads using BWA in the usual way, and use samtools to convert the resulting SAM file into a BAM file. Then re-map the BAM file using Stampy, but keep the well-mapped reads: ./stampy.py -g hg18 -h hg18 --bamkeepgoodreads -M bwa.bam To further speed up mapping, tell both BWA and Stampy to use multiple threads using the -t option. Previous version of Stampy launched BWA itself, and processed reads in batches. For backward compatibility this is still supported: ./stampy.py --bwaoptions="-q10 BWAindex/hg18.fa" -g hg18 -h hg18 -M solexareads_1.fastq solexareads_2.fastq However this mode is not recommended: it takes longer because BWA has to be launched many times, and the BWA and Stampy tables are needed simultaneously so that the memory requirement is about twice as high. Finally, this mode only works with BWA versions 0.5.6, 0.5.7 and 0.5.9; the 0.6.x versions don't work as they occasionally segfault when BWA is given a small number of reads to process. The achieved speed-up depends on the quality of the input reads; if BWA fails to map a large proportion of reads, the speed-up will be comparatively low. On some filesystems, Stampy is particularly slow. In order to conserve main memory, Stampy shares its two large tables across multiple instances running on the same shared-memory node. This is achieved by memory-mapping a shared file. On standard filesystems (NFS, Linux etx3/4) this works fine, but certain higher-end distributed file systems this causes excessive slow-down. The solution is to copy the .stidx and .sthash files to a local drive, e.g. /tmp, and share those copies locally rather than through the globally shared file system. 6. Mapping divergent reads =========================== To calculate correct mapping qualities, Stampy needs to know the expected divergence from the reference. This is set with the --substitutionrate= option. The default is 0.001 substitutions per site. Increasing the read length, and using paired-end reads, helps mapping divergent reads. The following table gives an indication of the divergence at which a reasonable proportion of reads can be correctly mapped. These numbers were obtained by simulation, using the human genome as reference, and should be taken as an indication only; they are dependent on error rates, the repetitiveness of the genome, the insert size distribution, and local variations in divergence; in addition no indel mutations were included. 36bp 36bp 72bp 72bp divergence | single paired single paired ------------------------------------------------------- 0% | 82% 95% 87% 96% 3% | 73% 91% 80% 94% 6% | 60% 83% 72% 92% 9% | 41% 56% 56% 88% 12% | 28% 51% 48% 80% 7. Mapping mate pair libraries =============================== Mate pair libraries contain a mixture of ordinary (--> <--) paired-end reads, and mate pairs (<-- -->). To map these, Stampy trains separate insert size distributions for the two types of pairs, and attempts realignment within both regions. To allow Stampy to accurately train the two distributions, both distributions need to be seeded properly. This is best done by mapping a small fraction of the data, starting with a very wide distribution: ./stampy.py -g ref -h ref --numrecords=5000 \ --insertsize2=-2000 --insertsd2=750 -M read1.fq read2.fq Stampy trains a distribution with reads that fall within 3 standard deviations of the mean, using the current estimates of each. Stampy reports the estimated distributions when its done: stampy: # Paired-end insert size: 266.4 +/- 91.0 (903 pairs) stampy: # Mate pair insert size: -1519.1 +/- 164.5 (3374 pairs) The values -1519 and -165 can be used as new initial parameters, and the procedure repeated to obtain more accurate approximations. Note that the mean mate-pair insert size is negative, corresponding to the fact that the read mapping to the reverse strand maps to the left of the forward-mapping read. Protocols other than the currently standard Illumina mate-pair protocol may need positive insert sizes. Note also that with the current protocol, the standard deviation of the mate pair insert size distribution is always larger than that of the paired-end distribution, and that wider distributions slow down the realignment process, so that mate pair mapping can be slow. It is not recommended to use BWA for pre-mapping when mapping mate pair reads, as BWA does not consider the alternative positions, causing incorrect mappings in certain cases. 8. Reporting alternative mapping positions =========================================== Stampy can output alternative mapping locations for reads and read pairs, using the XA tag. By default this is switched off; it can be enabled with the options --xa-max=3 --xa-max-discordant=10 to report at most 3 alternative placements for single reads and concordant read pairs, and at most 10 for discordant read pairs (these are the BWA defaults). The XA tag reports, for each additional hit, the chromosome name, position and strand, CIGAR string, and the number of base mismatches plus the total length of any insertions or deletions. Note that when using BWA for pre-mapping, there is no need to set BWA's -n or -N options; Stampy automatically sets these to the correct values. 9. Testing Stampy ================== Stampy includes test code that generates reads under an empirical read error model and introduce SNPs van indel variants. From the output, stampy estimates the sensitivity for mapping reads back to their correct location, and it tabulates results by mapping posterior score to see if these are well calibrated. To run these tests, you need to create a genome file and corresponding hash file, and provide one or two .fastq files with short reads and quality scores. Stampy will use the length and qualities of these reads, but generate sequences from random locations in the genome provided. You run a test by using the -S command to simulate reads, followed by a -P command to parse the .sam output; or run the two together using -T: ./stampy.py -g hg18 -h hg18 -T solexareads_1.fastq,solexareads_2.fastq This uses paired-end reads; single-end reads are used if just one .fastq file is provided. The following options are useful in test mode: --substitutionrate=S Introduce an expected fraction S of Poisson-distributed substitutions (default: 0.001) --insertsize=N Set the mean insert size for paired-end reads (default: 250) --insertsd=N Set the standard deviation of the insert size distribution (default: 60) --numrecords=N Only map the first N reads, or read pairs (default: all) --simulate-minindellen=N Set the lower bound for simulated indel lengths (default: 0) --simulate-maxindellen=N Set the upper bound for simulated indel lengths (default: 0) --simulate-duplications Introduce duplications rather than insertions of random sequence (default) --simulate-numsubstitutions=N Introduce N substitutions in each read, rather than a Poisson-distributed number The settings of the first three options are used in simulations, and again when computing mapping statistics from the mapped reads. These options are therefore also meaningful outside of testing. The default for librarysd is generous, to train the model on a wide range of input data. For simulations this needs to be adjusted. Simulated indels are generated by drawing the indel length from a uniform distribution. This is not meant to be close to a real distribution, but is useful for testing the behaviour of stampy conditional on indels being present. Negative indel length are deletions, positive ones are insertions. When single-end reads are used, each read will contain a simulated indel; for paired-end reads, one of the mate pairs will contain an indel. 10. Mapping output =================== Output is written to stdout by default. An output file can be chosen with the -o or --output= option, however the SAM format is now the accepted standard and use of any other format is discouraged. A number of formats can be chosen using th -f (or --format=) option: -f sam : SAM format (default) -f maqtxt : Maq's text output format (produced by 'maq mapview') -f maqmap : Maq's binary .map format, new version (long reads) -f maqmapShort : Maq's binary .map format, old version (short reads) -f maqmapShortN : Maq's binary .map format, old version, including variant positions (produced with 'maq map -N') By default all reads are represented in the output. The default output format is the SAM format. See below for details on the various formats. The SAM format is the most comprehensive format, and is recommended. The Samtools program (samtools.sourceforge.net) is recommended for dealing with .SAM files. Several tools have been developed to use Maq's .map format as well, including those that are included with the Maq package, and therefore the .maq format was included as a convenience. However, Maq's .map format cannot represent all useful information. In particular indels are better represented in the SAM format. 10.1. A note on likelihoods and posteriors The single most often used statistic to judge the trustworthiness of a read map location is its "mapping quality". This is an approximation of the probability that a read is mapped to the wrong location (represented as a Phred score). The probabilistic model used by Stampy is a hybrid of three models: (1) a Bayesian model, which considers all candidate locations weighted by their likelihood. (2) an error model, which considers the possibility that read errors cause reads to be incorrectly mapped (3) a random model, which predicts how well a random sequence would match to the genome The first model deals with errors due to repetitive and nearly-repetitive sequence, and assumes that the correct mapping location was considered among the candidates. The second model estimates the probability that the correct candidate was missed because of (single-nucleotide) read errors. The third model acts as a post-hoc filter, and assesses whether a candidate locations looks better than a random best match. Together these models capture most of the error modes of read mapping. The most obvious exception is that the error model does not consider indel errors or mutations; these often lead to the correct candidate location being missed, particularly for short single-end reads. The "best" map that results is often caught by the third model; however if the sequence is mildly repetitive, it will also pass this filter. As a result, mapping quality is well calibrated for almost all cases, except short single-end maps of reads that contain indel mutations, in which case the mapping quality is overly optimistic by about an order of magnitude. Consequently, indels that are supported by single-end maps only should be treated with caution. Stampy computes posteriors and likelihoods both for pairs of reads, and for reads considered by themselves. The following table summarizes the SAM tags for these statistics for paired reads: Read | Posterior | Likelihood | -------------------------+-------------+------------+ Pair | MAPQ column | PQ:i: | This read as single read | SM:i: | UQ:i: | Mate as single read | MQ:i: | XQ:i: | For single reads, only the MAPQ column and SM:i: tag are present. 10.2. SAM output format This tab-delimited output format is described in detail on samtools.sourceforge.net. A few things to note: - The MAPQ field is the phred-scaled estimated probability that the read was mapped to the wrong location. See 8.1 for details. - In addition to the mapping quality (the probability that a read was mapped incorrectly), Stampy also reports the read likelihood: the likelihood that a read was produced from the reference, conditional on the mapping location being correct. This score is the sum of phred qualities on mismatching sites, and includes probabilities for indels and read separation as well. The single- read, paired-read and mate likelihoods are reported in the optional "UQ", "PQ" and "XQ" fields. See 8.1 for details. - The SAM format requires that paired reads share identifiers; if a trailer like "/1" is present, it will be removed from the identifier to conform to the standard. - If an identifier contains spaces, only the first word will be used (and paired-end /1, /2 trailers silently added if necessary). The option --keeplabel changes this behaviour, and instead replaces spaces by underscores. It is not possible to keep spaces, since BWA also only uses the first word. - The "proper pair" flag bit (value 2) is set if two reads are correctly oriented, and their separation is within 5 standard deviations from the mean - The "mate unmapped" bit (value 8) is never set by itself; pairs are either both mapped or both unmapped - The "UQ" optional field (single read likelihood) is always present. - The "PQ" and "XQ" fields are always present for paired-end reads, and represent the paired likelihood (which includes terms for mismatches, indels and read separation), and the single read posterior, respectively. - The optional "SM" and "MQ" fields (mapping quality of the read or its mate, considered as a single read) are always present for paired-end reads 11. Hints and features 11.1 Ordering of chromosomes in SAM output The SAM/BAM format does not specify the order in which the reference sequences appear in the SAM header. By default Stampy up to version 1.0.13 ordered them alphabetically; from 1.0.14 the default order is as the original .fa input file. Some downstream tools require the references to appear in the same order as in the original .fa input file. To enable this behaviour, you had to use --keepreforder option while mapping before version 1.0.14; this is now the default. To re-enable alphabetic mapping, use --alphabetical. Note that .stidx files created before Stampy v1.0.4 will not work correctly; you need to re-create the .stidx file. 11.2 Parsing NCBI fasta files NCBI fasta files use a >gi|nnn|ref|xxx identifier. By default Stampy parses this and uses only the "xxx" part. Use --noparseNCBI to switch off this behaviour and use the full NCBI identifier. 11.3 Base-level alignment qualities The placement of an indel into an alignment is uncertain, because of actual ambiguity, polymorphisms, and read errors. A probabilistic alignment considers all possibilities (under a suitable model) and this information can be used to compute a posterior score. To have Stampy compute this, use the --alignquals option. Note that this increases the output file size, and also increases the runtime, since the required forward and backward iterations take time. 11.4 Base quality recalibration For very old Solexa/Illumina reads it may be advantageous to recalibrate the quality scores: ./stampy.py -g hg18 -h hg18 -R solexareads.fastq.gz By default this maps 1 percent of the fastq file onto the reference, collects statistics, and writes a *.recaldata file. If you now start the mapper again, ./stampy.py -g hg18 -h hg18 -M solexareads.fastq.gz, Stampy will use the .recaldata file and apply the recalibration before mapping, which should improve the mapping quality statistics. (Note that if the original file is not ascii-33 based, or uses logit scores, you need to provide the relevant options both times.) For current Illumina data, the quality scores are sufficiently well calibrated, and will degrade by this recalibration procedure because polymorphisms are not taken into account. 11.5 Multiple mapping locations Normally, when Stampy identifies several equally good mapping locations for a read or read pair, it reports one of these at random (and assigns the choice a low mapping quality, of 3 or less). Alternatively Stampy can report a limited number of alternative mapping locations if you set the --xa-max option to a nonzero value. The alternative locations are reported in a single XA:Z:... tag, using the format that is used by BWA. Normally, no alternative mappings are reported for discordant read pairs. However, if you set --xa-max-discordant to a nonzero value, these will be reported as well. 11.6 Marking BWA-mapped reads The option --bwamark tells Stampy to mark reads that were mapped by BWA, and directly copied by Stampy, with a XP:Z:BWA tag. This feature causes the output file to increase in size, however it is more efficient than keeping separate BAM files for Stampy and BWA. In cases where Stampy re-considers reads that were also mapped by BWA (for instance, if these overlapped a large indel), such reads are reported as unmapped (flag value 4) and as secondary (flag value 256); the original flag value is reported as XP:Z:BWA:. In this case, the read mapped by Stampy is reported without XP:Z:BWA tag. Downstream tools should ignore reads that are marked as either unmapped or secondary. Since BWA-specific reads have both flags set, there is a good chance that such tools will indeed do so. However since many mappers don't include unmapped reads or secondary mappings, it is possible that downstream tools ignore these flags. Please be aware of this when using this feature. To reconstruct the BWA output, keep all reads with an XP:Z:BWA tag, reconstructing the flag value is necessary. Note that this feature is experimental, is not very well tested, and will for the moment not be supported. 11.7 Mapping from BAM files BAM files are useful containers for storing and archiving both mapping results and original read data. Stampy has a few options to simplify re-mapping reads contained in BAM files. Sometimes, reads from a sequencing experiment are split between multiple BAM files, to store them by chromosome for example, or unmapped reads may be stored in a separate BAM file. Stampy can map multiple BAM files simultaneously, and will merge the contents before mapping. BAMs often contain data from multiple sequencing runs, and/or multiple libraries. It is important to separate the data from different libraries, since these often have different insert size distributions. To only re-map data from one read group, use --readgroup=ID:xxx where xxx is the read group identifier. To map all reads from a single library across readgroups, use the same option but specify the library identifier. In the output, the reads will be assigned to a readgroup with the library identifier. If reads are mapped with BWA, a considerable speed-up can be achieved with the --bamkeepgoodreads option, which copies the well-mapped reads to the output without remapping them. Stampy pairs up the reads from a single fragment before re-mapping them using a generous read-ahead window, allowing most reads to be paired on the fly. Those mapping to very different locations are spilled to a temporary file, sorted, and mapped when all input has been read. Occasionally, this temporary file may become large. To avoid problems, you can specify a location for this temporary file using --bamsortprefix=S You can limit the amount of memory used when sorting by specifying --bamsortmemory=N This value is passed to samtools via its -m option. 12. License =========== This is a release version. Permission is granted for the normal use of the program and its output in an academic setting, including in publications. If the program is used to generate data for a publication, please cite this paper: G. Lunter and M. Goodson. Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Res. 2011 21:936-939. The program itself may not be modified in any way, and may not be reverse-engineered. This license does not allow the use of this program for any commercial purpose. If you wish to use this program for commercial purposes, please contact the author. No guarantees are given as to the program's correctness, or the accuracy or completeness of its output. The author accepts no liability for damage or otherwise following from using and interpreting the output of this program. The software is supplied "as is", without obligation by the author to provide any services or support. 13. Revision history ==================== 1.0.5, rev 786 (1 October 2010) - first release 1.0.6, rev 803 (20 October 2010) - added NM (number of mismatches) tag - added support for sorted BAM/SAM input - bugfix: unmapped reads were sometimes reversed 1.0.7, rev 827 (3 November 2010) - added FAQ section to README - bugfix: BWA support was broken in 1.0.6 1.0.8, rev 828 (4 November 2010) - bugfix: occasional undefined variable reference 1.0.9, rev 852 (22 November 2010) - added --ignore-improper-pairs option, useful when remapping broken BAM files - bugfix: memory leak when mapping with BWA - bugfix: improper read pairing when remapping part of a paired BAM file - bugfix: v1.0.7/8 occasionally printed debug output 1.0.10, rev 854 (23 November 2010) - added --bwatmpdir option - better adherence to SAM v1.3 format specification; added option --referenceuri - bugfix: BWA paired-end mapping broken in 1.0.9 1.0.11, rev 880 (22 December 2010) - Faster, particularly for longer reads - Improved reporting of errors - Base Alignment Quality scores (BAQ, --baq) - Improved scoring of low-quality inserted bases - Improved indexing for references with many contigs - Fixed a bug requiring index and hash to be writeable - Fixed a bug throwing occasional segmentation faults for references with many contigs 1.0.12, rev 1010 (20 March 2011) - Improved specificity in low complexity regions - More informative error messages - Bugfix: underflow could cause negative BAQ values - Bugfix: all-N reads sometimes caused segfaults - Bugfix: -P sometimes failed to parse read label - Bugfix: Cigars sometimes started with I..D.. 1.0.13, rev 1160 (16 June 2011) - Added support for mate pairs - Added XA (alternative mappings) output tag - Added optional tagging of BWA-mapped reads - Bugfix: simulation of duplications did not work - Bugfix: estimation of insert sizes improved 1.0.14, rev 1307 (9 December 2011) - BWA tags allow reconstruction of BWA-mapped reads - bz2 input supported - Bugfix: highly repetitive reads occasionally were assigned high mapping qualities - Read pairs consisting of two identical sequences are now mapped as a single read (for GenomeSTRiP) - Option --keepreforder now default; added --alphabetical - Added option --maxpairseeds to allow increasing the sensitivity for mapping to highly repetitive genomes - The 'properly paired' flag is now set for insert sizes within 3 sd of the mean; this used to be 4. - Added option --overwrite - Fixed bug when read pairs were labeled as /1 and /3 - Fixed bug in auto-sensing code for FASTA files 1.0.15, rev 1360 (2 February 2012) - Added --casava8 option to keep (but not map) reads not passing QC filters - Bugfix: --overwrite now also works for .sam output - Bugfix: occassional math domain errors - Bugfix: the PQ tag value was float rather than int 1.0.16, rev 1430 (16 February 2012) - Bugfix: memory leak (present from revision 1307) - Bugfix: now does not require Python bz2 module 1.0.17, rev 1481 (4 April 2012) - Bugfix: --bwamark did not properly copy all flag bits - Bugfix: mapping positions of reads starting with adjacent insertions and deletions were sometimes slightly off - Changed computation of ISIZE field to match BWA's sign and value for nonstandard directions and overlap - Changed computation of "NM" tag to agree with Samtools: count now includes bases involved in indels - Added option --labelfilter to map a subset of reads - Added option --maxbasequal to accept high base qualities - Added option --gatkcigarworkaround removing adjacent I/D events from CIGAR strings, which trips up GATK - Option --casava8 with --keeplabel keeps multiplex tag 1.0.18, rev 1526 (17 May 2012) - Bugfix: BWA-mapped reads lost their fragment size sign - Regression: simulation stopped working in 1.0.17 - Added informative error message when mem-mapping fails 1.0.19, rev 1617 (10 October 2012) - Added capability to map from arbitrary BAM input - Added multithreading support - Added option --separator[1,2] to deal with alternatively formatted Casava v1.8 tags - Bugfix: issue with numeric-only read labels 1.0.20, rev 1642 (18 October 2012) - Added --xcigar option for extended CIGAR output (XC tag) - Added --untrimq option to tweak soft trimming removal for BWA/BAM input - Bugfix: --bamkeepgoodreads never trimmed reads, causing slowdown - Bugfix: --bamkeepgoodreads sometimes accepted large insert sizes - Bugfix: increased specificity for reads mapping to highly repetitive regions - Bugfix: Meaningful error message when re-mapping from BAM with incorrect reference - Bugfix: --readgroup can be used to annotate reads from BAM files without readgroup annotation 12. FAQ ======= 1. Q: Stampy complains that there are no input files, if I don't use a comma between the file names A: Use -M as the last option on the command line to avoid this. 2. Q: Stampy starts but doesn't produce data for hours - is this normal? A: No. Stampy loads its index files in memory, and uses memory mapping to allow the memory to be shared between instances of Stampy. However, this does not work on all file systems (notably, the Sanger file system). To avoid the problem, copy the .index and .hash files to a local drive, e.g. /tmp 3. Q: Using Stampy with BWA seems to be slow, and it's supposed to be fast? A1: Make sure you run BWA (use --bwaoptions="...") A2: You might be running out of memory. BWA and Stampy each need 3 Gb of memory. To run n copies side-by-side, and mapping to a human-size genome, 4+3n Gb of free memory is recommended. A3: Note that multithreading (BWA's -t option) has negligible effect. Stampy cannot be multithreaded at this moment. 4. Q: OK, Stampy can't be multithreaded at the moment. What do I do to make it faster? A: You can run multiple copies side by side (but see the answer to question 3), and have each copy process part of the file, using the option --processpart=N/M (1 <= N <= M). The resulting SAM files can be merged using e.g. Samtools. 5. Q: Stampy exits with "Fatal Python error: PyImport_GetModuleDict: no module dictionary". Should I worry? A: This is an obscure Python bug that occurs rarely at shutdown. As long as Stampy says "Stampy: Done", it finished OK. 6. Q: Stampy cannot find "python*-config.py" during installation; how can I solve this? A: This happens on Ubuntu installations, where python*-config is not always installed. Try installing the "python-dev" package. If that fails, try installing python from source (be sure to do "make install", not just "make") 7. Q: I'm getting "Suspiciously high Q scores" warnings, but I'm sure the fastq data is fine. A: Your fastq data is in Solexa format (base-64), while Stampy defaults to Sanger format (base-33). Use the option --solexa. 8. Q: Stampy requires python 2.6 -- how do I choose this version? A: Ensure Python 2.6 is installed; then use "python2.6 stampy.py ..." 9. Q: Stampy complains about "mixed single-and paired-end in input", but I'm using only paired-end reads A: Stampy likely got confused by Casava 1.8 fastq headers. Try using the --casava8 option.