Examples Of How To Run Platypus
This page describes how to run Platypus is a number of different scenarios. In most cases, Platypus will work perfectly well using its default settings, and you don't need to change anything. To run Platypus you must first make sure that all your BAM files are indexed, using either 'samtools index', or another program that generates compatible indexes. You must also index the reference file using 'samtools faidx'. You must make sure that you are using the same reference file for variant calling as was used to map the BAM files.
Variant-calling on whole-genome data
Analysing whole genome data with Platypus is easy. Just run with default settings:
python Platypus.py callVariants --bamFiles=data.bam --refFile=ref.fa --output=out.vcf
If you want to only run on a specific region, specify it:
python Platypus.py callVariants --bamFiles=data.bam --refFile=ref.fa --output=out.vcf --regions=chr1
if you want to run on multiple regions, specify them all
python Platypus.py callVariants --bamFiles=data.bam --refFile=ref.fa --output=out.vcf --regions=chr1,chr2,chr3,chr4
or even
python Platypus.py callVariants --bamFiles=data.bam --refFile=ref.fa --output=out.vcf --regions=chr1:0-1000,chr2:-30000,chr3:2220-44444
or, if you have a lot of regions to look at, put them in a text file like this:
chr1:100-200
chr1:200-300
chr4:3330-9999
chr20:22229-999999
...
and specify the file:
python Platypus.py callVariants --bamFiles=data.bam --refFile=ref.fa --output=out.vcf --regions=FileOfRegions.txt
Variant-calling on exome capture data
Analysing exome capture data with Platypus is also easy, and works pretty much as for whole genome. Default settings are fine. The only issue to be aware of is that Platypus will, by default, make calls on all the data, which means that you will end up with calls across the genome, wherever the sequence data was mapped to. This is because exome capture also captures quite a lot of non-exonic sequence, and this gets mapped back to various parts of the genome, and Platypus will try to make calls wherever there is data. If you only want calls in the exonic target regions, you need to specify this using the '--regions' argument, as shown above.
Variant calling on targeted capture data (gene/exon panels etc)
Calling on targeted gene/exon panels, or other small regions works well. You may want to turn off duplicate read filtering, which Platypus does by default, because if you are sequencing a small region, there will be a higher rate of real duplicates (as opposed to PCR-duplicates) than in exome or whole-genome data, and filtering these may significantly reduce the coverage. You can do this using the '--filterDuplicates=0' flag.
python Platypus.py callVariants --bamFiles=data.bam --refFile=ref.fa --output=out.vcf --filterDuplicates=0
Because there can be significant biases in targeted sequencing, it is a good idea to be extra careful when filtering variant calls. Some real variants may end up with lower-than-expected allele frequencies (e.g. a heterozygous variant being present in only 15-20% of reads), and might trigger the 'alleleBias' filter.
Haloplex Data
If you are using Haloplex, there are a couple of things to be aware of. Sequencing from Haloplex data generates large blocks of reads, where all read-pairs start and end at the same positions. This means that all reads in the block are treated as 'duplicates' by Platypus, so you must turn off duplicate filtering. In addition, Platypus effectively clips the ends of reads, ignoring variants that only appear at the ends, to reduce error rates. You must turn this off, otherwise some variants will not be called. Clipping is controlled by the 'minFlank' parameter. Here is an example of how to run platypus on Haloplex data.
python Platypus.py callVariants --bamFiles=data.bam --refFile=ref.fa --output=out.vcf --filterDuplicates=0 --minFlank=0
Genotyping a list of known alleles
If you have a list of variants, and you want to genotype your data for those alleles, then Platypus can help.
python Platypus.py callVariants --bamFiles=data.bam --refFile=ref.fa --output=out.vcf --source=listOfVariants.vcf.gz --minPosterior=0 --getVariantsFromBAMs=0
You need to supply a VCF which has been compressed using bgzip, and indexed using tabix. If they are not already installed on your system, you can get bgzip and tabix from this site. To compress and index a normal VCF run the following commands:
bgzip listOfVariants.vcf
tabix -p vcf listOfVariants.vcf.gz
The first command produces a compressed (.vcf.gz) vcf file from the original file. The second command creates an index (.vcf.gz.tbi) for the compressed VCF. This allows Platypus to query the VCF for specific regions. The option '--minPosterior=0' sets the minimum quality score for reported variants to zero. This is so that Platypus can report reference genotypes for those variants in the input list which are not present in your data. If you don't want to see reference genotypes in the output, then remove this option. The option '--getVariantsFromBAMs=0' prevents Platypus from doing its normal variant calling, and makes sure that only alleles in the input VCF are reported. If you want normal calling and genotyping of the input alleles, then remove this option (this allows you to use an external list of variants to augment the candidates generated by Platypus, which can sometimes be useful for calling e.g. larger indels).