Pan1c (Pangenome at chromosome level) workflow
Pan1c : a snakemake workflow for creating pangenomes at chromosomic scale. The workflow use a set of apptainer images :
- PanGeTools (Odgi, VG, ...): https://forgemia.inra.fr/alexis.mergez/pangetools
- PanGraTools (PGGB): https://forgemia.inra.fr/alexis.mergez/pangratools
- Pan1c-Apps (Python, Snakemake): https://forgemia.inra.fr/alexis.mergez/pan1capps
An example of input files and a config file is available in
example/
.
Prepare your data
This workflow can take chromosome level assemblies as well as contig level assemblies but requires a reference assembly.
Fasta files need to be compressed using bgzip2 (included in PanGeTools).
Sequences names of the reference must follow this pattern : <sample>#<haplotype>#<contig or chromosome name>
.
For example, CHM13 chromosomes (haploïd) must be named CHM13#1#chr..
. Only the reference needs to follow this pattern for its sequence names. Others haplotypes sequences will be renamed based on the reference and their respective fasta file names.
Fasta files must also follow a pattern : <sample>.hap<haplotype>.fa.gz
. Once again with CHM13, the fasta file should be named : CHM13.hap1.fa.gz
.
See PanSN for more info on sequence naming.
You should only provide chromosome-level assemblies, but, as the haplotypes are renamed using RagTag, it is possible to give scaffold or contig-level assemblies. Since RagTag scaffolds each assemblies using the "reference" haplotype, it can scaffold chromosome-level assemblies that also contains non-placed scaffold/contigs. If you don't want this behavior, prune your FASTAs from any non-chromosome-level sequences before providing them to Pan1c.
Download apptainer images
Before running the worflow, some apptainer images needs to be downloaded. Use the script getApps.sh to do so :
./getApps.sh -a <apps directory>
Make sure to use the latest version or the workflow might return you errors !
Running the workflow
Clone this repository and create a data/haplotypes
directory where you will place all your haplotypes.
Update the reference name and the apptainer image directory in config.yaml
.
Then, modify the variables in runSnakemake.sh
to match your requirements (number of threads, memory, job name, email, etc.).
Navigate to the root directory of the repository and execute sbatch runSnakemake.sh
!
Outputs
The workflow generates several key files :
- Aggregated graph including every chromosome scale graphs (
output/pan1c.<panname>.gfa
) - Chromosome scale graphs (
data/chrGraphs/chr<id>.gfa
) - Panacus html reports for each chromosome level graph (
output/panacus.reports/chr<id>.histgrowth.html
) - Statistics on input sequences, graphs and resources used by the workflow (
output/stats
) - Odgi 1D visualization of chromosome level graphs (
output/chrGraphs.figs
) - (Optional) SyRI structural variant figures (
output/asm.syri.figs
) - (Optional) Quast results on your input haplotypes (
output/quast
) - (Optional) Contig composition of chromosomes of your input haplotypes (
output/hap.contig
) - (optional) PAV matrices for each chromosome graph (
output/pav.matrices/chr<id>.pav.matrix.tsv
)
File architecture
Before running the workflow
Pan1c/
├── config.yaml
├── data
│ └── haplotypes
│ ├── ref.hap<x>.fa.gz
│ ├── samp1.hap<x>.fa.gz
│ └── ...
├── example
│ └── ...
├── getApps.sh
├── README.md
├── runSnakemake.sh
├── scripts
│ └── ...
└── Snakefile
After the workflow (Arabidopsis Thaliana example)
The following tree is non-exhaustive for clarity. Temporary files are not listed, but key files are included.
The name of the pangenome is 06AT-v3
.
Pan1c-06AT-v3
├── chrInputs
│
├── config.yaml
├── data
│ ├── chrGraphs
│ │ ├── chr<id>
│ │ ├── chr<id>.gfa
│ │ └── graphsList.txt
│ ├── chrInputs
│ │ └── chr<id>.fa.gz
│ ├── haplotypes
│ └── hap.ragtagged
│ ├── <sample>.hap<hid>
│ └── <sample>.hap<hid>.ragtagged.fa.gz
├── logs
│ ├── pan1c.pggb.06AT-v3.logs.tar.gz
│ └── pggb
│ ├── chr<id>.pggb.cmd.log
│ └── chr<id>.pggb.time.log
├── output
│ ├── figures
│ │ ├── chr<id>.1Dviz.png
│ │ └── chr<id>.pcov.png
│ ├── stats
│ │ ├── pan1c.pggb.06AT-v3.core.stats.tsv
│ │ ├── pan1c.pggb.06AT-v3.chrGraph.general.stats.tsv
│ │ └── pan1c.pggb.06AT-v3.chrGraph.path.stats.tsv
│ ├── pan1c.pggb.06AT-v3.gfa
│ ├── panacus.reports
│ │ └── chr<id>.histgrowth.html
│ └── chrGraphs.stats
│ └── chr<id>.stats.tsv
├── Pan1c-06AT-v3.log
├── README.md
├── runSnakemake.sh
├── scripts
│ └── ...
├── Snakefile
└── workflow.svg
Example DAG (Saccharomyces cerevisiae example)
This DAG shows the worflow for a pangenome of Saccharomyces cerevisiae
using the R64
reference.