Skip to content
Snippets Groups Projects
README.md 8.76 KiB
Newer Older
Raphaël Flores's avatar
Dev
Raphaël Flores committed
# panREPET

Johann Confais's avatar
Johann Confais committed
panREPET is described in this preprint available on [BioRxiv](https://doi.org/10.1101/2024.06.17.598857)
Raphaël Flores's avatar
Dev
Raphaël Flores committed
## Description

SAIDI SOMIA's avatar
SAIDI SOMIA committed
Transposable elements (TEs) are mobile DNA elements that can invade genomes by transposition. Despite their reputation as parasitic sequences, they can enrich genomes with functional novelties that foster genome evolution. The impact of TEs in a genome can be explored by searching for their insertions. Individuals of the same species independently undergo TE insertions causing inter-individual genetic variability. This variability between individuals is the basis of the natural selection that leads to an increased adaptation of individuals to their environment. A way to search for the potential role of TEs in host adaptation is through a pangenomic approach. The TE pangenome can be described by (i) TE insertions present in all individuals of the species (core-genome), (ii) insertions present only among a subset of individuals (dispensable-genome) or (iii) ecogenome when the individuals share the same environment, and finally (iv) individual-specific insertions.
A majority of current pangenomic analysis methods are based on the alignment of reads from different genomes of a species to an assembled reference genome. But, the advent of the third-generation sequencing makes now possible to better approach this question using **multiple </m>de novo</em> assembled genomes** of the same species to avoid the bias introduced by a single reference genome. panREPET is a new pipeline to take in charge this type of data. It compares TE copies between each pair of genomes then identifies **TE copies shared by a group of individuals**. As it work on multiple assembled genomes, it allows accessing to exact genomic contexts of TE copies and their exact sequences.
Raphaël Flores's avatar
Dev
Raphaël Flores committed

## Installation

Install [Conda](https://docs.conda.io/en/latest/miniconda.html) (>=4.12.0)

Install Snakemake via [mamba](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) (mamba >=0.22.1, snakemake >=7.3.8)

SAIDI SOMIA's avatar
SAIDI SOMIA committed
Install [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/installation.html) (>=3.8.7-1.el7)

Raphaël Flores's avatar
Dev
Raphaël Flores committed
## Directory tree

```
.
├── README.md
├── Snakefile
├── run_Snakemake.sh
│   └── scripts
│       ├── extractCopies.py
SAIDI SOMIA's avatar
SAIDI SOMIA committed
│       ├── seqIdLenFasta.py
Raphaël Flores's avatar
Dev
Raphaël Flores committed
│       ├── fastaLength.py
│       ├── reformat.py
│       ├── launchMinimap2.py
│       ├── minimap2align.py
│       ├── bestHits.py
│       ├── cliques.py
│       ├── uniqueCopies.py
│       └── stats.py
│   ├── config
│   ├── containers
SAIDI SOMIA's avatar
SAIDI SOMIA committed
│   ├── data
│   └── envs
Raphaël Flores's avatar
Dev
Raphaël Flores committed
```

## How configurate the configfile

SAIDI SOMIA's avatar
SAIDI SOMIA committed
Set up the configfile in config/example.yaml:
Raphaël Flores's avatar
Dev
Raphaël Flores committed

* `output_directory`: Output directory path, the path must be absolute.
Johann Confais's avatar
Johann Confais committed
* `GFF`: Specify the path to TE annotation from your genomes of interest (GFF format, of the same shape as the GFF from [panTEannot](https://forgemia.inra.fr/urgi-anagen/panTEannot)).
Raphaël Flores's avatar
Dev
Raphaël Flores committed
* `Ref`: Specify the path to the reference sequence of your genomes. Each genome for GFF and Ref must have the same identifier.
* `Consensus`: Specify the path to the TE consensus library used for TE annotation of genomes (fasta format).

**params:**

SAIDI SOMIA's avatar
SAIDI SOMIA committed
* `cov_consensus:` Select TE copies which cover their consensus betwen x% and (100+(100-x))% (95 by default meaning over 95-105%). The higher the coverage, the more complete copies are selected.
Raphaël Flores's avatar
Dev
Raphaël Flores committed
* `copies_type`: Specify the complete TE copy type you want to extract. Choose between **FLF or FLC (FLC by default).** The copies are classified as **Full-Length Copies (FLC)** when they are aligned (potentially with large gaps if there are insertions) over x%-(100+(100-x))% of their reference sequence and as **Full-Length Fragment copies (FLF)**, when they are unfragmented.
* `extension_length:` Flanking regions length of TE copies in base pairs (500 bp by default).
* `cov_flank:` Filter which checks that the observed TE copy and its flanking regions covers a certain defined percentage **of the two flankers from** its match (80% by default). This filter diminishes the number of false positive copies.
SAIDI SOMIA's avatar
SAIDI SOMIA committed
* `cov_match:` Filter which checks that the observed TE copy and its flanking regions covers a certain defined percentage of its match (not useful, 0 by default).

Raphaël Flores's avatar
Dev
Raphaël Flores committed

**params (which are optional):**
* `select_region_bed:` Specify regions to select or to mask (bedfile). None by default. None by default means that all submitted genomes (params Ref) are considered.
* `select_type:` Select or mask a part of genomes (e.g. you can mask the centromeres). Choose between **mask or select (select by default).**
* `is_chr_level:` You can specify at which chromosomal level TE insertions should be compared (e.g. compare only same chromosomes between genomes). Caution 1: For assemblies that are not on a chromosomal scale, you must set false. Caution 2: If chromosome names are not the same between genomes, the comparison will not be possible.

**Example:**
```
output_directory = /absolute_path/my_output_directory

GFF:
   genome1: genome1.gff
   genome2: genome2.gff

Ref:
   genome1: genome1.fa
   genome2: genome2.fa

Consensus: my_TE_library.fa

params:
SAIDI SOMIA's avatar
SAIDI SOMIA committed
   cov_consensus: 95
Raphaël Flores's avatar
Dev
Raphaël Flores committed
   copies_type: FLC
   extension_length: 500
   cov_flank: 80
   cov_match: 0.0
   select_region_bed: centromeres.bed (optional)
   select_type: mask (optional)
   is_chr_level: False (optional)
```

centromeres.bed must have this format:
```
genome	chr	start	end
genome1	1	1000	5000
genome1	2	700	1000
genome2	1	1000	5000
genome2	2	800	1100
```
During the "extractCopies" step, panREPET does not extract or extracts only the TE copies in the regions specified in select_region_bed params.

**tools:**

SAIDI SOMIA's avatar
SAIDI SOMIA committed
* `container :` Singularity container containing TEfinder 2.31 with matcherThread2.31, already includes in git folder containers/te_finder_2.31.sif
Raphaël Flores's avatar
Dev
Raphaël Flores committed

SAIDI SOMIA's avatar
SAIDI SOMIA committed
## About TEfinder container
Raphaël Flores's avatar
Dev
Raphaël Flores committed

Another way to get the TEfinder container:
```
SAIDI SOMIA's avatar
SAIDI SOMIA committed
singularity pull --arch amd64 library://hquesneville/default/te_finder:2.31
Raphaël Flores's avatar
Dev
Raphaël Flores committed
```
SAIDI SOMIA's avatar
SAIDI SOMIA committed
Source: https://cloud.sylabs.io/library/hquesneville/default/te_finder:2.31
Raphaël Flores's avatar
Dev
Raphaël Flores committed

For more details about TEfinder and Matcher tools: Quesneville, H., Nouaud, D. & Anxolabéhère, D.  Detection of New Transposable Element Families in Drosophila melanogaster and Anopheles gambiae Genomes . J Mol Evol57 (Suppl 1), S50–S59 (2003). https://doi.org/10.1007/s00239-003-0007-2

## Execution

You can execute an example based on data you can find in data folder composed of:
- 2 whole-genome assemblies of <em>Brachypodium distachyon</em> from Phytozome database (https://phytozome.jgi.doe.gov/, Gordon et al. 2017)
- their TE annotation from panTEannot
SAIDI SOMIA's avatar
SAIDI SOMIA committed
- and a TE library built <em>denovo</em> with TEdenovo pipeline from REPET v3.0 (https://urgi.versailles.inra.fr/Tools/REPET, Flutre et al. 2011) on Bd21 v3.2 genome.
Raphaël Flores's avatar
Dev
Raphaël Flores committed

```
SAIDI SOMIA's avatar
SAIDI SOMIA committed
conda activate snakemake
Raphaël Flores's avatar
Dev
Raphaël Flores committed
nohup ./run_Snakemake.sh &> panREPET.log &
```

## Build DAG in png

To visualize your DAG (Directed Acyclic Graph):
```
snakemake --forceall --dag --configfile config/example.yaml | dot -Tpng > dag.png
```

## How to deal with outputs

```
.
├── benchmarks
SAIDI SOMIA's avatar
SAIDI SOMIA committed
├── log
├── extractCopies
├── extendedGFF
├── extendedFasta
├── minimap2
├── matcher
Raphaël Flores's avatar
Dev
Raphaël Flores committed
├── bestHits
├── clique
│   └── filter80
│       ├── cliques_copy_filter80.tsv
│       ├── cliques_stats_filter80.tsv
│       └── filter80.stats
├── uniqueCopies
│   └── filter80
SAIDI SOMIA's avatar
SAIDI SOMIA committed
│       └── concat_uniqueCopies_filter80.tsv
Raphaël Flores's avatar
Dev
Raphaël Flores committed
```

SAIDI SOMIA's avatar
SAIDI SOMIA committed
`cliques_copy_filter80.tsv :` this output gives in each line a TE copy belonging to an accession and which has been cliqued (i.e. a copy shared with at least one other accession). The id column indicates the clique ID, i.e which TE copies are shared by the same accessions.
Raphaël Flores's avatar
Dev
Raphaël Flores committed
```
SAIDI SOMIA's avatar
SAIDI SOMIA committed
id      accession       chromosome      start   end    copy_name       consensus       clique_size    pangenomic_compartment
Raphaël Flores's avatar
Dev
Raphaël Flores committed
1       genome1     chr1      1   3000   {copy_id}_{consensus_name}    consensus_name   2       core
1       genome2     chr1    100 3300 {copy_id}_{consensus_name}    consensus_name   2       core
```

`cliques_stats_filter80.tsv :` this output gives in each line a clique (i.e. a shared TE insertion).
```
SAIDI SOMIA's avatar
SAIDI SOMIA committed
id      clique_size    core/dispensable        pangenomic_compartment  accessions      clique
Raphaël Flores's avatar
Dev
Raphaël Flores committed
1       2       core    core    ['genome1', 'genome2']  ['genome1/chr1/{copy_id}_{consensus_name}/{consensus_name}/1-3000', 'genome2/chr1/{copy_id}_{consensus_name}/{consensus_name}/100-3300']

```

`filter80.stats :` Statistics about cliques (i.e. shared TE insertions).

`concat_uniqueCopies_filter80.tsv :` gff file containing TE copies that could not be clicked on (i.e. unique TE insertions which are not shared). The last column gives the name of the accession to which the TE copy belongs.
Johann Confais's avatar
Johann Confais committed



![dag](dag.png "dag")