README.md

# panREPET

panREPET is described in this preprint available on [BioRxiv](https://doi.org/10.1101/2024.06.17.598857)

## Description

Transposable elements (TEs) are mobile DNA elements that can invade genomes by transposition. Despite their reputation as parasitic sequences, they can enrich genomes with functional novelties that foster genome evolution. The impact of TEs in a genome can be explored by searching for their insertions. Individuals of the same species independently undergo TE insertions causing inter-individual genetic variability. This variability between individuals is the basis of the natural selection that leads to an increased adaptation of individuals to their environment. A way to search for the potential role of TEs in host adaptation is through a pangenomic approach. The TE pangenome can be described by (i) TE insertions present in all individuals of the species (core-genome), (ii) insertions present only among a subset of individuals (dispensable-genome) or (iii) ecogenome when the individuals share the same environment, and finally (iv) individual-specific insertions.

A majority of current pangenomic analysis methods are based on the alignment of reads from different genomes of a species to an assembled reference genome. But, the advent of the third-generation sequencing makes now possible to better approach this question using **multiple </m>de novo</em> assembled genomes** of the same species to avoid the bias introduced by a single reference genome. panREPET is a new pipeline to take in charge this type of data. It compares TE copies between each pair of genomes then identifies **TE copies shared by a group of individuals**. As it work on multiple assembled genomes, it allows accessing to exact genomic contexts of TE copies and their exact sequences.

![panREPET_figure1](panREPET.png)

## Installation

Install [Conda](https://docs.conda.io/en/latest/miniconda.html) (>=4.12.0)

Install Snakemake via [mamba](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) (mamba >=0.22.1, snakemake >=7.3.8)

Install [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/installation.html) (>=3.8.7-1.el7)

Load the TEfinder2.31 Singularity Image File (SIF) containing the Matcher tool:
```
cd containers
conda activate snakemake
singularity pull --arch amd64 library://hquesneville/default/te_finder:2.31
```
For more details about TEfinder and Matcher tools: Quesneville, H., Nouaud, D. & Anxolabéhère, D.  Detection of New Transposable Element Families in Drosophila melanogaster and Anopheles gambiae Genomes . J Mol Evol57 (Suppl 1), S50–S59 (2003). https://doi.org/10.1007/s00239-003-0007-2

## Directory tree

```
.
├── README.md
├── Snakefile
├── run_Snakemake.sh
├── scripts
│   ├── extractCopies.py
│   ├── seqIdLenFasta.py
│   ├── fastaLength.py
│   ├── reformat.py
│   ├── launchMinimap2.py
│   ├── minimap2align.py
│   ├── bestHits.py
│   ├── cliques.py
│   ├── uniqueCopies.py
│   └── stats.py
├── config
├── containers
│   └── how_install\_te-finder.txt
├── data
└── envs
```

## How configurate the configfile

Set up the configfile in config/example.yaml:

* `output_directory`: Output directory path, the path must be absolute.
* `GFF`: Specify the path to TE annotation from your genomes of interest (GFF format, of the same shape as the GFF from [panTEannot](https://forgemia.inra.fr/urgi-anagen/panTEannot)).
* `Ref`: Specify the path to the reference sequence of your genomes. Each genome for GFF and Ref must have the same identifier.
* `Consensus`: Specify the path to the TE consensus library used for TE annotation of genomes (fasta format).

**params:**

* `cov_consensus:` Select TE copies which cover their consensus betwen x% and (100+(100-x))% (95 by default meaning over 95-105%). The higher the coverage, the more complete copies are selected.
* `copies_type`: Specify the complete TE copy type you want to extract. Choose between **FLF or FLC (FLC by default).** The copies are classified as **Full-Length Copies (FLC)** when they are aligned (potentially with large gaps if there are insertions) over x%-(100+(100-x))% of their reference sequence and as **Full-Length Fragment copies (FLF)**, when they are unfragmented.
* `extension_length:` Flanking regions length of TE copies in base pairs (500 bp by default).
* `cov_flank:` Filter which checks that the observed TE copy and its flanking regions covers a certain defined percentage **of the two flankers from** its match (80% by default). This filter diminishes the number of false positive copies.
* `cov_match:` Filter which checks that the observed TE copy and its flanking regions covers a certain defined percentage of its match (not useful, 0 by default).


**params (which are optional):**
* `select_region_bed:` Specify regions to select or to mask (bedfile). None by default. None by default means that all submitted genomes (params Ref) are considered.
* `select_type:` Select or mask a part of genomes (e.g. you can mask the centromeres). Choose between **mask or select (select by default).**
* `is_chr_level:` You can specify at which chromosomal level TE insertions should be compared (e.g. compare only same chromosomes between genomes). Caution 1: For assemblies that are not on a chromosomal scale, you must set false. Caution 2: If chromosome names are not the same between genomes, the comparison will not be possible.

**Example:**
```
output_directory = /absolute_path/my_output_directory

GFF:
   genome1: genome1.gff
   genome2: genome2.gff

Ref:
   genome1: genome1.fa
   genome2: genome2.fa

Consensus: my_TE_library.fa

params:
   cov_consensus: 95
   copies_type: FLC
   extension_length: 500
   cov_flank: 80
   cov_match: 0.0
   select_region_bed: centromeres.bed (optional)
   select_type: mask (optional)
   is_chr_level: False (optional)
```

The gff files must have this format:
```
chr1	matcher	match	30000	31000	900	-	.	ID=1;Name=consensus_name_1;Target=consensus_name_1 1 1000;Note=e-value:0.0,identity:95
chr1	matcher	match_part	30000	31000	900	-	.	ID=1.1;Parent=1;Name=consensus_name_1;Target=Bdis_TEdenovoGr-B-G6303-Map3_reversed 1 1000;Note=e-value:0.0,identity:95
```

centromeres.bed must have this format:
```
genome	chr	start	end
genome1	1	1000	5000
genome1	2	700	1000
genome2	1	1000	5000
genome2	2	800	1100
```
During the "extractCopies" step, panREPET does not extract or extracts only the TE copies in the regions specified in select_region_bed params.

**tools:**

* `container :` Singularity container containing TEfinder 2.31 path (containers/te_finder_2.31.sif)


## Build DAG in png

To visualize your DAG (Directed Acyclic Graph):
```
snakemake --forceall --dag --configfile config/example.yaml | dot -Tpng > dag.png
```
![dag](dag.png "dag")


## How to deal with outputs

```
.
├── benchmarks
├── log
├── extractCopies
├── extendedGFF
├── extendedFasta
├── minimap2
├── matcher
├── bestHits
├── clique
│   └── filter80
│       ├── cliques_copy_filter80.tsv
│       ├── cliques_stats_filter80.tsv
│       └── filter80.stats
├── uniqueCopies
│   └── filter80
│       └── concat_uniqueCopies_filter80.tsv
```

`cliques_copy_filter80.tsv :` this output gives in each line a TE copy belonging to an accession and which has been cliqued (i.e. a copy shared with at least one other accession). The id column indicates the clique ID, i.e which TE copies are shared by the same accessions.
```
id      accession       chromosome      start   end    copy_name       consensus       clique_size    pangenomic_compartment
1       genome1     chr1      1   3000   {copy_id}_{consensus_name}    consensus_name   2       core
1       genome2     chr1    100 3300 {copy_id}_{consensus_name}    consensus_name   2       core
```

`cliques_stats_filter80.tsv :` this output gives in each line a clique (i.e. a shared TE insertion).
```
id      clique_size    core/dispensable        pangenomic_compartment  accessions      clique
1       2       core    core    ['genome1', 'genome2']  ['genome1/chr1/{copy_id}_{consensus_name}/{consensus_name}/1-3000', 'genome2/chr1/{copy_id}_{consensus_name}/{consensus_name}/100-3300']

```

`filter80.stats :` Statistics about cliques (i.e. shared TE insertions).

`concat_uniqueCopies_filter80.tsv :` gff file containing TE copies that could not be clicked on (i.e. unique TE insertions which are not shared). The last column gives the name of the accession to which the TE copy belongs.


## Execution

You can execute an example based on data you can find in data folder composed of:
- 2 whole-genome assemblies of Brachypodium distachyon from [Phytozome database](https://phytozome.jgi.doe.gov/) (Gordon _et al._ 2017) : [data/Bdis_TEdenovoGr.fa](https://phytozome.jgi.doe.gov/info/556) (Bd21 v3.2) and [data/ABR2_337.fa.formated](https://phytozome.jgi.doe.gov/info/337)
- their TE annotation from panTEannot
- a TE library built denovo with TEdenovo pipeline from REPET v3.0 (https://urgi.versailles.inra.fr/Tools/REPET, Flutre _et al._ 2011) on Bd21 v3.2 genome (data/Bdis_TEannotGr_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthCopy.fa)

Please specify in run_Snakemake.sh file, the argument `--singularity-args '--bind /home/myhome'` if necessary.

```
conda activate snakemake
nohup ./run_Snakemake.sh &> test.log &
```

Expected results are in results folder (results/clique/filter80/cliques_copy_filter80.tsv and results/clique/filter80/cliques_stats_filter80.tsv). The test execution took 45 minutes on 8 cores 16 RAM (specify --cores 8 --resources mem_gb=16). However, note that minimap2 rule is not parallelized per TE copy for a given accession, but is parallelized for the number of accessions compared in pairs.