Skip to content
Snippets Groups Projects
README.md 9.72 KiB
Newer Older
Raphaël Flores's avatar
Dev
Raphaël Flores committed
# panREPET

Johann Confais's avatar
Johann Confais committed
panREPET is described in this preprint available on [BioRxiv](https://doi.org/10.1101/2024.06.17.598857)
Raphaël Flores's avatar
Dev
Raphaël Flores committed
## Description

SAIDI SOMIA's avatar
SAIDI SOMIA committed
Transposable elements (TEs) are mobile DNA elements that can invade genomes by transposition. Despite their reputation as parasitic sequences, they can enrich genomes with functional novelties that foster genome evolution. The impact of TEs in a genome can be explored by searching for their insertions. Individuals of the same species independently undergo TE insertions causing inter-individual genetic variability. This variability between individuals is the basis of the natural selection that leads to an increased adaptation of individuals to their environment. A way to search for the potential role of TEs in host adaptation is through a pangenomic approach. The TE pangenome can be described by (i) TE insertions present in all individuals of the species (core-genome), (ii) insertions present only among a subset of individuals (dispensable-genome) or (iii) ecogenome when the individuals share the same environment, and finally (iv) individual-specific insertions.
SAIDI SOMIA's avatar
SAIDI SOMIA committed

SAIDI SOMIA's avatar
SAIDI SOMIA committed
A majority of current pangenomic analysis methods are based on the alignment of reads from different genomes of a species to an assembled reference genome. But, the advent of the third-generation sequencing makes now possible to better approach this question using **multiple </m>de novo</em> assembled genomes** of the same species to avoid the bias introduced by a single reference genome. panREPET is a new pipeline to take in charge this type of data. It compares TE copies between each pair of genomes then identifies **TE copies shared by a group of individuals**. As it work on multiple assembled genomes, it allows accessing to exact genomic contexts of TE copies and their exact sequences.
Raphaël Flores's avatar
Dev
Raphaël Flores committed

SAIDI SOMIA's avatar
SAIDI SOMIA committed
![panREPET_figure1](panREPET.png)

Raphaël Flores's avatar
Dev
Raphaël Flores committed
## Installation

Install [Conda](https://docs.conda.io/en/latest/miniconda.html) (>=4.12.0)

Install Snakemake via [mamba](https://snakemake.readthedocs.io/en/stable/getting_started/installation.html) (mamba >=0.22.1, snakemake >=7.3.8)

SAIDI SOMIA's avatar
SAIDI SOMIA committed
Install [Singularity](https://docs.sylabs.io/guides/3.0/user-guide/installation.html) (>=3.8.7-1.el7)

SAIDI SOMIA's avatar
SAIDI SOMIA committed
Load the TEfinder2.31 Singularity Image File (SIF) containing the Matcher tool:
```
cd containers
conda activate snakemake
singularity pull --arch amd64 library://hquesneville/default/te_finder:2.31
```
For more details about TEfinder and Matcher tools: Quesneville, H., Nouaud, D. & Anxolabéhère, D.  Detection of New Transposable Element Families in Drosophila melanogaster and Anopheles gambiae Genomes . J Mol Evol57 (Suppl 1), S50–S59 (2003). https://doi.org/10.1007/s00239-003-0007-2

Raphaël Flores's avatar
Dev
Raphaël Flores committed
## Directory tree

```
.
├── README.md
├── Snakefile
├── run_Snakemake.sh
SAIDI SOMIA's avatar
SAIDI SOMIA committed
├── scripts
│   ├── extractCopies.py
│   ├── seqIdLenFasta.py
│   ├── fastaLength.py
│   ├── reformat.py
│   ├── launchMinimap2.py
│   ├── minimap2align.py
│   ├── bestHits.py
│   ├── cliques.py
│   ├── uniqueCopies.py
│   └── stats.py
├── config
├── containers
│   └── how_install\_te-finder.txt
├── data
└── envs
Raphaël Flores's avatar
Dev
Raphaël Flores committed
```

## How configurate the configfile

SAIDI SOMIA's avatar
SAIDI SOMIA committed
Set up the configfile in config/example.yaml:
Raphaël Flores's avatar
Dev
Raphaël Flores committed

* `output_directory`: Output directory path, the path must be absolute.
Johann Confais's avatar
Johann Confais committed
* `GFF`: Specify the path to TE annotation from your genomes of interest (GFF format, of the same shape as the GFF from [panTEannot](https://forgemia.inra.fr/urgi-anagen/panTEannot)).
Raphaël Flores's avatar
Dev
Raphaël Flores committed
* `Ref`: Specify the path to the reference sequence of your genomes. Each genome for GFF and Ref must have the same identifier.
* `Consensus`: Specify the path to the TE consensus library used for TE annotation of genomes (fasta format).

**params:**

SAIDI SOMIA's avatar
SAIDI SOMIA committed
* `cov_consensus:` Select TE copies which cover their consensus betwen x% and (100+(100-x))% (95 by default meaning over 95-105%). The higher the coverage, the more complete copies are selected.
Raphaël Flores's avatar
Dev
Raphaël Flores committed
* `copies_type`: Specify the complete TE copy type you want to extract. Choose between **FLF or FLC (FLC by default).** The copies are classified as **Full-Length Copies (FLC)** when they are aligned (potentially with large gaps if there are insertions) over x%-(100+(100-x))% of their reference sequence and as **Full-Length Fragment copies (FLF)**, when they are unfragmented.
* `extension_length:` Flanking regions length of TE copies in base pairs (500 bp by default).
* `cov_flank:` Filter which checks that the observed TE copy and its flanking regions covers a certain defined percentage **of the two flankers from** its match (80% by default). This filter diminishes the number of false positive copies.
SAIDI SOMIA's avatar
SAIDI SOMIA committed
* `cov_match:` Filter which checks that the observed TE copy and its flanking regions covers a certain defined percentage of its match (not useful, 0 by default).

Raphaël Flores's avatar
Dev
Raphaël Flores committed

**params (which are optional):**
* `select_region_bed:` Specify regions to select or to mask (bedfile). None by default. None by default means that all submitted genomes (params Ref) are considered.
* `select_type:` Select or mask a part of genomes (e.g. you can mask the centromeres). Choose between **mask or select (select by default).**
* `is_chr_level:` You can specify at which chromosomal level TE insertions should be compared (e.g. compare only same chromosomes between genomes). Caution 1: For assemblies that are not on a chromosomal scale, you must set false. Caution 2: If chromosome names are not the same between genomes, the comparison will not be possible.

**Example:**
```
output_directory = /absolute_path/my_output_directory

GFF:
   genome1: genome1.gff
   genome2: genome2.gff

Ref:
   genome1: genome1.fa
   genome2: genome2.fa

Consensus: my_TE_library.fa

params:
SAIDI SOMIA's avatar
SAIDI SOMIA committed
   cov_consensus: 95
Raphaël Flores's avatar
Dev
Raphaël Flores committed
   copies_type: FLC
   extension_length: 500
   cov_flank: 80
   cov_match: 0.0
   select_region_bed: centromeres.bed (optional)
   select_type: mask (optional)
   is_chr_level: False (optional)
```

SAIDI SOMIA's avatar
SAIDI SOMIA committed
The gff files must have this format:
```
chr1	matcher	match	30000	31000	900	-	.	ID=1;Name=consensus_name_1;Target=consensus_name_1 1 1000;Note=e-value:0.0,identity:95
chr1	matcher	match_part	30000	31000	900	-	.	ID=1.1;Parent=1;Name=consensus_name_1;Target=Bdis_TEdenovoGr-B-G6303-Map3_reversed 1 1000;Note=e-value:0.0,identity:95
```

Raphaël Flores's avatar
Dev
Raphaël Flores committed
centromeres.bed must have this format:
```
genome	chr	start	end
genome1	1	1000	5000
genome1	2	700	1000
genome2	1	1000	5000
genome2	2	800	1100
```
During the "extractCopies" step, panREPET does not extract or extracts only the TE copies in the regions specified in select_region_bed params.

**tools:**

SAIDI SOMIA's avatar
SAIDI SOMIA committed
* `container :` Singularity container containing TEfinder 2.31 path (containers/te_finder_2.31.sif)
Raphaël Flores's avatar
Dev
Raphaël Flores committed


## Build DAG in png

To visualize your DAG (Directed Acyclic Graph):
```
snakemake --forceall --dag --configfile config/example.yaml | dot -Tpng > dag.png
```
SAIDI SOMIA's avatar
SAIDI SOMIA committed
![dag](dag.png "dag")

Raphaël Flores's avatar
Dev
Raphaël Flores committed

## How to deal with outputs

```
.
├── benchmarks
SAIDI SOMIA's avatar
SAIDI SOMIA committed
├── log
├── extractCopies
├── extendedGFF
├── extendedFasta
├── minimap2
├── matcher
Raphaël Flores's avatar
Dev
Raphaël Flores committed
├── bestHits
├── clique
│   └── filter80
│       ├── cliques_copy_filter80.tsv
│       ├── cliques_stats_filter80.tsv
│       └── filter80.stats
├── uniqueCopies
│   └── filter80
SAIDI SOMIA's avatar
SAIDI SOMIA committed
│       └── concat_uniqueCopies_filter80.tsv
Raphaël Flores's avatar
Dev
Raphaël Flores committed
```

SAIDI SOMIA's avatar
SAIDI SOMIA committed
`cliques_copy_filter80.tsv :` this output gives in each line a TE copy belonging to an accession and which has been cliqued (i.e. a copy shared with at least one other accession). The id column indicates the clique ID, i.e which TE copies are shared by the same accessions.
Raphaël Flores's avatar
Dev
Raphaël Flores committed
```
SAIDI SOMIA's avatar
SAIDI SOMIA committed
id      accession       chromosome      start   end    copy_name       consensus       clique_size    pangenomic_compartment
Raphaël Flores's avatar
Dev
Raphaël Flores committed
1       genome1     chr1      1   3000   {copy_id}_{consensus_name}    consensus_name   2       core
1       genome2     chr1    100 3300 {copy_id}_{consensus_name}    consensus_name   2       core
```

`cliques_stats_filter80.tsv :` this output gives in each line a clique (i.e. a shared TE insertion).
```
SAIDI SOMIA's avatar
SAIDI SOMIA committed
id      clique_size    core/dispensable        pangenomic_compartment  accessions      clique
Raphaël Flores's avatar
Dev
Raphaël Flores committed
1       2       core    core    ['genome1', 'genome2']  ['genome1/chr1/{copy_id}_{consensus_name}/{consensus_name}/1-3000', 'genome2/chr1/{copy_id}_{consensus_name}/{consensus_name}/100-3300']

```

`filter80.stats :` Statistics about cliques (i.e. shared TE insertions).

`concat_uniqueCopies_filter80.tsv :` gff file containing TE copies that could not be clicked on (i.e. unique TE insertions which are not shared). The last column gives the name of the accession to which the TE copy belongs.
SAIDI SOMIA's avatar
SAIDI SOMIA committed
## Execution
Johann Confais's avatar
Johann Confais committed

SAIDI SOMIA's avatar
SAIDI SOMIA committed
You can execute an example based on data you can find in data folder composed of:
- 2 whole-genome assemblies of Brachypodium distachyon from [Phytozome database](https://phytozome.jgi.doe.gov/) (Gordon _et al._ 2017) : [data/Bdis_TEdenovoGr.fa](https://phytozome.jgi.doe.gov/info/556) (Bd21 v3.2) and [data/ABR2_337.fa.formated](https://phytozome.jgi.doe.gov/info/337)
- their TE annotation from panTEannot
- a TE library built denovo with TEdenovo pipeline from REPET v3.0 (https://urgi.versailles.inra.fr/Tools/REPET, Flutre _et al._ 2011) on Bd21 v3.2 genome (data/Bdis_TEannotGr_chr_allTEs_nr_noSSR_join_path.annotStatsPerTE_FullLengthCopy.fa)

Please specify in run_Snakemake.sh file, the argument `--singularity-args '--bind /home/myhome'` if necessary.

```
conda activate snakemake
nohup ./run_Snakemake.sh &> test.log &
```

Expected results are in results folder (results/clique/filter80/cliques_copy_filter80.tsv and results/clique/filter80/cliques_stats_filter80.tsv). The test execution took 45 minutes on 8 cores 16 RAM (specify --cores 8 --resources mem_gb=16). However, note that minimap2 rule is not parallelized per TE copy for a given accession, but is parallelized for the number of accessions compared in pairs.