INSERT-seq is a bioinformatics pipeline that can be used to analyse the landscape of genomic integrations from samples sequenced following INSERT-seq library prep protocol.
A command line version of the pipeline is available to download in a Bitbucket repository.
- 1. Adapter sequence before UMI: Write the sequence of the adapter part before the UMI.
- 2. Ontarget primer sequence: Write the on-target primer to detect the integrated sequence and trim it.
- 3. Upload fastq file: Upload one file per sample. The file name must have the form SampleID.fastq. The file must contain basecalled Oxford Nanopore reads obtained by following the INSERT-seq library-prep protocol. For each sample, provide the adapter sequence and the ontarget sequence in the previous fields.
- 4. Select the reference organism. Currently only humand and mouse genomes are available.
- 5. Provide the second adapter sequence, the part of the adapter after the UMI.
- 6. You can change the error rate for primer and adapter detection, by default 0.2 is used.
- 7. Choose if the UMI clustering step is required, only if you have used UMIs during the library-prep.
- 8. If the UMI clustering step is required, you can select the UMI type. Currently only TV (TTTVVVVTTVVVVTTVVVVTTVVVVTTT) and NYR (NNNYRNNNYRNNNYRNNN) UMIs are available.
- 9 & 10. Select or modify the maximum and minimum UMI length. By default the UMI length +/- 1 bp is used.
- 11. Modify the clustering identity value, by default 0.99.
- 12. Modify the minimum UMI bin size, by default 1, all UMIs are selected regardless of the UMI bin size.
After the library-pre protocol and Oxford Nanopore sequencing, the reads will have the following structure.
Sequencing adapters are ligated at both ends.
The UMI & barcode adapter used for PCR amplifications will be presend in one end, consisting of three main parts:
- A. The primer binding site (complementary to the sequencing adapter) and the barcode sequence.
- UMI. The unimolecular identifyer sequence.
- B. The target binding site (complementary between PCR 1 and PCR2 adapters).
This adapter is bound to the genome and at the other end of the read we will find the integration site.
- C. The sequence corresponding to the paylod (or integration site) right next to the juntion is required by the analysis pipeline in the section "Ontarget primer sequence" of the input.
When a run end, a compressed file will be generated for you to download the results.
Results are organized in folders.
- FILTERED. Contains quality filtered and trimmed reads.
- MAPPED. Contains SAM and BAM files for read alignment inspection. Alignments can be visualized with any genome browser program such as IGV or Geneious.
- *_peaks.bed Bed file containing sigfinicant insertions. The columns correspond to chromosome, start, end, number of reads per insertion, DHD distance to a reference peak.
- *_filtered_peakSite.bed Bed file containing significant insertions with only the predicted insertion point instead of a range.
- *_prePeakCalling_sorted_insertions.bed Bed file with all genomic regions with mapped reads, contains significant and non significant insertions or noise.
- STATS. Contains one folder per sample with histograms of raw read length or Nanopore sequencing quality.
- UMI. Containes extracted UMIs from each read and the consensus reads after UMI clustering.
- OUTPUT. Contains plots and tables of gene feature annotations and unanchored peak calling.
- *_unanchored_peaks.bed Bed file containing unanchored insertions.