BroCOLI Tutorials¶

Welcome to the gsMap Tutorials. In this section, we provide detailed examples and guides to help you understand and utilize gsMap effectively.

%%{init: {'themeVariables': { 'fontSize': '20px' }}}%%
graph LR
  classDef input fill:#e3f2fd,stroke:#1565c0,stroke-width:2px,color:black;
  classDef process fill:#fff9c4,stroke:#fbc02d,stroke-width:2px,color:black;
  classDef result fill:#e8f5e9,stroke:#2e7d32,stroke-width:2px,color:black;

  A[raw fastq.gz]:::input
  C[new fastq]:::process
  B{sorted.sam}:::process
  E[gene & transcript counts]:::result

  A --> |"Bulk: minimap2"| B
  A --> |"SC: pre-BroCOLI"| C
  C --> |"SC: minimap2"| B
  B --> |"BroCOLI_bulk"| E
  B --> |"BroCOLI_sc"| E

  linkStyle 0,3 stroke:#ff9800,stroke-width:4px;
  linkStyle 1,2,4 stroke:#9c27b0,stroke-width:4px;

BroCOLI input files¶

To run BroCOLI, you should provide:

FASTQ (FASTQ.gz) should be processed into sorted SAM. minimap2
Reference sequence in FASTA format.
Optionally, you may provide a reference gene annotation in GTF format (recommend).

BroCOLI General Usage¶

Bulk¶

Step1 Mapping of the fastq files with minimap2

minimap2 -ax splice -ub --secondary=no -t 20 ref.fasta raw.fastq.gz > raw.sam
samtools sort -@ 20 -o raw_sorted.sam raw.sam

Note

The mapping SAM files need to be sorted by samtools before running BroCOLI.

Noisy cDNA data recommended parameter

For noisy 1D cDNA Nanopore data the developer of Minimap2 suggests adding -k 14 and -w 4:

minimap2 -ax splice -ub -k14 -w 4 --secondary=no -t 20 ref.fasta raw.fastq.gz > raw.sam

A BED file can be provided to assist the mapping

paftools.js gff2bed anno.gtf > junctions.bed
minimap2 -ax splice -ub -k14 -w 4 --junc-bed junctions.bed --secondary=no -t 20 ref.fasta raw.fastq.gz > raw.sam

Step2 Transcript identification and quantification
For a single SAM file, use the -s parameter to specify its absolute path (i).
For multiple files, set the -s parameter to the directory containing the sorted SAM files (ii). Alternatively, you can provide a TXT/TSV file listing the absolute path to each input SAM file on a separate line. The output order will correspond to the order listed in the file (iii).

(i)                     (ii)                    (iii)    
input_reads.sam     ─── input_directory      ─── input.txt(.tsv)
                        ├── sample0.sam          ├── sample0.sam
                        └── sample1.sam          ├── sample1.sam
                                                 └── sample2.sam

./BroCOLI_bulk -s sam_files_path -f ref.fasta -g GTF.gtf -o output_path

Single cell and spatial¶

Step1 Processing fastq files with BroCOLI

./preBroCOLI -q visium -p 20 -w barcode_whitelist.txt raw.fastq > new.fastq

-q indicates the data type. such as, [visium, 10x3v3, magicseq].
-p represents the number of threads.
-w is the whitelist of cell barcodes. (i) Provide a TSV file containing only barcodes, such as the filtered whitelist generated by CellRanger. Alternatively, (ii) use the provided ext_bc_and_umi.py to obtain a whitelist that includes both barcodes and UMIs. In this mode, UMI correction will be performed automatically.

ext_bc_and_umi.py

python ext_bc_and_umi.py --bam cellranger_processed.bam -f cellranger_filter_barcodes.tsv -o sc_bc_umi.txt

You also can use Flexiplex for preprocessing

You can visit its GitHub page to learn more about its detailed usage.

First, assign reads - short reads or single-cell long reads - to cellular barcodes

flexiplex -d 10x3v3 -p 20 -k cellRangerbarcodes.tsv raw.fastq > new_reads.fastq

Second, mapping.

minimap2 -ax splice -ub -k14 -w 4 --secondary=no -t 20 ref.fasta new_reads.fastq > new_reads.sam
samtools sort -@ 20 -o new_reads_sorted.sam new_reads.sam

You also can use Sicelore-2.1 for preprocessing

You can visit its GitHub page to learn more about its detailed usage. Before you run sicelore, you need to set up the required JAVA environment for it.

First, scan Nanopore reads - assign cell barcodes.

java -jar -Xmx80g <path>/NanoporeBC_UMI_finder-2.1.jar scanfastq -d <directory to start recursive search for fastq files> -o outPutDirectory --bcEditDistance 1 --cellRangerBCs cellRangerbarcodes.tsv

The --cellRangerBCs parameter is optional. If Illumina data is available, a TSV file containing cell barcodes (e.g., from Cell Ranger) can be provided, which will improve the accuracy of barcode identification.

Second, mapping.

minimap2 -ax splice -ub -k14 -w 4 --junc-bed junctions.bed --sam-hit-only --secondary=no -t 20 ref.fasta <fastq.gz path> > raw.sam
samtools view -bS -@ 20 raw.sam > raw.bam
samtools sort -@ 20 -o raw_sorted.bam raw.bam
samtools index -@ 20 raw_sorted.bam raw_sorted_index

Third, UMI assignment.

java -jar -Xmx80g <path>/NanoporeBC_UMI_finder-2.1.jar assignumis --inFileNanopore raw_sorted.bam --outfile raw_sorted_umi.bam --ONTgene GE --annotationFile GTF.gtf

The output bam file generated by the cell bc and UMI assignment is converted to a sam file for BroCOLI's input.

Step2 Mapping of the fastq files with minimap2

The processing is identical to Step 1 in the bulk workflow.

Step3 Transcript identification and quantification

The input data is similar to the bulk.

(i)                     (ii)                    (iii)    
input_reads.sam     ─── input_directory      ─── input.txt(.tsv)
                        ├── sample0.sam          ├── sample0.sam
                        └── sample1.sam          ├── sample1.sam
                                                 └── sample2.sam

./BroCOLI_sc -s sam_files_path -f ref.fasta -g GTF.gtf -o output_path

Examples¶

Simple test¶

Bulk: SIRV4 dataset

./BroCOLI_bulk -t 1 -s example/example_SIRV.sam -g example/example_SIRV.gtf -f example/example_SIRV.fasta -o TestResult

Single cell

./BroCOLI_sc

All Arguments¶

./BroCOLI_bulk -h
./BroCOLI_sc -h

Advanced testing of BroCOLI can be performed using the following parameters:

Arguments: 
-s, --sam
      SAM file path. We recommend using absolute paths. If you have a single file, you can directly provide its absolute path. If you have multiple files, you can specify the path to a folder that contains all the sorted SAM files you want to process. (required)

-f, --fasta
      FASTA file path. FASTA file requires the chromosome names to match the GTF file. (required)

-o, --output:
      output folder path. (required)

-g, --gtf
      input annotation file in GTF format. (optional, Recommendation provided)

-n, --support 
      min perfect read count for all splice junctions of novel isoform. (optional, default:2)

-j, --SJDistance
      the minimum distance determined as intron. (optional, default:18)

-e, --single_exon_boundary
      belongs to the isoform scope of a single exon. (optional, default:60)

-d, --graph_distance:
      the distance threshold for constructing the isoform candidate distance graph. (optional, default:60)

-t, --thread
      thread number (optional, default:8).

-h, --help
      show this help information.