{"body": "## Overview\n\nThe long-read analysis pipeline for PacBio Kinnex RNA-seq data follows Iso-Seq processing guidelines<sup><sub>1</sub></sup>. It is designed for per-sample execution, handling one or more Full Length Non Chimeric (FLNC) BAM files. The pipeline has been extended to align and annotate FLNC reads with isoform-level information directly within the BAM file.\n\n### Key Pipeline Steps\n\n1. **Read Clustering:** Generation of high-quality consensus transcript sequences through clustering of FLNC reads.\n2. **Alignment with pbmm2:** Alignment of the consensus transcripts and FLNC reads to the reference genome using pbmm2.\n3. **Transcript Collapsing:** Collapsing of redundant transcripts based on exon-intron structure to define a unique set of isoforms.\n4. **Isoform Classification and Filtering:** Classification and filtering of isoforms to remove potential artifacts and retain high-confidence transcripts.\n5. **Read Annotation:** Annotation of aligned FLNC reads with isoform-level information embedded in the BAM file.\n\n### Pipeline Chart\n\n![flow_chart](/static/img/pipeline-docs/Flow_Chart_Pipeline_Long-Read_RNA-seq_PacBio_Kinnex.png)\n\n<sub><b>1</b>: PacBio Iso-Seq Analysis Guidelines. Available at: [https://isoseq.how](https://isoseq.how)</sub>\n\n---\n\n## Read Clustering\n\nIn this step, the pipeline generates high-quality consensus transcript sequences from Full Length Non Chimeric (FLNC) reads. It accepts and processes one or multiple FLNC BAM files in a single run, producing a unified output. The consensus transcripts are used in downstream alignment, collapsing, and quantification.\n\n### Clustering FLNC Reads\n\n##### Generate high-quality consensus transcripts\n\n<pre class=\"code-block copy-wrapper\">\nisoseq cluster2 --singletons flnc.fofn transcripts.bam\n</pre>\n\nArguments:\n\n- *flnc.fofn*: a file-of-filenames (FOFN) listing the FLNC BAM files to be processed together.\n- *-\\-singletons*: includes low-abundance transcripts supported by fewer than two FLNC reads, which are typically excluded. This allows retention of rare transcript isoforms that might still be biologically relevant.\n\n### Implementation with IsoSeq\n\nThe pipeline uses [IsoSeq](https://github.com/PacificBiosciences/IsoSeq) version 4.2.0.\n\n### Source Code\n\nAll the relevant code can be accessed in the GitHub repository:\n\n  - [isoseq_cluster2.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/isoseq/isoseq_cluster2.sh) [cluster2]\n\n---\n\n## Alignment with pbmm2\n\nThe pipeline uses pbmm2 to align each unaligned BAM file to the reference genome. The software also sorts the reads by genomic coordinates, strips unnecessary tags, and links methylation tags if present. An integrity check is then performed on the resulting BAM file.\n\npbmm2 is used to align both the Full Length Non Chimeric (FLNC) reads and the consensus transcripts generated by cluster2.\n\n### Aligning and Sorting\n\n##### Align and sort reads\n\n<pre class=\"code-block copy-wrapper\">\npbmm2 align --preset ISOSEQ --sort --strip --unmapped reference.fasta unaligned.bam aligned.bam\n</pre>\n\nArguments:\n\n- *-\\-preset*: use parameters optimized for Iso-Seq data (ISOSEQ).\n- *-\\-sort*: sort the aligned reads by genomic coordinates.\n- *-\\-strip*: remove extraneous tags if present in the input BAM file. Tags removed: `dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq, pt, pv, pw, px, sf, sq, st`.\n- *-\\-unmapped*: retain unmapped reads.\n\n### Integrity Check\n\nTo confirm the integrity of the alignment BAM file, in-house Python code checks for the presence of the 28-byte empty block representing the EOF marker in BAM format.\n\n### Implementation with pbmm2\n\nThe pipeline uses [pbmm2](https://github.com/PacificBiosciences/pbmm2) version 1.13.0, which wraps [minimap2](https://github.com/lh3/minimap2) version 2.26. It's important to note that pbmm2 sets some defaults that may differ from the standard minimap2.\n\nDefault set by pbmm2 for minimap2:\n\n- Soft clipping is enabled with `-Y`.\n- Long cigars for the `CG` tag are set using `-L`.\n- X/= cigars are used instead of M with `--eqx`.\n- Overlapping query intervals with repeated matches trimming are disabled.\n- Secondary alignments are excluded with `--secondary=no`.\n\n*Note: Due to multi-threading the output alignment ordering can differ between multiple runs with the same input parameters. The same can occur even with option -\\-sort for records that align to the same target sequence, the same position within that target, and in the same orientation, which are the only fields that samtools sort uses.*\n\n---\n\n## Transcript Collapsing\n\nIn this step, the pipeline merges redundant consensus transcripts that align to the same genomic loci. Transcripts with identical exon\u2013intron structures are collapsed into a single representative transcript model. The output includes unique isoforms in GFF format, a FASTA sequence file, and several supporting metric files.\n\nBoth the aligned consensus reads and the original Full Length Non Chimeric (FLNC) reads are used to determine transcript structure and quantify read support.\n\n### Collapsing Consensus Transcripts\n\n##### Collapse redundant transcripts into unique isoforms\n\n<pre class=\"code-block copy-wrapper\">\nisoseq collapse aligned_transcripts.bam flnc.bam collapsed_isoforms.gff\n</pre>\n\nArguments:\n\n- *flnc.bam*: original FLNC reads used to assess transcript support by counting the number of reads mapped to each isoform.\n\n*Note: In addition to the GFF, the output includes a TXT file with read-to-isoform mappings, a TXT file listing transcript support statistics (FLNC counts), and a JSON file with detailed metrics. These files are required for downstream annotation and quality control.*\n\n### Implementation with IsoSeq\n\nThe pipeline uses [IsoSeq](https://github.com/PacificBiosciences/IsoSeq) version 4.2.0.\n\n### Source Code\n\nAll the relevant code can be accessed in the GitHub repository:\n\n  - [isoseq_collapse.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/isoseq/isoseq_collapse.sh) [collapse]\n\n---\n\n## Isoform Classification and Filtering\n\nIn this step, the pipeline uses Pigeon to classify collapsed isoforms and filter out potential artifacts. Pigeon categorizes isoforms based on their splice junctions and structural similarity to known reference annotations. This classification helps in identifying known, novel, and potentially artifactual transcripts.\n\nThe output includes a set of high-confidence isoforms, along with detailed classification metrics and annotations.\n\n### Classifying and Filtering Isoforms\n\n##### Prepare the reference annotation and genome\n\n<pre class=\"code-block copy-wrapper\">\npigeon prepare annotation.gtf reference.fasta\n</pre>\n\nThis command sorts and indexes the GTF annotation and genome FASTA files for compatibility with Pigeon.\n\n*Note: The annotation file used is the GENCODE comprehensive gene annotations. For more detailed information please refer to the GENCODE documentation under \u201cGenome Annotations\u201d section.*\n\n#####  Prepare the collapsed isoforms\n\n<pre class=\"code-block copy-wrapper\">\npigeon prepare collapsed_isoforms.gff\n</pre>\n\nThis command sorts and indexes the collapsed isoforms generated in the previous step for compatibility with Pigeon.\n\n##### Classify isoforms\n\n<pre class=\"code-block copy-wrapper\">\npigeon classify \\\n  sorted_isoforms.gff \\\n  annotation.gtf \\\n  reference.fasta  \\\n  --fl flnc_count.txt \\\n  --cage-peak refTSS.bed \\\n  --poly-a polyA.list\n</pre>\n\nArguments:\n\n- *isoform, annotation, and reference input files must be preprocessed in the prepare step.*\n- *-\\-fl*: file with Full Length Non Chimeric (FLNC) read counts from the collapsing step. Required to include read support in the classification output.\n- *-\\-cage-peak*: BED file with CAGE peaks information. Used to improve annotation of transcription start sites (TSS).\n- *-\\-poly-a*: file in Pigeon custom format with polyA motifs. Used to improve annotation of polyA sites.\n\n*Note: The refTSS.bed and polyA.list files used by the pipeline are provided by PacBio as part of their reference resource bundle and can be downloaded [here](https://downloads.pacbcloud.com/public/dataset/Kinnex-full-length-RNA).*\n\n##### Filter high-confidence isoforms\n\n<pre class=\"code-block copy-wrapper\">\npigeon filter classification.txt --isoforms sorted_isoforms.gff\n</pre>\n\nArguments:\n\n- *-\\-isoforms* enables generation of a filtered GFF file as additional output. Input isoform file must be preprocessed in the prepare step.\n\nThis command filters isoforms from the classification output.\n\n### Implementation with Pigeon\n\nThe pipeline uses [Pigeon](https://github.com/PacificBiosciences/Pigeon) version 1.3.0.\n\n### Source Code\n\nAll the relevant code can be accessed in the GitHub repository:\n\n  - [pigeon_prepare.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_prepare.sh) [prepare]\n  - [pigeon_classify.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_classify.sh) [classify]\n  - [pigeon_filter.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_filter.sh) [filter]\n\n---\n\n## Read Annotation\n\nIn this step, the pipeline annotates individual Full Length Non Chimeric (FLNC) reads with the isoform-level classification generated in the previous step. This allows downstream analyses to trace high-confidence isoforms back to the specific supporting reads.\n\nThe annotation is performed using a custom in-house script that lifts isoform classification to the read level.\n\n### Annotation Tags\n\nReads are annotated in the BAM format using the following custom tags:\n\n| Tag        | Format | Description |\n|------------|--------|-------------|\n| `in:Z:`    | string | Isoform ID. |\n| `sc:Z:`    | string | Structural category. One of: `full-splice_match`, `incomplete-splice_match`, `novel_in_catalog`, `novel_not_in_catalog`, `genic`, `antisense`, `fusion`, `intergenic`, `genic_intron`. |\n| `gn:Z:`    | string | Associated reference gene name. |\n| `tn:Z:`    | string | Associated reference transcript name. |\n| `sb:Z:`    | string | Subcategory for additional splicing information. Values may include `mono-exon`, `multi-exon`, and `intron_retention` (separated by semicolons). |\n| `ct:i:`    | int    | Total number of reads supporting the isoform. |\n\n### Annotating FLNC Reads by Isoform Class\n\n##### Annotate FLNC reads\n\n<pre class=\"code-block copy-wrapper\">\nFLNC_ImportTags.py \\\n  --input_flnc aligned_flnc.bam \\\n  --output_flnc annotated_flnc.bam \\\n  --read_stat read_stat.txt \\\n  --classification filtered_classification.txt \\\n  --index\n</pre>\n\n**Arguments:**\n\n- *-\\-input_flnc*: input BAM file containing aligned FLNC reads to annotate.\n- *-\\-read_stat*: file from the collapsing step with read-to-isoform mappings (read_stat).\n- *-\\-classification*: classification file from the filtering step.\n- *-\\-index*: flag to index the output BAM file. Requires the reads to be sorted.\n\n### Implementation\n\nThe annotation step is implemented using a custom Python script maintained in-house.\n\n### Source Code\n\nAll the relevant code can be accessed in the GitHub repository:\n\n- [FLNC_ImportTags.py](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/utils/FLNC_ImportTags.py) [FLNC_ImportTags.py]\n", "title": "PacBio Kinnex (long-read)", "status": "open", "options": {"filetype": "md", "collapsible": false, "default_open": true, "convert_ext_links": true, "initial_header_level": 2}, "consortia": [{"display_title": "SMaHT", "uuid": "358aed10-9b9d-4e26-ab84-4bd162da182b", "@id": "/consortia/358aed10-9b9d-4e26-ab84-4bd162da182b/", "status": "open", "@type": ["Consortium", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "identifier": "long-read_rna-seq_pacbio_kinnex", "date_created": "2026-01-09T20:07:19.308154+00:00", "section_type": "Page Section", "submitted_by": {"error": "no view permissions"}, "last_modified": {"modified_by": {"error": "no view permissions"}, "date_modified": "2026-05-20T19:29:08.157252+00:00"}, "schema_version": "1", "submission_centers": [{"@id": "/submission-centers/9626d82e-8110-4213-ac75-0a50adf890ff/", "display_title": "HMS DAC", "status": "open", "uuid": "9626d82e-8110-4213-ac75-0a50adf890ff", "@type": ["SubmissionCenter", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "@id": "/static-sections/22b22be8-8380-4407-a952-c8106d70c4bc/", "@type": ["StaticSection", "UserContent", "Item"], "uuid": "22b22be8-8380-4407-a952-c8106d70c4bc", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}, "display_title": "PacBio Kinnex (long-read)", "content_as_html": "<div><h2>Overview</h2>\n<p>The long-read analysis pipeline for PacBio Kinnex RNA-seq data follows Iso-Seq processing guidelines<sup><sub>1</sub></sup>. It is designed for per-sample execution, handling one or more Full Length Non Chimeric (FLNC) BAM files. The pipeline has been extended to align and annotate FLNC reads with isoform-level information directly within the BAM file.</p>\n<h3>Key Pipeline Steps</h3>\n<ol>\n<li><strong>Read Clustering:</strong> Generation of high-quality consensus transcript sequences through clustering of FLNC reads.</li>\n<li><strong>Alignment with pbmm2:</strong> Alignment of the consensus transcripts and FLNC reads to the reference genome using pbmm2.</li>\n<li><strong>Transcript Collapsing:</strong> Collapsing of redundant transcripts based on exon-intron structure to define a unique set of isoforms.</li>\n<li><strong>Isoform Classification and Filtering:</strong> Classification and filtering of isoforms to remove potential artifacts and retain high-confidence transcripts.</li>\n<li><strong>Read Annotation:</strong> Annotation of aligned FLNC reads with isoform-level information embedded in the BAM file.</li>\n</ol>\n<h3>Pipeline Chart</h3>\n<p><img alt=\"flow_chart\" src=\"/static/img/pipeline-docs/Flow_Chart_Pipeline_Long-Read_RNA-seq_PacBio_Kinnex.png\" /></p>\n<p><sub><b>1</b>: PacBio Iso-Seq Analysis Guidelines. Available at: <a href=\"https://isoseq.how\" target=\"_blank\" rel=\"noopener noreferrer\">https://isoseq.how</a></sub></p>\n<hr />\n<h2>Read Clustering</h2>\n<p>In this step, the pipeline generates high-quality consensus transcript sequences from Full Length Non Chimeric (FLNC) reads. It accepts and processes one or multiple FLNC BAM files in a single run, producing a unified output. The consensus transcripts are used in downstream alignment, collapsing, and quantification.</p>\n<h3>Clustering FLNC Reads</h3>\n<h5>Generate high-quality consensus transcripts</h5>\n<pre class=\"code-block copy-wrapper\">\nisoseq cluster2 --singletons flnc.fofn transcripts.bam\n</pre>\n\n<p>Arguments:</p>\n<ul>\n<li><em>flnc.fofn</em>: a file-of-filenames (FOFN) listing the FLNC BAM files to be processed together.</li>\n<li><em>--singletons</em>: includes low-abundance transcripts supported by fewer than two FLNC reads, which are typically excluded. This allows retention of rare transcript isoforms that might still be biologically relevant.</li>\n</ul>\n<h3>Implementation with IsoSeq</h3>\n<p>The pipeline uses <a href=\"https://github.com/PacificBiosciences/IsoSeq\" target=\"_blank\" rel=\"noopener noreferrer\">IsoSeq</a> version 4.2.0.</p>\n<h3>Source Code</h3>\n<p>All the relevant code can be accessed in the GitHub repository:</p>\n<ul>\n<li><a href=\"https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/isoseq/isoseq_cluster2.sh\" target=\"_blank\" rel=\"noopener noreferrer\">isoseq_cluster2.sh</a> [cluster2]</li>\n</ul>\n<hr />\n<h2>Alignment with pbmm2</h2>\n<p>The pipeline uses pbmm2 to align each unaligned BAM file to the reference genome. The software also sorts the reads by genomic coordinates, strips unnecessary tags, and links methylation tags if present. An integrity check is then performed on the resulting BAM file.</p>\n<p>pbmm2 is used to align both the Full Length Non Chimeric (FLNC) reads and the consensus transcripts generated by cluster2.</p>\n<h3>Aligning and Sorting</h3>\n<h5>Align and sort reads</h5>\n<pre class=\"code-block copy-wrapper\">\npbmm2 align --preset ISOSEQ --sort --strip --unmapped reference.fasta unaligned.bam aligned.bam\n</pre>\n\n<p>Arguments:</p>\n<ul>\n<li><em>--preset</em>: use parameters optimized for Iso-Seq data (ISOSEQ).</li>\n<li><em>--sort</em>: sort the aligned reads by genomic coordinates.</li>\n<li><em>--strip</em>: remove extraneous tags if present in the input BAM file. Tags removed: <code>dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq, pt, pv, pw, px, sf, sq, st</code>.</li>\n<li><em>--unmapped</em>: retain unmapped reads.</li>\n</ul>\n<h3>Integrity Check</h3>\n<p>To confirm the integrity of the alignment BAM file, in-house Python code checks for the presence of the 28-byte empty block representing the EOF marker in BAM format.</p>\n<h3>Implementation with pbmm2</h3>\n<p>The pipeline uses <a href=\"https://github.com/PacificBiosciences/pbmm2\" target=\"_blank\" rel=\"noopener noreferrer\">pbmm2</a> version 1.13.0, which wraps <a href=\"https://github.com/lh3/minimap2\" target=\"_blank\" rel=\"noopener noreferrer\">minimap2</a> version 2.26. It's important to note that pbmm2 sets some defaults that may differ from the standard minimap2.</p>\n<p>Default set by pbmm2 for minimap2:</p>\n<ul>\n<li>Soft clipping is enabled with <code>-Y</code>.</li>\n<li>Long cigars for the <code>CG</code> tag are set using <code>-L</code>.</li>\n<li>X/= cigars are used instead of M with <code>--eqx</code>.</li>\n<li>Overlapping query intervals with repeated matches trimming are disabled.</li>\n<li>Secondary alignments are excluded with <code>--secondary=no</code>.</li>\n</ul>\n<p><em>Note: Due to multi-threading the output alignment ordering can differ between multiple runs with the same input parameters. The same can occur even with option --sort for records that align to the same target sequence, the same position within that target, and in the same orientation, which are the only fields that samtools sort uses.</em></p>\n<hr />\n<h2>Transcript Collapsing</h2>\n<p>In this step, the pipeline merges redundant consensus transcripts that align to the same genomic loci. Transcripts with identical exon\u2013intron structures are collapsed into a single representative transcript model. The output includes unique isoforms in GFF format, a FASTA sequence file, and several supporting metric files.</p>\n<p>Both the aligned consensus reads and the original Full Length Non Chimeric (FLNC) reads are used to determine transcript structure and quantify read support.</p>\n<h3>Collapsing Consensus Transcripts</h3>\n<h5>Collapse redundant transcripts into unique isoforms</h5>\n<pre class=\"code-block copy-wrapper\">\nisoseq collapse aligned_transcripts.bam flnc.bam collapsed_isoforms.gff\n</pre>\n\n<p>Arguments:</p>\n<ul>\n<li><em>flnc.bam</em>: original FLNC reads used to assess transcript support by counting the number of reads mapped to each isoform.</li>\n</ul>\n<p><em>Note: In addition to the GFF, the output includes a TXT file with read-to-isoform mappings, a TXT file listing transcript support statistics (FLNC counts), and a JSON file with detailed metrics. These files are required for downstream annotation and quality control.</em></p>\n<h3>Implementation with IsoSeq</h3>\n<p>The pipeline uses <a href=\"https://github.com/PacificBiosciences/IsoSeq\" target=\"_blank\" rel=\"noopener noreferrer\">IsoSeq</a> version 4.2.0.</p>\n<h3>Source Code</h3>\n<p>All the relevant code can be accessed in the GitHub repository:</p>\n<ul>\n<li><a href=\"https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/isoseq/isoseq_collapse.sh\" target=\"_blank\" rel=\"noopener noreferrer\">isoseq_collapse.sh</a> [collapse]</li>\n</ul>\n<hr />\n<h2>Isoform Classification and Filtering</h2>\n<p>In this step, the pipeline uses Pigeon to classify collapsed isoforms and filter out potential artifacts. Pigeon categorizes isoforms based on their splice junctions and structural similarity to known reference annotations. This classification helps in identifying known, novel, and potentially artifactual transcripts.</p>\n<p>The output includes a set of high-confidence isoforms, along with detailed classification metrics and annotations.</p>\n<h3>Classifying and Filtering Isoforms</h3>\n<h5>Prepare the reference annotation and genome</h5>\n<pre class=\"code-block copy-wrapper\">\npigeon prepare annotation.gtf reference.fasta\n</pre>\n\n<p>This command sorts and indexes the GTF annotation and genome FASTA files for compatibility with Pigeon.</p>\n<p><em>Note: The annotation file used is the GENCODE comprehensive gene annotations. For more detailed information please refer to the GENCODE documentation under \u201cGenome Annotations\u201d section.</em></p>\n<h5>Prepare the collapsed isoforms</h5>\n<pre class=\"code-block copy-wrapper\">\npigeon prepare collapsed_isoforms.gff\n</pre>\n\n<p>This command sorts and indexes the collapsed isoforms generated in the previous step for compatibility with Pigeon.</p>\n<h5>Classify isoforms</h5>\n<pre class=\"code-block copy-wrapper\">\npigeon classify \\\n  sorted_isoforms.gff \\\n  annotation.gtf \\\n  reference.fasta  \\\n  --fl flnc_count.txt \\\n  --cage-peak refTSS.bed \\\n  --poly-a polyA.list\n</pre>\n\n<p>Arguments:</p>\n<ul>\n<li><em>isoform, annotation, and reference input files must be preprocessed in the prepare step.</em></li>\n<li><em>--fl</em>: file with Full Length Non Chimeric (FLNC) read counts from the collapsing step. Required to include read support in the classification output.</li>\n<li><em>--cage-peak</em>: BED file with CAGE peaks information. Used to improve annotation of transcription start sites (TSS).</li>\n<li><em>--poly-a</em>: file in Pigeon custom format with polyA motifs. Used to improve annotation of polyA sites.</li>\n</ul>\n<p><em>Note: The refTSS.bed and polyA.list files used by the pipeline are provided by PacBio as part of their reference resource bundle and can be downloaded <a href=\"https://downloads.pacbcloud.com/public/dataset/Kinnex-full-length-RNA\" target=\"_blank\" rel=\"noopener noreferrer\">here</a>.</em></p>\n<h5>Filter high-confidence isoforms</h5>\n<pre class=\"code-block copy-wrapper\">\npigeon filter classification.txt --isoforms sorted_isoforms.gff\n</pre>\n\n<p>Arguments:</p>\n<ul>\n<li><em>--isoforms</em> enables generation of a filtered GFF file as additional output. Input isoform file must be preprocessed in the prepare step.</li>\n</ul>\n<p>This command filters isoforms from the classification output.</p>\n<h3>Implementation with Pigeon</h3>\n<p>The pipeline uses <a href=\"https://github.com/PacificBiosciences/Pigeon\" target=\"_blank\" rel=\"noopener noreferrer\">Pigeon</a> version 1.3.0.</p>\n<h3>Source Code</h3>\n<p>All the relevant code can be accessed in the GitHub repository:</p>\n<ul>\n<li><a href=\"https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_prepare.sh\" target=\"_blank\" rel=\"noopener noreferrer\">pigeon_prepare.sh</a> [prepare]</li>\n<li><a href=\"https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_classify.sh\" target=\"_blank\" rel=\"noopener noreferrer\">pigeon_classify.sh</a> [classify]</li>\n<li><a href=\"https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_filter.sh\" target=\"_blank\" rel=\"noopener noreferrer\">pigeon_filter.sh</a> [filter]</li>\n</ul>\n<hr />\n<h2>Read Annotation</h2>\n<p>In this step, the pipeline annotates individual Full Length Non Chimeric (FLNC) reads with the isoform-level classification generated in the previous step. This allows downstream analyses to trace high-confidence isoforms back to the specific supporting reads.</p>\n<p>The annotation is performed using a custom in-house script that lifts isoform classification to the read level.</p>\n<h3>Annotation Tags</h3>\n<p>Reads are annotated in the BAM format using the following custom tags:</p>\n<table>\n<thead>\n<tr>\n<th>Tag</th>\n<th>Format</th>\n<th>Description</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td><code>in:Z:</code></td>\n<td>string</td>\n<td>Isoform ID.</td>\n</tr>\n<tr>\n<td><code>sc:Z:</code></td>\n<td>string</td>\n<td>Structural category. One of: <code>full-splice_match</code>, <code>incomplete-splice_match</code>, <code>novel_in_catalog</code>, <code>novel_not_in_catalog</code>, <code>genic</code>, <code>antisense</code>, <code>fusion</code>, <code>intergenic</code>, <code>genic_intron</code>.</td>\n</tr>\n<tr>\n<td><code>gn:Z:</code></td>\n<td>string</td>\n<td>Associated reference gene name.</td>\n</tr>\n<tr>\n<td><code>tn:Z:</code></td>\n<td>string</td>\n<td>Associated reference transcript name.</td>\n</tr>\n<tr>\n<td><code>sb:Z:</code></td>\n<td>string</td>\n<td>Subcategory for additional splicing information. Values may include <code>mono-exon</code>, <code>multi-exon</code>, and <code>intron_retention</code> (separated by semicolons).</td>\n</tr>\n<tr>\n<td><code>ct:i:</code></td>\n<td>int</td>\n<td>Total number of reads supporting the isoform.</td>\n</tr>\n</tbody>\n</table>\n<h3>Annotating FLNC Reads by Isoform Class</h3>\n<h5>Annotate FLNC reads</h5>\n<pre class=\"code-block copy-wrapper\">\nFLNC_ImportTags.py \\\n  --input_flnc aligned_flnc.bam \\\n  --output_flnc annotated_flnc.bam \\\n  --read_stat read_stat.txt \\\n  --classification filtered_classification.txt \\\n  --index\n</pre>\n\n<p><strong>Arguments:</strong></p>\n<ul>\n<li><em>--input_flnc</em>: input BAM file containing aligned FLNC reads to annotate.</li>\n<li><em>--read_stat</em>: file from the collapsing step with read-to-isoform mappings (read_stat).</li>\n<li><em>--classification</em>: classification file from the filtering step.</li>\n<li><em>--index</em>: flag to index the output BAM file. Requires the reads to be sorted.</li>\n</ul>\n<h3>Implementation</h3>\n<p>The annotation step is implemented using a custom Python script maintained in-house.</p>\n<h3>Source Code</h3>\n<p>All the relevant code can be accessed in the GitHub repository:</p>\n<ul>\n<li><a href=\"https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/utils/FLNC_ImportTags.py\" target=\"_blank\" rel=\"noopener noreferrer\">FLNC_ImportTags.py</a> [FLNC_ImportTags.py]</li>\n</ul></div>", "content": "## Overview\n\nThe long-read analysis pipeline for PacBio Kinnex RNA-seq data follows Iso-Seq processing guidelines<sup><sub>1</sub></sup>. It is designed for per-sample execution, handling one or more Full Length Non Chimeric (FLNC) BAM files. The pipeline has been extended to align and annotate FLNC reads with isoform-level information directly within the BAM file.\n\n### Key Pipeline Steps\n\n1. **Read Clustering:** Generation of high-quality consensus transcript sequences through clustering of FLNC reads.\n2. **Alignment with pbmm2:** Alignment of the consensus transcripts and FLNC reads to the reference genome using pbmm2.\n3. **Transcript Collapsing:** Collapsing of redundant transcripts based on exon-intron structure to define a unique set of isoforms.\n4. **Isoform Classification and Filtering:** Classification and filtering of isoforms to remove potential artifacts and retain high-confidence transcripts.\n5. **Read Annotation:** Annotation of aligned FLNC reads with isoform-level information embedded in the BAM file.\n\n### Pipeline Chart\n\n![flow_chart](/static/img/pipeline-docs/Flow_Chart_Pipeline_Long-Read_RNA-seq_PacBio_Kinnex.png)\n\n<sub><b>1</b>: PacBio Iso-Seq Analysis Guidelines. Available at: [https://isoseq.how](https://isoseq.how)</sub>\n\n---\n\n## Read Clustering\n\nIn this step, the pipeline generates high-quality consensus transcript sequences from Full Length Non Chimeric (FLNC) reads. It accepts and processes one or multiple FLNC BAM files in a single run, producing a unified output. The consensus transcripts are used in downstream alignment, collapsing, and quantification.\n\n### Clustering FLNC Reads\n\n##### Generate high-quality consensus transcripts\n\n<pre class=\"code-block copy-wrapper\">\nisoseq cluster2 --singletons flnc.fofn transcripts.bam\n</pre>\n\nArguments:\n\n- *flnc.fofn*: a file-of-filenames (FOFN) listing the FLNC BAM files to be processed together.\n- *-\\-singletons*: includes low-abundance transcripts supported by fewer than two FLNC reads, which are typically excluded. This allows retention of rare transcript isoforms that might still be biologically relevant.\n\n### Implementation with IsoSeq\n\nThe pipeline uses [IsoSeq](https://github.com/PacificBiosciences/IsoSeq) version 4.2.0.\n\n### Source Code\n\nAll the relevant code can be accessed in the GitHub repository:\n\n  - [isoseq_cluster2.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/isoseq/isoseq_cluster2.sh) [cluster2]\n\n---\n\n## Alignment with pbmm2\n\nThe pipeline uses pbmm2 to align each unaligned BAM file to the reference genome. The software also sorts the reads by genomic coordinates, strips unnecessary tags, and links methylation tags if present. An integrity check is then performed on the resulting BAM file.\n\npbmm2 is used to align both the Full Length Non Chimeric (FLNC) reads and the consensus transcripts generated by cluster2.\n\n### Aligning and Sorting\n\n##### Align and sort reads\n\n<pre class=\"code-block copy-wrapper\">\npbmm2 align --preset ISOSEQ --sort --strip --unmapped reference.fasta unaligned.bam aligned.bam\n</pre>\n\nArguments:\n\n- *-\\-preset*: use parameters optimized for Iso-Seq data (ISOSEQ).\n- *-\\-sort*: sort the aligned reads by genomic coordinates.\n- *-\\-strip*: remove extraneous tags if present in the input BAM file. Tags removed: `dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq, pt, pv, pw, px, sf, sq, st`.\n- *-\\-unmapped*: retain unmapped reads.\n\n### Integrity Check\n\nTo confirm the integrity of the alignment BAM file, in-house Python code checks for the presence of the 28-byte empty block representing the EOF marker in BAM format.\n\n### Implementation with pbmm2\n\nThe pipeline uses [pbmm2](https://github.com/PacificBiosciences/pbmm2) version 1.13.0, which wraps [minimap2](https://github.com/lh3/minimap2) version 2.26. It's important to note that pbmm2 sets some defaults that may differ from the standard minimap2.\n\nDefault set by pbmm2 for minimap2:\n\n- Soft clipping is enabled with `-Y`.\n- Long cigars for the `CG` tag are set using `-L`.\n- X/= cigars are used instead of M with `--eqx`.\n- Overlapping query intervals with repeated matches trimming are disabled.\n- Secondary alignments are excluded with `--secondary=no`.\n\n*Note: Due to multi-threading the output alignment ordering can differ between multiple runs with the same input parameters. The same can occur even with option -\\-sort for records that align to the same target sequence, the same position within that target, and in the same orientation, which are the only fields that samtools sort uses.*\n\n---\n\n## Transcript Collapsing\n\nIn this step, the pipeline merges redundant consensus transcripts that align to the same genomic loci. Transcripts with identical exon\u2013intron structures are collapsed into a single representative transcript model. The output includes unique isoforms in GFF format, a FASTA sequence file, and several supporting metric files.\n\nBoth the aligned consensus reads and the original Full Length Non Chimeric (FLNC) reads are used to determine transcript structure and quantify read support.\n\n### Collapsing Consensus Transcripts\n\n##### Collapse redundant transcripts into unique isoforms\n\n<pre class=\"code-block copy-wrapper\">\nisoseq collapse aligned_transcripts.bam flnc.bam collapsed_isoforms.gff\n</pre>\n\nArguments:\n\n- *flnc.bam*: original FLNC reads used to assess transcript support by counting the number of reads mapped to each isoform.\n\n*Note: In addition to the GFF, the output includes a TXT file with read-to-isoform mappings, a TXT file listing transcript support statistics (FLNC counts), and a JSON file with detailed metrics. These files are required for downstream annotation and quality control.*\n\n### Implementation with IsoSeq\n\nThe pipeline uses [IsoSeq](https://github.com/PacificBiosciences/IsoSeq) version 4.2.0.\n\n### Source Code\n\nAll the relevant code can be accessed in the GitHub repository:\n\n  - [isoseq_collapse.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/isoseq/isoseq_collapse.sh) [collapse]\n\n---\n\n## Isoform Classification and Filtering\n\nIn this step, the pipeline uses Pigeon to classify collapsed isoforms and filter out potential artifacts. Pigeon categorizes isoforms based on their splice junctions and structural similarity to known reference annotations. This classification helps in identifying known, novel, and potentially artifactual transcripts.\n\nThe output includes a set of high-confidence isoforms, along with detailed classification metrics and annotations.\n\n### Classifying and Filtering Isoforms\n\n##### Prepare the reference annotation and genome\n\n<pre class=\"code-block copy-wrapper\">\npigeon prepare annotation.gtf reference.fasta\n</pre>\n\nThis command sorts and indexes the GTF annotation and genome FASTA files for compatibility with Pigeon.\n\n*Note: The annotation file used is the GENCODE comprehensive gene annotations. For more detailed information please refer to the GENCODE documentation under \u201cGenome Annotations\u201d section.*\n\n#####  Prepare the collapsed isoforms\n\n<pre class=\"code-block copy-wrapper\">\npigeon prepare collapsed_isoforms.gff\n</pre>\n\nThis command sorts and indexes the collapsed isoforms generated in the previous step for compatibility with Pigeon.\n\n##### Classify isoforms\n\n<pre class=\"code-block copy-wrapper\">\npigeon classify \\\n  sorted_isoforms.gff \\\n  annotation.gtf \\\n  reference.fasta  \\\n  --fl flnc_count.txt \\\n  --cage-peak refTSS.bed \\\n  --poly-a polyA.list\n</pre>\n\nArguments:\n\n- *isoform, annotation, and reference input files must be preprocessed in the prepare step.*\n- *-\\-fl*: file with Full Length Non Chimeric (FLNC) read counts from the collapsing step. Required to include read support in the classification output.\n- *-\\-cage-peak*: BED file with CAGE peaks information. Used to improve annotation of transcription start sites (TSS).\n- *-\\-poly-a*: file in Pigeon custom format with polyA motifs. Used to improve annotation of polyA sites.\n\n*Note: The refTSS.bed and polyA.list files used by the pipeline are provided by PacBio as part of their reference resource bundle and can be downloaded [here](https://downloads.pacbcloud.com/public/dataset/Kinnex-full-length-RNA).*\n\n##### Filter high-confidence isoforms\n\n<pre class=\"code-block copy-wrapper\">\npigeon filter classification.txt --isoforms sorted_isoforms.gff\n</pre>\n\nArguments:\n\n- *-\\-isoforms* enables generation of a filtered GFF file as additional output. Input isoform file must be preprocessed in the prepare step.\n\nThis command filters isoforms from the classification output.\n\n### Implementation with Pigeon\n\nThe pipeline uses [Pigeon](https://github.com/PacificBiosciences/Pigeon) version 1.3.0.\n\n### Source Code\n\nAll the relevant code can be accessed in the GitHub repository:\n\n  - [pigeon_prepare.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_prepare.sh) [prepare]\n  - [pigeon_classify.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_classify.sh) [classify]\n  - [pigeon_filter.sh](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/pigeon/pigeon_filter.sh) [filter]\n\n---\n\n## Read Annotation\n\nIn this step, the pipeline annotates individual Full Length Non Chimeric (FLNC) reads with the isoform-level classification generated in the previous step. This allows downstream analyses to trace high-confidence isoforms back to the specific supporting reads.\n\nThe annotation is performed using a custom in-house script that lifts isoform classification to the read level.\n\n### Annotation Tags\n\nReads are annotated in the BAM format using the following custom tags:\n\n| Tag        | Format | Description |\n|------------|--------|-------------|\n| `in:Z:`    | string | Isoform ID. |\n| `sc:Z:`    | string | Structural category. One of: `full-splice_match`, `incomplete-splice_match`, `novel_in_catalog`, `novel_not_in_catalog`, `genic`, `antisense`, `fusion`, `intergenic`, `genic_intron`. |\n| `gn:Z:`    | string | Associated reference gene name. |\n| `tn:Z:`    | string | Associated reference transcript name. |\n| `sb:Z:`    | string | Subcategory for additional splicing information. Values may include `mono-exon`, `multi-exon`, and `intron_retention` (separated by semicolons). |\n| `ct:i:`    | int    | Total number of reads supporting the isoform. |\n\n### Annotating FLNC Reads by Isoform Class\n\n##### Annotate FLNC reads\n\n<pre class=\"code-block copy-wrapper\">\nFLNC_ImportTags.py \\\n  --input_flnc aligned_flnc.bam \\\n  --output_flnc annotated_flnc.bam \\\n  --read_stat read_stat.txt \\\n  --classification filtered_classification.txt \\\n  --index\n</pre>\n\n**Arguments:**\n\n- *-\\-input_flnc*: input BAM file containing aligned FLNC reads to annotate.\n- *-\\-read_stat*: file from the collapsing step with read-to-isoform mappings (read_stat).\n- *-\\-classification*: classification file from the filtering step.\n- *-\\-index*: flag to index the output BAM file. Requires the reads to be sorted.\n\n### Implementation\n\nThe annotation step is implemented using a custom Python script maintained in-house.\n\n### Source Code\n\nAll the relevant code can be accessed in the GitHub repository:\n\n- [FLNC_ImportTags.py](https://github.com/smaht-dac/rnaseq-pipelines/blob/main/dockerfiles/utils/FLNC_ImportTags.py) [FLNC_ImportTags.py]\n", "filetype": "md", "@context": "/terms/", "aggregated-items": {}, "validation-errors": []}