{"title": "PacBio HiFi (long-read)", "status": "open", "content": [{"body": "## Overview\n\nThe long-read alignment pipeline for PacBio HiFi data is designed for per-sample and per-library execution, handling one or multiple unaligned BAM files. The pipeline is optimized for distributed processing, requiring each unaligned BAM file to correspond to a single SMRT Cell.\n\n### Key Pipeline Steps\n\n1. **Alignment with pbmm2:** Initial alignment of the raw reads to the reference genome using pbmm2.\n2. **Read Groups Assignment:** Assignment of reads to specific groups.\n3. **Methylation and Tags Linking:** Linking the methylation status information and other specific tags from the unaligned to the alignment BAM file.\n\n### Pipeline Chart\n\n![flow_chart](/static/img/pipeline-docs/Flow_Chart_Pipeline_Long-Read_PacBio_HiFi.png)\n\n---\n## Alignment with pbmm2\n\nThe pipeline uses pbmm2 to align each unaligned BAM file to the reference genome. The software also sorts the reads by genomic coordinates, strips unnecessary tags, and links methylation tags if present. An integrity check is then performed on the resulting BAM file.\n\n### Aligning and Sorting\n\n###### Align and sort reads\n\n<pre class=\"code-block copy-wrapper\">\npbmm2 align --sort --strip --unmapped reference.fasta unaligned.bam sorted.bam\n</pre>\n\nArguments:\n\n- *-\\-sort*: sort the aligned reads by genomic coordinates.\n- *-\\-strip*: remove extraneous tags if present in the input BAM file. Tags removed: `dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq, pt, pv, pw, px, sf, sq, st`.\n- *-\\-unmapped*: retain unmapped reads.\n\n### Integrity Check\n\nTo confirm the integrity of the alignment BAM file, in-house Python code checks for the presence of the 28-byte empty block representing the EOF marker in BAM format.\n\n### Implementation with pbmm2\n\nThe pipeline uses [pbmm2](https://github.com/PacificBiosciences/pbmm2) version 1.13.0, which wraps [minimap2](https://github.com/lh3/minimap2) version 2.26. It's important to note that pbmm2 sets some defaults that may differ from the standard minimap2.\n\nDefault set by pbmm2 for minimap2:\n\n- Soft clipping is enabled with `-Y`.\n- Long cigars for the `CG` tag are set using `-L`.\n- X/= cigars are used instead of M with `--eqx`.\n- Overlapping query intervals with repeated matches trimming are disabled.\n- Secondary alignments are excluded with `--secondary=no`.\n\n*Note: Due to multi-threading the output alignment ordering can differ between multiple runs with the same input parameters. The same can occur even with option -\\-sort for records that align to the same target sequence, the same position within that target, and in the same orientation, which are the only fields that samtools sort uses.*\n\n---\n\n## Read Groups\n\nA read group (`@RG`) is a unique identifier that group reads together, capturing relevant information about the sample and the sequencing process and technology, utilized by various downstream bioinformatics tools.\n\nThe relevant fields in defining a read group include:\n\n- **ID (Identifier):** A unique identifier for the read group within the BAM file and across multiple BAM files used in the same dataset.\n- **SM (Sample):** The sample to which the reads belong.\n- **PL (Platform):** The technology used to sequence the reads (e.g., PACBIO).\n- **PM (Platform Model):** The platform model reflecting the instrument series (e.g., REVIO, ASTRO, SEQUEL, RS).\n- **PU (Platform Unit):** A unique identifier for the sequencer unit used for sequencing (i.e., PacBio movie name).\n- **LB (Library):** The library used to sequence the reads.\n- **DS (Description):** Semantic information about the reads in the group, encoded as a semicolon-delimited list of \u201cKey=Value\u201d strings.\n\n###### Mandatory Description (DS) Information\n\n| Key               | Value Specification                                                                | Example     |\n|-------------------|------------------------------------------------------------------------------------|-------------|\n| READTYPE          | One of ZMW, HQREGION, SUBREAD, CCS, SCRAP, or UNKNOWN                              | CCS         |\n| BINDINGKIT        | Binding kit part number                                                            | 102-739-100 |\n| SEQUENCINGKIT     | Sequencing kit part number                                                         | 102-118-800 |\n| BASECALLERVERSION | Basecaller version number                                                          | 5.0         |\n| FRAMERATEHZ       | Frame rate in Hz                                                                   | 100         |\n| CONTROL           | TRUE if reads are classified as spike-in controls, otherwise CONTROL key is absent | TRUE        |\n\n###### Optional Description (DS) Information\n\n| Key            | Value Specification                                                                        | Example                                |\n|----------------|--------------------------------------------------------------------------------------------|----------------------------------------|\n| BarcodeFile    | Name of the FASTA file containing the sequences of the barcodes used                       | m84046_230828_225743_s2.barcodes.fasta |\n| BarcodeHash    | The MD5 hash of the contents of the barcoding sequence file                                | e7c4279103df8c8de7036efdbdca9008       |\n| BarcodeCount   | The number of barcode sequences in the barcode file                                        | 113                                    |\n| BarcodeMode    | Experimental design of the barcodes. Must be Symmetric/Asymmetric/Tailed or None Symmetric | Symmetric                              |\n| BarcodeQuality | The type of value encoded by the bq tag. Must be Score/Probability/None                    | Score                                  |\n\n### Assigning Read Groups\n\nDuring the alignment process using pbmm2, the original read groups from the unaligned BAM files are linked and maintained in the corresponding alignment BAM files. In-house bash code that utilizes samtools replaces `SM` and `LB` information with the correct identifiers used by the portal, as follows:\n\n- **SM:** `<sample name>`\n- **LB:** `<sample name>.<library>`\n\nE.g., in BAM file:\n\n<pre class=\"code-block copy-wrapper\">\n@RG\tID:f115ea06/25--25\tPL:PACBIO\tDS:READTYPE=CCS;Ipd:Frames=ip;PulseWidth:Frames=pw;BINDINGKIT=102-739-100;SEQUENCINGKIT=102-118-800;BASECALLERVERSION=5.0;FRAMERATEHZ=100.000000;BarcodeFile=metadata/m84046_230828_225743_s2.barcodes.fasta;BarcodeHash=e7c4279103df8c8de7036efdbdca9008;BarcodeCount=113;BarcodeMode=Symmetric;BarcodeQuality=Score\tLB:SMACUBS146SV.SMALIRA4HLNS\tPU:m84046_230828_225743_s2\tSM:SMACUBS146SV\tPM:REVIO\tBC:ACGCACGTACGAGTAT\tCM:R/P1-C1/5.0-25M\n</pre>\n\n### Source Code\n\nAll the relevant code is accessible in the GitHub repository:\n\n  - [ReplaceReadGroups.sh](https://github.com/smaht-dac/pipelines-scripts/blob/main/processing_scripts/ReplaceReadGroups.sh) [ReplaceReadGroups]\n\n---\n\n## Methylation and Tags\n\nDuring the alignment process using pbmm2, methylation, and other tags are linked from the unaligned to the alignment BAM file. The specific tags may vary depending on the method used to generate the data. Here is a definition of these tags.\n\n### Fiber-seq\n\nFiber-seq<sup><sub>1</sub></sup> is a chromatin mapping technique that employs methyltransferases to mark accessible adenines in DNA with methyl groups. The chromatin structure (e.g., nucleosomes and bound transcription factors) is used as a \"stencil\" for the methyltransferase, mapping the structure of chromatin fibers onto the underlying DNA template. The position of the methylated adenines is then used to infer the DNA accessibility from the template, offering high-resolution insights into chromatin structure at nearly single-molecule level.\n\nRaw PacBio HiFi data are processed through [fibertools-rs](https://github.com/fiberseq/fibertools-rs) to generate unaligned BAM files. fibertools-rs adds additional information to the files creating tags that are linked by pbmm2 during alignment.\n\nFiber-seq generates the following tags:\n\n- **MM (mCpG and m6A Methylation Positions):** Positions of mCpG and m6A methylation along the read.\n- **ML (Methylation Precision Values):** Precision values for each mCpG and m6A methylation call.\n- **ns (Nucleosome Start Positions):** Start positions of identified nucleosomes along the read relative to the first base.\n- **nl (Nucleosome Lengths):** Lengths of each nucleosome along the read.\n- **as (Methyltransferase Accessible Patch Start Positions):** Start positions of identified methyltransferase accessible patches (MSP) along the read.\n- **al (MSP Lengths):** Lengths of each MSP along the read.\n\n<sub><b>1</b>: *Andrew B. Stergachis et al.* Single-molecule regulatory architectures captured by chromatin fiber sequencing. *Science 368, 1449-1454(2020).* doi: 10.1126/science.aaz1646</sub>", "title": "PacBio HiFi (long-read)", "status": "open", "options": {"filetype": "md", "collapsible": false, "default_open": true, "convert_ext_links": true, "initial_header_level": 2}, "consortia": [{"@type": ["Consortium", "Item"], "status": "open", "display_title": "SMaHT", "@id": "/consortia/358aed10-9b9d-4e26-ab84-4bd162da182b/", "uuid": "358aed10-9b9d-4e26-ab84-4bd162da182b", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "identifier": "long-read_pacbio_hifi", "date_created": "2026-01-09T20:07:18.962645+00:00", "section_type": "Page Section", "submitted_by": {"error": "no view permissions"}, "last_modified": {"modified_by": {"error": "no view permissions"}, "date_modified": "2026-04-17T19:50:46.181092+00:00"}, "schema_version": "1", "submission_centers": [{"display_title": "HMS DAC", "status": "open", "@type": ["SubmissionCenter", "Item"], "@id": "/submission-centers/9626d82e-8110-4213-ac75-0a50adf890ff/", "uuid": "9626d82e-8110-4213-ac75-0a50adf890ff", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "@id": "/static-sections/23bd319b-19eb-4408-869b-eda806da56a5/", "@type": ["StaticSection", "UserContent", "Item"], "uuid": "23bd319b-19eb-4408-869b-eda806da56a5", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}, "display_title": "PacBio HiFi (long-read)", "content_as_html": "<div><h2>Overview</h2>\n<p>The long-read alignment pipeline for PacBio HiFi data is designed for per-sample and per-library execution, handling one or multiple unaligned BAM files. The pipeline is optimized for distributed processing, requiring each unaligned BAM file to correspond to a single SMRT Cell.</p>\n<h3>Key Pipeline Steps</h3>\n<ol>\n<li><strong>Alignment with pbmm2:</strong> Initial alignment of the raw reads to the reference genome using pbmm2.</li>\n<li><strong>Read Groups Assignment:</strong> Assignment of reads to specific groups.</li>\n<li><strong>Methylation and Tags Linking:</strong> Linking the methylation status information and other specific tags from the unaligned to the alignment BAM file.</li>\n</ol>\n<h3>Pipeline Chart</h3>\n<p><img alt=\"flow_chart\" src=\"/static/img/pipeline-docs/Flow_Chart_Pipeline_Long-Read_PacBio_HiFi.png\" /></p>\n<hr />\n<h2>Alignment with pbmm2</h2>\n<p>The pipeline uses pbmm2 to align each unaligned BAM file to the reference genome. The software also sorts the reads by genomic coordinates, strips unnecessary tags, and links methylation tags if present. An integrity check is then performed on the resulting BAM file.</p>\n<h3>Aligning and Sorting</h3>\n<h6>Align and sort reads</h6>\n<pre class=\"code-block copy-wrapper\">\npbmm2 align --sort --strip --unmapped reference.fasta unaligned.bam sorted.bam\n</pre>\n\n<p>Arguments:</p>\n<ul>\n<li><em>--sort</em>: sort the aligned reads by genomic coordinates.</li>\n<li><em>--strip</em>: remove extraneous tags if present in the input BAM file. Tags removed: <code>dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq, pt, pv, pw, px, sf, sq, st</code>.</li>\n<li><em>--unmapped</em>: retain unmapped reads.</li>\n</ul>\n<h3>Integrity Check</h3>\n<p>To confirm the integrity of the alignment BAM file, in-house Python code checks for the presence of the 28-byte empty block representing the EOF marker in BAM format.</p>\n<h3>Implementation with pbmm2</h3>\n<p>The pipeline uses <a href=\"https://github.com/PacificBiosciences/pbmm2\" target=\"_blank\" rel=\"noopener noreferrer\">pbmm2</a> version 1.13.0, which wraps <a href=\"https://github.com/lh3/minimap2\" target=\"_blank\" rel=\"noopener noreferrer\">minimap2</a> version 2.26. It's important to note that pbmm2 sets some defaults that may differ from the standard minimap2.</p>\n<p>Default set by pbmm2 for minimap2:</p>\n<ul>\n<li>Soft clipping is enabled with <code>-Y</code>.</li>\n<li>Long cigars for the <code>CG</code> tag are set using <code>-L</code>.</li>\n<li>X/= cigars are used instead of M with <code>--eqx</code>.</li>\n<li>Overlapping query intervals with repeated matches trimming are disabled.</li>\n<li>Secondary alignments are excluded with <code>--secondary=no</code>.</li>\n</ul>\n<p><em>Note: Due to multi-threading the output alignment ordering can differ between multiple runs with the same input parameters. The same can occur even with option --sort for records that align to the same target sequence, the same position within that target, and in the same orientation, which are the only fields that samtools sort uses.</em></p>\n<hr />\n<h2>Read Groups</h2>\n<p>A read group (<code>@RG</code>) is a unique identifier that group reads together, capturing relevant information about the sample and the sequencing process and technology, utilized by various downstream bioinformatics tools.</p>\n<p>The relevant fields in defining a read group include:</p>\n<ul>\n<li><strong>ID (Identifier):</strong> A unique identifier for the read group within the BAM file and across multiple BAM files used in the same dataset.</li>\n<li><strong>SM (Sample):</strong> The sample to which the reads belong.</li>\n<li><strong>PL (Platform):</strong> The technology used to sequence the reads (e.g., PACBIO).</li>\n<li><strong>PM (Platform Model):</strong> The platform model reflecting the instrument series (e.g., REVIO, ASTRO, SEQUEL, RS).</li>\n<li><strong>PU (Platform Unit):</strong> A unique identifier for the sequencer unit used for sequencing (i.e., PacBio movie name).</li>\n<li><strong>LB (Library):</strong> The library used to sequence the reads.</li>\n<li><strong>DS (Description):</strong> Semantic information about the reads in the group, encoded as a semicolon-delimited list of \u201cKey=Value\u201d strings.</li>\n</ul>\n<h6>Mandatory Description (DS) Information</h6>\n<table>\n<thead>\n<tr>\n<th>Key</th>\n<th>Value Specification</th>\n<th>Example</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>READTYPE</td>\n<td>One of ZMW, HQREGION, SUBREAD, CCS, SCRAP, or UNKNOWN</td>\n<td>CCS</td>\n</tr>\n<tr>\n<td>BINDINGKIT</td>\n<td>Binding kit part number</td>\n<td>102-739-100</td>\n</tr>\n<tr>\n<td>SEQUENCINGKIT</td>\n<td>Sequencing kit part number</td>\n<td>102-118-800</td>\n</tr>\n<tr>\n<td>BASECALLERVERSION</td>\n<td>Basecaller version number</td>\n<td>5.0</td>\n</tr>\n<tr>\n<td>FRAMERATEHZ</td>\n<td>Frame rate in Hz</td>\n<td>100</td>\n</tr>\n<tr>\n<td>CONTROL</td>\n<td>TRUE if reads are classified as spike-in controls, otherwise CONTROL key is absent</td>\n<td>TRUE</td>\n</tr>\n</tbody>\n</table>\n<h6>Optional Description (DS) Information</h6>\n<table>\n<thead>\n<tr>\n<th>Key</th>\n<th>Value Specification</th>\n<th>Example</th>\n</tr>\n</thead>\n<tbody>\n<tr>\n<td>BarcodeFile</td>\n<td>Name of the FASTA file containing the sequences of the barcodes used</td>\n<td>m84046_230828_225743_s2.barcodes.fasta</td>\n</tr>\n<tr>\n<td>BarcodeHash</td>\n<td>The MD5 hash of the contents of the barcoding sequence file</td>\n<td>e7c4279103df8c8de7036efdbdca9008</td>\n</tr>\n<tr>\n<td>BarcodeCount</td>\n<td>The number of barcode sequences in the barcode file</td>\n<td>113</td>\n</tr>\n<tr>\n<td>BarcodeMode</td>\n<td>Experimental design of the barcodes. Must be Symmetric/Asymmetric/Tailed or None Symmetric</td>\n<td>Symmetric</td>\n</tr>\n<tr>\n<td>BarcodeQuality</td>\n<td>The type of value encoded by the bq tag. Must be Score/Probability/None</td>\n<td>Score</td>\n</tr>\n</tbody>\n</table>\n<h3>Assigning Read Groups</h3>\n<p>During the alignment process using pbmm2, the original read groups from the unaligned BAM files are linked and maintained in the corresponding alignment BAM files. In-house bash code that utilizes samtools replaces <code>SM</code> and <code>LB</code> information with the correct identifiers used by the portal, as follows:</p>\n<ul>\n<li><strong>SM:</strong> <code>&lt;sample name&gt;</code></li>\n<li><strong>LB:</strong> <code>&lt;sample name&gt;.&lt;library&gt;</code></li>\n</ul>\n<p>E.g., in BAM file:</p>\n<pre class=\"code-block copy-wrapper\">\n@RG ID:f115ea06/25--25  PL:PACBIO   DS:READTYPE=CCS;Ipd:Frames=ip;PulseWidth:Frames=pw;BINDINGKIT=102-739-100;SEQUENCINGKIT=102-118-800;BASECALLERVERSION=5.0;FRAMERATEHZ=100.000000;BarcodeFile=metadata/m84046_230828_225743_s2.barcodes.fasta;BarcodeHash=e7c4279103df8c8de7036efdbdca9008;BarcodeCount=113;BarcodeMode=Symmetric;BarcodeQuality=Score   LB:SMACUBS146SV.SMALIRA4HLNS    PU:m84046_230828_225743_s2  SM:SMACUBS146SV PM:REVIO    BC:ACGCACGTACGAGTAT CM:R/P1-C1/5.0-25M\n</pre>\n\n<h3>Source Code</h3>\n<p>All the relevant code is accessible in the GitHub repository:</p>\n<ul>\n<li><a href=\"https://github.com/smaht-dac/pipelines-scripts/blob/main/processing_scripts/ReplaceReadGroups.sh\" target=\"_blank\" rel=\"noopener noreferrer\">ReplaceReadGroups.sh</a> [ReplaceReadGroups]</li>\n</ul>\n<hr />\n<h2>Methylation and Tags</h2>\n<p>During the alignment process using pbmm2, methylation, and other tags are linked from the unaligned to the alignment BAM file. The specific tags may vary depending on the method used to generate the data. Here is a definition of these tags.</p>\n<h3>Fiber-seq</h3>\n<p>Fiber-seq<sup><sub>1</sub></sup> is a chromatin mapping technique that employs methyltransferases to mark accessible adenines in DNA with methyl groups. The chromatin structure (e.g., nucleosomes and bound transcription factors) is used as a \"stencil\" for the methyltransferase, mapping the structure of chromatin fibers onto the underlying DNA template. The position of the methylated adenines is then used to infer the DNA accessibility from the template, offering high-resolution insights into chromatin structure at nearly single-molecule level.</p>\n<p>Raw PacBio HiFi data are processed through <a href=\"https://github.com/fiberseq/fibertools-rs\" target=\"_blank\" rel=\"noopener noreferrer\">fibertools-rs</a> to generate unaligned BAM files. fibertools-rs adds additional information to the files creating tags that are linked by pbmm2 during alignment.</p>\n<p>Fiber-seq generates the following tags:</p>\n<ul>\n<li><strong>MM (mCpG and m6A Methylation Positions):</strong> Positions of mCpG and m6A methylation along the read.</li>\n<li><strong>ML (Methylation Precision Values):</strong> Precision values for each mCpG and m6A methylation call.</li>\n<li><strong>ns (Nucleosome Start Positions):</strong> Start positions of identified nucleosomes along the read relative to the first base.</li>\n<li><strong>nl (Nucleosome Lengths):</strong> Lengths of each nucleosome along the read.</li>\n<li><strong>as (Methyltransferase Accessible Patch Start Positions):</strong> Start positions of identified methyltransferase accessible patches (MSP) along the read.</li>\n<li><strong>al (MSP Lengths):</strong> Lengths of each MSP along the read.</li>\n</ul>\n<p><sub><b>1</b>: <em>Andrew B. Stergachis et al.</em> Single-molecule regulatory architectures captured by chromatin fiber sequencing. <em>Science 368, 1449-1454(2020).</em> doi: 10.1126/science.aaz1646</sub></p></div>", "content": "## Overview\n\nThe long-read alignment pipeline for PacBio HiFi data is designed for per-sample and per-library execution, handling one or multiple unaligned BAM files. The pipeline is optimized for distributed processing, requiring each unaligned BAM file to correspond to a single SMRT Cell.\n\n### Key Pipeline Steps\n\n1. **Alignment with pbmm2:** Initial alignment of the raw reads to the reference genome using pbmm2.\n2. **Read Groups Assignment:** Assignment of reads to specific groups.\n3. **Methylation and Tags Linking:** Linking the methylation status information and other specific tags from the unaligned to the alignment BAM file.\n\n### Pipeline Chart\n\n![flow_chart](/static/img/pipeline-docs/Flow_Chart_Pipeline_Long-Read_PacBio_HiFi.png)\n\n---\n## Alignment with pbmm2\n\nThe pipeline uses pbmm2 to align each unaligned BAM file to the reference genome. The software also sorts the reads by genomic coordinates, strips unnecessary tags, and links methylation tags if present. An integrity check is then performed on the resulting BAM file.\n\n### Aligning and Sorting\n\n###### Align and sort reads\n\n<pre class=\"code-block copy-wrapper\">\npbmm2 align --sort --strip --unmapped reference.fasta unaligned.bam sorted.bam\n</pre>\n\nArguments:\n\n- *-\\-sort*: sort the aligned reads by genomic coordinates.\n- *-\\-strip*: remove extraneous tags if present in the input BAM file. Tags removed: `dq, dt, ip, iq, mq, pa, pc, pd, pe, pg, pm, pq, pt, pv, pw, px, sf, sq, st`.\n- *-\\-unmapped*: retain unmapped reads.\n\n### Integrity Check\n\nTo confirm the integrity of the alignment BAM file, in-house Python code checks for the presence of the 28-byte empty block representing the EOF marker in BAM format.\n\n### Implementation with pbmm2\n\nThe pipeline uses [pbmm2](https://github.com/PacificBiosciences/pbmm2) version 1.13.0, which wraps [minimap2](https://github.com/lh3/minimap2) version 2.26. It's important to note that pbmm2 sets some defaults that may differ from the standard minimap2.\n\nDefault set by pbmm2 for minimap2:\n\n- Soft clipping is enabled with `-Y`.\n- Long cigars for the `CG` tag are set using `-L`.\n- X/= cigars are used instead of M with `--eqx`.\n- Overlapping query intervals with repeated matches trimming are disabled.\n- Secondary alignments are excluded with `--secondary=no`.\n\n*Note: Due to multi-threading the output alignment ordering can differ between multiple runs with the same input parameters. The same can occur even with option -\\-sort for records that align to the same target sequence, the same position within that target, and in the same orientation, which are the only fields that samtools sort uses.*\n\n---\n\n## Read Groups\n\nA read group (`@RG`) is a unique identifier that group reads together, capturing relevant information about the sample and the sequencing process and technology, utilized by various downstream bioinformatics tools.\n\nThe relevant fields in defining a read group include:\n\n- **ID (Identifier):** A unique identifier for the read group within the BAM file and across multiple BAM files used in the same dataset.\n- **SM (Sample):** The sample to which the reads belong.\n- **PL (Platform):** The technology used to sequence the reads (e.g., PACBIO).\n- **PM (Platform Model):** The platform model reflecting the instrument series (e.g., REVIO, ASTRO, SEQUEL, RS).\n- **PU (Platform Unit):** A unique identifier for the sequencer unit used for sequencing (i.e., PacBio movie name).\n- **LB (Library):** The library used to sequence the reads.\n- **DS (Description):** Semantic information about the reads in the group, encoded as a semicolon-delimited list of \u201cKey=Value\u201d strings.\n\n###### Mandatory Description (DS) Information\n\n| Key               | Value Specification                                                                | Example     |\n|-------------------|------------------------------------------------------------------------------------|-------------|\n| READTYPE          | One of ZMW, HQREGION, SUBREAD, CCS, SCRAP, or UNKNOWN                              | CCS         |\n| BINDINGKIT        | Binding kit part number                                                            | 102-739-100 |\n| SEQUENCINGKIT     | Sequencing kit part number                                                         | 102-118-800 |\n| BASECALLERVERSION | Basecaller version number                                                          | 5.0         |\n| FRAMERATEHZ       | Frame rate in Hz                                                                   | 100         |\n| CONTROL           | TRUE if reads are classified as spike-in controls, otherwise CONTROL key is absent | TRUE        |\n\n###### Optional Description (DS) Information\n\n| Key            | Value Specification                                                                        | Example                                |\n|----------------|--------------------------------------------------------------------------------------------|----------------------------------------|\n| BarcodeFile    | Name of the FASTA file containing the sequences of the barcodes used                       | m84046_230828_225743_s2.barcodes.fasta |\n| BarcodeHash    | The MD5 hash of the contents of the barcoding sequence file                                | e7c4279103df8c8de7036efdbdca9008       |\n| BarcodeCount   | The number of barcode sequences in the barcode file                                        | 113                                    |\n| BarcodeMode    | Experimental design of the barcodes. Must be Symmetric/Asymmetric/Tailed or None Symmetric | Symmetric                              |\n| BarcodeQuality | The type of value encoded by the bq tag. Must be Score/Probability/None                    | Score                                  |\n\n### Assigning Read Groups\n\nDuring the alignment process using pbmm2, the original read groups from the unaligned BAM files are linked and maintained in the corresponding alignment BAM files. In-house bash code that utilizes samtools replaces `SM` and `LB` information with the correct identifiers used by the portal, as follows:\n\n- **SM:** `<sample name>`\n- **LB:** `<sample name>.<library>`\n\nE.g., in BAM file:\n\n<pre class=\"code-block copy-wrapper\">\n@RG\tID:f115ea06/25--25\tPL:PACBIO\tDS:READTYPE=CCS;Ipd:Frames=ip;PulseWidth:Frames=pw;BINDINGKIT=102-739-100;SEQUENCINGKIT=102-118-800;BASECALLERVERSION=5.0;FRAMERATEHZ=100.000000;BarcodeFile=metadata/m84046_230828_225743_s2.barcodes.fasta;BarcodeHash=e7c4279103df8c8de7036efdbdca9008;BarcodeCount=113;BarcodeMode=Symmetric;BarcodeQuality=Score\tLB:SMACUBS146SV.SMALIRA4HLNS\tPU:m84046_230828_225743_s2\tSM:SMACUBS146SV\tPM:REVIO\tBC:ACGCACGTACGAGTAT\tCM:R/P1-C1/5.0-25M\n</pre>\n\n### Source Code\n\nAll the relevant code is accessible in the GitHub repository:\n\n  - [ReplaceReadGroups.sh](https://github.com/smaht-dac/pipelines-scripts/blob/main/processing_scripts/ReplaceReadGroups.sh) [ReplaceReadGroups]\n\n---\n\n## Methylation and Tags\n\nDuring the alignment process using pbmm2, methylation, and other tags are linked from the unaligned to the alignment BAM file. The specific tags may vary depending on the method used to generate the data. Here is a definition of these tags.\n\n### Fiber-seq\n\nFiber-seq<sup><sub>1</sub></sup> is a chromatin mapping technique that employs methyltransferases to mark accessible adenines in DNA with methyl groups. The chromatin structure (e.g., nucleosomes and bound transcription factors) is used as a \"stencil\" for the methyltransferase, mapping the structure of chromatin fibers onto the underlying DNA template. The position of the methylated adenines is then used to infer the DNA accessibility from the template, offering high-resolution insights into chromatin structure at nearly single-molecule level.\n\nRaw PacBio HiFi data are processed through [fibertools-rs](https://github.com/fiberseq/fibertools-rs) to generate unaligned BAM files. fibertools-rs adds additional information to the files creating tags that are linked by pbmm2 during alignment.\n\nFiber-seq generates the following tags:\n\n- **MM (mCpG and m6A Methylation Positions):** Positions of mCpG and m6A methylation along the read.\n- **ML (Methylation Precision Values):** Precision values for each mCpG and m6A methylation call.\n- **ns (Nucleosome Start Positions):** Start positions of identified nucleosomes along the read relative to the first base.\n- **nl (Nucleosome Lengths):** Lengths of each nucleosome along the read.\n- **as (Methyltransferase Accessible Patch Start Positions):** Start positions of identified methyltransferase accessible patches (MSP) along the read.\n- **al (MSP Lengths):** Lengths of each MSP along the read.\n\n<sub><b>1</b>: *Andrew B. Stergachis et al.* Single-molecule regulatory architectures captured by chromatin fiber sequencing. *Science 368, 1449-1454(2020).* doi: 10.1126/science.aaz1646</sub>", "filetype": "md"}], "consortia": [{"status": "open", "@type": ["Consortium", "Item"], "uuid": "358aed10-9b9d-4e26-ab84-4bd162da182b", "@id": "/consortia/358aed10-9b9d-4e26-ab84-4bd162da182b/", "display_title": "SMaHT", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "identifier": "docs/additional-resources/pipeline-docs/long-read_pacbio_hifi", "date_created": "2026-01-09T20:07:38.631372+00:00", "submitted_by": {"error": "no view permissions"}, "last_modified": {"modified_by": {"error": "no view permissions"}, "date_modified": "2026-04-17T19:51:19.077089+00:00"}, "schema_version": "1", "table-of-contents": {"enabled": true, "skip-depth": 1, "header-depth": 2, "include-top-link": false}, "submission_centers": [{"@type": ["SubmissionCenter", "Item"], "status": "open", "@id": "/submission-centers/9626d82e-8110-4213-ac75-0a50adf890ff/", "uuid": "9626d82e-8110-4213-ac75-0a50adf890ff", "display_title": "HMS DAC", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "@id": "/docs/additional-resources/pipeline-docs/long-read_pacbio_hifi", "@type": ["DocsAdditional-resourcesPipeline-docsLong-read_pacbio_hifiPage", "DocsAdditional-resourcesPipeline-docsPage", "DocsAdditional-resourcesPage", "DocsPage", "StaticPage", "Portal"], "uuid": "ab58ce07-b8f9-4807-aa8a-d651b6cd733d", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}, "display_title": "PacBio HiFi (long-read)", "@context": "/docs/additional-resources/pipeline-docs/long-read_pacbio_hifi", "is_leaf": true, "toc": {"enabled": true, "skip-depth": 1, "header-depth": 2, "include-top-link": false}, "next": {"identifier": "docs/additional-resources/pipeline-docs/long-read_oxford_nanopore", "title": "Oxford Nanopore (long-read)", "status": "open", "content": [{"uuid": "9cc46136-477d-4f6e-8751-94d41e6fb5cf", "display_title": "Oxford Nanopore (long-read)", "status": "open", "@id": "/static-sections/9cc46136-477d-4f6e-8751-94d41e6fb5cf/", "@type": ["StaticSection", "UserContent", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "consortia": [{"@type": ["Consortium", "Item"], "@id": "/consortia/358aed10-9b9d-4e26-ab84-4bd162da182b/", "uuid": "358aed10-9b9d-4e26-ab84-4bd162da182b", "status": "open", "display_title": "SMaHT", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "date_created": "2026-01-09T20:07:38.786682+00:00", "submitted_by": {"error": "no view permissions"}, "last_modified": {"modified_by": {"error": "no view permissions"}, "date_modified": "2026-04-17T19:51:19.175913+00:00"}, "schema_version": "1", "table-of-contents": {"enabled": true, "skip-depth": 1, "header-depth": 2, "include-top-link": false}, "submission_centers": [{"display_title": "HMS DAC", "status": "open", "uuid": "9626d82e-8110-4213-ac75-0a50adf890ff", "@id": "/submission-centers/9626d82e-8110-4213-ac75-0a50adf890ff/", "@type": ["SubmissionCenter", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "@id": "/docs/additional-resources/pipeline-docs/long-read_oxford_nanopore", "@type": ["DocsAdditional-resourcesPipeline-docsLong-read_oxford_nanoporePage", "DocsAdditional-resourcesPipeline-docsPage", "DocsAdditional-resourcesPage", "DocsPage", "StaticPage", "Portal"], "uuid": "041ca9a2-2b67-409c-ac50-4705f995fb99", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}, "display_title": "Oxford Nanopore (long-read)", "is_leaf": true, "sibling_length": 11, "sibling_position": 3}, "previous": {"identifier": "docs/additional-resources/pipeline-docs/short-read_illumina_paired-end", "title": "Illumina (short-read)", "status": "open", "content": [{"uuid": "ce182c42-4f0e-4c1b-b325-213f730a0b42", "display_title": "Illumina (short-read)", "status": "open", "@id": "/static-sections/ce182c42-4f0e-4c1b-b325-213f730a0b42/", "@type": ["StaticSection", "UserContent", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "consortia": [{"@type": ["Consortium", "Item"], "@id": "/consortia/358aed10-9b9d-4e26-ab84-4bd162da182b/", "uuid": "358aed10-9b9d-4e26-ab84-4bd162da182b", "status": "open", "display_title": "SMaHT", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "date_created": "2026-01-09T20:07:38.467642+00:00", "submitted_by": {"error": "no view permissions"}, "last_modified": {"modified_by": {"error": "no view permissions"}, "date_modified": "2026-04-17T19:51:18.983582+00:00"}, "schema_version": "1", "table-of-contents": {"enabled": true, "skip-depth": 1, "header-depth": 2, "include-top-link": false}, "submission_centers": [{"display_title": "HMS DAC", "status": "open", "uuid": "9626d82e-8110-4213-ac75-0a50adf890ff", "@id": "/submission-centers/9626d82e-8110-4213-ac75-0a50adf890ff/", "@type": ["SubmissionCenter", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "@id": "/docs/additional-resources/pipeline-docs/short-read_illumina_paired-end", "@type": ["DocsAdditional-resourcesPipeline-docsShort-read_illumina_paired-endPage", "DocsAdditional-resourcesPipeline-docsPage", "DocsAdditional-resourcesPage", "DocsPage", "StaticPage", "Portal"], "uuid": "1756229d-7498-444f-b574-983d3020f948", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}, "display_title": "Illumina (short-read)", "is_leaf": true, "sibling_length": 11, "sibling_position": 1}, "parent": {"identifier": "docs/additional-resources/pipeline-docs", "parent": {"identifier": "docs/additional-resources", "parent": {"identifier": "docs", "parent": {"identifier": "", "@id": "/", "display_title": "Home", "@type": ["DirectoryPage", "StaticPage", "Portal"]}, "@id": "/docs", "uuid": "089319c4-3ce9-4ec1-bd0b-5451a48bd99e", "display_title": "Documentation", "@type": ["DocsPage", "DirectoryPage", "StaticPage", "Portal"], "sibling_length": 5, "sibling_position": 3}, "title": "Analysis & Additional Resources", "status": "open", "consortia": [{"status": "open", "uuid": "358aed10-9b9d-4e26-ab84-4bd162da182b", "@id": "/consortia/358aed10-9b9d-4e26-ab84-4bd162da182b/", "display_title": "SMaHT", "@type": ["Consortium", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "date_created": "2024-03-01T19:21:24.278212+00:00", "submitted_by": {"error": "no view permissions"}, "last_modified": {"modified_by": {"error": "no view permissions"}, "date_modified": "2026-04-17T19:51:22.981232+00:00"}, "schema_version": "1", "table-of-contents": {"enabled": true, "skip-depth": 1, "header-depth": 4, "include-top-link": false}, "submission_centers": [{"status": "open", "uuid": "9626d82e-8110-4213-ac75-0a50adf890ff", "@id": "/submission-centers/9626d82e-8110-4213-ac75-0a50adf890ff/", "display_title": "HMS DAC", "@type": ["SubmissionCenter", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "@id": "/docs/additional-resources", "@type": ["DocsAdditional-resourcesPage", "DocsPage", "DirectoryPage", "StaticPage", "Portal"], "uuid": "1ada4fca-af4b-4304-947d-59e2918ab728", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}, "display_title": "Analysis & Additional Resources", "sibling_length": 3, "sibling_position": 2}, "title": "Analysis Pipelines", "status": "open", "content": [{"@id": "/static-sections/b78b2ebb-d01c-4635-8a67-76a98ab81772/", "@type": ["StaticSection", "UserContent", "Item"], "status": "open", "uuid": "b78b2ebb-d01c-4635-8a67-76a98ab81772", "display_title": "Analysis Pipelines", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "redirect": {"code": 307, "enabled": false}, "consortia": [{"@id": "/consortia/358aed10-9b9d-4e26-ab84-4bd162da182b/", "uuid": "358aed10-9b9d-4e26-ab84-4bd162da182b", "display_title": "SMaHT", "@type": ["Consortium", "Item"], "status": "open", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "date_created": "2026-01-09T20:07:37.883644+00:00", "submitted_by": {"error": "no view permissions"}, "last_modified": {"modified_by": {"error": "no view permissions"}, "date_modified": "2026-04-17T19:51:18.780233+00:00"}, "schema_version": "1", "submission_centers": [{"uuid": "9626d82e-8110-4213-ac75-0a50adf890ff", "@id": "/submission-centers/9626d82e-8110-4213-ac75-0a50adf890ff/", "status": "open", "display_title": "HMS DAC", "@type": ["SubmissionCenter", "Item"], "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}}], "@id": "/docs/additional-resources/pipeline-docs", "@type": ["DocsAdditional-resourcesPipeline-docsPage", "DocsAdditional-resourcesPage", "DocsPage", "DirectoryPage", "StaticPage", "Portal"], "uuid": "6e144832-6abc-47e2-bea5-f720598cf61a", "principals_allowed": {"view": ["system.Everyone"], "edit": ["group.admin"]}, "display_title": "Analysis Pipelines", "sibling_length": 5, "sibling_position": 0}, "sibling_length": 11, "sibling_position": 2}