.. _user-guide:

==========
User Guide
==========

The following subcommands are available in gaftools

* :ref:`find_path <gaftools-find-path>` - Extracting base sequencing from node path.
* :ref:`gfa2rgfa <gaftools-gfa2rgfa>` - Converting GFA to rGFA using W lines.
* :ref:`index <gaftools-index>` - Indexing a GAF file for the view command.
* :ref:`order_gfa <gaftools-order-gfa>` - Ordering GFA file using BO and NO tags.
* :ref:`phase <gaftools-phase>` - Adding phase information of the reads from a haplotag TSV.
* :ref:`realign <gaftools-realign>` - Realigning GAF alignments using wavefront alignment.
* :ref:`sort <gaftools-sort>` - Sorting GAF alignments using BO and NO tags of the corresponding graph.
* :ref:`stat <gaftools-stat>` - Basic statistics of the GAF file.
* :ref:`view <gaftools-view>` - Viewing and subsetting the GAF file.

To get help for all the subcommands, please use::

    gaftools <subcommand> --help

Links to related resources:

* `rGFA and GAF documentation <https://github.com/lh3/gfatools/blob/master/doc/rGFA.md>`_
* `GFA v1.0 documentation <https://github.com/GFA-spec/GFA-spec/blob/master/GFA1.md>`_
* `GFA v2.0 documentation <https://github.com/GFA-spec/GFA-spec/blob/master/GFA2.md>`_

Some of the tags used in GAF files are derieved from :code:`minimap2` defined `here <https://lh3.github.io/minimap2/minimap2.html>`_

Requirement Summary by Subcommands
==================================

.. list-table::
  :widths: 33 33 33
  :header-rows: 1

  * - Subcommands
    - rGFA Required?
    - Ordering Required?
  * - find_path
    - No
    - No
  * - gfa2rgfa
    - No
    - No
  * - index
    - Yes
    - No
  * - order_gfa
    - No
    - No
  * - phase
    - No
    - No
  * - realign
    - No
    - No
  * - sort
    - Yes
    - Yes
  * - stat
    - No
    - No
  * - view
    - Yes
    - No

.. _gaftools-find-path:

gaftools find_path
==================

This subcommand retrieves the base seqeunce of paths in the given GFA.

Usage
-----

The :code:`find_path` subcommand takes 2 obligatory inputs, a GFA file and either a node path :code:`-p, --path` (e.g., :code:`">s82312<s82313"` (with the quotes))
or file path which has node paths each on a separate line with :code:`--paths_file`.
It returns the sequence of the path(s) by default but using the :code:`--fasta` flag, the sequences will be returned as a FASTA file.

If one of the paths does not exist, the process will terminate, the user can use :code:`-k, --keep-going` to skip the paths that do not exist and continue with the next path if available.

.. code-block:: none
  :caption: find_path arguments

  Find the genomic sequence of a given connected GFA path.

  positional arguments:
    GFA                   GFA file (can be bgzip-compressed)

  options:
    -h, --help            show this help message and exit
    -p, --path PATH       GFA node path to retrieve the sequence (e.g., ">s82312<s82313") with the quotes
    --paths-file PATHS_FILE
                          File containing the paths to retrieve the sequences for, each path on new line
    -k, --keep-going      Keep going after instead of stopping when a path does not exist
    -o, --output OUTPUT   Output file. If omitted, use standard output.
    -f, --fasta           Flag to output the sequence as a FASTA file with the seqeunce named seq_<node path>

.. _gaftools-gfa2rgfa:

gaftools gfa2rgfa
=================

This subcommand creates rGFA equivalent of GFA files using the W lines present in the GFA.

For creating the rGFA tags for all the reference nodes, the minimum requirements (as specified below) should be given. We assume that the pangenome graph is based on a base reference assembly (as is the case with minigraph-based graph constructions methods).

For further tagging the non-reference nodes coming from assemblies added to the reference assembly, the :code:`seqfile` can be provided and corresponding W lines should be available in the GFA (refer to 'Format of seqfile').
If the :code:`seqfile` is not provided, then the order in which walks are given in the GFA will be used to tag the non-reference nodes. Users can provide their own order of assemblies in the seqfile, and the coordinates will be generated accordingly.

**Note**: If the proper :code:`seqfile` is not provided, the tagging will not show the coordinates intended by the graph creator since the order of assemblies might be different.

Any node not found in the W lines will be tagged with the :code:`SN` tag as `unknown`, the :code:`SO` tag as `0`, and the :code:`SN` tag as `-1`.

If the GFA already has the reference nodes tagged (as is the case of the GFA generated by minigraph-cactus), then the :code:`--reference-tagged` flag can be provided and they will not be retagged.

Minimum Requirements
--------------------

* The GFA should have a reference path defined in the Header lines with :code:`RS:Z` tag or given by the user with :code:`--reference-name`. There should also be W-lines for the reference path.
* The reference path described by the W line should be cycle-free and oriented in the forward direction (path of the format :code:`>node_1>node_2....>node_N`)

Note: From the Human Pangenome Reference Consortium, the graphs generated using minigraph-cactus follow these requirements. The PGGB graphs do not.

Format of seqfile
-----------------

The seqfile asked by the subcommand has two columns separated by a tab: :code:`<assembly name> <tab> <path to the assembly>`.
This file is part of the minigraph-cactus graph generation pipeline (in case the user is attempting to convert the GFA of that pipeline).

The assembly names should be of the form :code:`<sample name>.<assembly number>` which should correspond to W lines starting with :code:`W <tab> <sample name> <tab> <assembly name> ...`

In the case that the assembly is not based on a sample (as is the case of the chm13-minigraph-cactus graph which has a W line for GRCh38 added later and vice-versa for the grch38-minigraph-catus graph), the seqfile should contain a line for :code:`GRCh38` and GFA should have a corresponding W line of format :code:`W <tab> GRCh38 <tab> 0 ...`

Users can provide their own order of assemblies and the coordinates will be generated accordingly. The reference path should be cycle-free.

**Note**: The code does not require the actual assemblies. So the assemblies provided in the seqfile do not need to be downloaded.


Usage
-----

The :code:`gfa2rgfa` subcommand takes 1 obligatory input, a GFA file. It creates a rGFA file with appropriate tags.

.. code-block:: none
  :caption: gfa2rgfa arguments

  usage: gaftools gfa2rgfa [-h] [--reference-name REFERENCE NAME] [--override-reference] [--seqfile SEQFILE] [--output rGFA] GFA

  Converting a GFA file to rGFA format using the W-lines and the acyclic reference path. (e.g., minigraph-based graphs)

  positional arguments:
    GFA                   GFA file (can be bgzip-compressed). This GFA should have a W-line corresponding to the reference genome or the reference nodes have to be tagged already.

  options:
    -h, --help            show this help message and exit
    --reference-name REFERENCE NAME
                          The name of the reference genome given in the W-line. If --reference-name is not specified, it looks for reference name in H lines.
    --override-reference  Flag to override any nodes tagged in the GFA as reference nodes.
    --seqfile SEQFILE     File containing the sequence in which assemblies were given. Provide the seqfile given as part of running minigraph-cactus (only the first column is required). It has the format: <assembly_name><tab><assembly_path>. If not provided, the order of assemblies will be taken from
                          the GFA file. User-defined order of assemblies can also be given. There should be W lines for each assembly in the GFA.
    --output rGFA         Output rGFA (bgzipped if the file ends with .gz). If omitted, use standard output.


.. _gaftools-index:

gaftools index
==============

This subcommand creates a index file with the extension :code:`.gvi` which is used by the :code:`view` command to subset alignments.

View Index Structure
--------------------

The index is a reverse look-up table with the keys being nodes in the graph and the values being the location of the alignments which have the nodes.
The index is stored as a Python dictionary with the node IDs (as strings) as keys and a list of offset values (as integers) as values for the key.
This dictionary is written to memory using the :code:`pickle` module.

.. code-block:: none
  :caption: view index structure

  {
      "node_1": [offset_1, offset_2, ...],
      "node_2": [offset_3, offset_4, ...],
      ...
  }

Usage
-----

The :code:`index` subcommand takes 2 obligatory inputs, a GAF alignment file and its corresponding rGFA graph. It creates an viewing index file with the
extension :code:`.gvi` which is required by the :code:`view` command.

.. code-block:: none
  :caption: index arguments

  usage: gaftools index [-h] [-o OUTPUT] GAF rGFA

  Index a GAF file for the view functionality.

  This script creates an inverse look-up table where:
      - key: the node information
      - value: the offsets in the GAF file where the node is present

  positional arguments:
    GAF                  GAF file (can be bgzip-compressed)
    rGFA                 rGFA file (can be bgzip-compressed)

  options:
    -h, --help           show this help message and exit
    -o, --output OUTPUT  Output GAF View Index (GVI) file. Default: <GAF file>.gvi

.. _gaftools-order-gfa:

gaftools order_gfa
==================

This subcommand establishes an order to the graph based on the "bubbles" in the graph.
Here, we define the bubbles as biconnected components, i.e. not the strict definition of a bubble found in other papers.

The graph input here has to be an `rGFA <https://github.com/lh3/gfatools/blob/master/doc/rGFA.md>`_, with SN and SO tags.

The basic idea here is that we first detect all biconnected components (bubbles), and collapse the detected bubbles into one node,
which should result in a line graph made from scaffold nodes that belong to the reference and the collapsed bubbles inbetween. We then
order this line graph in ascending order based on the coordinates in the SO tag. Each node in this ordered line graph
gets an ascending BO tag from 1 to N (N being the number of nodes in the line graph). For the nodes that represent a collapsed
bubbles, all the nodes in that bubble will get the same BO tag (Figure 1). As for the NO tag, the nodes in a bubble get an ascending
number from 1 to M (M being the number of nodes in a bubble). However, the NO tag inside a bubble is assigned based on the node id order, i.e.
in a lexicographic order of the node IDs.
In the graph shown below, which is a part of a longer graph, when the bubbles collapsed,
this will result in a line graph of 9 nodes.

Below we see a chain of 4 bubbles (biconnected components) and 5 scaffold nodes, where the nodes inside
the bubbles are colored blue and the scaffold nodes are colored yellow. The numbers of the node is the
BO tag, where it increases by 1 starting from the first scaffold node on the left (19 to 27), and we see that
all the nodes in a bubble have the same BO tag

.. image:: _static/bo_tags.png
    :width: 600

In this figure, we see the same graph but with the NO tags marked on the nodes. Scaffold nodes always
have a NO tag of 0, and the nodes inside a bubble are marked with an increasing order of the NO tag.

.. image:: _static/no_tags.png
    :width: 600

Note: The :code:`order_gfa` subcommand does not change the name of the nodes and only adds the BO and NO tags as additional information.


Non-linear scaffold graphs:
---------------------------

In the cases where after collapsing the bi-connected component, the scaffold graph is not a line graph, the user can use
:code:`--ignore-branching` to enforce an ordering where only the linear traversals in the scaffold graph where the articulation points
have an SN tag matching the reference will be ordered and branches that belong to other haplotypes will be ignored, i.e., give BO and NO tag of -1.

Usage
-----
The :code:`order_gfa` subcommand takes an rGFA as an obligatory input to order.
Optionally, the user can specify the order/subset of the chromosomes in the output GFA using the :code:`--chromosome-order` option.
By default, the chromosomes are ordered in a lexicographic order of the component names.
With the :code:`--by-chrom` flag, all the chromosomal graphs are output separately.
Users can also specify an output directory.

The outputs of :code:`order_gfa` are separate rGFA graphs for each chromosome and a graph for all chromosomes both ordered by S lines first then L lines, and the S lines are ordered by
their BO tag then NO tag, also will output a CSV file with node colors similar to the figure above that works with Bandage.

.. code-block:: none
  :caption: order_gfa arguments

  usage: gaftools order_gfa [-h] [--chromosome-order CHROMOSOME_ORDER] [--without-sequence] [--outdir OUTDIR] [--by-chrom] [--ignore-branching] [--output-scaffold] GRAPH

  Order bubbles in the GFA by adding BO and NO tags.

  The BO (Bubble Order) tags order bubbles in the GFA.
  The NO (Node Order) tags order the nodes in a bubble (in a lexicographic order).

  positional arguments:
    GRAPH                 Input rGFA file

  options:
    -h, --help            show this help message and exit
    --chromosome-order CHROMOSOME_ORDER
                          Order in which to arrange chromosomes in terms of BO sorting. By default, it is arranged in the lexicographic order of identified component names. Expecting comma-separated list. Example: 'chr1,chr2,chr3'
    --without-sequence    Strip sequences from the output graph (for less memory usage and easier visualization). Default = False
    --outdir OUTDIR       Output Directory to store all the GFA and CSV files. Default location is a "out" folder from the directory of execution.
    --by-chrom            Outputs each chromosome as a separate GFA, otherwise, all chromosomes in one GFA file
    --ignore-branching    Force the order even when branching paths occur in the scaffold graph. Alternative alleles will not be ordered
    --output-scaffold     Output the scaffold graph in GFA format. The scaffold graph is the graph created from collapsing all the biconnected components.

.. _gaftools-phase:

gaftools phase
==============

This subcommands adds the phase information of the GAF reads from a haplotag TSV file generated using
:code:`whatshap haplotag`.

Usage
-----

The :code:`phase` subcommand takes 2 obligatory inputs, a GAF alignment file and a haplotag TSV file generated from :code:`whatshap haplotag`.
The TSV file has tags for each read labelled as `H1`, `H2`, or `none` for reads that has been determined to belong to the first haplotpye,
second haplotype or unknown haplotype. It also has the `PS` tag which is the `phaseset` that the read is in. Refer to WhatsHap documentation for
further details.

The :code:`phase` command adds these tags to the GAF file so that downstream processes can utilize them.

.. code-block:: none
  :caption: phase arguments

  usage: gaftools phase [-h] [-o OUTPUT] GAF TSV

  Add phasing information to the GAF file from a haplotag TSV file.

  The script uses the TSV file containing the haplotag information generated from WhatsHap's haplotag command.
  The H1 and H2 labels for each read are then added to the reads in the GAF file.

  positional arguments:
    GAF                  GAF file (can be bgzip-compressed)
    TSV                  WhatsHap haplotag TSV file. Refer to https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag

  options:
    -h, --help           show this help message and exit
    -o, --output OUTPUT  Output GAF file (bgzipped if the file ends with .gz). If omitted, output is directed to stdout.


.. _gaftools-realign:

gaftools realign
================

This subcommand realigns all the alignments in GAF back the rGFA it was originally aligned to using Wavefront Alignment.
This fixes alignment issues found in GraphAligner where large indels are represented as a series of small indels in the
CIGAR string.

Usage
-----
The :code:`realign` subcommand takes 3 obligatory input files, the GAF alignments, the rGFA graph that was used for the alignments,
and the reads that correspond to the alignments in the GAF file.

Due to the high memory consumption of :code:`pyWFA` with longer alignments, :code:`gaftools` limits the alignments to 60,000 base pairs in length
and  the alignments that are longer will be outputted as is from the input file.
Moreover, :code:`realign` can be sped up by using more cores. However, for longer alignments, memory can peak substantially, so users should be aware
that they need to maybe use a cluster with sufficient memory. For example, we tested alignments that were between 50,000 and 60,000 bp long, and when
:code:`gaftools` was given 10 cores, the memory peaked to around 100 Gb at certain points, and with 1 core, it peaked at around 20 Gb. In case one of the subprocesses gets killed
by the system due to high memory consumption, the realignment run will be aborted.


.. code-block:: none
  :caption: realign arguments

  usage: gaftools realign [-h] [-o OUTPUT] [-c CORES] GAF GFA READS

  Realign a GAF file using wavefront alignment algorithm (WFA).

  positional arguments:
    GAF                  GAF file (can be bgzip-compressed)
    GFA                  GFA file (can be bgzip-compressed)
    READS                Reads file in FASTA or FASTQ format (optionally gzipped)

  options:
    -h, --help           show this help message and exit
    -o, --output OUTPUT  Output GAF file (bgzipped if the file ends with .gz). If omitted, use standard output.
    -c, --cores CORES    Number of cores to use for alignments.


.. _gaftools-sort:

gaftools sort
=============

This subcommand sorts the alignments in the GAF file using the BO and NO tags generated by :code:`gaftools order_gfa`. Hence this
subcommand requires initial processing of the rGFA with :code:`order_gfa`.
Along with a sorted GAF file, the command also creates a sorting index based on the command flags.

Sort Index Structure
--------------------

The current index is just pointers to the start and end position of alignments to chromosomes in the sorted GAF file.

The index is stored as a Python dictionary with the chromosome names (as strings) as keys and the start offset and end offset of the alignments in the GAF file as values.
This dictionary is written to memory using the :code:`pickle` module.

.. code-block:: none
  :caption: sort index structure

  {
      "chr1": [start_offset_1, end_offset_1],
      "chr2": [start_offset_2, end_offset_2],
      ...
  }

Usage
-----

The :code:`sort` subcommand takes 2 obligatory input files, the GAF alignments, the rGFA graph that was used for the alignments.
The rGFA graph should have been processed by the :code:`order_gfa` command and has the BO and NO tags.

The output GAF file is sorted based on the BO and NO tags of the rGFA graph and has additional tags added to each alignment. The
tags in the GAF are :code:`bo:i` (which is the BO tag of the first node of the alignment), :code:`sn:Z`
(which is the name of the reference chromosome the read aligned to), and :code:`iv:i` (which is 1 is the alignment has an inversion).

By default, the GAF is outputted in :code:`stdout` and without a sorting index. The flags :code:`--outgaf` and :code:`--outind`
can be used to provide path to the output GAF and its sorting index. If no :code:`--outind` is given, the command automatically creates one
with the name of the GAF file provided and :code:`.gsi` extension.

.. code-block:: none
  :caption: sort arguments

  usage: gaftools sort [-h] [--outgaf OUTGAF] [--outind OUTIND] GAF GFA

  Sort the GAF alignments using BO and NO tags of the corresponding graph.

  positional arguments:
    GAF              GAF File (can be bgzip-compressed)
    GFA              GFA file with the sort keys (BO and NO tagged). This is done with gaftools order_gfa

  options:
    -h, --help       show this help message and exit
    --outgaf OUTGAF  Output GAF (bgzipped if the file ends with .gz). If omitted, use standard output.
    --outind OUTIND  Output GAF Sorting Index file. When --outgaf is not given, no index is created. If it is given and --outind is not specified, it will have same file name with .gsi extension)


.. _gaftools-stat:

gaftools stat
=============

This subcommand returns basic statistics of the GAF alignments like number of primary and secondary alignments, total aligned bases,
average mapping quality, etc.

Usage
-----

The :code:`stat` subcommand takes 1 obligatory inputs, a GAF alignment file. It outputs statistics for the GAF file in the
:code:`stdout` by default. The :code:`--cigar` flag can be provided for more detailed statistics but requires more time.

.. code-block:: none
  :caption: index arguments

  usage: gaftools stat [-h] [-o OUTPUT] [--cigar] GAF

  Calculate statistics of the given GAF file.

  positional arguments:
    GAF                  GAF file (can be bgzip-compressed)

  options:
    -h, --help           show this help message and exit
    -o, --output OUTPUT  Output file. If omitted, use standard output.
    --cigar              Outputs cigar related statistics (requires more time)


.. _gaftools-view:

gaftools view
=============

This subcommand helps view the GAF alignments, convert formatting from stable to unstable and vice-versa, and subsetting
the files based on nodes or regions given by the user.

Note: The subcommand currently does not support the :code:`ds` tag that has been introduced in minigraph v0.21. The subcommand ignores that tag and outputs the rest of the fields.

Usage
-----

The :code:`view` subcommand takes 1 obligatory input, the GAF alignment file. But for full functionaility, it requires
the rGFA file which was used for the alignment and the index file created with :code:`gaftools index`.

By only providing the GAF file, the :code:`view` command can output the entire file.

Additional files required for certain functionality:
 * Subsetting alignments based on region (given with :code:`--region` or :code:`-r`) or node (given with :code:`--node` or :code:`-n`) requires the viewing index generated by :code:`gaftools index`.
 * Converting coordinate system with :code:`--format` or :code:`-f` requires an GFA file.

Modes of selecting alignments (when giving multiple nodes or regions):
 * Union selects alignments that align to at least one of the nodes or regions.
 * Intersection selects alignments that align to all of the nodes or regions.

.. code-block:: none
  :caption: view arguments

  usage: gaftools view [-h] [-g GFA] [-o OUTPUT] [-i INDEX] [-n NODE] [-r REGION] [-f FORMAT] [-m MODE] GAF

  View, subset or convert a GAF file (GAF file should be indexed first, using gaftools index, for subsetting).

  The view command allows subsetting the GAF file based on node IDs or regions available.

  positional arguments:
    GAF                  GAF file (can be bgzip-compressed)

  options:
    -h, --help           show this help message and exit
    -g, --gfa GFA        GFA file (can be bgzip-compressed). Required when converting from one coordinate system to another.
    -o, --output OUTPUT  Output GAF (bgzipped if the file ends with .gz). If omitted, use standard output.
    -i, --index INDEX    Path to GAF Viewing Index file. Required when subsetting alignments based on region or nodes. This index is created using gaftools index. If path is not provided, it is assumed to be in the same directory as GAF file with the same name and .gvi extension (default location of
                        the index script)
    -n, --node NODE      Nodes to search. Multiple can be provided (Eg. gaftools view .... -n s1 -n s2 -n s3 .....).
    -r, --region REGION  Regions to search. Multiple can be provided (Eg. gaftools view .... -r chr1:10-20 -r chr1:50-60 .....).
    -f, --format FORMAT  Format of output path (unstable | stable)
    -m, --mode MODE      Mode of selecting alignment with multiple regions/nodes provided ("I" | "U"). Choose "I" for finding alignments with all provided regions/nodes (intersection) and "U" for finding alignments with at least one of the provided regions/nodes (union). (default: U)