User Guide

The following subcommands are available in gaftools

find_path - Extracting base sequencing from node path.
gfa2rgfa - Converting GFA to rGFA using W lines.
index - Indexing a GAF file for the view command.
order_gfa - Ordering GFA file using BO and NO tags.
phase - Adding phase information of the reads from a haplotag TSV.
realign - Realigning GAF alignments using wavefront alignment.
sort - Sorting GAF alignments using BO and NO tags of the corresponding graph.
stat - Basic statistics of the GAF file.
view - Viewing and subsetting the GAF file.

To get help for all the subcommands, please use:

gaftools <subcommand> --help

Links to related resources:

Some of the tags used in GAF files are derieved from minimap2 defined here

Requirement Summary by Subcommands

Subcommands	rGFA Required?	Ordering Required?
find_path	No	No
gfa2rgfa	No	No
index	Yes	No
order_gfa	No	No
phase	No	No
realign	No	No
sort	Yes	Yes
stat	No	No
view	Yes	No

gaftools find_path

This subcommand retrieves the base seqeunce of paths in the given GFA.

Usage

The find_path subcommand takes 2 obligatory inputs, a GFA file and either a node path -p, --path (e.g., ">s82312<s82313" (with the quotes)) or file path which has node paths each on a separate line with --paths_file. It returns the sequence of the path(s) by default but using the --fasta flag, the sequences will be returned as a FASTA file.

If one of the paths does not exist, the process will terminate, the user can use -k, --keep-going to skip the paths that do not exist and continue with the next path if available.

find_path arguments

Find the genomic sequence of a given connected GFA path.

positional arguments:
  GFA                   GFA file (can be bgzip-compressed)

options:
  -h, --help            show this help message and exit
  -p, --path PATH       GFA node path to retrieve the sequence (e.g., ">s82312<s82313") with the quotes
  --paths-file PATHS_FILE
                        File containing the paths to retrieve the sequences for, each path on new line
  -k, --keep-going      Keep going after instead of stopping when a path does not exist
  -o, --output OUTPUT   Output file. If omitted, use standard output.
  -f, --fasta           Flag to output the sequence as a FASTA file with the seqeunce named seq_<node path>

gaftools gfa2rgfa

This subcommand creates rGFA equivalent of GFA files using the W lines present in the GFA.

For creating the rGFA tags for all the reference nodes, the minimum requirements (as specified below) should be given. We assume that the pangenome graph is based on a base reference assembly (as is the case with minigraph-based graph constructions methods).

For further tagging the non-reference nodes coming from assemblies added to the reference assembly, the seqfile can be provided and corresponding W lines should be available in the GFA (refer to ‘Format of seqfile’). If the seqfile is not provided, then the order in which walks are given in the GFA will be used to tag the non-reference nodes. Users can provide their own order of assemblies in the seqfile, and the coordinates will be generated accordingly.

Note: If the proper seqfile is not provided, the tagging will not show the coordinates intended by the graph creator since the order of assemblies might be different.

Any node not found in the W lines will be tagged with the SN tag as unknown, the SO tag as 0, and the SN tag as -1.

If the GFA already has the reference nodes tagged (as is the case of the GFA generated by minigraph-cactus), then the --reference-tagged flag can be provided and they will not be retagged.

Minimum Requirements

The GFA should have a reference path defined in the Header lines with RS:Z tag or given by the user with --reference-name. There should also be W-lines for the reference path.
The reference path described by the W line should be cycle-free and oriented in the forward direction (path of the format >node_1>node_2....>node_N)

Note: From the Human Pangenome Reference Consortium, the graphs generated using minigraph-cactus follow these requirements. The PGGB graphs do not.

Format of seqfile

The seqfile asked by the subcommand has two columns separated by a tab: <assembly name> <tab> <path to the assembly>. This file is part of the minigraph-cactus graph generation pipeline (in case the user is attempting to convert the GFA of that pipeline).

The assembly names should be of the form <sample name>.<assembly number> which should correspond to W lines starting with W <tab> <sample name> <tab> <assembly name> ...

In the case that the assembly is not based on a sample (as is the case of the chm13-minigraph-cactus graph which has a W line for GRCh38 added later and vice-versa for the grch38-minigraph-catus graph), the seqfile should contain a line for GRCh38 and GFA should have a corresponding W line of format W <tab> GRCh38 <tab> 0 ...

Users can provide their own order of assemblies and the coordinates will be generated accordingly. The reference path should be cycle-free.

Note: The code does not require the actual assemblies. So the assemblies provided in the seqfile do not need to be downloaded.

Usage

The gfa2rgfa subcommand takes 1 obligatory input, a GFA file. It creates a rGFA file with appropriate tags.

gfa2rgfa arguments

usage: gaftools gfa2rgfa [-h] [--reference-name REFERENCE NAME] [--override-reference] [--seqfile SEQFILE] [--output rGFA] GFA

Converting a GFA file to rGFA format using the W-lines and the acyclic reference path. (e.g., minigraph-based graphs)

positional arguments:
  GFA                   GFA file (can be bgzip-compressed). This GFA should have a W-line corresponding to the reference genome or the reference nodes have to be tagged already.

options:
  -h, --help            show this help message and exit
  --reference-name REFERENCE NAME
                        The name of the reference genome given in the W-line. If --reference-name is not specified, it looks for reference name in H lines.
  --override-reference  Flag to override any nodes tagged in the GFA as reference nodes.
  --seqfile SEQFILE     File containing the sequence in which assemblies were given. Provide the seqfile given as part of running minigraph-cactus (only the first column is required). It has the format: <assembly_name><tab><assembly_path>. If not provided, the order of assemblies will be taken from
                        the GFA file. User-defined order of assemblies can also be given. There should be W lines for each assembly in the GFA.
  --output rGFA         Output rGFA (bgzipped if the file ends with .gz). If omitted, use standard output.

gaftools index

This subcommand creates a index file with the extension .gvi which is used by the view command to subset alignments.

View Index Structure

The index is a reverse look-up table with the keys being nodes in the graph and the values being the location of the alignments which have the nodes. The index is stored as a Python dictionary with the node IDs (as strings) as keys and a list of offset values (as integers) as values for the key. This dictionary is written to memory using the pickle module.

view index structure

{
    "node_1": [offset_1, offset_2, ...],
    "node_2": [offset_3, offset_4, ...],
    ...
}

Usage

The index subcommand takes 2 obligatory inputs, a GAF alignment file and its corresponding rGFA graph. It creates an viewing index file with the extension .gvi which is required by the view command.

index arguments

usage: gaftools index [-h] [-o OUTPUT] GAF rGFA

Index a GAF file for the view functionality.

This script creates an inverse look-up table where:
    - key: the node information
    - value: the offsets in the GAF file where the node is present

positional arguments:
  GAF                  GAF file (can be bgzip-compressed)
  rGFA                 rGFA file (can be bgzip-compressed)

options:
  -h, --help           show this help message and exit
  -o, --output OUTPUT  Output GAF View Index (GVI) file. Default: <GAF file>.gvi

gaftools order_gfa

This subcommand establishes an order to the graph based on the “bubbles” in the graph. Here, we define the bubbles as biconnected components, i.e. not the strict definition of a bubble found in other papers.

The graph input here has to be an rGFA, with SN and SO tags.

The basic idea here is that we first detect all biconnected components (bubbles), and collapse the detected bubbles into one node, which should result in a line graph made from scaffold nodes that belong to the reference and the collapsed bubbles inbetween. We then order this line graph in ascending order based on the coordinates in the SO tag. Each node in this ordered line graph gets an ascending BO tag from 1 to N (N being the number of nodes in the line graph). For the nodes that represent a collapsed bubbles, all the nodes in that bubble will get the same BO tag (Figure 1). As for the NO tag, the nodes in a bubble get an ascending number from 1 to M (M being the number of nodes in a bubble). However, the NO tag inside a bubble is assigned based on the node id order, i.e. in a lexicographic order of the node IDs. In the graph shown below, which is a part of a longer graph, when the bubbles collapsed, this will result in a line graph of 9 nodes.

Below we see a chain of 4 bubbles (biconnected components) and 5 scaffold nodes, where the nodes inside the bubbles are colored blue and the scaffold nodes are colored yellow. The numbers of the node is the BO tag, where it increases by 1 starting from the first scaffold node on the left (19 to 27), and we see that all the nodes in a bubble have the same BO tag

In this figure, we see the same graph but with the NO tags marked on the nodes. Scaffold nodes always have a NO tag of 0, and the nodes inside a bubble are marked with an increasing order of the NO tag.

Note: The order_gfa subcommand does not change the name of the nodes and only adds the BO and NO tags as additional information.

Non-linear scaffold graphs:

In the cases where after collapsing the bi-connected component, the scaffold graph is not a line graph, the user can use --ignore-branching to enforce an ordering where only the linear traversals in the scaffold graph where the articulation points have an SN tag matching the reference will be ordered and branches that belong to other haplotypes will be ignored, i.e., give BO and NO tag of -1.

Usage

The order_gfa subcommand takes an rGFA as an obligatory input to order. Optionally, the user can specify the order/subset of the chromosomes in the output GFA using the --chromosome-order option. By default, the chromosomes are ordered in a lexicographic order of the component names. With the --by-chrom flag, all the chromosomal graphs are output separately. Users can also specify an output directory.

The outputs of order_gfa are separate rGFA graphs for each chromosome and a graph for all chromosomes both ordered by S lines first then L lines, and the S lines are ordered by their BO tag then NO tag, also will output a CSV file with node colors similar to the figure above that works with Bandage.

order_gfa arguments

usage: gaftools order_gfa [-h] [--chromosome-order CHROMOSOME_ORDER] [--without-sequence] [--outdir OUTDIR] [--by-chrom] [--ignore-branching] [--output-scaffold] GRAPH

Order bubbles in the GFA by adding BO and NO tags.

The BO (Bubble Order) tags order bubbles in the GFA.
The NO (Node Order) tags order the nodes in a bubble (in a lexicographic order).

positional arguments:
  GRAPH                 Input rGFA file

options:
  -h, --help            show this help message and exit
  --chromosome-order CHROMOSOME_ORDER
                        Order in which to arrange chromosomes in terms of BO sorting. By default, it is arranged in the lexicographic order of identified component names. Expecting comma-separated list. Example: 'chr1,chr2,chr3'
  --without-sequence    Strip sequences from the output graph (for less memory usage and easier visualization). Default = False
  --outdir OUTDIR       Output Directory to store all the GFA and CSV files. Default location is a "out" folder from the directory of execution.
  --by-chrom            Outputs each chromosome as a separate GFA, otherwise, all chromosomes in one GFA file
  --ignore-branching    Force the order even when branching paths occur in the scaffold graph. Alternative alleles will not be ordered
  --output-scaffold     Output the scaffold graph in GFA format. The scaffold graph is the graph created from collapsing all the biconnected components.

gaftools phase

This subcommands adds the phase information of the GAF reads from a haplotag TSV file generated using whatshap haplotag.

Usage

The phase subcommand takes 2 obligatory inputs, a GAF alignment file and a haplotag TSV file generated from whatshap haplotag. The TSV file has tags for each read labelled as H1, H2, or none for reads that has been determined to belong to the first haplotpye, second haplotype or unknown haplotype. It also has the PS tag which is the phaseset that the read is in. Refer to WhatsHap documentation for further details.

The phase command adds these tags to the GAF file so that downstream processes can utilize them.

phase arguments

usage: gaftools phase [-h] [-o OUTPUT] GAF TSV

Add phasing information to the GAF file from a haplotag TSV file.

The script uses the TSV file containing the haplotag information generated from WhatsHap's haplotag command.
The H1 and H2 labels for each read are then added to the reads in the GAF file.

positional arguments:
  GAF                  GAF file (can be bgzip-compressed)
  TSV                  WhatsHap haplotag TSV file. Refer to https://whatshap.readthedocs.io/en/latest/guide.html#whatshap-haplotag

options:
  -h, --help           show this help message and exit
  -o, --output OUTPUT  Output GAF file (bgzipped if the file ends with .gz). If omitted, output is directed to stdout.

gaftools realign

This subcommand realigns all the alignments in GAF back the rGFA it was originally aligned to using Wavefront Alignment. This fixes alignment issues found in GraphAligner where large indels are represented as a series of small indels in the CIGAR string.

Usage

The realign subcommand takes 3 obligatory input files, the GAF alignments, the rGFA graph that was used for the alignments, and the reads that correspond to the alignments in the GAF file.

Due to the high memory consumption of pyWFA with longer alignments, gaftools limits the alignments to 60,000 base pairs in length and the alignments that are longer will be outputted as is from the input file. Moreover, realign can be sped up by using more cores. However, for longer alignments, memory can peak substantially, so users should be aware that they need to maybe use a cluster with sufficient memory. For example, we tested alignments that were between 50,000 and 60,000 bp long, and when gaftools was given 10 cores, the memory peaked to around 100 Gb at certain points, and with 1 core, it peaked at around 20 Gb. In case one of the subprocesses gets killed by the system due to high memory consumption, the realignment run will be aborted.

realign arguments

usage: gaftools realign [-h] [-o OUTPUT] [-c CORES] GAF GFA READS

Realign a GAF file using wavefront alignment algorithm (WFA).

positional arguments:
  GAF                  GAF file (can be bgzip-compressed)
  GFA                  GFA file (can be bgzip-compressed)
  READS                Reads file in FASTA or FASTQ format (optionally gzipped)

options:
  -h, --help           show this help message and exit
  -o, --output OUTPUT  Output GAF file (bgzipped if the file ends with .gz). If omitted, use standard output.
  -c, --cores CORES    Number of cores to use for alignments.

gaftools sort

This subcommand sorts the alignments in the GAF file using the BO and NO tags generated by gaftools order_gfa. Hence this subcommand requires initial processing of the rGFA with order_gfa. Along with a sorted GAF file, the command also creates a sorting index based on the command flags.

Sort Index Structure

The current index is just pointers to the start and end position of alignments to chromosomes in the sorted GAF file.

The index is stored as a Python dictionary with the chromosome names (as strings) as keys and the start offset and end offset of the alignments in the GAF file as values. This dictionary is written to memory using the pickle module.

sort index structure

{
    "chr1": [start_offset_1, end_offset_1],
    "chr2": [start_offset_2, end_offset_2],
    ...
}

Usage

The sort subcommand takes 2 obligatory input files, the GAF alignments, the rGFA graph that was used for the alignments. The rGFA graph should have been processed by the order_gfa command and has the BO and NO tags.

The output GAF file is sorted based on the BO and NO tags of the rGFA graph and has additional tags added to each alignment. The tags in the GAF are bo:i (which is the BO tag of the first node of the alignment), sn:Z (which is the name of the reference chromosome the read aligned to), and iv:i (which is 1 is the alignment has an inversion).

By default, the GAF is outputted in stdout and without a sorting index. The flags --outgaf and --outind can be used to provide path to the output GAF and its sorting index. If no --outind is given, the command automatically creates one with the name of the GAF file provided and .gsi extension.

sort arguments

usage: gaftools sort [-h] [--outgaf OUTGAF] [--outind OUTIND] GAF GFA

Sort the GAF alignments using BO and NO tags of the corresponding graph.

positional arguments:
  GAF              GAF File (can be bgzip-compressed)
  GFA              GFA file with the sort keys (BO and NO tagged). This is done with gaftools order_gfa

options:
  -h, --help       show this help message and exit
  --outgaf OUTGAF  Output GAF (bgzipped if the file ends with .gz). If omitted, use standard output.
  --outind OUTIND  Output GAF Sorting Index file. When --outgaf is not given, no index is created. If it is given and --outind is not specified, it will have same file name with .gsi extension)

gaftools stat

This subcommand returns basic statistics of the GAF alignments like number of primary and secondary alignments, total aligned bases, average mapping quality, etc.

Usage

The stat subcommand takes 1 obligatory inputs, a GAF alignment file. It outputs statistics for the GAF file in the stdout by default. The --cigar flag can be provided for more detailed statistics but requires more time.

index arguments

usage: gaftools stat [-h] [-o OUTPUT] [--cigar] GAF

Calculate statistics of the given GAF file.

positional arguments:
  GAF                  GAF file (can be bgzip-compressed)

options:
  -h, --help           show this help message and exit
  -o, --output OUTPUT  Output file. If omitted, use standard output.
  --cigar              Outputs cigar related statistics (requires more time)

gaftools view

This subcommand helps view the GAF alignments, convert formatting from stable to unstable and vice-versa, and subsetting the files based on nodes or regions given by the user.

Note: The subcommand currently does not support the ds tag that has been introduced in minigraph v0.21. The subcommand ignores that tag and outputs the rest of the fields.

Usage

The view subcommand takes 1 obligatory input, the GAF alignment file. But for full functionaility, it requires the rGFA file which was used for the alignment and the index file created with gaftools index.

By only providing the GAF file, the view command can output the entire file.

Additional files required for certain functionality:

Subsetting alignments based on region (given with --region or -r) or node (given with --node or -n) requires the viewing index generated by gaftools index.
Converting coordinate system with --format or -f requires an GFA file.

Modes of selecting alignments (when giving multiple nodes or regions):

Union selects alignments that align to at least one of the nodes or regions.
Intersection selects alignments that align to all of the nodes or regions.

view arguments

usage: gaftools view [-h] [-g GFA] [-o OUTPUT] [-i INDEX] [-n NODE] [-r REGION] [-f FORMAT] [-m MODE] GAF

View, subset or convert a GAF file (GAF file should be indexed first, using gaftools index, for subsetting).

The view command allows subsetting the GAF file based on node IDs or regions available.

positional arguments:
  GAF                  GAF file (can be bgzip-compressed)

options:
  -h, --help           show this help message and exit
  -g, --gfa GFA        GFA file (can be bgzip-compressed). Required when converting from one coordinate system to another.
  -o, --output OUTPUT  Output GAF (bgzipped if the file ends with .gz). If omitted, use standard output.
  -i, --index INDEX    Path to GAF Viewing Index file. Required when subsetting alignments based on region or nodes. This index is created using gaftools index. If path is not provided, it is assumed to be in the same directory as GAF file with the same name and .gvi extension (default location of
                      the index script)
  -n, --node NODE      Nodes to search. Multiple can be provided (Eg. gaftools view .... -n s1 -n s2 -n s3 .....).
  -r, --region REGION  Regions to search. Multiple can be provided (Eg. gaftools view .... -r chr1:10-20 -r chr1:50-60 .....).
  -f, --format FORMAT  Format of output path (unstable | stable)
  -m, --mode MODE      Mode of selecting alignment with multiple regions/nodes provided ("I" | "U"). Choose "I" for finding alignments with all provided regions/nodes (intersection) and "U" for finding alignments with at least one of the provided regions/nodes (union). (default: U)