📖

grcMalaria R package - User Guide

Welcome

grcMalaria is a user-friendly, open-source R package, designed to make genetic epidemiology analysis tasks accessible.

The package facilitates the translation of genetic information derived from malaria parasites from SpotMalaria Genetic Report Cards (GRC) into intuitive geographical maps of prevalence, diversity, relatedness. This software library is also capable of identifying circulating strains, characterising drug resistance profiles, and mapping spread.

Table of Contents

Welcome
About
First time set-up of grcMalaria
System requirements
Installation of packages in R
📁 Files to download
Initialization - start from here every session ⭐
1. Load libraries
2. Load data file, i.e. the GRC .xlsx file
3. Create data folder for the analysis
4. Select sample set to work on
The Analyses 📊
Sample Count Map
Drug Resistance Map (Predicted Phenotype)
Mutation Map
Allele Proportion Map
Diversity Map
Connectedness Maps
Barcode Frequency Maps
🕓 Time series maps
Cluster / Relatedness Analyses 📊
Find Clusters
Cluster Group Sharing Maps
Cluster Prevalence Maps
Analyses Parameters ⚙️
Common parameters
🎨 Assigning custom marker colours
PCA and NJ Tree Analyses🌴
Step1: User-defined Graphical Attributes for PCA and Tree plot
Step 2: Loading Attributes
Step 3: Applying Attributes to PCA or NJ Tree plots
Principal Component Analyses
Neighbor-Joining Tree
Useful Links
Getting Help & Contacting Us
🗨️ Need help or leave a comment

grcMalaria R package transforms GRC datasheet into genetic epidemiological maps

‣

About

This user-guide provides comprehensive information on how to perform analyses with the grcMalaria R package, including system setup, input dataset, and details of the R command lines.

First time set-up of grcMalaria

grcMalaria is installed on Rstudio by first installing dependencies and, subsequently, the grcMalariaGeodata package and the grcMalaria package using devtools.

System requirements

Make sure you have R, Rtools and Rstudio on your computer before proceeding: Instructions for R installation

Note: Rtools is not used for Mac. For Mac users R studio will prompt you to install ‘xcode-select’ during the installation process of the grcMalaria R package.

Installation of packages in R

Install dependencies

## Install devtools and rgeos
install.packages("devtools")
install.packages("rgeos")

##Install grcMalariaGeodata from Github
devtools::install_github("malariagen/grcMalariaGeodata")

## Require these dependencies to install 'malariagen/grcMalaria'
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("pcaMethods")

Install grcMalaria

✅

Latest stable version of grcMalaria: V2.0.0 (compatible with GRC dataset v1.4)

## Install grcMalaria from Github 
devtools::install_github("malariagen/grcMalaria", ref="V2.0.0")

📁 Files to download

Dataset: GRC v1.4
Codes to load in RStudio, available in two formats

.R file (recommended for intermediate/advance R users):

grcMalaria-2.0.0.R.zip2.2KB

R notebook (recommended for R beginners):

grcMalaria-2.0.0.Rmd.zip9.6KB

Initialization - start from here every session ⭐

For every session, you will have to start at this section

1. Load libraries

# Load libraries
library(grcMalariaGeodata)
library(grcMalaria)

Check that you have the latest version of R package.

# Check package versions grcMalaria and grcMalariaGeodata
# Latest version grcMalariaGeodata: 0.4.0
# Latest stable version of grcMalaria: 2.0.0
packageVersion('grcMalaria')
packageVersion('grcMalariaGeodata')

‣

If you don’t have the latest package version of the package, remove the package and install the package again (press the black triangle to see code):

2. Load data file, i.e. the GRC .xlsx file

Genetic Report Cards (GRC) data v1.4 are available at genremekong.org/data
Save the GRC onto your computer in order to load onto grcMalaria
Load a GRC data file (Microsoft Excel Worksheet format), by indicating the file path in the loadGrc() function . All types of GRC data file are supported, whether it is a public data file, or privately owned GRC data for study owners. A GRC data file consists of groups of variables, including metadata fields with geographical and spatial data (grey headers), private metadata (pink, not included in public data), predicted drug resistance (orange), sample characteristics including species determination, and complexity of infection (light blue), drug resistance amino acid haplotypes (green), amino acid variant for individual drug resistance variants (blue), quantitative polymerase chain reaction (qPCR) results for gene copy number variation (CNV) for plasmepsim2/3 (pm23), and multidrug resistance protein 1 (Pfmdr1) (purple), and lastly genotypes used for genetic barcodes (red).

💡For more information on the Genetic Report Card (GRC) v1.4 structure input file, see this page: GRC v1.4 User Guide

# Load data file
# Change the path to where your file is located before running the code
Data <- loadGrc("D:/.../GRC.xlsx", 
    sheet = "GenRe-Mekong", 
    species = "Pf", version = "1.4")

Parameters:

"D:/…/ GRC.xlsx" - change the path to the GRC Microsoft Excel file located in your computer (use forward slashes /in the path directory)
sheet - name of datasheet or tab name within the Excel file containing the GRC data
species - the species being analysed. Currently, the platform can only analysed "Pf" (i.e. Plasmodium falciparum)
version - the version number of the GRC data file format. Current version is 1.4

‣

😎 Optional function to merge two datasets (Toggle to expand)

3. Create data folder for the analysis

initializeContext() creates a context in preparation of subsequent analysis.

This function removes samples and SNPs with high missingness, based on the thresholds specified. It then performs barcode imputation and calculates genetic distance matrices using the remaining data. Filtering is perform to minimise errors that can occur due to insufficient DNA concentrations for sequencing. Insufficiency in DNA could be attributed to low parasitemia, or incorrect Plasmodium species classification.

By default, SNPs with missing calls in more than 20% of samples (minSnpTypability=0.8), and samples with missing alleles in more than 25% of the remaining SNPs (minSampleTypability=0.75), are removed.

The function creates two directories, ‘data’, and ‘out’, within the specified path.

The ‘data’ directory contains a subdirectory for each GRC dataset. Within this subdirectories, four folders are created:

(i) ‘barcode’: contains filtered and imputed barcodes in tabular and fasta format

(ii) ‘distance’: contains a filtered and an imputed distance matrix

(iii) ‘genotype’: contains filtered and imputed genotypes

(iv) ‘meta’: contains unfiltered, filtered, and imputed sample metadata

‣

Output folder: data

ctx <- initializeContext(Data, 
                          dir="D:/...", #Change the path to where you want output file to be 
															 minSnpTypability=0.8, minSampleTypability=0.75)

Parameters:

Data - The data obtained from reading the GRC data with loadGrc()
dir - You will want to change the path and folder name to where you want the output file to be stored on your computer
minSnpTypability - The minimum proportion of non-missing samples for a barcode position to be retained in analysis. The default is 0.8. This means SNPs where the allele is missing in more than 20 percent of samples are removed.
minSampleTypability - The minimum proportion of non-missing positions for a sample to be retained in analysis. The default is 0.75. This means samples where more than 25 percent of the remaining SNPs are missing are removed.

How to read the output:

4. Select sample set to work on

Selects a set of sample for analysis, based on their meta values in Green. A given analysis context can contain multiple sample sets. These will be labelled with different names.

The selection criteria are specified as a list containing a sequence of lists with two elements each, a selected sample must match all the criteria:

1. "field" which is the column name to be checked

2. "values" which is an array of possible values that can be matched for a sample to be selected

## Select sample set to work on. 

# To select samples from 1 field (1 column in the GRC)
selectSampleSet(ctx, sampleSetName="EBKK", select=list(
								list(field="Country", values=c("VN", "KH", "LA")) ))

# To select samples from 2 fields
selectSampleSet(ctx, sampleSetName="Laos", select=list(
								list(field="TimePoint", values=c("D00H00","-")),
								list(field="Study", values=c("1208-PF-LA-CMPE-GENRE")) ))

# To select samples from 3 fields
selectSampleSet(ctx, sampleSetName="SouthLA_2017", select=list(
                 list(field="Country", values="LA"),
                 list(field="AdmDiv1", values=c("Attapeu", "Champasak")),
								 list(field="Year", values=c("2017", "2018")) ))

# To select samples from more fields, follow the example above to add more -> list(field=" ", values = " ")

💡 Repeat this step for the different sample set you would like to analyse

Parameters:

ctx - the analysis context, created by intializeContext()
sampleSetName – set a name for the sample set. You will use this for subsequent analysis tasks. Once run, this will create a folder with the same name e.g. “Laos”
select function will select samples matching all the specified criteria. Selected samples will match all criteria. The selection criteria are specified as a list containing a sequence of lists with two elements each:

field corresponds to a column name in the GRC datafile
values corresponds to an array of possible values that can be matched for a sample to be selected. Use comma , to add additional parameters inside a quotation mark " "

The Analyses 📊

The main output for the analyses are the plots and .tab data file. The outputs are placed in the output folder set in intializeContext().

➡️ Parameters of analyses can be adjusted. See Common parameters

➡️ Aesthetics of output can be adjusted. See Graphical parameters

Sample Count Map

This creates maps showing the numbers of samples collected at different aggregation levels.

mapSampleCounts (ctx, sampleSet="EBKK", timePeriods=NULL,
                   aggregate=c("Province","District"), 
                   minAggregateCount=1, 
                   markerSize=c(10,40), 
                   colourBy="Province", 
                   showNames=TRUE,
                   ...)

⚙️ Parameters:

See Common parameters for description of the parameters available for all the functions
… additional graphical parameters to adjust aesthetics of the plot
timePeriods parameter is available for this analysis. See the usage in Time interval parameters

🤔 How to read the output:

Number of samples are shown on the marker
If number of samples is less than minimum count of aggregated samples (minAggregateCount), the marker will not appear.
File name with

‘filtered’ is showing samples that passed the quality filtering threshold (default is minSampleTypability=0.75, meaning samples that have more than 25% missingness in their barcode are filtered out and not shown on the map).
‘unfiltered’ is showing all the samples without applying quality filtering

Drug Resistance Map (Predicted Phenotype)

This creates maps showing the levels of resistance to a particular antimalarial drug for different administrative divisions. Based on published genetic markers, each sample has a predicted phenotype for different types of antimalarial drugs. If multiple drugs are specified, then different maps will be created for different drugs.

mapDrugResistancePrevalence (ctx, sampleSet="EBKK", timePeriods=NULL,
                   drugs="ALL", aggregate=c("Province","District"),
                   minAggregateCount=10, showNames=TRUE, markerSize=16,
                   ...)

⚙️Parameters:

See Common parameters
… additional graphical parameters to adjust aesthetics of the plot
timePeriods parameter is available for this analysis. See the usage in Time interval parameters
drugs - An array of drugs for which prevalence of resistance will be estimated; "ALL" creates maps for all the drugs for which we have resistance predictions. These include:

Artemisinin	Chloroquine
Piperaquine	DHA-PPQ
Mefloquine	AS-MQ
Sulfadoxine	Pyrimethamine
S-P	S-P-IPTp

To specify a drug put drug name in between quotation marks e.g. "Artemisinin" or c("Artemisinin", "Chloroquine", "S-P") to select several specified drugs.

🤔 How to read the output:

Number on the marker shows proportion of samples with resistant markers for the specified drug, ranging between 0 and 1 where 0 means no samples carry the resistant marker and 1 means 100% of the samples carry the markers used to predict resistance.

For example, the figure shows 0.03 on Bac Ai and 0.85 in Phu Yen, which means 3% of the samples in Bac Ai and 85% of samples in Phu Yen predicted to be resistant to DHA-PPQ

Resistance prevalence is calculated by dividing the number of resistant samples by the sum of resistant and sensitive samples,

fr = Resistant samples/(Resistant + Sensitive sample)

Mutation Map

This creates maps of frequency of variants known to be associated with drug resistance. Mutation frequency is the ratio of known mutant variants divided by total variants in the population.

mapMutationPrevalence (ctx, sampleSet="EBKK", timePeriods=NULL,
                   mutations="ALL", aggregate="District",
                   minAggregateCount=10, showNames=TRUE, markerSize=16,
                   ...)

⚙️ Parameters:

See Common parameters
timePeriods parameter is available for this analysis. See the usage in Time interval parameters
mutations - An array of known variants associated with drug resistance "ALL" creates maps for all the variants which we have markers for. Below a table of available markers, with associated antimalarial. To specify a mutation, put the specific variant in between quotation marks e.g. "crt_C72S" or c(“mdr1_N86Y”, “mdr1_Y184F”) to select several specified mutations.

Gene	Mutations (use this column for the code)	Antimalarial*
crt	crt_C72S , crt_M74I , crt_N75E , crt_N75D , crt_K76T , crt_T93S , crt_H97Y , crt_H97L , crt_I218F , crt_A220S , crt_Q271E , crt_N326S , crt_N326D , crt_T333S , crt_I356T , crt_I356L , crt_R371I	chloroquine, piperaquine
dhfr	dhfr_N51I , dhfr_C59R , dhfr_C59H , dhfr_S108N , dhfr_S108T , dhfr_I164L	pyrimethamine, cycloguanil
dhps	dhps_S436A , dhps_S436F , dhps_A437G , dhps_K540E , dhps_K540N , dhps_A581G , dhps_A613T , dhps_A613S	sulfadoxine
mdr1	mdr1_N86Y , mdr1_Y184F , mdr1_S1034I , mdr1_F1226Y , mdr1_D1246Y	chloroquine, amodiaquine, lumefantrine, mefloquine
mdr2	mdr2_T484I	artemisinin
arps10	arps10_V127M , arps10_D128H , arps10_D128Y	artemisinin
fd (ferredoxin)	fd_D193Y	artemisinin
exo (exonuclease)	exo_E415G	piperaquine

*details of genetic markers used in SPOTMalaria: 20200705-GenRe-04a-SpotMalaria-0.39.pdf (sanger.ac.uk), pages 4-7

🤔 How to read the output:

Number on the marker shows frequency of specified variant. The scale ranges between 0 and 1 where 0 means no samples carry the specified allele and 1 means 100% of the samples carry the specified allele.
Mutation frequency = selected variant / total variants in the population

Allele Proportion Map

This function creates maps with pie-charts showing the proportion of selected alleles in each administrative division. Missing genotypes and heterozygous genotypes are excluded in the calculations.

mapAlleleProportions (ctx, sampleSet="EBKK", timePeriods=periods,
                       mutations="ALL",
                       aggregate="Province", minAggregateCount=10,
                       showNames=TRUE, markerSize=c(10,25))

⚙️ Parameters:

See Common parameters
timePeriods parameter is available for this analysis. See the usage in Time interval parameters
mutations - An array of known variants associated with drug resistance. "ALL" creates maps for all the variants which we have markers for. To specify a mutation, put specific variant in between quotation marks e.g. "Pfkelch13" or c("crt_C72", "crt_M74", "mdr1-Amp") to select several specified mutations.
alleleColours parameter allows user to specify colours to alleles. Without this parameter, default colour palette is used. This parameter takes a vector of R colours (specified either as R colour names, or HTML colour codes) which is used to colour the slices in the pie symbols of the plot. All elements of the vector must be named with the name of the value (allele) they will represent. The sequence of names will determine the order in which the values are displayed. The name "Other" will indicate the colour used to group together samples that carry alleles other than those names. Usually, "Other" will be listed at the end of the sequence. If no colour is specified for "Other" in the sequence, then an additional "Other"="white" element will be automatically added to the end of the sequence.

‣

🎨 How to assign colours to specific alleles using alleleColours (Click to view):

Available markers:

Gene	mutations/haplotypes (use this column for the code)	Antimalarial
pm23-Amp	pm23-Amp	piperaquine
mdr1-Amp	mdr1-Amp	mefloquine, lumefantrine
Artemisinin background	ART-R backgound	artemisinin
Pfkelch13	kelch13	artemisinin
crt	crt, crt_C72, crt_M74, crt_N75, crt_K76, crt_T93, crt_H97, crt_I218, crt_A220, crt_Q271, crt_N326, crt_T333, crt_I356, crt_R371	chloroquine
dhfr	dhfr, dhfr_N51, dhfr_C59, dhfr_S108, dhfr_I164	pyrimethamine, cycloguanil
dhps	dhps, dhps_S436, dhps_A437, dhps_K540, dhps_A581, dhps_A613	sulfadoxine
mdr1	mdr1, mdr1_N86, mdr1_Y184, mdr1_S1034, mdr1_F1226, mdr1_D1246	chloroquine, amodiaquine, lumefantrine, mefloquine
mdr2	mdr2_T484	artemisinin-resistant (ART-R) genetic background
arps10	arps10_V127, arps10_D128	ART-R genetic background
fd	fd_D193	ART-R genetic background
exo	exo_E415	ART-R genetic background

🤔 How to read the output:

Markers are displayed as pie-chart showing proportion of all the homozygous alleles for the specified gene. Missing and heterozygous genotypes are excluded from the analysis.

Diversity Map

This creates maps of expected heterozygosity or genetic diversity for different administrative divisions.

mapDiversity (ctx, sampleSet="EBKK", timePeriods=NULL,
                   measures="ALL", aggregate="Province", markerColours="red3",
                   minAggregateCount=10, showNames=TRUE, markerSize=16,
                   ...)

⚙️ Parameters:

See Common parameters
timePeriods parameter is available for this analysis. See the usage in Time interval parameters
markerColours set colour for the marker
measures - name of methods for the estimation calculation of genetic diversity. "ALL" creates maps for all six methods. To specify a method, put method name in between quotation marks e.g. “haploHet1” or c(“haploHet1”, “meanSnpHet”) to select several specified methods.

Included `measures`	🤔 Explanation
haploHet	A measure of heterozygosity of the barcode. This gives probability of two randomly selected samples carrying different barcode. It is computed as (N/(N-1))(1-sum(p^2)), where N is number of samples and p is the proportion of samples carrying each barcode. How to read the output: Scale: 0 to 1 (0% to 100%) of the samples carry different barcode - Low value suggests low diversity - High value suggest high diversity
meanSnpHet	A measure of heterozygosity of SNPs across the loci. This gives probability of two randomly selected samples having different alleles at a locus. Mean heterozygosity of SNP is taken across all the SNP loci. How to read the output: Scale: 0 to 1. The value is the mean proportion of the samples carry different alleles across the loci - Low values suggests low heterozygosity and low genetic variability - High values suggest high heterozygosity and high genetic variability
maxHaploFreq	Shows proportion of samples carrying the most common haplotype. It’s a measure of loss of diversity How to read the output: Scale: 0 to 1 (0% to 100%) of the samples carry the most common haplotype - Low values suggest low proportion of samples carry the most common haplotype, i.e. not dominated by a single haplotype, high diversity in population - High values suggest high proportion of samples carry the most common haplotype, i.e. less diversity in population
medianDistance	Median of pairwise genetic distance, showing how related samples are in a population. It is computed as proportion of SNPs differ between two barcodes. The final value is the median distance of all the samples within the population. This is not a measure of diversity but it is related to it.

Connectedness Maps

This creates maps of connections of haplotypes between different administrative divisions. It produces multiple maps for different mean genetic distance thresholds.

The administrative divisions are compared in a pairwise fashion, within this comparison output is produced that gives either

i) the proportion of sample pairs that have at least the assigned percentage of similarity (default: 1), or
ii) the proportion of sample pairs with a mean genetic distance of at least the meanDistanceLevels (default: 0.5).

mapConnections (ctx, sampleSet="EBKK", 
                   measures="ALL", aggregate="Province",
                   meanDistanceLevels=c(1, 0.9, 0.8),
									 minIdentity = c(1, 0.9, 0.8),
                   minAggregateCount=10, showNames=TRUE, markerSize=16,
                   ...)

⚙️Parameters:

See Common parameters
measures - default is “ALL” (default) where two measures, “similarity” and “meanDistance”, are plotted. "meanDistance" produces a map with pairwise comparisons between administrative divisions (e.g. sites) in which the thickness of the connection line corresponds to the proportion of sample pairs with a mean genetic distance of at least the meanDistanceLevels. "similarity" produces a map with pairwise comparisons between administrative divisions (e.g. sites) in which the thickness of the connection line corresponds to the proportion of sample pairs with at least the percentage of similarity given in minIdentity.
meanDistanceLevelsvalue between 0 and 1, default is 0.5. Several thresholds can be specified at once e.g. c(1, 0.9, 0.8)
minIdentity value between 0 and 1, default is 1

🤔 How to read output:

Maps are produced showing connections between the administrative divisions, in which the thickness of the connection line corresponds to the proportion of either

i) the proportion of sample pairs that have at least the assigned percentage of similarity (default: 1), or
ii) the proportion of sample pairs with a mean genetic distance of at least the meanDistanceLevels (default: 0.5).

Both similarity and mean genetic distance are calculated based on the 101-SNP genetic barcode.

Barcode Frequency Maps

This creates maps showing frequency of groups of Plasmodium samples that have an identical barcode (100% barcode similarity) for different administrative divisions. Barcode frequencies are calculated from the 101-SNP genetic barcode.

mapBarcodeFrequencies (ctx, sampleSet="EBKK", 
                   type=c("bar","pie"), 
                   aggregate="District", 
                   minAggregateCount=10, showNames=TRUE, markerScale=0.8, 
                   ...)

⚙️ Parameters:

See Common parameters
type – visualisation types of frequency as "bar" and "pie" charts.

🤔 How to read output:

Frequency of groups of samples that have an identical barcode (100% barcode similarity) are shown as either a white box in a bar chart or white segment in a pie-chart

The presence of a few large segments suggests that the location is dominated by several large groups, each representing a different "strain".
When we observe "black" colors in the bar or pie chart, it indicates the presence of many overlapping segment edges. This suggests the existence of numerous groups of strains with low frequency. Therefore, the color black can serve as an indicator of diversity. In other words, a high level of "black" implies a greater number of different strains in a given location.

🕓 Time series maps

Time-sequence maps can be implemented using a parameter called timePeriods parameter. When this is passed to a function, it will partition samples into time-interval plots. Time intervals parameters are available in 5 analyses: mapSampleCounts(), mapDrugResistancePrevalence(), mapMutationPrevalence(), mapAlleleProportions() and mapDiversity().

Why use time-intervals parameter:

It will produce maps with consistent geographical boundaries
Users can slice up the dataset in any specified time period

To implement the time-sequence maps, first we provide a list of specified time periods, then we incorporate the timePeriod into the function, for example:

Specify time periods

## To implement time-sequence maps
# 1. Specify time periods
periods <- list(
  list(name="2021", type="year", start="1-Jan-2021"),
  list(name="2020", type="year", start="1-Jan-2020"),
  list(name="2017-19", type="period", start="1-Jan-2017", end="31-Dec-2019") )

Each list specifies a time period, with the required three parameters:

name - the name parameter is required and is added to the end of the file name of the produced files
type - there are two types of time interval. type="year", uses start date and by default show until the end of the defined year. type="period" allows user to specify any time period using both the start and end date.
start and end - start and end date. The parameter start must be a date in the format “dd-MMM-yyyy”

**NOTE: The public GRC contains only collected years, therefore only the year in the provided “start” date is used to produce time-sequence maps

Apply timePeriods parameter

⏱️ Time intervals parameters, timePeriods, can be applied to 5 analyses: mapSampleCounts(), mapDrugResistancePrevalence(), mapMutationPrevalence(), mapAlleleProportions() and mapDiversity().

# 2. Apply timePeriods parameter to any of the five analyses:

mapSampleCounts(ctx, sampleSet="Laos", timePeriods = periods,
                   aggregate=c("Province","District"), minAggregateCount=1, 
                   markerSize=c(10,40), colourBy="Province", showNames=TRUE,
                   width=15, height=15)

mapDrugResistancePrevalence(ctx, sampleSet="Laos", timePeriods = periods,
                   drugs="ALL", aggregate=c("Province","District"),
                   minAggregateCount=10, showNames=TRUE, markerSize=16,
                   width=15, height=15)

mapMutationPrevalence(ctx, sampleSet="Laos", timePeriods = periods,
                   mutations="ALL", aggregate="District",
                   minAggregateCount=10, showNames=TRUE, markerSize=16,
                   width=15, height=15)

mapAlleleProportions(ctx, sampleSet="Laos", timePeriods = periods,
                   mutations="ALL", aggregate="District",
                   minAggregateCount=10, showNames=TRUE, markerSize=16,
                   width=15, height=15)

mapDiversity (ctx, sampleSet="Laos", timePeriods = periods,
                   measures="ALL", aggregate="District", markerColours="red3",
                   minAggregateCount=10, showNames=TRUE, markerSize=16,
                   width=15, height=15)

Example of time-interval parameter applied to drug resistant prevalence outputs:

Cluster / Relatedness Analyses 📊

Find Clusters

In order to plot maps of cluster prevalence and cluster sharing, first we have to find clusters of similar genetic background. The function below partitions the Plasmodium samples into clusters with a similar genetic background, based on levels of genetic barcode similarity. This function was built on the igraph R package (Csárdi G, 2023), which computes a graph connecting sample pairs with $S$ greater than a minimum threshold $S_{min}$ , subsequently partitioning the graph into clusters.

## Find clusters of similar genetic background ##
findClusters(ctx, sampleSet="EBKK", clusterSet = "GMS",
             minIdentity = c(0.95, 0.80), impute=TRUE,
             clusteringMethod = "louvain", minClusterSize = 2)

⚙️ Parameters:

ctx - the analysis context, created by intializeContext()
sampleSet - the name of the sample set being used, which must have been previously created by selectSampleSet()
clusterSet - give the clustering set a name. In the above example the clustering set is called "GMS".
minIdentity - minimal similarity level set for a pair of samples to be in a cluster. For instance, 0.95 means the software only put pairs of samples whose genetic barcode are at least 95% matching. Multiple similarity levels can be set at once, as in the example above, by putting number inside c() separated by ,
impute - To use imputed or filtered data. The default is TRUE.
minClusterSize - minimal number of samples in a cluster
clusteringMethod - The clustering method. Two methods are available: "allNeighbours" and "louvain". The default is "allNeighbours". The "allNeighbours" method clusters samples together that are above the set "minIdentity" threshold. This method is less informative at low similarity levels, because each sample will be assigned to a separate cluster. The "louvain" method is the preferred method. Also known as Louvain Community-based clustering, this method uses an algorithm to identify clusters within a network that are strongly connected to each other, and more weakly connected to other clusters. This method is superior to "allNeighbours", particularly in sample sets that have low genetic similarity.

‣

📂 Output folder: cluster

Cluster Group Sharing Maps

This creates a map showing the proportions of assigned clusters (haplotype groups) for different administrative divisions. The different colours represent different haplotype groups. White colour represents samples that do not belong to a cluster because they did not meet the specified similarity threshold minIdentity.

⚠️ Make sure you have run findClusters() to identify clusters before running the code below

mapClusterSharing (ctx, sampleSet="EBKK", clusterSet = "GMS",
                   type=c("bar", "pie"),
                   aggregate=c("Province","District"),
                   minAggregateCount=5, showNames=TRUE, markerScale=0.8,
                   ...)

⚙️ Parameters:

See Common parameters
clusterSet - give the clustering set a name. In the above example the clustering set is called "GMS".
type – visualisation types of frequency as "bar" and "pie" charts.

🤔 How to read the output:

Shows proportion of haplotype groups for different administrative division as either bar-chart or pie-chart

Different haplotype groups are represented by different colours
Grey colour represents samples that do not belong to a cluster because they did not meet the specified similarity threshold

Cluster Prevalence Maps

This function produces network maps showing the prevalence of clusters (haplotype groups).

For each assigned cluster (haplotype group) a map is created, showing the prevalence and overlap in prevalence of that cluster across administrative divisions. Each map also contains information on sample count, drug-resistance and mutation frequencies.

⚠️ Make sure you have run findClusters() to add cluster information to ctx before running the code below

mapClusterPrevalence (ctx, sampleSet="EBKK", clusterSet = "GMS",
                      aggregate=c("Province","District"), 
                      minAggregateCount=5, showNames=TRUE, markerScale=0.8,
                      ...)

⚙️ Parameters:

See Common parameters
clusterSet - give the clustering set a name. In the above example the clustering set is called "GMS".

🤔 How to read output:

Each map is showing information for a single cluster

Number on the marker (’node’) is the proportion of sample belonging to this cluster in the area
Thickness of lines (’edges’) is the proportion samples belong to this cluster shared between two locations

The legend also shows information on sample count, drug-resistant prediction and mutation frequencies

‣

Toggle to see annotated example

Connecting Cluster information with Metadata (Optional)

There are several you can connect data between two files. If you have Microsoft Excel on your computer, you can following steps (the steps correspond to the video): 1. Open GRC datasheet 2. Create a copy of the datasheet (to retain original data in case of mistake), by right click on the tab > “Move or Copy” > “Create a copy” 3. Insert a new column for the cluster information, by right click on the column head > “Insert” 4. Open the ClusterMembers.tab file 5. In the GRC , select the first row in the column where cluster information would go 6. Press “fx” to insert function 7. Select “VLOOKUP” function and fill the following: - In the “Lookup_value”: select the SampleID column in the GRC - In the “Table_array”: select the column with SampleID and cluster information in ClusterMembers.tab - In the “Col_Index_num”: add the number in which it correspond to the column where the cluster member information is located. 8. In the “Range_lookup”: type “FALSE” to return exact match 9. Copy the code down the column

Once cluster information is connected with the GRC, you can use this for further analyses, such as analyzing the prevalence of clonal clusters

Linking cluster member information to GRC using VLOOKUP() function in Microsoft Excel:

Components of the VLOOKUP function. SampleID is typically used for connecting the datasheet together

Analyses Parameters ⚙️

Common parameters

The following are the common parameters shared across the functions in this package

ctx - The analysis context, created by intializeContext()
sampleSet - The name of the sample set being used; must have been previously created by selectSampleSet()
aggregate - The administrative level at which we group (”Province” or “District”). Separate maps are created for each administrative level.
minAggregateCount - The minimum count of aggregated samples. To avoid estimating on very small samples, one can set a minimum count of samples, below which the marker is not shown.
showNames - If TRUE, labels are shown with the name of the aggregation unit (”Province” or “District”)

Aesthetics (…)

Marker

colourby - Shows the aggregation level to be used to colour the markers (”Province” or “Country”). For each aggregation unit, we place a marker on the map, coloured according to the level of resistance to the drug or mutation, with a label indicating the prevalence.
markerSize - Allows adjustment of the size of markers on the map. If only one value is passed, all markers will be of that size; if two values are passed, they will be used as the min and max size of the marker, whose size will reflect the number of samples.

Font size

markerFontSize - Allows adjustment of the font size shown on the markers (numeric), default=6
nameFontSize - Allows adjustment of the place’s label name font size (numeric), default=5
axisTitleSize - specifies axis label font size (numeric), default=1
legendFontSize - specifies font size of the legend (numeric), default=4

Plot size

width - the width of the plot (numeric), default=15
height - the height of the plot (numeric), default=15
units - the units in which the width and height are expressed. Supported values: "in" (default), "cm", "mm", "px"
dpi - the resolution of the plot output, expressed as dots per inch, default=300
format - the file format in which the plot will be saved. Supported values: "png” (default), "pdf"

Legend

legendPosition - specifies where the legend should be plotted. Supported values: "inset" (default), "separate"
legendDirection - specifies location of legend. Supported values: "vertical" (default), "horizontal"

‣

🎨 Assigning custom marker colours

PCA and NJ Tree Analyses🌴

In the next section creation of Principal Component Analysis (PCA) and Neighbor-Joining (NJ) - trees will be explained.

However, firstly specification of graphical display (graphic attributes) based on metadata values is required. These graphical attributes will be used to visualise PCA and NJ analyses.

‣

Step1: User-defined Graphical Attributes for PCA and Tree plot

‣

Step 2: Loading Attributes

‣

Step 3: Applying Attributes to PCA or NJ Tree plots

Principal Component Analyses

Principal Component Analysis (PCA) or Principal Coordinate Analysis (PCoA) are methods of reducing complexity of the data (in this case distances of 101-SNP genetic barcodes) down to two dimensions, while preserving as much information contained in the original data as possible. This was done by creating principal components that explain most of the observed variance in the dataset.

PCoA is the default method in the grcMalaria but other PCA methods are available. One plot is created for each specified graphical attribute in the parameter 'plots', in which the samples are coloured according to user input.

# Perform a principal component analysis for province, kelch13, plasmepsin2/3 amplification status, 
# and kelch13 in combination with plasmepsin2/3 amplification status,
# using the "PCoA" method on sampleSet "EBKK"
plotPrincipalComponents (ctx, sampleSet="EBKK", type="PCoA",
     plots=list(
               list(name="ByProvince",       attributes="province"),
               list(name="ByKelch13",        attributes="k13"),
               list(name="ByPm23",           attributes="pm23"),
               list(name="ByKelch13AndPm23", attributes=c("k13","pm23-noColour"))
            ), 
     ...)

⚙️ Parameters:

See Common parameters
type - The type of method used for the principal component analysis. "PCoA" (default), performs multidimensional scaling on a distance matrix, in which each sample is a variable. PCA methods: "nipals"& "bpca" are methods applied to the set of genetic barcode genotypes (each barcode position is a variable).
plots - the list of attribute rule for which separate plots will be created with the results of this analysis

🤔 How to read output:

The PC (Principal Components) are arranged in descending order of explained variance, with PC1 axis showing the largest variation and PC2 axis showing the second the largest variation and so on.
Samples that are close to each other means they are similar to each other. Well separated clusters means there are differences between samples.
In the example plot, PC1 explains the variation we see in the Pfkelch13 variants, separating C580Y mutation and WT. The data suggests that these parasites carrying C580Y and WT are distinct from one another.

Neighbor-Joining Tree

Neighbor-joining (NJ) tree is a another method to show how samples are related based on the distances between barcodes. It is not a phylogeny tree.

NJ tree is unrooted, which means:

It clusters together related sequences using distance matrix
Tree has no orientation
Without assumption about ancestry (no ancestry defined)

This function creates tree plots, in which the similarity between members of the population (i.e. samples in the sampleSet) is reconstructed based on the 101-SNP genetic barcode. For each specified attribute, a plot is created for the phylogenetic tree, in which the nodes (i.e. samples) are coloured according to that attribute. Graphical display of attributes needs to be specified before using this function, see steps above. In addition to the plot, a newick file is generated, which can be used to upload in tree visualisation tools (for example https://microreact.org/)

# Produce a neighbour-joining tree for sampleSet "Laos"
# with plots in which the samples (i.e. nodes) are coloured according to i) province, ii) kelch13, 
# iii) plasmepsin2/3 amplification status, iv) kelch13 and plasmepsin2/3 amplification status
 plotTree (ctx, sampleSet="Laos", type="njt",
           plots=list(
                     list(name="ByProvince",       attributes="province"),
                     list(name="ByKelch13",        attributes="k13"),
                     list(name="ByPm23",           attributes="pm23"),
                     list(name="ByKelch13AndPm23", attributes=c("k13","pm23-noColour"))
           ), 
           width=15, height=10)

⚙️Parameters:

See Common parameters
type - currently the only option is "njt" which creates a neighbour-joining tree.
plots - the list of attribute rule for which separate plots will be created with the results of this analysis

🤔How to read the output:

Cluster:

Related samples are clustered together
Unrelated samples are seperated

Branching:

Related clusters branch off together

Leaf Colour (the nodes at the end of the lines):

Graphic Attribute set by user

Branch Colour (the lines):

Grey branch means not all the leaves branching from it are the same colour
Coloured branch = all the sub-branches and their leaves are in that colour

Useful Links

🧬 Genetic Report Cards (v1.4)

📺 Video Tutorials

Videos from the training workshop (8 -9th May 2023). (genremekong.org)

🧑🏽‍💻 Source code

https://github.com/malariagen/grcMalaria

🪲 Reporting bugs

https://github.com/malariagen/grcMalaria/issues

📔 Background information

Getting Help & Contacting Us

We are happy to support our partners with any problems they are running in to.

‣

🗨️ Need help or leave a comment

R Package developed by Olivo Miotto

Version 2.0.0 | Last update on @March 21, 2024

↑ Back to top