Welcome
grcMalaria
is a user-friendly, open-source R package, designed to make genetic epidemiology analysis tasks accessible.
The package facilitates the translation of genetic information derived from malaria parasites from SpotMalaria Genetic Report Cards (GRC) into intuitive geographical maps of prevalence, diversity, relatedness. This software library is also capable of identifying circulating strains, characterising drug resistance profiles, and mapping spread.
Table of Contents
- Welcome
- About
- First time set-up of grcMalaria
- System requirements
- Installation of packages in R
- 📁 Files to download
- Initialization - start from here every session ⭐
- 1. Load libraries
- 2. Load data file, i.e. the GRC .xlsx file
- 3. Create data folder for the analysis
- 4. Select sample set to work on
- The Analyses 📊
- Sample Count Map
- Drug Resistance Map (Predicted Phenotype)
- Mutation Map
- Allele Proportion Map
- Diversity Map
- Connectedness Maps
- Barcode Frequency Maps
- 🕓 Time series maps
- Cluster / Relatedness Analyses 📊
- Find Clusters
- Cluster Group Sharing Maps
- Cluster Prevalence Maps
- Analyses Parameters ⚙️
- Common parameters
- 🎨 Assigning custom marker colours
- PCA and NJ Tree Analyses🌴
- Step1: User-defined Graphical Attributes for PCA and Tree plot
- Step 2: Loading Attributes
- Step 3: Applying Attributes to PCA or NJ Tree plots
- Principal Component Analyses
- Neighbor-Joining Tree
- Useful Links
- Getting Help & Contacting Us
About
This user-guide provides comprehensive information on how to perform analyses with the grcMalaria
R package, including system setup, input dataset, and details of the R command lines.
First time set-up of grcMalaria
grcMalaria
is installed on Rstudio by first installing dependencies and, subsequently, the grcMalariaGeodata
package and the grcMalaria
package using devtools.
System requirements
- Make sure you have R, Rtools and Rstudio on your computer before proceeding: Instructions for R installation
- Note: Rtools is not used for Mac. For Mac users R studio will prompt you to install ‘xcode-select’ during the installation process of the grcMalaria R package.
Installation of packages in R
Install dependencies
## Install devtools and rgeos
install.packages("devtools")
install.packages("rgeos")
##Install grcMalariaGeodata from Github
devtools::install_github("malariagen/grcMalariaGeodata")
## Require these dependencies to install 'malariagen/grcMalaria'
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("pcaMethods")
Install grcMalaria
grcMalaria
: V2.0.0 (compatible with GRC dataset v1.4)## Install grcMalaria from Github
devtools::install_github("malariagen/grcMalaria", ref="V2.0.0")
📁 Files to download
- Dataset: GRC v1.4
- Codes to load in RStudio, available in two formats
- .R file (recommended for intermediate/advance R users):
- R notebook (recommended for R beginners):
Initialization - start from here every session ⭐
For every session, you will have to start at this section
1. Load libraries
# Load libraries
library(grcMalariaGeodata)
library(grcMalaria)
Check that you have the latest version of R package.
# Check package versions grcMalaria and grcMalariaGeodata
# Latest version grcMalariaGeodata: 0.4.0
# Latest stable version of grcMalaria: 2.0.0
packageVersion('grcMalaria')
packageVersion('grcMalariaGeodata')
2. Load data file, i.e. the GRC .xlsx file
- Genetic Report Cards (GRC) data v1.4 are available at genremekong.org/data
- Save the GRC onto your computer in order to load onto
grcMalaria
- Load a GRC data file (Microsoft Excel Worksheet format), by indicating the file path in the
loadGrc()
function . All types of GRC data file are supported, whether it is a public data file, or privately owned GRC data for study owners. A GRC data file consists of groups of variables, including metadata fields with geographical and spatial data (grey headers), private metadata (pink, not included in public data), predicted drug resistance (orange), sample characteristics including species determination, and complexity of infection (light blue), drug resistance amino acid haplotypes (green), amino acid variant for individual drug resistance variants (blue), quantitative polymerase chain reaction (qPCR) results for gene copy number variation (CNV) for plasmepsim2/3 (pm23), and multidrug resistance protein 1 (Pfmdr1) (purple), and lastly genotypes used for genetic barcodes (red).
💡For more information on the Genetic Report Card (GRC) v1.4 structure input file, see this page: GRC v1.4 User Guide
# Load data file
# Change the path to where your file is located before running the code
Data <- loadGrc("D:/.../GRC.xlsx",
sheet = "GenRe-Mekong",
species = "Pf", version = "1.4")
Parameters:
"D:/…/ GRC.xlsx"
- change the path to the GRC Microsoft Excel file located in your computer (use forward slashes/
in the path directory)sheet
- name of datasheet or tab name within the Excel file containing the GRC dataspecies
- the species being analysed. Currently, the platform can only analysed"Pf"
(i.e. Plasmodium falciparum)version
- the version number of the GRC data file format. Current version is 1.4
3. Create data folder for the analysis
initializeContext()
creates a context in preparation of subsequent analysis.
This function removes samples and SNPs with high missingness, based on the thresholds specified. It then performs barcode imputation and calculates genetic distance matrices using the remaining data. Filtering is perform to minimise errors that can occur due to insufficient DNA concentrations for sequencing. Insufficiency in DNA could be attributed to low parasitemia, or incorrect Plasmodium species classification.
By default, SNPs with missing calls in more than 20% of samples (minSnpTypability=0.8
), and samples with missing alleles in more than 25% of the remaining SNPs (minSampleTypability=0.75
), are removed.
The function creates two directories, ‘data’, and ‘out’, within the specified path.
The ‘data’ directory contains a subdirectory for each GRC dataset. Within this subdirectories, four folders are created:
(i) ‘barcode’: contains filtered and imputed barcodes in tabular and fasta format
(ii) ‘distance’: contains a filtered and an imputed distance matrix
(iii) ‘genotype’: contains filtered and imputed genotypes
(iv) ‘meta’: contains unfiltered, filtered, and imputed sample metadata
ctx <- initializeContext(Data,
dir="D:/...", #Change the path to where you want output file to be
minSnpTypability=0.8, minSampleTypability=0.75)
Parameters:
Data
- The data obtained from reading the GRC data withloadGrc()
dir
- You will want to change the path and folder name to where you want the output file to be stored on your computerminSnpTypability
- The minimum proportion of non-missing samples for a barcode position to be retained in analysis. The default is 0.8. This means SNPs where the allele is missing in more than 20 percent of samples are removed.minSampleTypability
- The minimum proportion of non-missing positions for a sample to be retained in analysis. The default is 0.75. This means samples where more than 25 percent of the remaining SNPs are missing are removed.
How to read the output:
4. Select sample set to work on
Selects a set of sample for analysis, based on their meta values in Green. A given analysis context can contain multiple sample sets. These will be labelled with different names.
The selection criteria are specified as a list containing a sequence of lists with two elements each, a selected sample must match all the criteria:
1. "field" which is the column name to be checked
2. "values" which is an array of possible values that can be matched for a sample to be selected
## Select sample set to work on.
# To select samples from 1 field (1 column in the GRC)
selectSampleSet(ctx, sampleSetName="EBKK", select=list(
list(field="Country", values=c("VN", "KH", "LA")) ))
# To select samples from 2 fields
selectSampleSet(ctx, sampleSetName="Laos", select=list(
list(field="TimePoint", values=c("D00H00","-")),
list(field="Study", values=c("1208-PF-LA-CMPE-GENRE")) ))
# To select samples from 3 fields
selectSampleSet(ctx, sampleSetName="SouthLA_2017", select=list(
list(field="Country", values="LA"),
list(field="AdmDiv1", values=c("Attapeu", "Champasak")),
list(field="Year", values=c("2017", "2018")) ))
# To select samples from more fields, follow the example above to add more -> list(field=" ", values = " ")
💡 Repeat this step for the different sample set you would like to analyse
Parameters:
ctx
- the analysis context, created byintializeContext()
sampleSetName
– set a name for the sample set. You will use this for subsequent analysis tasks. Once run, this will create a folder with the same name e.g. “Laos”select
function will select samples matching all the specified criteria. Selected samples will match all criteria. The selection criteria are specified as a list containing a sequence of lists with two elements each:field
corresponds to a column name in the GRC datafilevalues
corresponds to an array of possible values that can be matched for a sample to be selected. Use comma,
to add additional parameters inside a quotation mark" "
The Analyses 📊
The main output for the analyses are the plots and .tab
data file. The outputs are placed in the output folder set in intializeContext()
.
➡️ Parameters of analyses can be adjusted. See Common parameters
➡️ Aesthetics of output can be adjusted. See Graphical parameters
Sample Count Map
This creates maps showing the numbers of samples collected at different aggregation levels.
mapSampleCounts (ctx, sampleSet="EBKK", timePeriods=NULL,
aggregate=c("Province","District"),
minAggregateCount=1,
markerSize=c(10,40),
colourBy="Province",
showNames=TRUE,
...)
⚙️ Parameters:
- See Common parameters for description of the parameters available for all the functions
…
additional graphical parameters to adjust aesthetics of the plottimePeriods
parameter is available for this analysis. See the usage in Time interval parameters
🤔 How to read the output:
- Number of samples are shown on the marker
- If number of samples is less than minimum count of aggregated samples (
minAggregateCount
), the marker will not appear. - File name with
- ‘filtered’ is showing samples that passed the quality filtering threshold (default is
minSampleTypability=0.75
, meaning samples that have more than 25% missingness in their barcode are filtered out and not shown on the map). - ‘unfiltered’ is showing all the samples without applying quality filtering
Drug Resistance Map (Predicted Phenotype)
This creates maps showing the levels of resistance to a particular antimalarial drug for different administrative divisions. Based on published genetic markers, each sample has a predicted phenotype for different types of antimalarial drugs. If multiple drugs are specified, then different maps will be created for different drugs.
mapDrugResistancePrevalence (ctx, sampleSet="EBKK", timePeriods=NULL,
drugs="ALL", aggregate=c("Province","District"),
minAggregateCount=10, showNames=TRUE, markerSize=16,
...)
⚙️Parameters:
- See Common parameters
…
additional graphical parameters to adjust aesthetics of the plottimePeriods
parameter is available for this analysis. See the usage in Time interval parametersdrugs
- An array of drugs for which prevalence of resistance will be estimated;"ALL"
creates maps for all the drugs for which we have resistance predictions. These include:
Artemisinin | Chloroquine |
Piperaquine | DHA-PPQ |
Mefloquine | AS-MQ |
Sulfadoxine | Pyrimethamine |
S-P | S-P-IPTp |
To specify a drug put drug name in between quotation marks e.g. "Artemisinin"
or c("Artemisinin", "Chloroquine", "S-P")
to select several specified drugs.
🤔 How to read the output:
- Number on the marker shows proportion of samples with resistant markers for the specified drug, ranging between 0 and 1 where 0 means no samples carry the resistant marker and 1 means 100% of the samples carry the markers used to predict resistance.
- For example, the figure shows 0.03 on Bac Ai and 0.85 in Phu Yen, which means 3% of the samples in Bac Ai and 85% of samples in Phu Yen predicted to be resistant to DHA-PPQ
- Resistance prevalence is calculated by dividing the number of resistant samples by the sum of resistant and sensitive samples,
fr = Resistant samples/(Resistant + Sensitive sample)
Mutation Map
This creates maps of frequency of variants known to be associated with drug resistance. Mutation frequency is the ratio of known mutant variants divided by total variants in the population.
mapMutationPrevalence (ctx, sampleSet="EBKK", timePeriods=NULL,
mutations="ALL", aggregate="District",
minAggregateCount=10, showNames=TRUE, markerSize=16,
...)
⚙️ Parameters:
- See Common parameters
timePeriods
parameter is available for this analysis. See the usage in Time interval parametersmutations
- An array of known variants associated with drug resistance"ALL"
creates maps for all the variants which we have markers for. Below a table of available markers, with associated antimalarial. To specify a mutation, put the specific variant in between quotation marks e.g."crt_C72S"
orc(“mdr1_N86Y”, “mdr1_Y184F”)
to select several specified mutations.
Gene | Mutations (use this column for the code) | Antimalarial* |
crt | crt_C72S , crt_M74I , crt_N75E , crt_N75D , crt_K76T , crt_T93S , crt_H97Y , crt_H97L , crt_I218F , crt_A220S , crt_Q271E , crt_N326S , crt_N326D , crt_T333S , crt_I356T , crt_I356L , crt_R371I | chloroquine, piperaquine |
dhfr | dhfr_N51I , dhfr_C59R , dhfr_C59H , dhfr_S108N , dhfr_S108T , dhfr_I164L | pyrimethamine, cycloguanil |
dhps | dhps_S436A , dhps_S436F , dhps_A437G , dhps_K540E , dhps_K540N , dhps_A581G , dhps_A613T , dhps_A613S | sulfadoxine |
mdr1 | mdr1_N86Y , mdr1_Y184F , mdr1_S1034I , mdr1_F1226Y , mdr1_D1246Y | chloroquine, amodiaquine,
lumefantrine, mefloquine |
mdr2 | mdr2_T484I | artemisinin |
arps10 | arps10_V127M , arps10_D128H , arps10_D128Y | artemisinin |
fd (ferredoxin) | fd_D193Y | artemisinin |
exo (exonuclease) | exo_E415G | piperaquine |
*details of genetic markers used in SPOTMalaria: 20200705-GenRe-04a-SpotMalaria-0.39.pdf (sanger.ac.uk), pages 4-7
🤔 How to read the output:
- Number on the marker shows frequency of specified variant. The scale ranges between 0 and 1 where 0 means no samples carry the specified allele and 1 means 100% of the samples carry the specified allele.
- Mutation frequency = selected variant / total variants in the population
Allele Proportion Map
This function creates maps with pie-charts showing the proportion of selected alleles in each administrative division. Missing genotypes and heterozygous genotypes are excluded in the calculations.
mapAlleleProportions (ctx, sampleSet="EBKK", timePeriods=periods,
mutations="ALL",
aggregate="Province", minAggregateCount=10,
showNames=TRUE, markerSize=c(10,25))
⚙️ Parameters:
- See Common parameters
timePeriods
parameter is available for this analysis. See the usage in Time interval parametersmutations
- An array of known variants associated with drug resistance."ALL"
creates maps for all the variants which we have markers for. To specify a mutation, put specific variant in between quotation marks e.g."Pfkelch13"
orc("crt_C72", "crt_M74", "mdr1-Amp")
to select several specified mutations.alleleColours
parameter allows user to specify colours to alleles. Without this parameter, default colour palette is used. This parameter takes a vector of R colours (specified either as R colour names, or HTML colour codes) which is used to colour the slices in the pie symbols of the plot. All elements of the vector must be named with the name of the value (allele) they will represent. The sequence of names will determine the order in which the values are displayed. The name "Other" will indicate the colour used to group together samples that carry alleles other than those names. Usually, "Other" will be listed at the end of the sequence. If no colour is specified for "Other" in the sequence, then an additional "Other"="white" element will be automatically added to the end of the sequence.
alleleColours
(Click to view):Available markers:
Gene | mutations/haplotypes (use this column for the code) | Antimalarial |
Pfkelch13 | Pfkelch13 | artemisinin |
pm23-Amp | pm23-Amp | piperaquine |
mdr1-Amp | mdr1-Amp | mefloquine, lumefantrine |
crt | PfCRT,
crt_C72, crt_M74, crt_N75, crt_K76, crt_T93, crt_H97, crt_I218, crt_A220, crt_Q271, crt_N326, crt_T333, crt_I356, crt_R371 | chloroquine |
dhfr | PfDHFR,
dhfr_N51, dhfr_C59, dhfr_S108, dhfr_I164 | pyrimethamine, cycloguanil |
dhps | PfDHPS,
dhps_S436, dhps_A437, dhps_K540, dhps_A581, dhps_A613 | sulfadoxine |
mdr1 | PfMDR1,
mdr1_N86, mdr1_Y184, mdr1_S1034, mdr1_F1226, mdr1_D1246 | chloroquine, amodiaquine, lumefantrine, mefloquine |
mdr2 | mdr2_T484 | artemisinin-resistant (ART-R) genetic background |
arps10 | arps10_V127, arps10_D128 | ART-R genetic background |
fd | fd_D193 | ART-R genetic background |
exo | exo_E415 | ART-R genetic background |
🤔 How to read the output:
- Markers are displayed as pie-chart showing proportion of all the homozygous alleles for the specified gene. Missing and heterozygous genotypes are excluded from the analysis.
Diversity Map
This creates maps of expected heterozygosity or genetic diversity for different administrative divisions.
mapDiversity (ctx, sampleSet="EBKK", timePeriods=NULL,
measures="ALL", aggregate="Province", markerColours="red3",
minAggregateCount=10, showNames=TRUE, markerSize=16,
...)
⚙️ Parameters:
- See Common parameters
timePeriods
parameter is available for this analysis. See the usage in Time interval parametersmarkerColours
set colour for the markermeasures
- name of methods for the estimation calculation of genetic diversity."ALL"
creates maps for all six methods. To specify a method, put method name in between quotation marks e.g.“haploHet1”
orc(“haploHet1”, “meanSnpHet”)
to select several specified methods.
Included measures | 🤔 Explanation |
haploHet | A measure of heterozygosity of the barcode. This gives probability of two randomly selected samples carrying different barcode.
It is computed as (N/(N-1))(1-sum(p^2)), where N is number of samples and p is the proportion of samples carrying each barcode.
How to read the output:
Scale: 0 to 1 (0% to 100%) of the samples carry different barcode
- Low value suggests low diversity
- High value suggest high diversity |
meanSnpHet | A measure of heterozygosity of SNPs across the loci. This gives probability of two randomly selected samples having different alleles at a locus. Mean heterozygosity of SNP is taken across all the SNP loci.
How to read the output:
Scale: 0 to 1. The value is the mean proportion of the samples carry different alleles across the loci
- Low values suggests low heterozygosity and low genetic variability
- High values suggest high heterozygosity and high genetic variability |
maxHaploFreq | Shows proportion of samples carrying the most common haplotype. It’s a measure of loss of diversity
How to read the output:
Scale: 0 to 1 (0% to 100%) of the samples carry the most common haplotype
- Low values suggest low proportion of samples carry the most common haplotype, i.e. not dominated by a single haplotype, high diversity in population
- High values suggest high proportion of samples carry the most common haplotype, i.e. less diversity in population |
medianDistance | Median of pairwise genetic distance, showing how related samples are in a population. It is computed as proportion of SNPs differ between two barcodes. The final value is the median distance of all the samples within the population. This is not a measure of diversity but it is related to it. |
Connectedness Maps
This creates maps of connections of haplotypes between different administrative divisions. It produces multiple maps for different mean genetic distance thresholds.
The administrative divisions are compared in a pairwise fashion, within this comparison output is produced that gives either
- i) the proportion of sample pairs that have at least the assigned percentage of similarity (default: 1), or
- ii) the proportion of sample pairs with a mean genetic distance of at least the
meanDistanceLevels
(default:0.5
).
mapConnections (ctx, sampleSet="EBKK",
measures="ALL", aggregate="Province",
meanDistanceLevels=c(1, 0.9, 0.8),
minIdentity = c(1, 0.9, 0.8),
minAggregateCount=10, showNames=TRUE, markerSize=16,
...)
⚙️Parameters:
- See Common parameters
measures
- default is“ALL”
(default) where two measures,“similarity”
and“meanDistance”
, are plotted."meanDistance"
produces a map with pairwise comparisons between administrative divisions (e.g. sites) in which the thickness of the connection line corresponds to the proportion of sample pairs with a mean genetic distance of at least themeanDistanceLevels
."similarity"
produces a map with pairwise comparisons between administrative divisions (e.g. sites) in which the thickness of the connection line corresponds to the proportion of sample pairs with at least the percentage of similarity given inminIdentity
.meanDistanceLevels
value between0
and1
, default is0.5
. Several thresholds can be specified at once e.g.c(1, 0.9, 0.8)
minIdentity
value between0
and1
, default is1
🤔 How to read output:
- Maps are produced showing connections between the administrative divisions, in which the thickness of the connection line corresponds to the proportion of either
- i) the proportion of sample pairs that have at least the assigned percentage of similarity (default: 1), or
- ii) the proportion of sample pairs with a mean genetic distance of at least the
meanDistanceLevels
(default:0.5
).
Both similarity and mean genetic distance are calculated based on the 101-SNP genetic barcode.
Barcode Frequency Maps
This creates maps showing frequency of groups of Plasmodium samples that have an identical barcode (100% barcode similarity) for different administrative divisions. Barcode frequencies are calculated from the 101-SNP genetic barcode.
mapBarcodeFrequencies (ctx, sampleSet="EBKK",
type=c("bar","pie"),
aggregate="District",
minAggregateCount=10, showNames=TRUE, markerScale=0.8,
...)
⚙️ Parameters:
- See Common parameters
type
– visualisation types of frequency as"bar"
and"pie"
charts.
🤔 How to read output:
- Frequency of groups of samples that have an identical barcode (100% barcode similarity) are shown as either a white box in a bar chart or white segment in a pie-chart
- The presence of a few large segments suggests that the location is dominated by several large groups, each representing a different "strain".
- When we observe "black" colors in the bar or pie chart, it indicates the presence of many overlapping segment edges. This suggests the existence of numerous groups of strains with low frequency. Therefore, the color black can serve as an indicator of diversity. In other words, a high level of "black" implies a greater number of different strains in a given location.
🕓 Time series maps
Time-sequence maps can be implemented using a parameter called timePeriods
parameter. When this is passed to a function, it will partition samples into time-interval plots. Time intervals parameters are available in 5 analyses: mapSampleCounts()
, mapDrugResistancePrevalence()
, mapMutationPrevalence()
, mapAlleleProportions()
and mapDiversity()
.
Why use time-intervals parameter:
- It will produce maps with consistent geographical boundaries
- Users can slice up the dataset in any specified time period
To implement the time-sequence maps, first we provide a list of specified time periods, then we incorporate the timePeriod into the function, for example:
Specify time periods
## To implement time-sequence maps
# 1. Specify time periods
periods <- list(
list(name="2021", type="year", start="1-Jan-2021"),
list(name="2020", type="year", start="1-Jan-2020"),
list(name="2017-19", type="period", start="1-Jan-2017", end="31-Dec-2019") )
- Each list specifies a time period, with the required three parameters:
name
- the name parameter is required and is added to the end of the file name of the produced filestype
- there are two types of time interval.type=
"year"
, usesstart
date and by default show until the end of the defined year.type=
"period"
allows user to specify any time period using both thestart
andend
date.start
andend
- start and end date. The parameter start must be a date in the format “dd-MMM-yyyy”
**NOTE: The public GRC contains only collected years, therefore only the year in the provided “start” date is used to produce time-sequence maps
Apply timePeriods parameter
⏱️ Time intervals parameters, timePeriods, can be applied to 5 analyses: mapSampleCounts()
, mapDrugResistancePrevalence(), mapMutationPrevalence()
, mapAlleleProportions()
and mapDiversity()
.
# 2. Apply timePeriods parameter to any of the five analyses:
mapSampleCounts(ctx, sampleSet="Laos", timePeriods = periods,
aggregate=c("Province","District"), minAggregateCount=1,
markerSize=c(10,40), colourBy="Province", showNames=TRUE,
width=15, height=15)
mapDrugResistancePrevalence(ctx, sampleSet="Laos", timePeriods = periods,
drugs="ALL", aggregate=c("Province","District"),
minAggregateCount=10, showNames=TRUE, markerSize=16,
width=15, height=15)
mapMutationPrevalence(ctx, sampleSet="Laos", timePeriods = periods,
mutations="ALL", aggregate="District",
minAggregateCount=10, showNames=TRUE, markerSize=16,
width=15, height=15)
mapAlleleProportions(ctx, sampleSet="Laos", timePeriods = periods,
mutations="ALL", aggregate="District",
minAggregateCount=10, showNames=TRUE, markerSize=16,
width=15, height=15)
mapDiversity (ctx, sampleSet="Laos", timePeriods = periods,
measures="ALL", aggregate="District", markerColours="red3",
minAggregateCount=10, showNames=TRUE, markerSize=16,
width=15, height=15)
Example of time-interval parameter applied to drug resistant prevalence outputs:
Cluster / Relatedness Analyses 📊
Find Clusters
In order to plot maps of cluster prevalence and cluster sharing, first we have to find clusters of similar genetic background. The function below partitions the Plasmodium samples into clusters with a similar genetic background, based on levels of genetic barcode similarity. This function was built on the igraph R package (Csárdi G, 2023), which computes a graph connecting sample pairs with greater than a minimum threshold , subsequently partitioning the graph into clusters.
## Find clusters of similar genetic background ##
findClusters(ctx, sampleSet="EBKK", clusterSet = "GMS",
minIdentity = c(0.95, 0.80), impute=TRUE,
clusteringMethod = "louvain", minClusterSize = 2)
⚙️ Parameters:
ctx
- the analysis context, created by intializeContext()sampleSet
- the name of the sample set being used, which must have been previously created by selectSampleSet()clusterSet
- give the clustering set a name. In the above example the clustering set is called"GMS"
.minIdentity
- minimal similarity level set for a pair of samples to be in a cluster. For instance,0.95
means the software only put pairs of samples whose genetic barcode are at least 95% matching. Multiple similarity levels can be set at once, as in the example above, by putting number insidec()
separated by,
impute
- To use imputed or filtered data. The default isTRUE
.minClusterSize
- minimal number of samples in a clusterclusteringMethod
- The clustering method. Two methods are available:"allNeighbours"
and"louvain"
. The default is"allNeighbours"
. The"allNeighbours"
method clusters samples together that are above the set"minIdentity"
threshold. This method is less informative at low similarity levels, because each sample will be assigned to a separate cluster. The"louvain"
method is the preferred method. Also known as Louvain Community-based clustering, this method uses an algorithm to identify clusters within a network that are strongly connected to each other, and more weakly connected to other clusters. This method is superior to"allNeighbours"
, particularly in sample sets that have low genetic similarity.
Cluster Group Sharing Maps
This creates a map showing the proportions of assigned clusters (haplotype groups) for different administrative divisions. The different colours represent different haplotype groups. White colour represents samples that do not belong to a cluster because they did not meet the specified similarity threshold minIdentity
.
⚠️ Make sure you have run findClusters()
to identify clusters before running the code below
mapClusterSharing (ctx, sampleSet="EBKK", clusterSet = "GMS",
type=c("bar", "pie"),
aggregate=c("Province","District"),
minAggregateCount=5, showNames=TRUE, markerScale=0.8,
...)
⚙️ Parameters:
- See Common parameters
clusterSet
- give the clustering set a name. In the above example the clustering set is called"GMS"
.type
– visualisation types of frequency as"bar"
and"pie"
charts.
🤔 How to read the output:
- Shows proportion of haplotype groups for different administrative division as either bar-chart or pie-chart
- Different haplotype groups are represented by different colours
- Grey colour represents samples that do not belong to a cluster because they did not meet the specified similarity threshold
Cluster Prevalence Maps
This function produces network maps showing the prevalence of clusters (haplotype groups).
For each assigned cluster (haplotype group) a map is created, showing the prevalence and overlap in prevalence of that cluster across administrative divisions. Each map also contains information on sample count, drug-resistance and mutation frequencies.
⚠️ Make sure you have run findClusters()
to add cluster information to ctx
before running the code below
mapClusterPrevalence (ctx, sampleSet="EBKK", clusterSet = "GMS",
aggregate=c("Province","District"),
minAggregateCount=5, showNames=TRUE, markerScale=0.8,
...)
⚙️ Parameters:
- See Common parameters
clusterSet
- give the clustering set a name. In the above example the clustering set is called"GMS"
.
🤔 How to read output:
- Each map is showing information for a single cluster
- Number on the marker (’node’) is the proportion of sample belonging to this cluster in the area
- Thickness of lines (’edges’) is the proportion samples belong to this cluster shared between two locations
- The legend also shows information on sample count, drug-resistant prediction and mutation frequencies
Connecting Cluster information with Metadata (Optional)
Once cluster information is connected with the GRC, you can use this for further analyses, such as analyzing the prevalence of clonal clusters
Analyses Parameters ⚙️
Common parameters
The following are the common parameters shared across the functions in this package
ctx
- The analysis context, created byintializeContext()
sampleSet
- The name of the sample set being used; must have been previously created byselectSampleSet()
aggregate
- The administrative level at which we group (”Province” or “District”). Separate maps are created for each administrative level.minAggregateCount
- The minimum count of aggregated samples. To avoid estimating on very small samples, one can set a minimum count of samples, below which the marker is not shown.showNames
- IfTRUE
, labels are shown with the name of the aggregation unit (”Province” or “District”)colourby
- Shows the aggregation level to be used to colour the markers (”Province” or “Country”). For each aggregation unit, we place a marker on the map, coloured according to the level of resistance to the drug or mutation, with a label indicating the prevalence.markerSize
- Allows adjustment of the size of markers on the map. If only one value is passed, all markers will be of that size; if two values are passed, they will be used as the min and max size of the marker, whose size will reflect the number of samples.markerFontSize
- Allows adjustment of the font size shown on the markers (numeric), default=6
nameFontSize
- Allows adjustment of the place’s label name font size (numeric), default=5
axisTitleSize
- specifies axis label font size (numeric), default=1
legendFontSize
- specifies font size of the legend (numeric), default=4
width
- the width of the plot (numeric), default=15
height
- the height of the plot (numeric), default=15
units
- the units in which the width and height are expressed. Supported values:"in"
(default),"cm"
,"mm"
,"px"
dpi
- the resolution of the plot output, expressed as dots per inch, default=300
format
- the file format in which the plot will be saved. Supported values:"png”
(default),"pdf"
legendPosition
- specifies where the legend should be plotted. Supported values:"inset"
(default),"separate"
legendDirection
- specifies location of legend. Supported values:"vertical"
(default),"horizontal"
Aesthetics (…)
Marker
Font size
Plot size
Legend
🎨 Assigning custom marker colours
PCA and NJ Tree Analyses🌴
In the next section creation of Principal Component Analysis (PCA) and Neighbor-Joining (NJ) - trees will be explained.
However, firstly specification of graphical display (graphic attributes) based on metadata values is required. These graphical attributes will be used to visualise PCA and NJ analyses.
Step1: User-defined Graphical Attributes for PCA and Tree plot
Step 2: Loading Attributes
Step 3: Applying Attributes to PCA or NJ Tree plots
Principal Component Analyses
Principal Component Analysis (PCA) or Principal Coordinate Analysis (PCoA) are methods of reducing complexity of the data (in this case distances of 101-SNP genetic barcodes) down to two dimensions, while preserving as much information contained in the original data as possible. This was done by creating principal components that explain most of the observed variance in the dataset.
PCoA is the default method in the grcMalaria
but other PCA methods are available. One plot is created for each specified graphical attribute in the parameter 'plots', in which the samples are coloured according to user input.
# Perform a principal component analysis for province, kelch13, plasmepsin2/3 amplification status,
# and kelch13 in combination with plasmepsin2/3 amplification status,
# using the "PCoA" method on sampleSet "EBKK"
plotPrincipalComponents (ctx, sampleSet="EBKK", type="PCoA",
plots=list(
list(name="ByProvince", attributes="province"),
list(name="ByKelch13", attributes="k13"),
list(name="ByPm23", attributes="pm23"),
list(name="ByKelch13AndPm23", attributes=c("k13","pm23-noColour"))
),
...)
⚙️ Parameters:
- See Common parameters
type
- The type of method used for the principal component analysis."PCoA"
(default), performs multidimensional scaling on a distance matrix, in which each sample is a variable. PCA methods:"nipals"
&"bpca"
are methods applied to the set of genetic barcode genotypes (each barcode position is a variable).plots
- the list of attribute rule for which separate plots will be created with the results of this analysis
🤔 How to read output:
- The PC (Principal Components) are arranged in descending order of explained variance, with PC1 axis showing the largest variation and PC2 axis showing the second the largest variation and so on.
- Samples that are close to each other means they are similar to each other. Well separated clusters means there are differences between samples.
- In the example plot, PC1 explains the variation we see in the Pfkelch13 variants, separating C580Y mutation and WT. The data suggests that these parasites carrying C580Y and WT are distinct from one another.
Neighbor-Joining Tree
Neighbor-joining (NJ) tree is a another method to show how samples are related based on the distances between barcodes. It is not a phylogeny tree.
NJ tree is unrooted, which means:
- It clusters together related sequences using distance matrix
- Tree has no orientation
- Without assumption about ancestry (no ancestry defined)
This function creates tree plots, in which the similarity between members of the population (i.e. samples in the sampleSet) is reconstructed based on the 101-SNP genetic barcode. For each specified attribute, a plot is created for the phylogenetic tree, in which the nodes (i.e. samples) are coloured according to that attribute. Graphical display of attributes needs to be specified before using this function, see steps above. In addition to the plot, a newick file is generated, which can be used to upload in tree visualisation tools (for example https://microreact.org/)
# Produce a neighbour-joining tree for sampleSet "Laos"
# with plots in which the samples (i.e. nodes) are coloured according to i) province, ii) kelch13,
# iii) plasmepsin2/3 amplification status, iv) kelch13 and plasmepsin2/3 amplification status
plotTree (ctx, sampleSet="Laos", type="njt",
plots=list(
list(name="ByProvince", attributes="province"),
list(name="ByKelch13", attributes="k13"),
list(name="ByPm23", attributes="pm23"),
list(name="ByKelch13AndPm23", attributes=c("k13","pm23-noColour"))
),
width=15, height=10)
⚙️Parameters:
- See Common parameters
type
- currently the only option is"njt"
which creates a neighbour-joining tree.plots
- the list of attribute rule for which separate plots will be created with the results of this analysis
🤔How to read the output:
Cluster:
- Related samples are clustered together
- Unrelated samples are seperated
Branching:
- Related clusters branch off together
Leaf Colour (the nodes at the end of the lines):
- Graphic Attribute set by user
Branch Colour (the lines):
- Grey branch means not all the leaves branching from it are the same colour
- Coloured branch = all the sub-branches and their leaves are in that colour
Useful Links
🧬 Genetic Report Cards (v1.4)
🧑🏽💻 Source code
🪲 Reporting bugs
📔 Background information
Getting Help & Contacting Us
We are happy to support our partners with any problems they are running in to.
🔧 If you experience any issues, please report on GitHub
📨 If you have any questions, feedback or issues please contact us at genremekong@tropmedres.ac
R Package developed by Olivo Miotto
Version 2.0.0 | Last update on @March 21, 2024