USE OF SSR FOR CANNABIS SATIVA CROPS ORIGIN IDENTIFICATION: A PRELIMINARY ANALYSIS TO EVALUATE GENETIC MARKER CLASSIFICATION METHODS

Published in 21/11/2024 - ISBN: 978-65-272-0843-3

Paper Title
USE OF SSR FOR CANNABIS SATIVA CROPS ORIGIN IDENTIFICATION: A PRELIMINARY ANALYSIS TO EVALUATE GENETIC MARKER CLASSIFICATION METHODS
Authors
  • Cássio Augusto Rodrigues Bettim
  • Oscar Cardenas
  • Franciele Maboni Siqueira
  • Clarice Sampaio Alho
  • Marcio Dorn
Modality
Poster
Subject area
DNA and Genomics
Publishing Date
21/11/2024
Country of Publishing
Brazil | Brasil
Language of Publishing
Inglês
Paper Page
https://www.even3.com.br/anais/xmeeting-2024/837315-use-of-ssr-for-cannabis-sativa-crops-origin-identification--a-preliminary-analysis-to-evaluate-genetic-marker-cla
ISBN
978-65-272-0843-3
Keywords
Cannabis sativa; SSR; Machine Learning; Feature Selection
Summary
Cannabis sativa L. is the most distributed psychoactive globally and the most consumed illegal drug in Latin America. Dedicated efforts of the Brazilian Federal Police (BFP) to combat drug traffic and eradicate large-scale cultivars led to an increase in practices such as smaller-scale indoor cultivation and alternative forms of trafficking: sending seeds via postal services. In this context, there is a growing need to develop methods to differentiate and predict key phenotypes, such as the origin of cultivation, which is crucial for police intelligence. Simple short repeats (SSRs) - also called microsatellites - are a class of small repeated sequences that vary from 2 to 6 nucleotides, widely used as molecular markers, which can vary in the number of repeat units and the repeat pattern, presenting a multiallelic character and high variability between individuals, giving them high discrimination power. In the present work, we sought to apply and compare two different strategies: 1) utilizing the identification of SSR and selection of more different markers and 2) Machine Learning methods based on Feature selection to obtain a panel of the most relevant markers for target phenotype from a high-dimensional SSR data frame to induce a classification model using machine learning techniques to predict the origin of plant cultivation. The analysis was developed with 38 genomes of Cannabis sativa seeds from four different origin groups - Colombia, Paraguay, the Marijuana Polygon region, and a Foreign group - and seized by the BFP. The sequences were aligned against the Cannabis sativa reference sequence GCF_029168945.1. Alignments obtained were converted into the consensus sequence format in fasta format so that it could be analyzed with the Micro and Mini SATellite IdentificatioN (SATIN) program; with the SSR obtained, the most different ones were processed with the same program according to the number of repetitions of each marker for a pre-selection and subsequent classification of markers to allow for the difference between each group. The SATIN Tool extracted 467.527 SSRs identified; of these, the most different according to the number of repetitions was around 364 SSRs and applying filters for markers containing missing data (“N” nucleotides) and considering only those present in every sample of each region, a list of 30 STRs was obtained, present in different coding regions. SSRs are present in samples from Colombia, Marijuana Polygon, and Foreign samples; no STR was identified as exclusive in all samples originating from Paraguay. The first strategy allowed classifying the four groups of samples according to origin with at least 30 SSRs. The following steps of the work include the Feature Selection method applied to the SSRs data frame to extract a second list of the most relevant features and the application of supervised machine learning techniques for induction of predictive classification models for the crop's origin, using the selected features from both methods as input and application of grid search and cross-validation techniques for optimization of the model, as well as the comparison of models performances through different evaluation metrics and further analysis of the selected features.
Title of the Event
20º Congresso Brasileiro de Bioinformática: X-Meeting 2024
City of the Event
Salvador
Title of the Proceedings of the event
X-Meeting presentations
Name of the Publisher
Even3
Means of Dissemination
Meio Digital

How to cite

BETTIM, Cássio Augusto Rodrigues et al.. USE OF SSR FOR CANNABIS SATIVA CROPS ORIGIN IDENTIFICATION: A PRELIMINARY ANALYSIS TO EVALUATE GENETIC MARKER CLASSIFICATION METHODS.. In: X-Meeting presentations. Anais...Salvador(BA) Hotel Deville Prime, 2024. Available in: https//www.even3.com.br/anais/xmeeting-2024/837315-USE-OF-SSR-FOR-CANNABIS-SATIVA-CROPS-ORIGIN-IDENTIFICATION--A-PRELIMINARY-ANALYSIS-TO-EVALUATE-GENETIC-MARKER-CLA. Access in: 25/04/2025

Paper

Even3 Publicacoes