Skip to main content

Dataset / A Streamlined Workflow to Facilitate Genome-Scale AI

Have a question about this item?

Item information. View source record on contributor's website.

Title
A Streamlined Workflow to Facilitate Genome-Scale AI
Contributor
Clo, Leonardo
Keogh, Aidan
Markuske, William
Wallace, Zachary
Date Created and/or Issued
2023-01 to 2023-06
Contributing Institution
UC San Diego, Research Data Curation Program
Collection
Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects
Rights Information
Under copyright
Constraint(s) on Use: This work is protected by the U.S. Copyright Law (Title 17, U.S.C.). Use of this work beyond that allowed by "fair use" or any license applied to this work requires written permission of the copyright holder(s). Responsibility for obtaining permissions and any use and distribution of this work rests exclusively with the user and not the UC San Diego Library. Inquiries can be made to the UC San Diego Library program having custody of the work.
Use: This work is available from the UC San Diego Library. This digital copy of the work is intended to support research, teaching, and private study.
Rights Holder and Contact
Wallace, Zachary; Keogh, Aidan; Markuske, William; Clo, Leonardo
Description
Since the first draft of the human genome was published in the early 2000s, the drastic reduction in the costs of sequencing technologies and speed by which such technology is able to sequence genomes has defied Moore’s Law and allowed scientists to collect a wealth of data regarding the code of life. When extending genomic sequencing to large populations of humans or other animals, the genomic data of each individual, also known as the genotype, can be paired with observable or measurable traits of the individual, also known as the phenotype, and allow for relationships between genotypes and phenotypes to be analyzed. This type of work has powered numerous Genome-Wide Association Studies (GWAS) that have been able to associate disease with underlying genetic causes that can root to the specific position on the genome. In this study we have collected genotypes for a heterogenous population of about 13,000 rats using unphased Whole-Genome Sequencing. Such data for each rat is represented as a constellation of point mutations, also known as Single Nucleotide Polymorphisms (SNP), where each mutation is characterized as a positional variation with respect to a rat reference genome, the rn6 reference genome, following an assembly and alignment of nucleotide reads guided by the reference. In conjunction with collecting rat genotypes by sequencing, we have also collected quantitative behavioral phenotypes, such as locomotor activity in response to various external stimuli. With this comprehensive collection of genotype data represented by millions of SNPs per rat paired with behavioral phenotype data, this work seeks to assess the capabilities of Machine Learning for predicting rat behavior on a whole-genome scale. However, considering the complexity of genomic data, such as the sheer dimensionality of SNPs or dependent relationships that deduce phenotypes, as well as the challenges due to a lack of phenotype data compared to the genotype data, there are many different ways to go about a Machine Learning solution for predicting genotype to phenotype. These methods all leverage different data processing and reduction techniques, different models, and different tuned hypermeters. To handle the complexity in devising Machine Learning models that predict phenotypes from millions of SNPs, we propose framework that leverages powerful ML workflows, namely hydra, optuna, and MLFlow combined with a highly scalable genomics toolkit, namely sgkit, to devise a streamlined pipeline meant to facilitate genome-scale AI.
Research Data Curation Program, UC San Diego, La Jolla, 92093-0175 (https://lib.ucsd.edu/rdcp)
Wallace, Zach; Keogh, Aidan; Markuske, William; Clo, Leonardo (2023). A Streamlined Workflow to Facilitate Genome-Scale AI. In Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects. UC San Diego Library Digital Collections. https://doi.org/10.6075/J0T72HNS
Type
dataset
Identifier
ark:/20775/bb4755006h
Language
English
Subject
Usability
Genotype-to-phenotype
Machine learning
MLFlow
Hydra
Genomics
Scalability
Data Science & Engineering Master of Advanced Study (DSE MAS)
Capstone projects
Genome wide association studies (GWAS)
Bioinformatics
DSE MAS - 2023 Cohort

About the collections in Calisphere

Learn more about the collections in Calisphere. View our statement on digital primary resources.

Copyright, permissions, and use

If you're wondering about permissions and what you can do with this item, a good starting point is the "rights information" on this page. See our terms of use for more tips.

Share your story

Has Calisphere helped you advance your research, complete a project, or find something meaningful? We'd love to hear about it; please send us a message.

Explore related content on Calisphere: