Under copyright Constraint(s) on Use: This work is protected by the U.S. Copyright Law (Title 17, U.S.C.). Use of this work beyond that allowed by "fair use" or any license applied to this work requires written permission of the copyright holder(s). Responsibility for obtaining permissions and any use and distribution of this work rests exclusively with the user and not the UC San Diego Library. Inquiries can be made to the UC San Diego Library program having custody of the work. Use: This work is available from the UC San Diego Library. This digital copy of the work is intended to support research, teaching, and private study.
Abstract: The American Gut Project (AGP) [1] is the largest citizen crowd-sourced collection of gut microbiome samples available today. Knowledge of the microbiome is in its beginning stages and the enormous amount of organism and gene effects that are ill-understood makes accurately interpreting results difficult. Reducing this high dimensional space with fundamentally different embedding techniques can be effective in capturing different aspects of the microbiome data to aide in research. Dimensionality reduction techniques like Word2Vec, Hyperbolic Embeddings, and Principal Coordinates Analysis (PCoA) were used to reduce a single sample’s dimensionality and explore their different strengths. Embeddings were validated by using them as features for a supervised machine learning model that classifies microbiome body sites (e.g. sebum, feces, saliva). Competing against the state of the art of PCoA using underlying phylogeny distances, the different embeddings kept the baseline logistic regression model’s F1 score within acceptable margins at +/- 0.1. These reduction comparisons included actual dimension sizes, metrics of the model prediction, and a representation of samples’ clusters. This paper will discuss the analysis, architecture, and visualization of the project that approached this main technical challenge of gaining a better understanding of microbiota. This project was done in the Cohort 4 2017-2019 group for the MAS DSE Master's program. The data used comes from the Rob Knight UCSD Lab and is contained in the Qiita website under study #10317. This project contains various analyses on microbiome data, survey data, drug data, and diet data. It also contains a Luigi pipeline and a Plotly Dash application for front end usage. Research Data Curation Program, UC San Diego, La Jolla, 92093-0175 (https://lib.ucsd.edu/rdcp) Conrad, Ryan; Inghilterra, Ryan; Rowan, Sean; Westerberg, Brandon; McDonald, Daniel; Knight, Rob (2019). DSE Capstone - American Gut Project Cohort4 2019. In Data Science & Engineering Master of Advanced Study (DSE MAS) Capstone Projects. UC San Diego Library Digital Collections. https://doi.org/10.6075/J0HT2MN3
Type
dataset
Identifier
ark:/20775/bb2666864s
Language
English
Subject
Luigi (Python package) Data Science & Engineering Master of Advanced Study (DSE MAS) Word2vec Principal Coordinates Analysis (PCoA) Hyperbolic Capstone projects Microbiome UniFrac DSE MAS - 2019 Cohort
If you're wondering about permissions and what you can do with this item, a good starting point is the "rights information" on this page. See our terms of use for more tips.
Share your story
Has Calisphere helped you advance your research, complete a project, or find something meaningful? We'd love to hear about it; please send us a message.