Artificial Intelligence and Machine Learning Resources at MaizeGDB

MaizeGDB Tools


PanEffect: This tool visualizes the potential effects of missense variants based on (1) all possible amino acid substitutions for proteins in the B73 genome and (2) the predicted impacts across the natural variation in the maize pan-genome.

Protein Structure Resources
  • Foldseek Search: This tool takes a given B73 gene model and finds the top protein structure hits (using FoldSeek) against the the proteomes of Arabidopsis thaliana (Arabidopsis), Glycine max (Soybean), Oryza sativa (Asian rice), Sorghum bicolor, Zea mays (Maize), Homo sapiens (Human), Saccharomyces cerevisiae (Budding yeast), and Schizosaccharomyces pombe (Fission yeast).
  • FATCAT Comparison: This tool shows the structural alignments between a query maize protein with the top hits from the sequence-based tool Diamond and two structural alignment tools FoldSeek and FATCAT, based on structure alignments against the following four plant proteomes: Arabidopsis thaliana (Arabidopsis), Glycine max (Soybean), Oryza sativa (Asian rice), and Sorghum bicolor (Sorghum).

Maize Feature Store: The Maize feature store is a centralized repository of maize data sets formatted to be used as gene-based features for machine learning applications.

Downloads


Gene Ontology (GO) Terms:
  • Genomes analyzed: B73 v4, B73 v5, Mo17 (CAU), W22, and the 25 NAM founder genomes.
  • Methodology: Used Pannzer, a machine-learning tool, to predict Gene Ontology terms and Uniprot descriptions.

Functional Annotations:
  • Genome analyzed: B73 v5.
  • Inferred UniProt annotations based on sequence homology using Diamond, along with structure alignment via FoldSeek and FATCAT. Utilized the FASSO pipeline to merge results and predict potential orthologs and functional annotations.

Metabolic Pathways:
  • Genomes analyzed: B73 v5 and the 25 NAM founder genomes.
  • Methodology: Employed the E2P2 pipeline on official protein annotations to anticipate enzymatic function. Pathway assignments were determined using data from CornCyc, AraCyc, and OryzaCyc. Assignments were validated with SAVI validation lists for evidence codes.

Protein Embeddings:
  • Genomes analyzed: B73 v5 and the 25 NAM founder genomes.
  • Methodology: Computed protein embeddings using three distinct protein language models (pLM): ESM, ProtBERT, and T5. Data is stored in both .h5 and .npy formats.

Protein Structures:
  • Genomes analyzed: B73 v5, W22, and the 25 NAM founder genomes.
  • Methodology: Predicted the 3D protein structures utilizing the ESMFold pipeline.

Variant Effect Scores:
  • Genomes analyzed: B73 v4, B73 v5, W22, the NAM founder genomes, and 25 other maize genomes.
  • Methodology: Used the esm-variants tool, which operates on the ESM protein language model, to predict variant effect scores. This tool calculates scores based on the log-likelihood ratio difference between the variant and its wild type.

Source Code


FASSO: Functional Annotations by Combining Sequence and Structure Orthology. FASSO generates structure-based orthologs by combining three methods (Diamond, FoldSeek, and Fatcat) and assigns UniProt function annotations.

PanEffect: PanEffect is a JavaScript framework to explore variant effects across a pangenome. The tool has two views that allows a user to (1) explore all possible amino acid substitutions and their variant effects for a reference genome, and (2) view the natural variation and their effects across a pangenome.

Publications


Andorf CM, Haley OC, Hayford RK, Portwood JL, Sen S, Cannon EK, Gardiner JM, Woodhouse MR. (2023). PanEffect: A pan-genome visualization tool for variant effects in maize. In bioRxiv (p. 2023.09.25.559155). https://doi.org/10.1101/2023.09.25.559155.

Woodhouse MR, Portwood JL, Sen S, Hayford RK, Gardiner JM, Cannon EK, Harper LC, Andorf CM. (2023). Maize Protein Structure Resources at the Maize Genetics and Genomics Database. Genetics. https://doi.org/10.1093/genetics/iyad016.

Andorf CM, Sen S, Hayford RK, Portwood JL, Cannon EK, Harper LC, Gardiner JM, Sen TZ, Woodhouse MR. (2022) FASSO: An AlphaFold based method to assign functional annotations by combining sequence and structure orthology. In bioRxiv (p. 2022.11.10.516002) doi: https://doi.org/10.1101/2022.11.10.516002.

Cho KT, Sen TZ, Andorf CM. (2022) Predicting Tissue-Specific mRNA and Protein Abundance in Maize: A Machine Learning Approach. Frontiers in artificial intelligence. 26 May 2022 Sec. AI in Food, Agriculture and Water. doi: https://doi.org/10.3389/frai.2022.830170.