Graph Representation Learning for Structured Data and Genomic Analysis

Xue, Hansheng

Graph Representation Learning for Structured Data and Genomic Analysis

Date

2024

Authors

Xue, Hansheng

Abstract

Graphs serve as a universal language, capable of modeling all interactions across the real world, spanning diverse domains such as recommendation systems and genomics. Analyzing graph-structured data helps reveal concealed patterns or insights within the graph information. In this thesis, our primary focus revolves around the utilization of deep learning techniques to develop graph representation learning models for modeling graph-structured data in e-commerce and genomics. In the realm of e-commerce, we have devised graph representation learning methodologies to model two prevalent graph structures, the multiplex bipartite graph and the dynamic heterogeneous graph, to enhance the performance of recommendation systems. In the field of genomics, our focus is directed towards addressing two typical challenges, metagenomics binning and haplotype phasing, and have developed two graph neural networks equipped with constraint satisfaction models. A multiplex bipartite graph comprises nodes from two distinct domains, where interactions are limited to inter-domain actions. Effectively modeling multiplex bipartite graphs entails addressing two key challenges: a) managing disparate node attributes within bipartite structures, and b) handling the presence of multiple edge types connecting the two distinct domains. We present DualHGCN, a graph neural network model specially designed to transform multiplex bipartite networks into two sets of hypergraphs. This transformation empowers DualHGCN to encode node representations using spectral hypergraph convolutional operators. DualHGCN also incorporates intra- and inter-message passing strategies to facilitate message exchange across various node and edge types. Dynamic heterogeneous graphs are typically depicted as a series of static graph snapshots, where each snapshot is inherently a heterogeneous graph. Effectively representing dynamic heterogeneous graphs entails not only capturing the structural information within individual static snapshots but also learning the evolutionary patterns between consecutive snapshots. To tackle this challenge, we present DyHATR, a method that utilizes hierarchical attention mechanisms to capture heterogeneous information within each snapshot and integrates recurrent neural networks with temporal attention to capture the evolutionary patterns between consecutive snapshots. In metagenomic contig binning, numerous existing tools often overlook the valuable information within the assembly graph. Instead, they primarily depend on the composition and coverage attributes of contigs for the binning process. We introduce RepBin, a binning tool that utilizes a graph neural network to capture the structure within the assembly graph, all while adhering to the heterophilous constraints derived from single-copy marker genes. RepBin further employs graph convolutional networks to label unknown contigs, initially utilizing constrained contigs for obtaining these labels. Reference-based polyploid haplotype phasing strives to categorize reads within a SNP matrix into clusters, each corresponding to distinct haplotypes. These methods frequently employ a minimum error correction (MEC) score to evaluate the disparities between the consensus haplotypes and the affiliated reads within each cluster. Optimizing the MEC score is a computationally challenging NP-Hard problem. We introduce NeurHap, an algorithm that frames the haplotype phasing problem as a graph coloring problem, with colors denoting haplotypes. NeurHap comprises two components: NeurHap-search, a graph neural network for learning vertex representations and color assignments, followed by NeurHap-refine, a local refinement strategy for color adjustment and MEC score optimization.