Geometric and Topological Methods in Data Science

Institute for Computational and Experimental Research in Mathematics (ICERM)

December 16, 2021 - December 17, 2021
Thursday, December 16, 2021
  • 8:45 - 9:00 am EST
    Welcome
    11th Floor Lecture Hall
    • Jeffrey Brock, Yale University
    • Bjorn Sandstede, Brown University
  • 9:00 - 9:45 am EST
    Geometry of Molecular Conformations in Cryo-EM
    11th Floor Lecture Hall
    • Speaker
    • Roy Lederman, Yale University
    • Session Chair
    • Jeffrey Brock, Yale University
    Abstract
    Cryo-Electron Microscopy (cryo-EM) is an imaging technology that is revolutionizing structural biology. Cryo-electron microscopes produce many very noisy two-dimensional projection images of individual frozen molecules; unlike related methods, such as computed tomography (CT), the viewing direction of each particle image is unknown. The unknown directions and extreme noise make the determination of the structure of molecules challenging. While other methods for structure determination, such as x-ray crystallography and NMR, measure ensembles of molecules, cryo-electron microscopes produce images of individual particles. Therefore, cryo-EM could potentially be used to study mixtures of conformations of molecules. We will discuss a range of recent methods for analyzing the geometry of molecular conformations using cryo-EM data.
  • 10:00 - 10:30 am EST
    Coffee Break
    11th Floor Collaborative Space
  • 11:00 - 11:45 am EST
    GEOMETRIC AND TOPOLOGICAL APPROACHES TO REPRESENTATION LEARNING IN BIOMEDICAL DATA
    11th Floor Lecture Hall
    • Speaker
    • Smita Krishnaswamy, Yale University
    • Session Chair
    • Jeffrey Brock, Yale University
    Abstract
    High-throughput, high-dimensional data has become ubiquitous in the biomedical sciences as a result of breakthroughs in measurement technologies and data collection. While these large datasets containing millions of observations of cells, peoples, or brain voxels hold great potential for understanding generative state space of the data, as well as drivers of differentiation, disease and progression, they also pose new challenges in terms of noise, missing data, measurement artifacts, and the so-called “curse of dimensionality.” In this talk, I will cover data geometric and topological approaches to understanding the shape and structure of the data. First, we show how diffusion geometry and deep learning can be used to obtain useful representations of the data that enable denoising, dimensionality reduction. Next we show how to combine diffusion geometry with topology to extract multi-granular features from the data to assist in differential and predictive analysis. On the flip side, we also create a manifold geometry from topological descriptors, and show its applications to neuroscience. Finally, we will show how to learn dynamics from static snapshot data by using a manifold-regularized neural ODE-based optimal transport. Together, we will show a complete framework for exploratory and unsupervised analysis of big biomedical data.
  • 12:00 - 12:10 pm EST
    Group Photo
    11th Floor Lecture Hall
  • 12:10 - 1:30 pm EST
    Lunch/Free Time
  • 1:30 - 2:15 pm EST
    Metric Repair
    11th Floor Lecture Hall
    • Speaker
    • Anna Gilbert, Yale University
    • Session Chair
    • Bjorn Sandstede, Brown University
    Abstract
    Metric embeddings are key algorithmic and mathematical techniques in applied mathematics and approximation algorithms, and their adaptations are ubiquitous in machine learning. They are used to embed one metric space into another with the hope of revealing hidden structure or reducing the dimension of a data set. Examples include the random projection of a set of points in high dimensions to a lower dimension and the embedding of a graph into a tree-like structure. The fundamental limitation with the application of metric embeddings to machine learning is that their use in data analysis is predicated upon the input data coming from a metric space. Real data, however, do not necessarily conform to a metric; they are messy. The fundamental problem in our research program is metric repair: given a set of input distances, adjust them so that they conform to a metric.
  • 2:30 - 3:00 pm EST
    Coffee Break
    11th Floor Collaborative Space
  • 3:00 - 3:45 pm EST
    From Questionnaires to PDEs: Dynamics and Emergent Models from Disorganized Data
    11th Floor Lecture Hall
    • Virtual Speaker
    • Yannis Kevrekidis, Johns Hopkins University
    • Session Chair
    • Bjorn Sandstede, Brown University
    Abstract
    Starting with sets of disorganized observations of spatiotemporally evolving systems obtained at different (also disorganized) sets of parameters, we demonstrate the data-driven derivation of generative, parameter dependent, evolutionary partial differential equation models of the data. We know what observations were made at the same physical location, the same time or the same set of parameter values - knowing neither where the physical location is, nor when the temporal moment is, nor what the parameter values are; this tensor type of data is reminiscent of shuffled (multi)-puzzle tiles .
    The {\em independent variables} for the evolution equations (their ``space"" and ``time"") as well as their effective parameters are all ``emergent"", i.e. determined in a data-driven way from our disorganized observations of behavior in them.
    We use a diffusion map based ``questionnaire"" approach to build a parametrization of our emergent space for the data. This approach iteratively processes the data by successively observing them on the ``space"", the ``time"" and the ``parameter"" axes of a tensor. Once the data are organized, we use neural-network-based learning to approximate the operators governing the evolution equations in this emergent space. Our illustrative example is based on a previously developed vertex-plus-signaling model of \textit{Drosophila} embryonic development. This allows us to discuss features of the process like symmetry breaking, translational invariance of the emergent PDE model, and interpretability.
  • 4:00 - 4:45 pm EST
    Topological data analysis of zebrafish patterns
    11th Floor Lecture Hall
    • Virtual Speaker
    • Alexandria Volkening, Purdue University
    • Session Chair
    • Bjorn Sandstede, Brown University
    Abstract
    Self-organization is present at many scales in biology, and here I will focus specifically on elucidating how brightly colored cells interact to form skin patterns in zebrafish. Wild-type zebrafish are named for their dark and light stripes, but mutant zebrafish feature variable skin patterns, including spots and labyrinth curves. All of these patterns form as the fish grow due to the interactions of tens of thousands of pigment cells, making agent-based modeling a natural approach for describing pattern formation. By identifying cell interactions that may change to create mutant patterns, my longterm goal is to help link genes, cell behavior, and visible animal characteristics in fish. However, agent-based models are stochastic and have many parameters, so comparing simulated patterns and fish images is often a qualitative process. Developing analytically tractable continuum models from agent-based systems is one means of addressing these challenges and better understanding the roles of different parameters in pattern formation. Alternatively, methods from topological data analysis can be applied to cell-based systems directly. In this talk, I will overview our models and present quantitative comparisons of in silico and in vivo cell-based patterns using our topological methods.
  • 5:00 - 6:00 pm EST
    Reception
    11th Floor Collaborative Space
Friday, December 17, 2021
  • 9:00 - 9:45 am EST
    Robust and Scalable Learning of Gaussian Mixture Models
    11th Floor Lecture Hall
    • Speaker
    • Kisung You, Yale University
    • Session Chair
    • Ian Adelstein, Yale University
    Abstract
    A Gaussian mixture model (GMM) is one of the highlighted methods in both machine learning and statistics communities for probabilistic clustering and density estimation. Estimation of the model is usually executed by the expectation-maximization (EM)-like algorithm. When the sample size is large, however, the EM algorithm may not be a convenient option due to exponential growth in computational costs. In this talk, I present a divide-and-conquer approach with minimal communication to resolve this problem by working with a Hilbertian structure of GMMs induced by kernel embedding of Gaussian measures. This is done by estimating multiple models on independent subsets of the data and aggregating those into a single GMM by geometric median in the Hilbert space, which guarantees robustness of the estimate under mild conditions. Next, once the estimate is achieved, it may contain overly redundant components in that the obtained clustering is not meaningful and interpretation of each component becomes incomprehensible. Upon the observation, two postprocessing strategies for model reduction and clustering characterization are proposed.
  • 10:00 - 10:30 am EST
    Coffee Break
    11th Floor Collaborative Space
  • 11:00 - 11:45 am EST
    Characterizing Transitions in Developmental Biology using Topological Machine Learning
    11th Floor Lecture Hall
    • Speaker
    • Dhananjay Bhaskar, Yale University
    • Session Chair
    • Ian Adelstein, Yale University
    Abstract
    I will present on-going work applying topological data analysis (TDA) and machine learning to identify transitions in cell organization and cell state within the context of developmental biology. First, using cell positions obtained from agent-based simulations of cell sorting and skin pigmentation, the complex relationship between cell-cell interactions and emergent patterns is automatically discovered via unsupervised classification of persistence images. This approach is used to analyze phase transitions in proliferating, heterogeneous populations and found to be empirically robust to random perturbations and finite-size effects. Next, I will discuss challenges associated with TDA of high-dimensional single cell sequencing datasets. In particular, lack of suitable techniques for intrinsic dimension and curvature estimation is limiting the use of multi-parameter filtration as a tool for understanding these data. I will briefly outline a novel approach for tackling this problem, using graph diffusion probabilities to predict curvature on toy data consisting of points sampled from quadric surfaces.
  • 12:00 - 1:15 pm EST
    Lunch/Free Time
  • 1:15 - 2:00 pm EST
    Geometry of Neural Representations Shapes Multi-Task Function in Neural Networks and Humans
    11th Floor Lecture Hall
    • Speaker
    • John Murray, Yale University
    • Session Chair
    • Smita Krishnaswamy, Yale University
    Abstract
    Flexible cognitive behavior requires the ability to learn and perform a diversity of tasks without detrimental interference. What are the geometric properties of neural representations that support multi-task learning and function? In this talk I will present recent and ongoing studies integrating computational modeling and empirical data to link the representational geometry of neural networks to cognitive function.
  • 2:15 - 2:45 pm EST
    Coffee Break
    11th Floor Collaborative Space
  • 2:45 - 3:30 pm EST
    Connecting molecules to individual cell behavior to emergent collective behavior
    11th Floor Lecture Hall
    • Speaker
    • Thierry Emonet, Yale University
    • Session Chair
    • Smita Krishnaswamy, Yale University
    Abstract
    Cells live in communities where they interact with each other and their environment. By coordinating individuals, such interactions often result in collective behavior and function that emerge on scales larger than the individuals and are beneficial to the population. At the same time, populations of individuals, even isogenic ones, display phenotypic heterogeneity, which diversifies individual behavior and enhances the resilience of the population in unexpected situations. This raises a dilemma: although individuality provides advantages, it also tends to reduce coordination. I will discuss our experimental and theoretical efforts that use bacterial chemotaxis as model system to understand the origin of individual cellular behavior and performance, and how populations of cells reconciliate individuality with group behavior to robustly operate in multiple environments. Bacterial chemotaxis is one of the best understood model systems of all of biology. As such it enables us to examine both experimentally and theoretically how dynamical interactions at one scale give rise to structure and function at the next (larger) scale. Thus, it is a great testbed for novel mathematical methods to study data.
  • 3:45 - 4:30 pm EST
    Geometric Scattering And Applications
    11th Floor Lecture Hall
    • Speaker
    • Michael Perlmutter, University of California, Los Angeles
    • Session Chair
    • Smita Krishnaswamy, Yale University
    Abstract
    The scattering transform is a mathematical model of convolutional neural networks (CNNs) introduced for functions defined on Euclidean space by Stephan\'e Mallat. It differs from traditional CNNs by using predesigned, wavelet filters rather than filters which are learned from training data. This leads to a network which provably has stability and invariance guarantees. Moreover, in situations where the wavelets can be designed in correspondence to underlying physics, it can produce very good numerical results. The rise of geometric deep learning motivated the introduction of geometric scattering transforms for data sets modeled as graphs or manifolds. These networks use wavelets constructed using the spectral decompositions of an appropriate Laplacian operator or via polynomials of a diffusion operator. In my talk, I will discuss applications of these networks to a variety of geometric deep learning tasks and show that they have analogous stability and invariance guarantees to their Euclidean predecessor. I will then talk about modifications of the graph scattering transform which can increase numerical performance and also work using the graph scattering transform as the front end of an encoder-decoder network for the purposes of molecule generation.

All event times are listed in ICERM local time in Providence, RI (Eastern Standard Time / UTC-5).

All event times are listed in .