{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# FastEMRIWaveforms Tutorial\n",
    "## ICERM Workshop: Waveform acceleration with machine learning and GPUs\n",
    "\n",
    "### Michael Katz, Max Planck Institute for Gravitational Physics (Albert Einstein Institute), Lead developer for FastEMRIWaveforms"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this tutorial, you will learn the basics of building an accelereated EMRI waveform. We encourage participants to see our paper ([arxiv.org/2008.06071](https://arxiv.org/abs/2008.06071)) and the FastEMRIWaveforms [package documentation](https://bhptoolkit.org/FastEMRIWaveforms/) for more information, as well as our forthcoming paper that will describe the waveform in much more detail than our PRL submission. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import packages:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "\n",
    "import h5py\n",
    "\n",
    "\n",
    "from few.amplitude import romannet\n",
    "from few.utils.utility import check_for_file_download, p_to_y\n",
    "from few.trajectory.flux import RunSchwarzEccFluxInspiral\n",
    "from few.amplitude.romannet import RomanAmplitude\n",
    "from few.amplitude.interp2dcubicspline import Interp2DAmplitude\n",
    "from few.waveform import FastSchwarzschildEccentricFlux"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A Quick Introduction to FastEMRIWaveforms\n",
    "\n",
    "### Collaborators: Alvin Chua, Niels Warburton, Scott Hughes, Lorenzo Speri"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "generate = FastSchwarzschildEccentricFlux()\n",
    "\n",
    "M = 1e6  # large mass\n",
    "mu = 1e1  # small mass\n",
    "p0 = 12.0  # separation \n",
    "e0 = 0.4  # eccentricity\n",
    "theta = np.pi/3.  # polar viewing angle\n",
    "phi = np.pi/4.  # azimuthal viewing angle\n",
    "dist = 1.0  # distance in Gpc\n",
    "\n",
    "T = 1/365.  # in years\n",
    "dt = 10.0  # time spacing of data stream\n",
    "\n",
    "wave = generate(M, mu, p0, e0, theta, phi, dist, T=T, dt=dt)\n",
    "\n",
    "plt.plot(wave.real)\n",
    "plt.plot(wave.imag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "FastEMRIWaveforms is the first fully relativistic template generation tool for extreme mass ratio inspirals. We think of it more as a framework than any specific waveform. In this tutorial, we will discuss the main pieces of this framework and how they relate to our first fully relativistic waveform model shown above. \n",
    "\n",
    "As a basic primer, our five key points we want to achieve with this framework are:\n",
    "\n",
    "* Accuracy: Our fast waveforms must be accurate when compared to slow and accurate waveforms generated by the waveform modeling community.\n",
    "\n",
    "* Modularity: These waveforms are to be built out of a set of modules. These modules are to be easily interchangeable, as well as stand alone tools available for more in-depth analysis.\n",
    "\n",
    "* Flexibility: This framework must be easily adaptable to new computational methods and/or improvements in EMRI physics. \n",
    "\n",
    "* Easy User Interface: All modules and complete waveforms are to have a front-facing python interface with clear and extensive documentation, as well as many examples.\n",
    "\n",
    "* Parallelization: These waveforms must take advantage of parallelization techniques such as OpenMP, as well as accelerator hardware such as GPUs. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basics of an EMRI Waveform"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The guiding equation for building an EMRI waveform is given by\n",
    "\n",
    "\\begin{equation}\n",
    "h_+-ih_x = \\frac{1}{r}\\sum_{lmkn}\\left(-\\frac{Z_{lmkn}}{\\omega_{mkn}^2}\\right)\\left(S_{lmkn}(\\theta)e^{-im\\phi}\\right)e^{i\\Phi_{mkn}} = \\frac{1}{r}\\sum_{lmkn}A_{lmkn}\\Theta_{lmkn}e^{i\\Phi_{mkn}}.\n",
    "\\end{equation}\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we are concerned with generating fast and accurate waveforms. These waveforms are created with a sequence of modules. We will discuss the three main modules used to produce these waveforms. The first module is the Trajectory module which takes initial parameters and produces arrays for all of the evolving quantities of concern. These include the phase evolution, {$\\Phi_\\varphi,\\Phi_\\theta, \\Phi_r$}, from the start to the end of the EMRI orbit. With these quantities, we can produce:\n",
    "\n",
    "\\begin{equation}\n",
    "\\Phi_{mkn} = m\\Phi_\\varphi + k\\Phi_\\theta + n\\Phi_r.\n",
    "\\end{equation}\n",
    "\n",
    "The trajectory also produces orbital quantities over time. These include the separation ($p$), eccentricity ($e$), and the inclination angle of the orbit ($\\iota$). With these arrays containing the orbital evolution of these values, we generate the amplitudes, $A_{lmkn}$. While the evolution is over time, the various phasing and amplitude computations are produced within a frequency decomposition into $(l,m,k,n)$ modes.\n",
    "\n",
    "With phases and amplitdues in hand, we combine this with the angular harmonic information ($\\Theta_{lmkn}$) to produce the final waveform. In this step, we calculate the sum of all modes at each time point in the template data stream. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Overall Waveform Strategy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* We calculate our trajectories and amplitudes as sparse arrays in time by using a large adaptive stepping integrator. This produces arrays with $\\sim100$ points. All of these quantities vary slowly and smoothly. After these calculations are complete, we scale this up to the actual data stream cadence.  \n",
    "\n",
    "* The speed of EMRI waveforms is strongly determined by the amount of harmonic content. Higher eccentricities require more modes to produce a high fidelity waveform. In order to make our waveforms as efficient as possible, we perform an online mode content calculation that removes modes from consideration if they do not contribute to the waveform power determined by a user-defined threshold. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Current Waveform Model: Schwarzschild Eccentric"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* No $k$ modes\n",
    "* Orbit is equatorial\n",
    "* $S_{lmkn}(\\theta)e^{-im\\phi}$ reduces to $_{-2}Y_{lm}(\\theta,\\phi)$\n",
    "* $l:\\{2,10\\}$, $m:\\{-l,l\\}$, $n:\\{-30,30\\}$ $\\rightarrow$ 3843 modes. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Fast Trajectories: $\\{p, e, \\Phi_\\varphi, \\Phi_r\\}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We are not going to spend too much time on the trajectory part. However, we need to generate it in order to build the rest of our waveform. To build the trajectory, we integrate with large steps using an RK8 integrator. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "traj = RunSchwarzEccFluxInspiral()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "p0 = 16.0  # initial separation\n",
    "e0 = 0.4  # initial eccentricity\n",
    "mu = 180.  # iniital small mass in solar masses, produces approximately 1 yr waveform\n",
    "M = 1e6   # initial large mass in solar masses\n",
    "dt = 10.0 # sets initial step size\n",
    "T = 1.0 # in years\n",
    "\n",
    "t, p, e, Phi_phi, Phi_r, flux = traj(M, mu, p0, e0, T=T, dt=dt)\n",
    "print(\"length:\", len(t), \"duration:\", t[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axes = plt.subplots(2, 3)\n",
    "plt.subplots_adjust(wspace=0.3)\n",
    "fig.set_size_inches(14, 8)\n",
    "axes = axes.ravel()\n",
    "\n",
    "ylabels = [r'$e$', r'$p$', r'$e$', r'$\\Phi_\\phi$', r'$\\Phi_r$', r'Flux']\n",
    "xlabels = [r'$p$', r'$t$', r'$t$', r'$t$', r'$t$', r'$t$', r'$t$', r'$t$']\n",
    "ys = [e, p, e, Phi_phi, Phi_r, flux]\n",
    "xs = [p, t, t, t, t, t]\n",
    "\n",
    "for i, (ax, x, y, xlab, ylab) in enumerate(zip(axes, xs, ys, xlabels, ylabels)):\n",
    "    ax.plot(x, y, lw=0.5)\n",
    "    ax.scatter(x, y, s=5)\n",
    "    ax.set_xlabel(xlab, fontsize=16)\n",
    "    ax.set_ylabel(ylab, fontsize=16)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## RomanNet Amplitudes: $A_{lmn}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To generate the amplitudes, we use a RomanNet ([arXiv:1811.05491](https://arxiv.org/abs/1811.05491)). Roman stands for Reduced Order Modelling with Artificial Neurons. When training a neural network, it can be crucial to hand it data that is distilled to highlight the most pertinent information. A nice tool for doing this is Reduced Order Modeling. Reduced order modelling projects the information with lossless compression down to a lower dimensional space. We take our complex mode amplitude vectors containing 3843 modes and project this down to a real-valued space with 198 values. The neural network is then trained with inputs given by the $p$ and $e$ values and outputs given by the reduced order coefficients ($\\alpha_i$):\n",
    "\n",
    "\\begin{equation}\n",
    "A_{lmn}\\in\\mathbb{C}^{3843}\\xrightarrow{\\mathit{ROM}}\\alpha_i\\in\\mathbb{R}^{198}\\xrightarrow{\\mathit{train}} f(p,e)=\\alpha_i\n",
    "\\end{equation}\n",
    "\n",
    "The neural network itself is extremely simple. It is a fully connected network with a LeakyReLU activation on all layers but the final layer. This means the neural network can be built simply with a sequence of linear matrix multiplications followed by a pass through the activation function. Once the neural network is trained, we will have a set of weights. This is where we will start in this tutorial. \n",
    "\n",
    "During online evaluation of the waveform, we perform the reverse process:\n",
    "\n",
    "\\begin{equation}\n",
    "f(p,e)\\xrightarrow{\\mathit{eval}}\\alpha_i\\in\\mathbb{R}^{198}\\xrightarrow{\\mathit{project}}A_{lmn}\\in\\mathbb{C}^{3843}\n",
    "\\end{equation}\n",
    "\n",
    "This method has pros and cons:\n",
    "\n",
    "Pros:\n",
    "\n",
    "* This is more of a global fit, rather than individual fits to given modes. This generally means storage of less information in memory. If we do individual fits of every mode, the memory necessary to store this information would scale badly with mode content. We are currently working in Schwarzschild Eccentric. As we go to generic Kerr, we expect the number of modes to increase by a factor of $\\sim10$.\n",
    "\n",
    "* Due to its global fit nature, this method is generally faster to evaluate than individual interpolants. \n",
    "\n",
    "* Since it is a neural network and a linear projection, this is very suitable to GPUs.\n",
    "\n",
    "* We expect this method, or methods similar to this, to scale better with dimensionality as we move towards the end goal of generic Kerr orbits. \n",
    "\n",
    "Cons:\n",
    "\n",
    "* For extremely quiet modes, this method can be less accurate. However, as these modes are quiet, this does not result in a significant loss of accuracy in the final waveform. Caution must be taken when using this method to analyze individual mode amplitudes. There is an approximate floor in the amplitude values at $\\sim10^{-5}$. For reference, the loudest modes at a given $p$ and $e$ value are usually $\\sim0.1-1.0$. (These amplitudes are not scaled for distance.|) \n",
    "\n",
    "* Training these neural networks can be more of an art than a science. It takes a lot of trial and error to get this right. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Constructing the neural network from trained weights"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "\n",
    "# prepare to load the weights\n",
    "\n",
    "path_to_few_dir = romannet.__file__[:-25]\n",
    "\n",
    "weight_file = fp = \"SchwarzschildEccentricInput.hdf5\"\n",
    "check_for_file_download(fp, path_to_few_dir)\n",
    "\n",
    "weights = []\n",
    "bias = []\n",
    "dim1 = []\n",
    "dim2 = []\n",
    "\n",
    "# get highest layer number\n",
    "num_layers = 0\n",
    "\n",
    "# extract all necessary information from the file\n",
    "with h5py.File(path_to_few_dir + \"few/files/\" + weight_file, \"r\") as fp:\n",
    "    \n",
    "    # get basic information\n",
    "    num_teuk_modes = fp.attrs[\"num_teuk_modes\"]\n",
    "    transform_factor = fp.attrs[\"transform_factor\"]\n",
    "    break_index = fp.attrs[\"break_index\"]\n",
    "    \n",
    "    # determine layer arrangement \n",
    "    for key, value in fp.items():\n",
    "        if key == \"reduced_basis\":\n",
    "            continue\n",
    "\n",
    "        layer_num = int(key[1:])\n",
    "\n",
    "        if layer_num > num_layers:\n",
    "            num_layers = layer_num\n",
    "\n",
    "    # get weights and bias\n",
    "    for i in range(1, num_layers + 1):\n",
    "        temp = {}\n",
    "        for let in [\"w\", \"b\"]:\n",
    "            mat = fp.get(let + str(i))[:]\n",
    "            temp[let] = np.asarray(mat)\n",
    "\n",
    "        weights.append(temp[\"w\"])\n",
    "        bias.append(temp[\"b\"])\n",
    "        dim1.append(temp[\"w\"].shape[0])\n",
    "        dim2.append(temp[\"w\"].shape[1])\n",
    "\n",
    "    # get the post network transform matrix\n",
    "    transform_matrix = np.asarray(fp[\"reduced_basis\"])\n",
    "\n",
    "# activation function\n",
    "# we use a factor of 0.2 for negative values\n",
    "def LeakyReLU(x):\n",
    "    out = (x >= 0.0) * x +  (x < 0.0) * 0.2*x\n",
    "    return out\n",
    "\n",
    "# build the neural network\n",
    "def RomanNet(p, e):\n",
    "    \n",
    "    p = np.atleast_1d(p)\n",
    "    e = np.atleast_1d(e)\n",
    "    \n",
    "    # convert from the p coordinate to a special y coordinate\n",
    "    # see the documentation for more details\n",
    "    y = p_to_y(p, e)\n",
    "    \n",
    "    # prepare input\n",
    "    x = np.array([y, e])\n",
    "    \n",
    "    # basic fully connected network\n",
    "    for layer_i in range(num_layers):\n",
    "        \n",
    "        # linear transformation\n",
    "        x = np.dot(weights[layer_i].T, x) + bias[layer_i][:, np.newaxis]\n",
    "        \n",
    "        # do not want to activate last layer\n",
    "        if layer_i < num_layers - 1:\n",
    "            # non-linear activatation\n",
    "            x = LeakyReLU(x)\n",
    "            \n",
    "    # separate real and imaginary\n",
    "    x = x[:break_index] + 1j * x[break_index:]\n",
    "    \n",
    "    # project back to amplitude basis\n",
    "    out = np.dot(transform_matrix.T, x)/transform_factor\n",
    "    return out.T\n",
    "        \n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# test it\n",
    "p_test = np.array([11.0, 10.0, 10.0])\n",
    "e_test = np.array([0.2, 0.3, 0.1])\n",
    "\n",
    "RomanNet(p_test, e_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Produce amplitudes associated with our trajectories"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate amplitudes with roman net\n",
    "amps = RomanNet(p, e)\n",
    "\n",
    "# check against actual code\n",
    "RomanNetTrue = RomanAmplitude()\n",
    "amps_check_1 = RomanNetTrue(p, e)\n",
    "\n",
    "assert np.allclose(amps, amps_check_1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check our RomanNet global fit against accurate values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get accurate values \n",
    "# each mode is fitted with a bicubic spline\n",
    "BicubicAmps = Interp2DAmplitude()\n",
    "amps_check_2 = BicubicAmps(p, e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Print vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(\"romannet:\", amps_check_1[0][0:10], \"\\nBicubic Spline:\", amps_check_2[0][0:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see if we take a quick examination of our results (a small subset) we see the problem with the global fit. It cannot handle modes with a very small amplitude. Therefore, you might naively think that this method might not work."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Compare via the cosine between the vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "cos = (np.dot(amps_check_2.conj()[0].T, amps_check_1[0]) /\n",
    "       np.sqrt(np.dot(amps_check_2.conj()[0], amps_check_2[0]) \n",
    "               * np.dot(amps_check_1.conj()[0], amps_check_1[0]))).real\n",
    "\n",
    "print('Cos:', cos)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now see that if we compare the results as a whole, the results match very well. If we look at individual modes that have high power, they are likely to strongly match when the bicubic spline is compared to the RomanNet method. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## GPU-accelerated Waveform Build: $\\sum_{lmn}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Quick primer on GPUs\n",
    "\n",
    "<img src=\"./GPU_diagram.png\" width=950 height=950 />\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "GPUs run code in parallel in a configuration of grids, blocks, and threads.\n",
    "\n",
    "* **Threads** are the the actual software units that run the code. Threads run independently of one another. Threads are referenced in code using `threadIdx.x`.\n",
    "* **Blocks** are an array of threads. This array can be 1, 2, or 3 dimensions. In more than one dimension, thread indices are referenced using `x`, `y`, and `z` (e.g. `threadIdx.z`). Blocks, similar to threads, are referenced using `blockIdx.x`, `blockIdx.y`, or `blockIdx.z`. The size of a block, i.e. the number of threads along a given dimension is given as `blockDim.x`. **Note**: In applications I have worked on, I rarely ever use more than 1 dimension of threads. \n",
    "* **Grids** are an array of blocks. This array can be 1, 2, or 3 dimensions. There is no reference for grids as they efffectively represent the entire GPU kernel. The size of a grid, i.e. the number of blocks along a dimension, is determined with `gridDim.x`. \n",
    "\n",
    "Organizing your code properly into blocks and threads is a key component of maximizing your efficiency. \n",
    "\n",
    "Another equally important aspect to the maximization of efficiency is the proper use of the GPU memory structures. There are three main GPU memory structures to consider (there are a few more but they are for more specific uses). These are global memory, shared memory, and local memory.\n",
    "\n",
    "* **Global memory** is contained on the host (or off-chip). Global memory is accessible by every block and thread in the entire grid. It contains many GBs of RAM. (The GPUs I use currently, which are Tesla V100s, have ~16GB. The new A100s have 40-80GB.) Because this memory is contained off-chip, it is slow to access. A key to accessing global memory efficiently is to use so-called \"memory coalescence.\" This effectively means that neighboring threads access neighboring addresses in memory. This allows the compiler to make up to 32 memory reads at the same time, rather than 32 separate memory reads. We will see this in a simple case below. \n",
    "\n",
    "* **Local memory** is also contained on the host. Local memory consists of any arrays allocated within the kernel specific to each thread. Therefore, this memory is only accessible by the thread it is created on. Usually, around 512 KB are available for local memory for each thread. Since, this memory is off-chip, it is also slow to access. Local memory is, however, always accessed in a memory coalescing manner. \n",
    "\n",
    "* **Shared memory** is different. Shared memory is located on-chip. It is, therefore, much faster to read from (~100x faster). Shared memory is accessible by all the threads on a given block. The catch is that only ~48 KB are available for shared memory. Therefore, leveraging shared memory effectively is key to the efficiency of GPU code. \n",
    "\n",
    "To sum up, there are two main points to consider when beginning to program on GPUs:\n",
    "\n",
    "* Layout your grid effectively for your given problem.\n",
    "* Use memory effectively: leverage the availability of shared memory and read from global memory in a coalescing fashion. \n",
    "\n",
    "Some parting thoughts on GPUs:\n",
    "\n",
    "* For maximal efficiency and stability, I usually code in C++/CUDA so that everything is precompiled and tested. \n",
    "* There are python libraries that leverage the power of GPUs. Check out numba, CuPy, PyTorch, Tensorflow, PyCUDA. \n",
    "* Lately, I have really focused on writing CPU/GPU agnostic code. This means that the source code is ~99% the same between the two. On the python side I usually sub in CuPy for Numpy. We will see some basic pointers on this below. In C++, I use short compiler directives to make minimal changes.\n",
    "* Generally speaking it is optimal to store all quantities in a 1 dimensional array when working with GPUs. Let's say you have a two dimensional array that has dimensions (dim1, dim2) and is referenced with (i, j). You can turn this into a 1D array that references each value with (i * dim2 + j). "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Why are GPUs important for EMRI Waveforms?\n",
    "\n",
    "The waveform summation is the key bottleneck. This is an operation that is uniquely suited to GPUs. You can see the improvement below."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./waveform_timing.png\" width=700 height=700 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basic example coded in python\n",
    "\n",
    "We are going to write code in python that will reflect what we will do on the GPU. It will obviously still be on the CPU, but will give you a chance to see how to write some basic GPU code and understand how it works. **Make sure to read the comments in the code.**\n",
    "\n",
    "We will do two examples. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Multiply two arrays\n",
    "\n",
    "Here we will multiply two arrays as we would on the GPU. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# setup our GPU quantities that would come in if we were actually on GPUs. \n",
    "\n",
    "NUM_THREADS = 64  # needs to be a multiple of 32\n",
    "blockDim = NUM_THREADS  # blockDim.x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "\n",
    "# this initial piece is a basic of CPU/GPU agnostic code in python\n",
    "try:\n",
    "    import cupy as xp\n",
    "    \n",
    "except ModuleNotFoundError:\n",
    "    import numpy as xp\n",
    "\n",
    "# __global__\n",
    "def multiply_arrays(array_out, array1, array2, n):\n",
    "    \"\"\"\n",
    "    // what this would like like in CUDA\n",
    "    \n",
    "    // if (threadIdx.x + blockDim.x * blockIdx.x >= n) return;\n",
    "    for (int i = threadIdx.x + blockDim.x * blockIdx.x; i < n; i += blockDim.x * gridDim.x)\n",
    "    {\n",
    "        array_out[i] = array1[i] * array2[i];\n",
    "    }\n",
    "    \"\"\"\n",
    "    # The loops here are to simulate the GPU. \n",
    "    # In reality the GPU will run all threads and blocks in parallel\n",
    "    \n",
    "    # begin simulation\n",
    "    for block in range(num_blocks):\n",
    "        for thread in range(NUM_THREADS):\n",
    "    # end simulation\n",
    "    \n",
    "            # get the overall index in the grid\n",
    "            # based on the thread and block\n",
    "            i = thread + block * blockDim\n",
    "        \n",
    "            # since the GPU runs threads in parallel\n",
    "            # we need to make sure the GPU does not index a value\n",
    "            # on the last block that goes over the array length\n",
    "            if i >= n:\n",
    "                continue\n",
    "                \n",
    "            # use our index to get values out of the array\n",
    "            # notice this simple statement uses memory coalescence\n",
    "            # neighboring threads will have consecutive indices\n",
    "            # therefore, they will access consecutive addresses in each array\n",
    "            array_out[i] = array1[i] * array2[i]\n",
    "            \n",
    "\n",
    "length = int(2 ** 14)\n",
    "\n",
    "# initialize arrays\n",
    "array1 = xp.random.rand(length)\n",
    "array2 = xp.random.rand(length)\n",
    "    \n",
    "# prepare output array\n",
    "array_out = xp.zeros_like(array1)\n",
    "\n",
    "# get the number of blocks\n",
    "# the number of blocks multiplied by the number of threads per block\n",
    "# must be greater than the length of the array \n",
    "num_blocks = int(np.ceil((length + NUM_THREADS -1)/NUM_THREADS))\n",
    "\n",
    "# this would actually be called as\n",
    "\"\"\"\n",
    "multiply_arrays<<<num_blocks, NUM_THREADS>>>(array_out, array1, array2, length);\n",
    "cudaDeviceSynchronize();\n",
    "\"\"\"\n",
    "\n",
    "multiply_arrays(array_out, array1, array2, length)\n",
    "\n",
    "# confirm it\n",
    "assert np.allclose(array_out, array1 * array2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Linear Interpolation to scale up an array "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For our second example it will be a bit more complicated. This way we can see how to use shared memory. \n",
    "\n",
    "Here we will scale up the size of an array using linear interpolation. The key here is that we need every block to read in the original array to shared memory. From there, we perform the interpolation. This allows us to achieve memory coalescence with all global reads and ensure that all reads when actually interpolating are done from shared memory. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# in CUDA, when you statically allocate shared arrays,\n",
    "# you cannot use a variable length\n",
    "# therefore, in C++/CUDA you need to declare a max value up front\n",
    "\n",
    "#define MAX_INPUT 1000\n",
    "\n",
    "def linear_interpolation(array_out, array_in, n_out, n_in, dx_out, dx_in):\n",
    "    \n",
    "    # we want to read in to shared memory\n",
    "    # want to do this with memory coalescence\n",
    "    # since this is only on a specific block\n",
    "    # we need to use only the threads on this block\n",
    "    \n",
    "    \"\"\"\n",
    "    // the below simulated code block in CUDA would look like this:\n",
    "        \n",
    "    // in CUDA, you have to declare shared arrays\n",
    "    \n",
    "    __shared__ double shared_array_in[MAX_INPUT];\n",
    "    \n",
    "    \n",
    "    for (int i = threadIdx.x; i < n_in; i += blockDim.x)\n",
    "    {\n",
    "        // memory coalescence only needed from array_in (global memory)\n",
    "        shared_array_in[i] = array_in[i];\n",
    "    }\n",
    "    \n",
    "    // we need to make sure the threads all finish this operation before we move on\n",
    "    // Therefore, we force the threads to synchronize\n",
    "    \n",
    "    __syncthreads();\n",
    "    \n",
    "    \"\"\"\n",
    "    \n",
    "    \n",
    "    # we will just do the following to simulate in python easily\n",
    "    shared_array_in = xp.zeros(n_in)\n",
    "    \n",
    "    for thread in range(NUM_THREADS):\n",
    "        i = thread \n",
    "        \n",
    "        # this while statement simulates the above commented loop\n",
    "        while (i < n_in):\n",
    "            # here is the memory coalesced reads\n",
    "            shared_array_in[i] = array_in[i]\n",
    "            \n",
    "            # just for simulation\n",
    "            i += blockDim\n",
    "        \n",
    "        \n",
    "    \"\"\"\n",
    "    // this is what the below would really like it in CUDA\n",
    "    // if (threadIdx.x + blockDim.x * blockIdx.x >= n_out) return;\n",
    "    for (int i = threadIdx.x + blockDim.x * blockIdx.x; i < n_out; i += blockDim.x * gridDim.x)\n",
    "    {\n",
    "        // get the new out value assuming equal spacing\n",
    "        double x_new = dx_out * i;\n",
    "            \n",
    "        // get index of the point in the original array below the new point\n",
    "        int ind_in = (int) (x_new / dx_in);\n",
    "        \n",
    "        // get the below x value\n",
    "        double x_old = ind_in * dx_in;\n",
    "\n",
    "        // slope of segment\n",
    "        double m = (shared_array_in[ind_in + 1] - shared_array_in[ind_in])/dx_in;\n",
    "\n",
    "        // interpolate\n",
    "        double new_value = m * (x_new - x_old) + shared_array_in[ind_in];\n",
    "        array_out[i] = new_value;\n",
    "    }\n",
    "    \n",
    "    \"\"\"\n",
    "    # The loops here are to simulate the GPU. \n",
    "    # In reality the GPU will run all threads and blocks in parallel\n",
    "    \n",
    "    # begin simulation\n",
    "    for block in range(num_blocks):\n",
    "        for thread in range(NUM_THREADS):\n",
    "    # end simulation\n",
    "    \n",
    "            # get the overall index in the grid\n",
    "            # based on the thread and block\n",
    "            i = thread + block * blockDim\n",
    "        \n",
    "            # since the GPU runs threads in parallel\n",
    "            # we need to make sure the GPU does not index a value\n",
    "            # on the last block that goes over the array length\n",
    "            if i >= n_out:\n",
    "                continue\n",
    "                \n",
    "            x_new = dx_out * i\n",
    "            \n",
    "            ind_in = int(x_new / dx_in)\n",
    "            x_old = ind_in * dx_in\n",
    "            \n",
    "            m = (shared_array_in[ind_in + 1] - shared_array_in[ind_in])/dx_in\n",
    "            \n",
    "            new_value = m * (x_new - x_old) + shared_array_in[ind_in]\n",
    "            array_out[i] = new_value\n",
    "    \n",
    "\n",
    "# original length\n",
    "length_in = 100\n",
    "\n",
    "# prepare input arrays\n",
    "x_in = xp.arange(length_in)\n",
    "y_in = (x_in ** 2).astype(xp.float64)\n",
    "\n",
    "# set length out\n",
    "length_out = int(2 ** 14)\n",
    "\n",
    "# setup the new x_values\n",
    "x_new = xp.linspace(x_in[0], x_in[-1], length_out + 1)[:-1]\n",
    "\n",
    "# change in original x values\n",
    "dx_in = 1.0\n",
    "\n",
    "# we will actually only use the spacing to find new points\n",
    "dx_out = x_new[1] - x_new[0]\n",
    "    \n",
    "y_out = xp.zeros(length_out)\n",
    "\n",
    "num_blocks = int(np.ceil((length_out + NUM_THREADS -1)/NUM_THREADS))\n",
    "\n",
    "# this would actually be called as\n",
    "# linear_interpolation<<<num_blocks, NUM_THREADS>>>(y_out, y_in, length_out, length_in, dx_out, dx_in);\n",
    "# cudaDeviceSynchronize();\n",
    "linear_interpolation(y_out, y_in, length_out, length_in, dx_out, dx_in)\n",
    "\n",
    "# confirm it\n",
    "plt.plot(x_new, y_out, lw=6, label='out')\n",
    "plt.plot(x_in, y_in, '.', label='in')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Comment on CPU/GPU Agnostic code in C++/CUDA and Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Making CPU/GPU agnostic code consists of 3 main parts: \n",
    "\n",
    "* Sub CuPy for NumPy\n",
    "* Compiler directives in C++/CUDA\n",
    "* Easy transition from Python to C++ through an augmented Cython process\n",
    "\n",
    "Above, we saw an example of how to deal with the Python side. The Cython functionality is effectively a decorator function that gets the pointer of a Numpy or Cupy array and sends that into the C++ code. Below is a basic example of how to use basic compiler directives to make your code more CPU/GPU agnostic.\n",
    "\n",
    "**Note**: I generally argue that we should build codes for GPUs and then adapt them to CPUs, not the other way around which is the typical direction. In my experience, the CPU codes adapted from GPU codes are just as fast or within a small percentage of the speed of CPU designed codes. And, generally, it is much harder to optimize going from CPU->GPU rather than GPU->CPU. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img src=\"./code_block_example.png\" width=700 height=700 />"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Future Plans\n",
    "\n",
    "* Build fast trajectories in Kerr under the NIT framework\n",
    "* Generate amplitudes in Kerr regime\n",
    "* Further analyze how to determine best methods for mode content inclusion\n",
    "\n",
    "We need your help! If anyone is interested in working on these types of issues, please let myself or any of my collaborators (listed above) know!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}