---
title: "Chapter 07: Kernels --- Writing and Using OpenCL Kernel Files"
author: "Kjell Nygren"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Chapter 07: Kernels --- Writing and Using OpenCL Kernel Files}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
```

## What is a kernel?

An OpenCL **kernel** is the entry point that executes on the GPU. It is
declared with the `__kernel` qualifier and is compiled at runtime (by the
OpenCL driver) from a source string assembled by the host program. Each
invocation of the kernel processes one work-item --- typically one element of
an output array --- in parallel with all other work-items.

In `nmathopencl`, kernels live in `inst/cl/src/`. Each file defines exactly
one `__kernel void` function. The file name follows the pattern
`<function>_kernel.cl`; for example, `dnorm_kernel.cl` implements the
vectorized density computation for the normal distribution.

## Anatomy of a simple kernel

```c
// dnorm_kernel.cl
// Vectorized wrapper kernel for the public Mathlib dnorm interface.

#pragma OPENCL EXTENSION cl_khr_fp64 : enable

__kernel void dnorm_kernel(
    __global const double* x,
    const double mu,
    const double sigma,
    const int give_log,
    __global double* out,
    const int n
) {
    int i = get_global_id(0);
    if (i >= n) return;

    out[i] = dnorm4(x[i], mu, sigma, give_log);
}
```

Key elements:

| Element | Purpose |
|---------|---------|
| `#pragma OPENCL EXTENSION cl_khr_fp64 : enable` | Activates double-precision arithmetic (required for all `nmathopencl` kernels) |
| `__kernel void` | Marks this function as a GPU entry point |
| `__global const double* x` | Input buffer in device (global) memory |
| `const double mu` | Scalar parameter passed by value |
| `__global double* out` | Output buffer in device (global) memory |
| `const int n` | Work-item count guard |
| `get_global_id(0)` | Returns this work-item's index in the first dimension |
| `dnorm4(...)` | Calls the nmath library function, defined in the concatenated library source |

The `if (i >= n) return;` guard handles the case where the global work size
is rounded up to a multiple of the local work-group size, leaving excess
work-items that must not write out-of-bounds.

## Standard argument layout

All kernels in `inst/cl/src/` follow a consistent argument layout designed
to work with `openclPort::opencl_dbl_scalar_kernel_runner`:

```
kernel(double arg0, double arg1, ..., double argN,
       __global double* out, int n_out)
```

Scalar `double` parameters come first (in any order needed by the
underlying nmath function), followed by the output buffer and its length.
Input **vectors** are also represented as `__global const double*` and
precede the scalar arguments when present.

## The `*_ex_kernel.cl` variants

Some distributions expose a secondary interface. The `bessel_*_ex_kernel.cl`
files, for example, provide a version that returns both the function value
and a sign/exponent component simultaneously. These exist to mirror the
extended interfaces in nmath (`bessel_i` vs `bessel_i` with `expo`
argument) and are distinguished by the `_ex_` infix.

## Kernel file index

The ~130 kernel files in `inst/cl/src/` cover every function exposed by
the `nmathopencl` R API:

### Density kernels

`dbeta_kernel.cl`, `dbinom_kernel.cl`, `dbinom_raw_kernel.cl`,
`dcauchy_kernel.cl`, `dchisq_kernel.cl`, `dexp_kernel.cl`, `df_kernel.cl`,
`dgamma_kernel.cl`, `dgeom_kernel.cl`, `dhyper_kernel.cl`,
`dlnorm_kernel.cl`, `dlogis_kernel.cl`, `dnbeta_kernel.cl`,
`dnbinom_kernel.cl`, `dnbinom_mu_kernel.cl`, `dnchisq_kernel.cl`,
`dnf_kernel.cl`, `dnorm_kernel.cl`, `dnt_kernel.cl`, `dpois_kernel.cl`,
`dpois_raw_kernel.cl`, `dsignrank_kernel.cl`, `dt_kernel.cl`,
`dunif_kernel.cl`, `dweibull_kernel.cl`, `dwilcox_kernel.cl`

### CDF kernels

`pbeta_kernel.cl`, `pbinom_kernel.cl`, `pcauchy_kernel.cl`,
`pchisq_kernel.cl`, `pexp_kernel.cl`, `pf_kernel.cl`, `pgamma_kernel.cl`,
`pgeom_kernel.cl`, `phyper_kernel.cl`, `plnorm_kernel.cl`,
`plogis_kernel.cl`, `pnbeta_kernel.cl`, `pnbinom_kernel.cl`,
`pnbinom_mu_kernel.cl`, `pnchisq_kernel.cl`, `pnf_kernel.cl`,
`pnorm_kernel.cl`, `pnt_kernel.cl`, `ppois_kernel.cl`, `psignrank_kernel.cl`,
`pt_kernel.cl`, `ptukey_kernel.cl`, `punif_kernel.cl`, `pweibull_kernel.cl`,
`pwilcox_kernel.cl`

### Quantile kernels

`qbeta_kernel.cl`, `qbinom_kernel.cl`, `qcauchy_kernel.cl`,
`qchisq_kernel.cl`, `qexp_kernel.cl`, `qf_kernel.cl`, `qgamma_kernel.cl`,
`qgeom_kernel.cl`, `qhyper_kernel.cl`, `qlnorm_kernel.cl`,
`qlogis_kernel.cl`, `qnbeta_kernel.cl`, `qnbinom_kernel.cl`,
`qnbinom_mu_kernel.cl`, `qnchisq_kernel.cl`, `qnf_kernel.cl`,
`qnorm_kernel.cl`, `qnt_kernel.cl`, `qpois_kernel.cl`, `qsignrank_kernel.cl`,
`qt_kernel.cl`, `qtukey_kernel.cl`, `qunif_kernel.cl`, `qweibull_kernel.cl`,
`qwilcox_kernel.cl`

### Random-variate kernels

`rbeta_kernel.cl`, `rbinom_kernel.cl`, `rcauchy_kernel.cl`,
`rchisq_kernel.cl`, `rexp_kernel.cl`, `rf_kernel.cl`, `rgamma_kernel.cl`,
`rgeom_kernel.cl`, `rhyper_kernel.cl`, `rlnorm_kernel.cl`,
`rlogis_kernel.cl`, `rmultinom_kernel.cl`, `rnbinom_kernel.cl`,
`rnbinom_mu_kernel.cl`, `rnchisq_kernel.cl`, `rnorm_kernel.cl`,
`rpois_kernel.cl`, `rsignrank_kernel.cl`, `rt_kernel.cl`, `runif_kernel.cl`,
`rweibull_kernel.cl`, `rwilcox_kernel.cl`

### Special function kernels

`beta_special_kernel.cl`, `choose_special_kernel.cl`,
`lbeta_special_kernel.cl`, `lchoose_special_kernel.cl`,
`gammafn_kernel.cl`, `lgammafn_kernel.cl`, `lgammafn_sign_kernel.cl`,
`lgamma1p_kernel.cl`, `digamma_kernel.cl`, `trigamma_kernel.cl`,
`psigamma_kernel.cl`, `pentagamma_kernel.cl`, `tetragamma_kernel.cl`,
`dpsifn_kernel.cl`

### Math support kernels

`fmax2_kernel.cl`, `fmin2_kernel.cl`, `fprec_kernel.cl`, `fround_kernel.cl`,
`fsign_kernel.cl`, `ftrunc_kernel.cl`, `imax2_kernel.cl`, `imin2_kernel.cl`,
`sign_kernel.cl`, `r_pow_kernel.cl`, `r_pow_di_kernel.cl`,
`pow1p_kernel.cl`, `log1pmx_kernel.cl`, `log1pexp_kernel.cl`,
`log1mexp_kernel.cl`, `logspace_add_kernel.cl`, `logspace_sub_kernel.cl`,
`logspace_sum_kernel.cl`

### RNG core kernels

`norm_rand_kernel.cl`, `unif_rand_kernel.cl`, `exp_rand_kernel.cl`,
`r_unif_index_kernel.cl`

### Utility kernels

`r_check_user_interrupt_kernel.cl`, `r_check_stack_kernel.cl`

### Bessel kernels (standard and extended)

`bessel_i_kernel.cl`, `bessel_i_ex_kernel.cl`,
`bessel_j_kernel.cl`, `bessel_j_ex_kernel.cl`,
`bessel_k_kernel.cl`, `bessel_k_ex_kernel.cl`,
`bessel_y_kernel.cl`, `bessel_y_ex_kernel.cl`

## Program assembly

Kernels are **never compiled in isolation**. The host-side runner assembles
a complete OpenCL program string by concatenating:

1. The nmath library source (via `opencltools::load_kernel_library("nmath", package = "nmathopencl")`)
2. The kernel file (via `opencltools::load_kernel_source("src/<name>_kernel.cl", package = "nmathopencl")`)

This concatenated string is then passed to `clBuildProgram`. Because
`opencltools::load_kernel_library()` performs topological sorting, the library functions
are always defined before the kernel refers to them. Chapter 08 covers this
assembly process in detail.

## Writing a custom kernel

To write a kernel that calls an `nmathopencl` nmath function:

1. Identify the nmath function you need (e.g., `dgamma`).
2. Check its `@provides` tag in the corresponding `.cl` file to confirm
   the exact C signature.
3. Write a `__kernel void` function following the standard layout.
4. Load the nmath library subset your function needs using the
   `@all_depends` metadata as a guide (see Chapter 08).
5. Append your kernel source after the library source in the assembled
   program string.

Chapter 10 works through this process end-to-end for the `glmbayes`
GLM log-posterior kernels.
