---
title: "Getting Started with WMAP"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{WMAP}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 6,
  fig.height = 4
)
library(WMAP)
```

# Introduction

WMAP (Weighted Multi-site Analysis Package) integrates evidence from multiple observational studies with different group compositions using balancing weights.

# Installation

```{r install, eval=FALSE}
# Install from CRAN
install.packages("WMAP")

# Load package
library(WMAP)
```

# Quick Start Example

## Load Demo Data

```{r load-data}
# Load demonstration dataset
data(demo)

# The demo data contains:
# S - Study/site indicators (7 hospitals)
# Z - Group indicators (2 groups: IDC vs ILC)
# X - Patient covariates (clinical variables)
# Y - Outcomes (8 gene expression measurements)
# naturalGroupProp - Known population group proportions
```

## Run Analysis

```{r basic-analysis}
# Estimate causal effects
result <- causal.estimate(
  S = S,                           # Study indicators
  Z = Z,                           # Group indicators
  X = X,                           # Covariates
  Y = Y,                           # Outcomes
  B = 20,                          # Bootstrap samples (use 100+ in practice)
  method = "IC",                   # Weighting method
  naturalGroupProp = naturalGroupProp,
  seed = 123                       # For reproducibility
)
```

## View Results

```{r results}
# Print summary
summary(result)

# Visualize bootstrap ESS distribution
plot(result)
```

# Understanding the Output

**Effective Sample Size (ESS):** The **Percent ESS** indicates how much information remains after weighting:

**Mean Differences:** Shows estimated differences between Group 1 and Group 2 with 95% confidence intervals.

**SD Ratios:** Ratios of standard deviations between groups.

# Advanced Features

## Choosing a Weighting Method

WMAP provides three methods:

- **IC (Integrative Combined)**
- **IGO (Integrative Generalized Overlap)**
- **FLEXOR (Flexible Overlap)**

```{r methods, eval=FALSE}
# Compare methods
result_ic <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp)
result_igo <- causal.estimate(S, Z, X, Y, B = 100, method = "IGO", naturalGroupProp)
result_flexor <- causal.estimate(S, Z, X, Y, B = 100, method = "FLEXOR", naturalGroupProp)

# Compare ESS
result_ic$percentESS
result_igo$percentESS
result_flexor$percentESS
```

## Outcome Model

By default, `causal.estimate` uses observed outcomes directly (`outcome_model = FALSE`). Set `outcome_model = TRUE` to fit a random forest on X, S, Z and substitute predicted outcomes for observed Y in all moment calculations (mean, SD, median).

```{r outcome-model, eval=FALSE}
# Default: use observed outcomes directly
result <- causal.estimate(S, Z, X, Y, B = 100, method = "IC",
                          naturalGroupProp = naturalGroupProp)

# RF predictions substituted for observed Y
result <- causal.estimate(S, Z, X, Y, B = 100, method = "IC",
                          naturalGroupProp = naturalGroupProp,
                          outcome_model = TRUE)
```

When `outcome_model = TRUE`, a random forest is fit per outcome per bootstrap iteration, which increases runtime.

### Cross-fitting

By default (`crossfit_folds = NULL`), the random forest is fit on the full sample. Set `crossfit_folds` to an integer >= 2 for K-fold cross-fitting:

```{r crossfit, eval=FALSE}
# Enable 5-fold cross-fitting
result <- causal.estimate(S, Z, X, Y, B = 100, method = "IC",
                          naturalGroupProp = naturalGroupProp,
                          outcome_model = TRUE,
                          crossfit_folds = 5)
```

## Bootstrap Options

### Adjust Bootstrap Sample Size

```{r bootstrap-size, eval=FALSE}
# Minimum for stable CIs
result <- causal.estimate(S, Z, X, Y, B = 50, method = "IC", naturalGroupProp)

# Recommended for publication
result <- causal.estimate(S, Z, X, Y, B = 200, method = "IC", naturalGroupProp)

# High precision
result <- causal.estimate(S, Z, X, Y, B = 500, method = "IC", naturalGroupProp)
```

**Trade-off:** More bootstrap samples = better precision but longer computation time

### Fast Point Estimates (No Bootstrap)

Skip bootstrap for quick exploration:

```{r no-bootstrap, eval=FALSE}
# Point estimates only (no confidence intervals)
result <- causal.estimate(S, Z, X, Y, B = 0, method = "IC", naturalGroupProp)
summary(result)  # Shows point estimates without CIs
```

### Parallel Bootstrap

Speed up computation using multiple cores:

```{r parallel, eval=FALSE}
# Install required packages first
install.packages(c("future", "future.apply"))

# Define n_cores based on your system (e.g., 4)
result <- causal.estimate(
  S, Z, X, Y,
  B = 100,
  method = "IC",
  naturalGroupProp = naturalGroupProp,
  parallel = TRUE,
  n_cores = 4
)
```

The parallelization strategy depends on the environment. On Mac/Linux, `multicore` is used: workers are created by duplicating the current R process, so data is already in memory and no copying is required. On Windows, `multisession` is used: each worker is a fresh R process that receives a copy of the data, which has higher startup overhead. RStudio disables `multicore` regardless of platform, so `multisession` will be used when running inside RStudio. For best performance on Mac/Linux, run your script from the terminal using `Rscript` or an interactive R session.


# Diagnostic Warnings

WMAP automatically checks for common problems and warns you:

## Low ESS Warning

```{r low-ess-demo, eval=FALSE}
# Example warning:
# Warning message:
# Low effective sample size: 3.1% (threshold: 5%).
```

## Extreme Weights Warning

```{r extreme-weights-demo, eval=FALSE}
# Example warning:
# Warning message:
# Extreme weights: max/mean ratio = 45.2 (threshold: 20).
```

## Positivity Violations

```{r positivity-demo, eval=FALSE}
# Example warning:
# Warning message:
# Sparse study-group cells: 2 cell(s) with < 5 observations.

# Example warning:
# Warning message:
# Near-zero propensity scores: 6.3% of subjects truncated.
```

# Common Workflows

## Standard Analysis Workflow

```{r standard-workflow, eval=FALSE}
# 1. Load and check data
data(demo)
table(S, Z)  # Check all cells have observations
sum(is.na(cbind(S, Z, X, Y)))  # Check for missing values

# 2. Run analysis
result <- causal.estimate(
  S, Z, X, Y,
  B = 100,
  method = "FLEXOR",
  naturalGroupProp = naturalGroupProp,
  seed = 123
)

# 3. Check diagnostics
summary(result)
plot(result)
```

## Method Comparison Workflow

```{r comparison-workflow, eval=FALSE}
# Compare all three methods
set.seed(123)

results_ic <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp)
results_igo <- causal.estimate(S, Z, X, Y, B = 100, method = "IGO", naturalGroupProp)
results_flexor <- causal.estimate(S, Z, X, Y, B = 100, method = "FLEXOR", naturalGroupProp)

# Compare ESS
cat("IC ESS:", results_ic$percentESS, "%\n")
cat("IGO ESS:", results_igo$percentESS, "%\n")
cat("FLEXOR ESS:", results_flexor$percentESS, "%\n")

# Compare estimates
summary(results_ic)
summary(results_igo)
summary(results_flexor)
```

## Sensitivity Analysis Workflow

```{r sensitivity-workflow, eval=FALSE}
# Test sensitivity to bootstrap sample size
result_b50 <- causal.estimate(S, Z, X, Y, B = 50, method = "IGO", naturalGroupProp, seed = 123)
result_b100 <- causal.estimate(S, Z, X, Y, B = 100, method = "IGO", naturalGroupProp, seed = 123)
result_b200 <- causal.estimate(S, Z, X, Y, B = 200, method = "IGO", naturalGroupProp, seed = 123)

# Compare CI widths
summary(result_b50)
summary(result_b100)
summary(result_b200)
```

## Working with Weights Only

Extract balancing weights without running full causal analysis:

```{r weights-workflow, eval=FALSE}
# Get weights
weights <- balancing.weights(
  S, Z, X,
  method = "IC",
  naturalGroupProp = naturalGroupProp
)

# Examine weight distribution
summary(weights)

# Access weight vector for custom analysis
head(weights$wt.v)
```

# Best Practices

**Always set a seed for reproducibility:**

```{r reproducibility, eval=FALSE}
result1 <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp, seed = 123)
result2 <- causal.estimate(S, Z, X, Y, B = 100, method = "IC", naturalGroupProp, seed = 123)
```

**Check data quality before analysis:**

```{r data-quality, eval=FALSE}
# Check for missing values
sum(is.na(cbind(S, Z, X, Y)))

# Ensure all study-group combinations have observations
table(S, Z)
```

# Getting Help

```{r help, eval=FALSE}
# Function documentation
?causal.estimate
?balancing.weights

# Package citation
citation("WMAP")
```