---
title: "Getting Started with bibnets"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Getting Started with bibnets}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r setup}
library(bibnets)
```

## Introduction

`bibnets` constructs bibliometric networks from scholarly metadata. It imports
the export formats of the major bibliographic databases, converts them
internally to a common tabular representation, and projects that representation
into networks through a single function per network type. The package covers
the standard constructions and adds several that are less commonly available:
position-based attention weighting, aggregation of entities into higher-level
networks, a range of counting and similarity weights, and temporal
construction over time windows.

### Data import

`bibnets` reads Scopus, Web of Science, OpenAlex, Lens.org, Dimensions,
Crossref, BibTeX, and RIS exports. `read_biblio()` detects the format from
file content and dispatches to the corresponding reader; all readers return an
identical schema, so records from different databases can be combined without
manual reconciliation. Multi-valued fields — authors, references, and keywords
— are parsed into list-columns. A data frame already in this form is used
directly, by naming the relevant column, without a reader.

### Network builders

Dedicated builders construct co-authorship, co-citation, bibliographic
coupling, keyword co-occurrence, direct citation, and historiograph networks; a
generic builder covers other projections. The builders share one interface and
return the same edge list, so the network type is determined by the function
name.

### Weighting and aggregation

Counting methods determine each publication's contribution to an edge. They
range from full and fractional counting to position-aware schemes (harmonic,
geometric, golden-ratio, first, last, and first–last), and six similarity
measures — cosine, association strength, Jaccard, inclusion, and equivalence —
rescale the projected weights.

Attention weighting assigns each author a positional weight that sums to one
across the byline, so a publication's credit is distributed by byline position
rather than equally and a large author list does not dominate the network.
Aggregation pools the references or members of a group to construct
collaboration and coupling networks among countries, institutions, or sources
rather than individuals. Temporal construction applies any builder over fixed,
sliding, or cumulative windows, and a disparity-filter backbone retains the
edges that are significant relative to each node's strength.

### Implementation

The incidence matrix is stored as a sparse `dgCMatrix` and projected with
`crossprod()` or `tcrossprod()`; edges are extracted without forming a dense
node-by-node matrix, so memory scales with the number of non-zero
co-occurrences rather than with the square of the vocabulary. The package
imports only Matrix, stats, and utils.

### Export formats

Constructed networks are exported to igraph, tidygraph, cograph, Gephi,
GraphML, and sparse-matrix representations.

### Output

Every builder returns a **`bibnets_network`**: a tidy data frame with four
columns —

- `from`, `to` — the two endpoints of an edge,
- `count` — the raw binary co-occurrence count for that pair,
- `weight` — the analytical weight after counting and optional similarity
  normalization.

With `counting = "full"` and `similarity = "none"`, `weight` equals
`count`. They diverge once fractional counting or a similarity measure is
applied.

The builders, at a glance:

| Function | Nodes | An edge means |
|---|---|---|
| `author_network()` | authors | co-authorship, author coupling, or co-citation |
| `reference_network()` | cited references | two references cited together |
| `document_network()` | documents | shared references, shared citers, or direct citation |
| `keyword_network()` | keywords | two keywords appear together |
| `source_network()` | journals | sources share references or are co-cited |
| `country_network()` | countries | countries collaborate or share references |
| `institution_network()` | institutions | institutions collaborate or share references |
| `conetwork()` | any field | entities co-occur, or share values of another field |
| `local_citations()` | documents | within-corpus citation counts |
| `historiograph()` | documents | directed citation history among top-cited papers |
| `temporal_network()` | any builder's nodes | the same network over time windows |

## Quick start

You do **not** need a special reader. Any data frame with one row per paper
works — point a builder at the column that holds the entity and tell it the
delimiter:

```{r quick-df}
papers <- data.frame(
  `Author Names` = c("Smith J, Doe A, Lee K", "Smith J, Lee K",
                     "Doe A, Lee K", "Smith J, Doe A"),
  check.names = FALSE
)

author_network(papers, authors = "Author Names", sep = ",")
```

If your data is a scholarly export instead, read it first — the format is
detected from the file content — then build with the defaults:

```{r quick-reader, eval = FALSE}
data    <- read_biblio("scopus.csv")
authors <- author_network(data, type = "collaboration")
```

Either way the result is the same four-column edge list, ready to inspect,
prune, or export.

## Reading your own data

### Scholarly exports

`read_biblio()` accepts a file, a folder, or several files, and detects
Scopus, Web of Science, OpenAlex, BibTeX, RIS, Lens.org, and Dimensions
from the content:

```{r read-files, eval = FALSE}
data <- read_biblio("export.csv")
data <- read_biblio("folder_with_exports/")
data <- read_biblio(c("part_1.csv", "part_2.csv"))
```

The format-specific readers can also be called directly
(`read_scopus()`, `read_wos()`, `read_openalex_csv()`, `read_dimensions()`,
`read_lens()`, `read_bibtex()`, `read_ris()`).

### A custom CSV

For a CSV that matches no known export, map each source column onto a
standard field **by name** — `authors`, `keywords`, `references`,
`countries`, `affiliations`, or `journal`. Naming any of them reads the
file as a generic CSV, so you do not pass `format` yourself:

```{r read-generic, eval = FALSE}
data <- read_biblio(
  "custom.csv",
  id       = "paper_id",
  authors  = "Author Names",
  keywords = "Tags",
  sep      = ","
)
```

Each mapped column is split on `sep` into the standard list-column, so
afterwards every builder works with its defaults.

### A plain data frame, directly

As the quick start showed, you can skip the reader entirely and let the
builder split a column for you. The same column arguments are available on
every builder:

```{r read-direct, eval = FALSE}
author_network(my_df, authors = "Author Names", sep = ",")
keyword_network(my_df, keywords = "Tags",       sep = ",")
```

The work identifier is the `id` column. You need not supply one: when no
`id` column is present each row is treated as one document; pass
`id = "paper_id"` to use a differently-named column. Surrounding quotes are
stripped by default (`strip_quotes = TRUE`), and in a coupling network the
references column takes its own `references_sep`. The companion
`vignette("reading-data")` covers every reader and these options in full.

### The standard schema

Readers return a common set of columns:

```{r schema}
data(scopus_quantum_cloud)
sc <- scopus_quantum_cloud
names(sc)[1:12]
```

The columns that matter for network construction are `id`, the list-columns
`authors` / `references` / `keywords`, and `year` (used by
`temporal_network()`). Source-specific extras such as `countries`,
`affiliations`, and `keywords_plus` are kept when available.

## Datasets used here

```{r data}
data(biblio_data)
data(learning_analytics)

small <- biblio_data            # tiny, synthetic
oa    <- learning_analytics     # 1,508 OpenAlex records on learning analytics

c(small = nrow(small), scopus = nrow(sc), openalex = nrow(oa))
```

## Author collaboration

Two authors are linked when they appear on the same paper:

```{r author-basic}
authors <- author_network(oa, type = "collaboration")
head(authors, 5)
summary(authors)
```

Use `min_occur` to drop rare authors before projection:

```{r author-minoccur}
nrow(author_network(oa, type = "collaboration"))
nrow(author_network(oa, type = "collaboration", min_occur = 2))
```

### Counting methods

`counting` controls how much each paper contributes to an edge:

```{r counting}
head(author_network(small, type = "collaboration", counting = "full"), 3)
head(author_network(small, type = "collaboration", counting = "fractional"), 3)
head(author_network(small, type = "collaboration", counting = "harmonic"), 3)
head(author_network(small, type = "collaboration", counting = "first_last"), 3)
```

The available methods differ in how they weight the rows or positions before
projection:

| Method | What it does | Trade-off | When to use |
|---|---|---|---|
| `"full"` | Leaves the binary incidence matrix unchanged; for positional author weights, every listed entity receives weight 1. | Large teams or long lists create many full-strength pairs. | Use for raw event counts where every observed co-occurrence should count equally. |
| `"fractional"` | For symmetric networks, each row contributes `1 / (n - 1)` to pairs when `n > 1`; for coupling it uses `1 / n`; positional use gives each entity `1 / n`. | Reduces large-list dominance but treats all positions equally. | Use when each paper or reference list should have limited influence and position is not meaningful. |
| `"paper"` | For symmetric networks, each paper's pair budget is scaled by `2 / (n * (n - 1))`; for coupling it uses `1 / n`. | Normalizes at the paper level, so very large and very small papers can contribute comparable total pair mass. | Use when publications, rather than individual author/entity pairs, should be the main unit of contribution. |
| `"strength"` | Multiplies entity columns by the square root of inverse document frequency, `sqrt(log(n_works / entity_frequency))`; row-size scaling for coupling is deferred to projection. | Downweights ubiquitous entities and emphasizes rarer shared entities; values are less like direct counts. | Use for coupling or profile similarity where common references, keywords, or entities should carry less evidence. |
| `"harmonic"` | Uses positional weights proportional to `1 / position`, normalized to sum to one. | Strongly favors early positions while still giving every later position some credit. | Use when author order matters and early authorship should dominate without excluding later authors. |
| `"arithmetic"` | Uses a linear decline from first to last, proportional to `n - position + 1`, normalized. | Gives a gentler first-author advantage than geometric methods. | Use when byline order matters but credit should decrease steadily rather than sharply. |
| `"geometric"` | Uses weights proportional to `0.5^(position - 1)`, normalized. | Concentrates credit heavily at the front of the byline. | Use when the first few positions are expected to carry most of the contribution. |
| `"adaptive_geometric"` | Uses a geometric sequence normalized so the first-to-last weight ratio equals `n` (`2/3`, `1/3` for two authors). | Adapts the steepness to team size, making long bylines more front-loaded. | Use when first-author emphasis should increase with the number of authors. |
| `"golden"` | Uses golden-ratio decay, proportional to `phi^(n - position)`, normalized. | More front-loaded than arithmetic but less abrupt than fixed halving. | Use as a moderate positional decay when author order matters but geometric halving is too strong. |
| `"first"` | Gives weight 1 to the first position and 0 to all others. | Ignores all non-first contributors. | Use for strict first-author analyses. |
| `"last"` | Gives weight 1 to the last position and 0 to all others. | Ignores all non-last contributors. | Use where last authorship represents the analytical role of interest, such as senior or PI credit. |
| `"first_last"` | With two authors, assigns `0.5` and `0.5`; otherwise gives first and last authors an elevated weight and middle authors a baseline weight, all normalized. | Highlights both endpoints while still retaining middle-author credit. | Use in fields where first and last positions have distinct credit or leadership meanings. |
| `"position_weighted"` | Uses the supplied `position_weights` vector, extending the last value to longer bylines, then normalizes. | Puts the burden of choosing defensible weights on the analyst. | Use when you have field-specific or study-specific positional weights. |

### Attention weights

Standard bibliometric co-authorship networks treat every byline position as
equivalent: a first author who conceived and drove the work is weighted
identically to a fifteenth contributor who provided a single instrument
reading. On hyper-authored papers this produces dense, low-meaning
co-authorship ties that drown out the focused two- or three-author
collaborations that often signal the sharpest intellectual kinship. The
attention weighting feature in `bibnets` is designed to correct this. The
name is an honest analogy to the attention mechanism in large language
models: just as a transformer assigns a normalized probability distribution
across the tokens in a sequence — concentrating weight on what matters,
spreading it thin over the rest — `bibnets` assigns each author on a paper a
positional weight that sums to one across the full byline. A fifty-author
paper therefore contributes exactly one unit of connection budget, the same
as a two-author paper, and the distribution of that budget reflects
authorship conventions: `"lead"` concentrates weight on the first author,
`"last"` on the senior or PI position, `"proximity"` rewards the central
authors, and `"circular"` rewards both ends jointly. The weights are a fixed
positional prior, not learned content-based attention, but they carry real
scholarly meaning, and activating them requires nothing more than passing
`attention = "lead"` (or any of the three alternatives) to any of the author,
keyword, country, or institution network functions.

| `attention` | Weight vector | Scholarly assumption | When it fits |
|---|---|---|---|
| `"lead"` | Quadratic drop from the first position: the first position has raw weight `n^2`, inner positions decline as the byline advances, and the last position has raw weight `1`, then all weights are normalized. | The lead author is the main intellectual driver. | Use in first-author-oriented fields or questions about lead contribution. |
| `"last"` | Quadratic rise to the last position: the first position has raw weight `1`, inner positions rise across the byline, and the last position has raw weight `n^2`, then all weights are normalized. | The last author represents senior, supervisory, or PI contribution. | Use in disciplines where last authorship marks lab leadership or supervision. |
| `"proximity"` | Pyramid profile using `min(position, n + 1 - position)`: first and last positions have raw weight `1`, inner positions increase toward the middle, and central positions are highest. | Central byline positions deserve the most attention. | Use when the question treats middle-position contributors as the focal group. |
| `"circular"` | Edge profile using `max(position, n + 1 - position)`: first and last positions have the largest raw weight, while inner positions decline toward the center. | Both ends of the byline are prominent. | Use where lead and senior positions jointly matter more than middle positions. |

`attention` applies a smooth positional profile instead of a named counting
scheme (available for author, keyword, country, and institution networks):

```{r attention}
head(author_network(small, attention = "lead"), 3)
```

## Reference co-citation

Two references are linked when a paper cites both:

```{r cocitation}
refs <- reference_network(sc, min_occur = 2)
head(refs, 5)
```

A similarity measure offsets the advantage of very frequently cited works:

```{r cocitation-cosine}
head(reference_network(sc, min_occur = 2, similarity = "cosine"), 3)
```

## Document coupling and citation

Coupling links two documents that share cited references:

```{r coupling}
head(document_network(sc, type = "coupling", similarity = "cosine"), 5)
```

Direct citation is directed — `from` cites `to` — and only within the
corpus (the cited work must also be a row in the data):

```{r citation}
head(document_network(sc, type = "citation"), 5)
```

## Keyword co-occurrence

```{r keywords}
kw <- keyword_network(sc, min_occur = 2)
head(kw, 5)
```

Labels are trimmed and upper-cased during construction, so
`machine learning`, `Machine Learning`, and ` MACHINE LEARNING ` are one
node. Association strength is a common choice for co-occurrence maps:

```{r keywords-assoc}
head(keyword_network(sc, min_occur = 2, similarity = "association"), 3)
```

## Countries, institutions, and sources

```{r geo}
head(country_network(oa, counting = "fractional"), 5)
head(institution_network(oa, counting = "fractional", min_occur = 2), 5)
head(source_network(sc, type = "coupling", min_occur = 2), 5)
```

For coupling networks, `min_occur` is applied to the aggregated entity
before the network is built.

## Generic co-networks

`conetwork()` covers projections without a dedicated wrapper. One field
links entities that co-occur; a second field (`by`) links them through a
shared value:

```{r conetwork}
head(conetwork(sc, "keywords", min_occur = 2), 3)
head(conetwork(sc, "authors", by = "keywords", min_occur = 2), 3)
```

The second result links authors through shared keywords — a thematic
similarity network, not a co-authorship one.

## Normalization

The same raw counts support different similarity scores; only `weight`
changes, `count` does not:

```{r normalize}
none <- keyword_network(sc, min_occur = 2, similarity = "none")
cos  <- keyword_network(sc, min_occur = 2, similarity = "cosine")
head(none[, c("from", "to", "weight", "count")], 3)
head(cos[,  c("from", "to", "weight", "count")], 3)
```

`normalize()` uses the diagonal of the projected matrix as each node's total
occurrence count:

| Similarity | Denominator | Meaning | When to use |
|---|---|---|---|
| `"none"` | No denominator; the projected matrix is returned as raw weighted co-occurrence, with the diagonal removed by the network builder unless self-loops are requested. | `weight` stays on the same scale as the counted projection. | Use when absolute co-occurrence or counted edge strength is the quantity of interest. |
| `"cosine"` | Square root of the product of the two node totals. | Symmetric size correction; pairs are high when their overlap is large relative to both nodes' frequencies. | Use as a general-purpose correction for very frequent nodes while preserving a familiar similarity scale. |
| `"association"` | Product of the two node totals. | Symmetric association-strength normalization; strongly penalizes pairs involving very frequent nodes. | Use for co-occurrence maps where you want rare, unexpectedly tight pairings to stand out. |
| `"jaccard"` | Sum of the two node totals minus their observed edge value. | Symmetric overlap over a union-like total. | Use when the edge should represent shared occurrence as a share of either node's combined footprint. |
| `"inclusion"` | The smaller of the two node totals. | Symmetric containment-oriented score; it reaches high values when the smaller node mostly appears with the larger one. | Use when subset or specialization relationships are more important than balanced overlap. |
| `"equivalence"` | Product of the two node totals, with the edge value squared before division. | Cosine-like normalization with stronger penalty for weak or occasional overlap. | Use when following equivalence-index conventions or when only consistently paired nodes should remain strong. |

## Reducing large networks

```{r reduce}
edges <- author_network(oa, type = "collaboration")
c(all        = nrow(edges),
  threshold  = nrow(prune(edges, threshold = 2)),
  top_n      = nrow(prune(edges, top_n = 5)),
  top_nodes  = nrow(filter_top(edges, n = 50)))
```

- `prune(threshold = x)` — absolute edge-weight cutoff.
- `prune(top_n = k)` — keep each node's strongest edges.
- `filter_top(n = k)` — keep edges among the most-connected nodes.

`backbone()` applies the disparity filter, which keeps edges that are strong
relative to a node's local strength distribution — not a global cutoff:

```{r backbone}
bb <- backbone(edges, alpha = 0.05)
nrow(bb)
```

## Temporal networks

`temporal_network()` runs any builder over time windows (fixed, sliding, or
cumulative):

```{r temporal}
tn <- temporal_network(oa, author_network, "collaboration", window = 3)
names(tn)
```

Each window's edge list carries a `window` column. Windows with fewer than
two records, or no surviving edges, are dropped; a builder error inside a
window becomes a warning labelled with that window.

## Local citations and historiographs

`local_citations()` counts how often each document is cited by others in the
same corpus; `historiograph()` builds the directed citation graph among the
top-cited documents:

```{r historiograph}
head(local_citations(sc), 5)

h <- historiograph(sc, n = 10)
h$nodes
head(h$edges, 5)
```

Both require reference strings or IDs to match document IDs in the data; if
the cited works are external, local counts stay low.

## Author-name normalization

`parse_names()` reorders and splits author names (it recognizes
`Last, First`, `SURNAME Initials`, and `First Last`). Because node identity
is fixed when a network is built, normalize *before* building so that two
spellings of one author merge:

```{r parse-names}
parse_names(c("Saqr, Mohammed", "WANG Y", "Mohammed Saqr"))
```

See `vignette("parsing-author-names")` for the full treatment.

## Exporting

The edge list is already usable; converters cover the common targets:

```{r export}
edges <- keyword_network(sc, min_occur = 2)

m <- to_matrix(edges)            # sparse adjacency matrix
m[1:4, 1:4]

gephi <- to_gephi(edges)         # Gephi node/edge tables
head(gephi$edges, 3)

cat(substr(to_graphml(edges), 1, 200))   # GraphML, no XML dependency
```

`to_igraph()`, `to_tbl_graph()`, and `to_cograph()` are available when their
(suggested) packages are installed.

## Reading a `bibnets_network`

The object records how it was built, as attributes:

```{r attrs}
edges <- author_network(oa, type = "collaboration", counting = "harmonic")
c(type     = attr(edges, "network_type"),
  counting = attr(edges, "counting"),
  sim      = attr(edges, "similarity"))

summary(edges)
```

`print()` reports the network type, node and edge counts, and the counting
and similarity methods — so a saved edge list always says how it was made.

## References

The methodology implemented in `bibnets` is described in:

López-Pernas, S., Saqr, M., & Apiola, M. (2023). Scientometrics: A Concise
Introduction and a Detailed Methodology for Mapping the Scientific Field of
Computing Education Research. In M. Apiola, S. López-Pernas, & M. Saqr (Eds.),
*Past, Present and Future of Computing Education Research: A Global
Perspective* (pp. 79–99). Springer Nature Switzerland AG.
<https://doi.org/10.1007/978-3-031-25336-2_5>

Saqr, M., López-Pernas, S., Conde, M. Á., & Hernández-García, Á. (2024). Social
Network Analysis: A primer, a guide and a tutorial in R. In M. Saqr & S.
López-Pernas (Eds.), *Learning Analytics Methods and Tutorials: A Practical
Guide Using R* (pp. 491–518). Springer, Cham.
<https://doi.org/10.1007/978-3-031-54464-4_15>
