--- title: "Getting Started: OFH Synthetic Cohort Generation" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started: OFH Synthetic Cohort Generation} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## Overview This vignette shows how to generate synthetic cohort datasets for method development before using real health data. The package-style API supports: - configurable cohort size - reproducible generation via seed - optional ICD-10 / OPCS4 / BNF code restrictions - configurable dataset coverage, record density, and field-level generation probabilities - control over whether to save CSVs and/or return R objects ## 1. Load the package ```{r load-api, eval = FALSE} library(ofhsyn) ``` ## 2. Generate a basic cohort ```{r basic-run, eval = FALSE} out <- generate_ofh_cohort( n = 1000, seed = 123 ) names(out) ``` This returns a named list of data frames and writes CSVs to an output folder in your current working directory. To return objects only (without writing CSV files): ```{r objects-only, eval = FALSE} out_objects_only <- generate_ofh_cohort( n = 1000, seed = 123, save_csv = FALSE, return_objects = TRUE ) ``` If you run this interactively, the generated data frames are also available in your R environment (for example `questionnaire_data`, `clinic_measurements_data`, `nhse_inpat_data`). ## 3. Restrict to specific code lists ```{r code-lists, eval = FALSE} out <- generate_ofh_cohort( n = 1000, seed = 123, icd10 = c( I210 = "STEMI of anterolateral wall", I500 = "Congestive heart failure" ), opcs4 = c( K401 = "Percutaneous transluminal balloon angioplasty of coronary artery" ), bnf_codes = data.frame( BNFCode = c("0212000B0", "0601023A0"), BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"), Formulation = c("tablets", "tablets"), Strength = c("20 mg", "500 mg"), stringsAsFactors = FALSE ) ) ``` You can also provide code files: - ICD10/OPCS4 files must include both `code` and `description` - For ICD10/OPCS4: use CSV (`code,description`) or tab-separated TXT (`codedescription`) - For BNF: use CSV with `BNFCode`, `BNFName`, `Formulation` (optional `Strength`) ```{r code-files, eval = FALSE} out <- generate_ofh_cohort( n = 1000, seed = 123, icd10_file = "icd10_codes.txt", opcs4_file = "opcs4_codes.txt", bnf_codes_file = "bnf_medications.csv" ) ``` ## 4. Configure dataset generation probabilities ```{r probabilities, eval = FALSE} out_custom <- generate_ofh_cohort( n = 1000, seed = 123, proportions = list( nhse_outpat = 0.25, nhse_inpat = 0.20, nhse_ed = 0.30, nhse_primcare_meds = 0.75 ), record_multipliers = list( nhse_outpat = 1.2, nhse_inpat = 1.1, nhse_ed = 1.3 ), code_config = list( nhse_outpat_data = list(diag_4_02_missing_prob = 0.70), nhse_inpat_data = list(single_diag_prob = 0.85) ) ) ``` ## 5. Use the OOP interface directly ```{r oop-run, eval = FALSE} syn <- OFHCohortSynthesizer$new(project_root = ".", seed = 123) syn$set_code_pools( icd10 = c(I210 = "STEMI of anterolateral wall"), opcs4 = c(K401 = "Percutaneous transluminal balloon angioplasty of coronary artery"), bnf_meds = data.frame( BNFCode = c("0212000B0", "0601023A0"), BNFName = c("Atorvastatin 20 mg tablets", "Metformin 500 mg tablets"), Formulation = c("tablets", "tablets"), Strength = c("20 mg", "500 mg"), stringsAsFactors = FALSE ) ) out <- syn$run_all(n = 800) ``` ## 6. Practical tips for researchers - Start with small `n` (for example, 200 to 1000) while developing. - Fix `seed` for reproducibility during method testing. - Check row counts and `pid` linkage assumptions in your analysis scripts. - Expand code lists as your phenotype definitions evolve. ## 7. Notes - Some datasets are intentional subsets of the full cohort. - Questionnaire output includes a small v1 proportion by design. - Primary care meds include prescribed-but-not-dispensed rows.