Explorative Analysis#

Uses the prostate time series dataset provided by the SurvSet package.

%pip install 'njab[all]' openpyxl

import logging

Hide code cell output

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Requirement already satisfied: njab[all] in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (0.1.2.dev2+g00a48f214)
Requirement already satisfied: omegaconf in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (2.3.0)
Requirement already satisfied: numpy in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (2.2.6)
Requirement already satisfied: pandas in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (2.3.2)
Requirement already satisfied: scikit-learn<1.7,>=1.4 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (1.6.1)
Requirement already satisfied: statsmodels in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (0.14.5)
Requirement already satisfied: matplotlib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (3.10.6)
Requirement already satisfied: mrmr_selection in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (0.2.8)
Requirement already satisfied: pingouin in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (0.5.5)
Requirement already satisfied: seaborn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (0.13.2)
Requirement already satisfied: lifelines in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab[all]) (0.30.0)
Requirement already satisfied: scipy>=1.6.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from scikit-learn<1.7,>=1.4->njab[all]) (1.15.3)
Requirement already satisfied: joblib>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from scikit-learn<1.7,>=1.4->njab[all]) (1.5.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from scikit-learn<1.7,>=1.4->njab[all]) (3.6.0)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Requirement already satisfied: autograd>=1.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab[all]) (1.8.0)
Requirement already satisfied: autograd-gamma>=0.3 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab[all]) (0.5.0)
Requirement already satisfied: formulaic>=0.2.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab[all]) (1.2.0)
Requirement already satisfied: interface-meta>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab[all]) (1.3.0)
Requirement already satisfied: narwhals>=1.17 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab[all]) (2.5.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab[all]) (4.15.0)
Requirement already satisfied: wrapt>=1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab[all]) (1.17.3)
Requirement already satisfied: contourpy>=1.0.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab[all]) (1.3.2)
Requirement already satisfied: cycler>=0.10 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab[all]) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab[all]) (4.60.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab[all]) (1.4.9)
Requirement already satisfied: packaging>=20.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab[all]) (25.0)
Requirement already satisfied: pillow>=8 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab[all]) (11.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab[all]) (3.2.4)
Requirement already satisfied: python-dateutil>=2.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab[all]) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab[all]) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab[all]) (2025.2)
Requirement already satisfied: six>=1.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from python-dateutil>=2.7->matplotlib->njab[all]) (1.17.0)
Requirement already satisfied: category-encoders in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr_selection->njab[all]) (2.8.1)
Requirement already satisfied: jinja2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr_selection->njab[all]) (3.1.6)
Requirement already satisfied: tqdm in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr_selection->njab[all]) (4.67.1)
Requirement already satisfied: polars>=0.12.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr_selection->njab[all]) (1.33.1)
Requirement already satisfied: patsy>=0.5.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from category-encoders->mrmr_selection->njab[all]) (1.0.1)
Requirement already satisfied: MarkupSafe>=2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from jinja2->mrmr_selection->njab[all]) (3.0.2)
Requirement already satisfied: antlr4-python3-runtime==4.9.* in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab[all]) (4.9.3)
Requirement already satisfied: PyYAML>=5.1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab[all]) (6.0.2)
Requirement already satisfied: pandas-flavor in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab[all]) (0.7.0)
Requirement already satisfied: tabulate in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab[all]) (0.9.0)
Requirement already satisfied: xarray in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas-flavor->pingouin->njab[all]) (2025.6.1)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
?25l
   ━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━━ 1/2 [openpyxl]
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 [openpyxl]
Successfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
Note: you may need to restart the kernel to use updated packages.

Hide code cell content

from functools import partial
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn
import sklearn
from IPython.display import display
from lifelines.plotting import add_at_risk_counts

import njab
import njab.plotting
from njab.plotting.km import compare_km_curves, log_rank_test

njab.pandas.set_pandas_options()
pd.options.display.min_rows = 10
njab.plotting.set_font_sizes("x-small")
seaborn.set_style("whitegrid")
plt.rcParams["figure.figsize"] = [4.0, 4.0]

Set parameters#

TARGET = "event"
TIME_KM = "time"
FOLDER = "prostate"
CLINIC = "https://raw.githubusercontent.com/RasmussenLab/njab/main/docs/tutorial/data/prostate/prostate.csv"
val_ids: str = ""  # List of comma separated values or filepath
#
# list or string of csv, eg. "var1,var2"
clinic_cont = ["age"]
# list or string of csv, eg. "var1,var2"
clinic_binary = ["male", "AD"]
# List of comma separated values or filepath
da_covar = "num_age,num_wt"

Hide code cell source

print(f"Time To Event: {TIME_KM}")
print(f"event (target) variable: {TARGET}")
Time To Event: time
event (target) variable: event

Hide code cell content

FOLDER = Path(FOLDER)
FOLDER.mkdir(exist_ok=True, parents=True)
FOLDER
PosixPath('prostate')

Inspect the data:

Hide code cell source

clinic = pd.read_csv(CLINIC, index_col=0).dropna(how="any")
clinic.columns.name = "feat_name"  # ! check needs to be implemented
cols_clinic = njab.pandas.get_colums_accessor(clinic)
clinic = clinic.astype(
    {
        var: "int"
        for var in [
            "event",
            "time",
            "num_age",
            "num_wt",
            "num_sbp",
            "num_dbp",
            "num_sz",
            "num_sg",
            "num_sdate",
            "fac_stage",
        ]
    }
)
clinic
feat_name event time fac_stage fac_rx fac_pf fac_hx fac_ekg fac_bm num_age num_wt num_sbp num_dbp num_hg num_sz num_sg num_ap num_sdate
pid
0 0 72 3 0.2 mg estrogen normal activity 0 heart strain 0 75 76 15 9 13.799 2 8 0.300 2,778
2 1 40 3 5.0 mg estrogen normal activity 1 heart strain 0 69 102 14 8 13.398 3 9 0.300 2,933
3 1 20 3 0.2 mg estrogen in bed < 50% daytime 1 benign 0 75 94 14 7 17.598 4 8 0.900 2,999
4 0 65 3 placebo normal activity 0 normal 0 67 99 17 10 13.398 34 8 0.500 3,002
5 1 24 3 0.2 mg estrogen normal activity 0 normal 0 71 98 19 10 15.100 10 11 0.600 3,086
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
496 1 5 4 5.0 mg estrogen normal activity 1 normal 0 73 100 19 10 16.797 2 11 2.900 3,433
497 1 0 3 1.0 mg estrogen normal activity 0 normal 0 78 108 11 6 15.799 9 13 0.600 3,297
498 1 41 3 0.2 mg estrogen normal activity 1 heart strain 0 78 127 16 10 15.799 5 9 0.500 3,349
499 0 53 3 1.0 mg estrogen normal activity 0 heart block or conduction def 0 77 93 17 10 16.000 11 9 0.800 3,357
500 1 19 4 5.0 mg estrogen normal activity 1 heart strain 0 82 96 12 6 12.398 32 13 8.398 3,311

475 rows × 17 columns

Descriptive statistics of non-numeric variables:

clinic.describe(include="object")
feat_name fac_rx fac_pf fac_ekg
count 475 475 475
unique 4 4 7
top 5.0 mg estrogen normal activity normal
freq 121 428 161

Set the binary variables and convert them to categories:

vars_binary = ["fac_hx", "fac_bm"]
clinic[vars_binary] = clinic[vars_binary].astype("category")

Covariates to adjust for:

Hide code cell source

check_isin_clinic = partial(njab.pandas.col_isin_df, df=clinic)
covar = check_isin_clinic(da_covar)
covar
['num_age', 'num_wt']

Set continous variables

vars_cont = [
    "num_age",
    "num_wt",
    "num_sbp",
    "num_dbp",
    "num_hg",
    "num_sz",
    "num_sg",
    "num_ap",
    "num_sdate",
    "fac_stage",
]

Collect outputs#

in an excel file:

Hide code cell source

fname = FOLDER / "1_differential_analysis.xlsx"
files_out = {fname.name: fname}
writer = pd.ExcelWriter(fname)
print(f"Output will be written to: {fname}")
Output will be written to: prostate/1_differential_analysis.xlsx

Differences between groups defined by target#

Perform an uncontrolled t-test for continous and binomal test for binary variables. Besides the test results, summary statistics of both groups are provided.

First, set the binary target as a boolean mask:

happend = clinic[TARGET].astype(bool)

Univariate t-test for continous variables#

Hide code cell source

ana_differential = njab.stats.groups_comparision.diff_analysis(
    clinic[vars_cont],
    happend,
    event_names=(TARGET, "no event"),
)
ana_differential = ana_differential.sort_values(("ttest", "p-val"))
ana_differential.to_excel(writer, sheet_name="clinic continous", float_format="%.4f")
ana_differential
event no event ttest
count mean std count mean std alternative p-val cohen-d
variable
num_sz 338.000 15.660 13.071 137.000 10.898 9.066 two-sided 0.000 0.395
num_sg 338.000 10.506 2.034 137.000 9.796 1.899 two-sided 0.000 0.356
num_hg 338.000 13.255 2.027 137.000 13.825 1.638 two-sided 0.002 0.297
num_age 338.000 72.080 6.956 137.000 70.263 6.682 two-sided 0.008 0.264
fac_stage 338.000 3.462 0.499 137.000 3.336 0.474 two-sided 0.010 0.256
num_wt 338.000 98.107 13.735 137.000 101.255 12.073 two-sided 0.014 0.237
num_sdate 338.000 3,028.766 215.335 137.000 3,075.511 235.183 two-sided 0.046 0.211
num_sbp 338.000 14.426 2.398 137.000 14.255 2.515 two-sided 0.498 0.070
num_dbp 338.000 8.136 1.410 137.000 8.212 1.602 two-sided 0.630 0.052
num_ap 338.000 13.207 64.096 137.000 11.007 63.434 two-sided 0.733 0.034

Binomal test for binary variables:#

Hide code cell source

diff_binomial = []
for var in vars_binary:
    if len(clinic[var].cat.categories) == 2:
        diff_binomial.append(
            njab.stats.groups_comparision.binomtest(
                clinic[var], happend, event_names=(TARGET, "no-event")
            )
        )
    else:
        logging.warning(
            f"Non-binary variable: {var} with {len(clinic[var].cat.categories)} categories"
        )

diff_binomial = pd.concat(diff_binomial).sort_values(("binomial test", "pvalue"))
diff_binomial.to_excel(writer, sheet_name="clinic binary", float_format="%.4f")
with pd.option_context("display.max_rows", len(diff_binomial)):
    display(diff_binomial)
event no-event binomial test
count p count p k n alternative statistic pvalue proportion_estimate
variable
fac_hx 338.000 0.500 137.000 0.277 169 338 two-sided 0.500 0.000 0.500
fac_bm 338.000 0.198 137.000 0.073 67 338 two-sided 0.198 0.000 0.198

Analaysis of covariance (ANCOVA)#

Next, we select continous variables controlling for covariates.

First the summary statistics for the target and covariates:

Hide code cell source

clinic_ancova = [TARGET, *covar]
clinic_ancova = clinic[clinic_ancova].copy()
clinic_ancova.describe(include="all")
feat_name event num_age num_wt
count 475.000 475.000 475.000
mean 0.712 71.556 99.015
std 0.454 6.920 13.341
min 0.000 48.000 69.000
25% 0.000 70.000 90.000
50% 1.000 73.000 98.000
75% 1.000 76.000 107.000
max 1.000 89.000 152.000

Discard all rows with a missing features (if present):

Hide code cell source

clinic_ancova = clinic_ancova.dropna()
categorical_columns = clinic_ancova.columns[clinic_ancova.dtypes == "category"]
print("Available covariates: " ", ".join(categorical_columns.to_list()))
for categorical_column in categorical_columns:
    # only works if no NA and only binary variables!
    clinic_ancova[categorical_column] = clinic_ancova[categorical_column].cat.codes

desc_ancova = clinic_ancova.describe()
desc_ancova.to_excel(writer, sheet_name="covars", float_format="%.4f")
desc_ancova

feat_name event num_age num_wt
count 475.000 475.000 475.000
mean 0.712 71.556 99.015
std 0.454 6.920 13.341
min 0.000 48.000 69.000
25% 0.000 70.000 90.000
50% 1.000 73.000 98.000
75% 1.000 76.000 107.000
max 1.000 89.000 152.000

Remove non-varying variables (if present):

Hide code cell source

if (desc_ancova.loc["std"] < 0.001).sum():
    non_varying = desc_ancova.loc["std"] < 0.001
    non_varying = non_varying[non_varying].index
    print("Non varying columns: ", ", ".join(non_varying))
    clinic_ancova = clinic_ancova.drop(non_varying, axis=1)
    for col in non_varying:
        covar.remove(col)

Run ANCOVA:

Hide code cell source

ancova = njab.stats.ancova.AncovaOnlyTarget(
    df_proteomics=clinic[vars_cont].drop(covar, axis=1),
    df_clinic=clinic_ancova,
    target=TARGET,
    covar=covar,
    value_name="",
)
ancova = ancova.ancova().sort_values("p-unc")
ancova = ancova.loc[:, "p-unc":]
ancova.columns = pd.MultiIndex.from_product(
    [["ancova"], ancova.columns], names=("test", "var")
)
ancova.to_excel(writer, sheet_name="olink controlled", float_format="%.4f")
ancova.head(20)
test ancova
var p-unc np2 -Log10 pvalue qvalue rejected
feat_name Source
num_sz event 0.000 0.030 3.802 0.001 True
num_sg event 0.000 0.025 3.304 0.002 True
fac_stage event 0.013 0.013 1.883 0.035 True
num_sdate event 0.024 0.011 1.621 0.039 True
num_hg event 0.025 0.011 1.608 0.039 True
num_sbp event 0.356 0.002 0.448 0.475 False
num_ap event 0.721 0.000 0.142 0.824 False
num_dbp event 0.873 0.000 0.059 0.873 False

Hide code cell source

writer.close()

Kaplan-Meier (KM) plot for top markers#

Cutoff is defined using a univariate logistic regression

\[ \ln \frac{p}{1-p} = \beta_0 + \beta_1 \cdot x \]

the default cutoff p=0.5 corresponds to a feature value of:

\[ x = - \frac{\beta_0}{\beta_1} \]

Optional: The cutoff could be adapted to the prevalence of the target.

List of markers with significant difference between groups as defined by ANCOVA:

Hide code cell source

rejected = ancova.query("`('ancova', 'rejected')` == True")
rejected
test ancova
var p-unc np2 -Log10 pvalue qvalue rejected
feat_name Source
num_sz event 0.000 0.030 3.802 0.001 True
num_sg event 0.000 0.025 3.304 0.002 True
fac_stage event 0.013 0.013 1.883 0.035 True
num_sdate event 0.024 0.011 1.621 0.039 True
num_hg event 0.025 0.011 1.608 0.039 True

Settings for plots

class_weight = "balanced"
y_km = clinic[TARGET]
time_km = clinic[TIME_KM]
compare_km_curves = partial(
    compare_km_curves,
    time=time_km,
    y=y_km,
    xlim=(0, 80),
    xlabel="time passed",
    ylabel=f"rate {y_km.name}",
)
log_rank_test = partial(
    log_rank_test,
    time=time_km,
    y=y_km,
)
TOP_N = 2  # None = all

Plot KM curves for TOP_N (here two) makers with log-rank test p-value:

Hide code cell source

for marker, _ in rejected.index[:TOP_N]:  # first case done above currently
    fig, ax = plt.subplots()
    class_weight = "balanced"
    # class_weight=None
    model = sklearn.linear_model.LogisticRegression(class_weight=class_weight)
    model = model.fit(X=clinic[marker].to_frame(), y=happend)
    print(
        f"Intercept {float(model.intercept_.squeeze()):5.3f}, coef.: {float(model.coef_.squeeze()):5.3f}"
    )
    ! could be adapted based on proportion of target (for imbalanced data):
    # offset = np.log(p/(1-p))
    offset = np.log(0.5 / (1 - 0.5))  # ! standard cutoff of probability of 0.5
    cutoff = offset - float(model.intercept_.squeeze()) / float(model.coef_.squeeze())
    direction = ">" if model.coef_ > 0 else "<"
    print(f"Custom cutoff defined by Logistic regressor for {marker:>10}: {cutoff:.3f}")
    pred = njab.sklearn.scoring.get_pred(model, clinic[marker].to_frame())
    ax, kmf_0, kmf_1 = compare_km_curves(pred=pred)
    res = log_rank_test(mask=pred)
    ax.set_title(
        f"KM curve for {TARGET.lower()}"
        f" and marker {marker} \n"
        f"(cutoff{direction}{cutoff:.2f}, log-rank-test p={res.p_value:.3f})"
    )
    ax.legend(
        [
            f"KP pred=0 (N={(~pred).sum()})",
            "95% CI (pred=0)",
            f"KP pred=1 (N={pred.sum()})",
            "95% CI (pred=1)",
        ]
    )
    fname = FOLDER / f"KM_plot_{marker}.pdf"
    files_out[fname.name] = fname
    njab.plotting.savefig(ax.get_figure(), fname)

    # add counts
    add_at_risk_counts(kmf_0, kmf_1, ax=ax)
    fname = FOLDER / f"KM_plot_{marker}_w_counts.pdf"
    files_out[fname.name] = fname
    njab.plotting.savefig(ax.get_figure(), fname)
Intercept -0.512, coef.: 0.039
/usr/bin/sh: 1: Syntax error: "(" unexpected
Custom cutoff defined by Logistic regressor for     num_sz: 13.031
Intercept -1.862, coef.: 0.184
/usr/bin/sh: 1: Syntax error: "(" unexpected
Custom cutoff defined by Logistic regressor for     num_sg: 10.142
../_images/98c0956044d33a7619f1eadee1a20caa7df345489c8775def4ce13b07be8130a.png ../_images/56117052f726e0c26e0c0a0af81890b0b6aab93006d8290c8db1c781da68dd3c.png