Explorative Analysis#

Uses the prostate time series dataset provided by the SurvSet package.

SurvSet package First install the dependencies:

%pip install njab openpyxl

Show code cell output Hide code cell output

Requirement already satisfied: njab in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (0.0.6)
Collecting openpyxl
  Downloading openpyxl-3.1.2-py2.py3-none-any.whl.metadata (2.5 kB)
Requirement already satisfied: omegaconf in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (2.3.0)
Requirement already satisfied: lifelines in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.28.0)
Requirement already satisfied: numpy in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (1.26.4)
Requirement already satisfied: pandas in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (2.2.2)
Requirement already satisfied: scikit-learn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (1.4.2)
Requirement already satisfied: statsmodels in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.14.2)
Requirement already satisfied: umap-learn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.5.6)
Requirement already satisfied: matplotlib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (3.8.4)
Requirement already satisfied: mrmr-selection in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.2.8)
Requirement already satisfied: pingouin in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.5.4)
Requirement already satisfied: seaborn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.13.2)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)
Requirement already satisfied: scipy>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.13.0)
Requirement already satisfied: autograd>=1.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.6.2)
Requirement already satisfied: autograd-gamma>=0.3 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (0.5.0)
Requirement already satisfied: formulaic>=0.2.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.0.1)
Requirement already satisfied: contourpy>=1.0.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (24.0)
Requirement already satisfied: pillow>=8 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab) (2024.1)
Requirement already satisfied: category-encoders in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (2.6.3)
Requirement already satisfied: jinja2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (3.1.4)
Requirement already satisfied: tqdm in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (4.66.4)
Requirement already satisfied: joblib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (1.4.2)
Requirement already satisfied: polars>=0.12.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (0.20.25)
Requirement already satisfied: antlr4-python3-runtime==4.9.* in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab) (4.9.3)
Requirement already satisfied: PyYAML>=5.1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab) (6.0.1)
Requirement already satisfied: pandas-flavor in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab) (0.6.0)
Requirement already satisfied: tabulate in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab) (0.9.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from scikit-learn->njab) (3.5.0)
Requirement already satisfied: patsy>=0.5.6 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from statsmodels->njab) (0.5.6)
Requirement already satisfied: numba>=0.51.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from umap-learn->njab) (0.59.1)
Requirement already satisfied: pynndescent>=0.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from umap-learn->njab) (0.5.12)
Requirement already satisfied: future>=0.15.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from autograd>=1.5->lifelines->njab) (1.0.0)
Requirement already satisfied: interface-meta>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (1.3.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (4.11.0)
Requirement already satisfied: wrapt>=1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (1.16.0)
Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from numba>=0.51.2->umap-learn->njab) (0.42.0)
Requirement already satisfied: six in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from patsy>=0.5.6->statsmodels->njab) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from jinja2->mrmr-selection->njab) (2.1.5)
Requirement already satisfied: xarray in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas-flavor->pingouin->njab) (2024.5.0)
Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.0/250.0 kB 4.6 MB/s eta 0:00:00
?25hDownloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2
Note: you may need to restart the kernel to use updated packages.

Set parameters#

TARGET = 'event'
TIME_KM = 'time'
FOLDER = 'prostate'
CLINIC = 'https://raw.githubusercontent.com/ErikinBC/SurvSet/main/SurvSet/_datagen/output/prostate.csv'
val_ids: str = ''  # List of comma separated values or filepath
#
# list or string of csv, eg. "var1,var2"
clinic_cont = ['age']
# list or string of csv, eg. "var1,var2"
clinic_binary = ['male', 'AD']
# List of comma separated values or filepath
da_covar = 'num_age,num_wt'

Time To Event: time
event (target) variable: event

Inspect the data:

feat_name	event	time	num_age	num_wt	num_sbp	num_dbp	num_hg	num_sz	num_sg	num_ap	num_sdate	fac_stage	fac_rx	fac_pf	fac_hx	fac_ekg	fac_bm
pid
0	0	72	75	76	15	9	13.799	2	8	0.300	2,778	3	0.2 mg estrogen	normal activity	N	heart strain	N
2	1	40	69	102	14	8	13.398	3	9	0.300	2,933	3	5.0 mg estrogen	normal activity	Y	heart strain	N
3	1	20	75	94	14	7	17.598	4	8	0.900	2,999	3	0.2 mg estrogen	in bed < 50% daytime	Y	benign	N
4	0	65	67	99	17	10	13.398	34	8	0.500	3,002	3	placebo	normal activity	N	normal	N
5	1	24	71	98	19	10	15.100	10	11	0.600	3,086	3	0.2 mg estrogen	normal activity	N	normal	N
6	1	46	75	100	14	10	13.000	13	9	0.800	3,099	3	placebo	normal activity	N	benign	N
7	0	62	73	114	17	11	12.600	3	9	0.600	3,100	3	placebo	normal activity	Y	heart strain	N
8	0	61	60	110	12	8	14.600	4	10	0.700	3,134	3	1.0 mg estrogen	normal activity	N	normal	N
9	0	60	78	107	13	8	13.000	21	6	0.400	3,140	3	1.0 mg estrogen	normal activity	Y	old MI	N
10	0	60	77	89	15	8	15.600	3	8	0.600	3,141	3	1.0 mg estrogen	normal activity	N	normal	N
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
491	0	51	76	114	30	18	11.600	6	9	0.500	3,440	3	1.0 mg estrogen	normal activity	Y	heart strain	N
492	0	51	68	98	14	8	13.398	6	9	0.700	3,440	3	5.0 mg estrogen	normal activity	N	normal	N
493	1	14	64	100	14	7	14.000	23	11	0.600	3,395	4	0.2 mg estrogen	in bed < 50% daytime	N	normal	N
494	1	27	75	102	16	8	14.199	5	11	29.598	3,430	4	0.2 mg estrogen	normal activity	N	normal	N
495	0	51	60	123	14	10	14.500	6	8	0.500	3,429	4	1.0 mg estrogen	normal activity	Y	heart strain	Y
496	1	5	73	100	19	10	16.797	2	11	2.900	3,433	4	5.0 mg estrogen	normal activity	Y	normal	N
497	1	0	78	108	11	6	15.799	9	13	0.600	3,297	3	1.0 mg estrogen	normal activity	N	normal	N
498	1	41	78	127	16	10	15.799	5	9	0.500	3,349	3	0.2 mg estrogen	normal activity	Y	heart strain	N
499	0	53	77	93	17	10	16.000	11	9	0.800	3,357	3	1.0 mg estrogen	normal activity	N	heart block or conduction def	N
500	1	19	82	96	12	6	12.398	32	13	8.398	3,311	4	5.0 mg estrogen	normal activity	Y	heart strain	N

483 rows × 17 columns

Descriptive statistics of non-numeric variables:

clinic.describe(include='object')

feat_name	fac_rx	fac_pf	fac_hx	fac_ekg	fac_bm
count	483	483	483	483	483
unique	4	4	2	8	2
top	placebo	normal activity	N	normal	N
freq	122	435	276	161	404

Set the binary variables and convert them to categories:

vars_binary = ['fac_hx', 'fac_bm']
clinic[vars_binary] = clinic[vars_binary].astype('category')

Covariates to adjust for:

['num_age', 'num_wt']

Set continous variables

vars_cont = [
    'num_age', 'num_wt', 'num_sbp', 'num_dbp', 'num_hg', 'num_sz', 'num_sg',
    'num_ap', 'num_sdate', 'fac_stage'
]

Collect outputs#

in an excel file:

Output will be written to: prostate/1_differential_analysis.xlsx

Differences between groups defined by target#

Perform an uncontrolled t-test for continous and binomal test for binary variables. Besides the test results, summary statistics of both groups are provided.

First, set the binary target as a boolean mask:

happend = clinic[TARGET].astype(bool)

Univariate t-test for continous variables#

	event			no event			ttest
	count	mean	std	count	mean	std	alternative	p-val	cohen-d
variable
num_sz	344.000	15.779	13.124	139.000	10.986	9.033	two-sided	0.000	0.396
num_sg	344.000	10.503	2.026	139.000	9.791	1.894	two-sided	0.000	0.358
num_hg	344.000	13.259	2.020	139.000	13.826	1.629	two-sided	0.001	0.296
num_wt	344.000	98.023	13.783	139.000	101.259	12.003	two-sided	0.011	0.243
fac_stage	344.000	3.459	0.499	139.000	3.338	0.475	two-sided	0.013	0.246
num_age	344.000	72.029	7.004	139.000	70.424	6.775	two-sided	0.020	0.231
num_sdate	344.000	3,027.677	214.218	139.000	3,072.885	234.628	two-sided	0.051	0.205
num_sbp	344.000	14.410	2.419	139.000	14.252	2.497	two-sided	0.526	0.065
num_dbp	344.000	8.131	1.407	139.000	8.201	1.593	two-sided	0.649	0.048
num_ap	344.000	13.046	63.550	139.000	11.124	63.021	two-sided	0.762	0.030

Binomal test for binary variables:#

	event		no-event		binomial test
	count	p	count	p	k	n	alternative	statistic	pvalue	proportion_estimate
variable
fac_hx	344.000	0.491	139.000	0.273	169	344	two-sided	0.491	0.000	0.491
fac_bm	344.000	0.198	139.000	0.079	68	344	two-sided	0.198	0.000	0.198

Analaysis of covariance (ANCOVA)#

Next, we select continous variables controlling for covariates.

First the summary statistics for the target and covariates:

feat_name	event	num_age	num_wt
count	483.000	483.000	483.000
mean	0.712	71.567	98.954
std	0.453	6.970	13.364
min	0.000	48.000	69.000
25%	0.000	70.000	90.000
50%	1.000	73.000	98.000
75%	1.000	76.000	107.000
max	1.000	89.000	152.000

Discard all rows with a missing features (if present):

feat_name	event	num_age	num_wt
count	483.000	483.000	483.000
mean	0.712	71.567	98.954
std	0.453	6.970	13.364
min	0.000	48.000	69.000
25%	0.000	70.000	90.000
50%	1.000	73.000	98.000
75%	1.000	76.000	107.000
max	1.000	89.000	152.000

Remove non-varying variables (if present):

Run ANCOVA:

	test	ancova
	var	p-unc	np2	-Log10 pvalue	qvalue	rejected
feat_name	Source
num_sz	event	0.000	0.030	3.885	0.001	True
num_sg	event	0.000	0.025	3.353	0.002	True
fac_stage	event	0.017	0.012	1.776	0.045	True
num_hg	event	0.025	0.010	1.608	0.049	True
num_sdate	event	0.031	0.010	1.514	0.049	True
num_sbp	event	0.348	0.002	0.459	0.464	False
num_ap	event	0.772	0.000	0.113	0.839	False
num_dbp	event	0.839	0.000	0.076	0.839	False

Kaplan-Meier (KM) plot for top markers#

Cutoff is defined using a univariate logistic regression

\[ \ln \frac{p}{1-p} = \beta_0 + \beta_1 \cdot x \]

the default cutoff p=0.5 corresponds to a feature value of:

\[ x = - \frac{\beta_0}{\beta_1} \]

Optional: The cutoff could be adapted to the prevalence of the target.

List of markers with significant difference between groups as defined by ANCOVA:

rejected = ancova.query("`('ancova', 'rejected')` == True")
rejected

	test	ancova
	var	p-unc	np2	-Log10 pvalue	qvalue	rejected
feat_name	Source
num_sz	event	0.000	0.030	3.885	0.001	True
num_sg	event	0.000	0.025	3.353	0.002	True
fac_stage	event	0.017	0.012	1.776	0.045	True
num_hg	event	0.025	0.010	1.608	0.049	True
num_sdate	event	0.031	0.010	1.514	0.049	True

Settings for plots

class_weight = 'balanced'
y_km = clinic[TARGET]
time_km = clinic[TIME_KM]
compare_km_curves = partial(compare_km_curves,
                            time=time_km,
                            y=y_km,
                            xlim=(0, 80),
                            xlabel='time passed',
                            ylabel=f'rate {y_km.name}')
log_rank_test = partial(
    log_rank_test,
    time=time_km,
    y=y_km,
)
TOP_N = 2  # None = all

Plot KM curves for TOP_N (here two) makers with log-rank test p-value:

for marker, _ in rejected.index[:TOP_N]:  # first case done above currently
    fig, ax = plt.subplots()
    class_weight = 'balanced'
    # class_weight=None
    model = sklearn.linear_model.LogisticRegression(class_weight=class_weight)
    model = model.fit(X=clinic[marker].to_frame(), y=happend)
    print(
        f"Intercept {float(model.intercept_):5.3f}, coef.: {float(model.coef_):5.3f}"
    )
    # offset = np.log(p/(1-p)) # ! could be adapted based on proportion of target (for imbalanced data)
    offset = np.log(0.5 / (1 - 0.5))  # ! standard cutoff of probability of 0.5
    cutoff = offset - float(model.intercept_) / float(model.coef_)
    direction = '>' if model.coef_ > 0 else '<'
    print(
        f"Custom cutoff defined by Logistic regressor for {marker:>10}: {cutoff:.3f}"
    )
    pred = njab.sklearn.scoring.get_pred(model, clinic[marker].to_frame())
    ax, kmf_0, kmf_1 = compare_km_curves(pred=pred)
    res = log_rank_test(mask=pred)
    ax.set_title(
        f'KM curve for {TARGET.lower()}'
        f' and marker {marker} \n'
        f'(cutoff{direction}{cutoff:.2f}, log-rank-test p={res.p_value:.3f})')
    ax.legend([
        f"KP pred=0 (N={(~pred).sum()})", '95% CI (pred=0)',
        f"KP pred=1 (N={pred.sum()})", '95% CI (pred=1)'
    ])
    fname = FOLDER / f'KM_plot_{marker}.pdf'
    files_out[fname.name] = fname
    njab.plotting.savefig(ax.get_figure(), fname)

    # add counts
    add_at_risk_counts(kmf_0, kmf_1, ax=ax)
    fname = FOLDER / f'KM_plot_{marker}_w_counts.pdf'
    files_out[fname.name] = fname
    njab.plotting.savefig(ax.get_figure(), fname)

Intercept -0.517, coef.: 0.039
Custom cutoff defined by Logistic regressor for     num_sz: 13.135
Intercept -1.877, coef.: 0.185
Custom cutoff defined by Logistic regressor for     num_sg: 10.139

../_images/b1ccce0ab2de9d5a68460be47313090a8159c146dd0df8a9034a7d98c4f13082.png

../_images/1dee2684ce89158192fc475054fc189aaddae7d860de188a6aedfe5e57d363c1.png

Explorative Analysis

Contents