Explorative Analysis#
Uses the prostate
time series dataset provided by the SurvSet
package.
SurvSet package First install the dependencies:
%pip install njab openpyxl
Show code cell output
Requirement already satisfied: njab in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (0.0.6)
Collecting openpyxl
Downloading openpyxl-3.1.2-py2.py3-none-any.whl.metadata (2.5 kB)
Requirement already satisfied: omegaconf in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (2.3.0)
Requirement already satisfied: lifelines in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.28.0)
Requirement already satisfied: numpy in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (1.26.4)
Requirement already satisfied: pandas in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (2.2.2)
Requirement already satisfied: scikit-learn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (1.4.2)
Requirement already satisfied: statsmodels in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.14.2)
Requirement already satisfied: umap-learn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.5.6)
Requirement already satisfied: matplotlib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (3.8.4)
Requirement already satisfied: mrmr-selection in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.2.8)
Requirement already satisfied: pingouin in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.5.4)
Requirement already satisfied: seaborn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.13.2)
Collecting et-xmlfile (from openpyxl)
Downloading et_xmlfile-1.1.0-py3-none-any.whl.metadata (1.8 kB)
Requirement already satisfied: scipy>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.13.0)
Requirement already satisfied: autograd>=1.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.6.2)
Requirement already satisfied: autograd-gamma>=0.3 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (0.5.0)
Requirement already satisfied: formulaic>=0.2.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.0.1)
Requirement already satisfied: contourpy>=1.0.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (1.4.5)
Requirement already satisfied: packaging>=20.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (24.0)
Requirement already satisfied: pillow>=8 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab) (2024.1)
Requirement already satisfied: category-encoders in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (2.6.3)
Requirement already satisfied: jinja2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (3.1.4)
Requirement already satisfied: tqdm in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (4.66.4)
Requirement already satisfied: joblib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (1.4.2)
Requirement already satisfied: polars>=0.12.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (0.20.25)
Requirement already satisfied: antlr4-python3-runtime==4.9.* in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab) (4.9.3)
Requirement already satisfied: PyYAML>=5.1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab) (6.0.1)
Requirement already satisfied: pandas-flavor in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab) (0.6.0)
Requirement already satisfied: tabulate in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab) (0.9.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from scikit-learn->njab) (3.5.0)
Requirement already satisfied: patsy>=0.5.6 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from statsmodels->njab) (0.5.6)
Requirement already satisfied: numba>=0.51.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from umap-learn->njab) (0.59.1)
Requirement already satisfied: pynndescent>=0.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from umap-learn->njab) (0.5.12)
Requirement already satisfied: future>=0.15.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from autograd>=1.5->lifelines->njab) (1.0.0)
Requirement already satisfied: interface-meta>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (1.3.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (4.11.0)
Requirement already satisfied: wrapt>=1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (1.16.0)
Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from numba>=0.51.2->umap-learn->njab) (0.42.0)
Requirement already satisfied: six in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from patsy>=0.5.6->statsmodels->njab) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from jinja2->mrmr-selection->njab) (2.1.5)
Requirement already satisfied: xarray in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas-flavor->pingouin->njab) (2024.5.0)
Downloading openpyxl-3.1.2-py2.py3-none-any.whl (249 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 250.0/250.0 kB 4.6 MB/s eta 0:00:00
?25hDownloading et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB)
Installing collected packages: et-xmlfile, openpyxl
Successfully installed et-xmlfile-1.1.0 openpyxl-3.1.2
Note: you may need to restart the kernel to use updated packages.
Show code cell source
from functools import partial
from pathlib import Path
import logging
from IPython.display import display
import numpy as np
import pandas as pd
import sklearn
import seaborn
from lifelines.plotting import add_at_risk_counts
import matplotlib.pyplot as plt
from njab.plotting.km import compare_km_curves, log_rank_test
import njab
import njab.plotting
njab.pandas.set_pandas_options()
njab.plotting.set_font_sizes('x-small')
seaborn.set_style("whitegrid")
plt.rcParams['figure.figsize'] = [4.0, 4.0]
Set parameters#
TARGET = 'event'
TIME_KM = 'time'
FOLDER = 'prostate'
CLINIC = 'https://raw.githubusercontent.com/ErikinBC/SurvSet/main/SurvSet/_datagen/output/prostate.csv'
val_ids: str = '' # List of comma separated values or filepath
#
# list or string of csv, eg. "var1,var2"
clinic_cont = ['age']
# list or string of csv, eg. "var1,var2"
clinic_binary = ['male', 'AD']
# List of comma separated values or filepath
da_covar = 'num_age,num_wt'
Show code cell source
print(f"Time To Event: {TIME_KM}")
print(f"event (target) variable: {TARGET}")
Time To Event: time
event (target) variable: event
Show code cell content
FOLDER = Path(FOLDER)
FOLDER.mkdir(exist_ok=True, parents=True)
FOLDER
PosixPath('prostate')
Inspect the data:
Show code cell source
clinic = pd.read_csv(CLINIC, index_col=0).dropna(how='any')
clinic.columns.name = 'feat_name' # ! check needs to be implemented
cols_clinic = njab.pandas.get_colums_accessor(clinic)
clinic = clinic.astype({var: 'int'
for var in ['event',
'time',
'num_age',
'num_wt',
'num_sbp',
'num_dbp',
'num_sz',
'num_sg',
'num_sdate',
'fac_stage']}
)
clinic
feat_name | event | time | num_age | num_wt | num_sbp | num_dbp | num_hg | num_sz | num_sg | num_ap | num_sdate | fac_stage | fac_rx | fac_pf | fac_hx | fac_ekg | fac_bm |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
pid | |||||||||||||||||
0 | 0 | 72 | 75 | 76 | 15 | 9 | 13.799 | 2 | 8 | 0.300 | 2,778 | 3 | 0.2 mg estrogen | normal activity | N | heart strain | N |
2 | 1 | 40 | 69 | 102 | 14 | 8 | 13.398 | 3 | 9 | 0.300 | 2,933 | 3 | 5.0 mg estrogen | normal activity | Y | heart strain | N |
3 | 1 | 20 | 75 | 94 | 14 | 7 | 17.598 | 4 | 8 | 0.900 | 2,999 | 3 | 0.2 mg estrogen | in bed < 50% daytime | Y | benign | N |
4 | 0 | 65 | 67 | 99 | 17 | 10 | 13.398 | 34 | 8 | 0.500 | 3,002 | 3 | placebo | normal activity | N | normal | N |
5 | 1 | 24 | 71 | 98 | 19 | 10 | 15.100 | 10 | 11 | 0.600 | 3,086 | 3 | 0.2 mg estrogen | normal activity | N | normal | N |
6 | 1 | 46 | 75 | 100 | 14 | 10 | 13.000 | 13 | 9 | 0.800 | 3,099 | 3 | placebo | normal activity | N | benign | N |
7 | 0 | 62 | 73 | 114 | 17 | 11 | 12.600 | 3 | 9 | 0.600 | 3,100 | 3 | placebo | normal activity | Y | heart strain | N |
8 | 0 | 61 | 60 | 110 | 12 | 8 | 14.600 | 4 | 10 | 0.700 | 3,134 | 3 | 1.0 mg estrogen | normal activity | N | normal | N |
9 | 0 | 60 | 78 | 107 | 13 | 8 | 13.000 | 21 | 6 | 0.400 | 3,140 | 3 | 1.0 mg estrogen | normal activity | Y | old MI | N |
10 | 0 | 60 | 77 | 89 | 15 | 8 | 15.600 | 3 | 8 | 0.600 | 3,141 | 3 | 1.0 mg estrogen | normal activity | N | normal | N |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
491 | 0 | 51 | 76 | 114 | 30 | 18 | 11.600 | 6 | 9 | 0.500 | 3,440 | 3 | 1.0 mg estrogen | normal activity | Y | heart strain | N |
492 | 0 | 51 | 68 | 98 | 14 | 8 | 13.398 | 6 | 9 | 0.700 | 3,440 | 3 | 5.0 mg estrogen | normal activity | N | normal | N |
493 | 1 | 14 | 64 | 100 | 14 | 7 | 14.000 | 23 | 11 | 0.600 | 3,395 | 4 | 0.2 mg estrogen | in bed < 50% daytime | N | normal | N |
494 | 1 | 27 | 75 | 102 | 16 | 8 | 14.199 | 5 | 11 | 29.598 | 3,430 | 4 | 0.2 mg estrogen | normal activity | N | normal | N |
495 | 0 | 51 | 60 | 123 | 14 | 10 | 14.500 | 6 | 8 | 0.500 | 3,429 | 4 | 1.0 mg estrogen | normal activity | Y | heart strain | Y |
496 | 1 | 5 | 73 | 100 | 19 | 10 | 16.797 | 2 | 11 | 2.900 | 3,433 | 4 | 5.0 mg estrogen | normal activity | Y | normal | N |
497 | 1 | 0 | 78 | 108 | 11 | 6 | 15.799 | 9 | 13 | 0.600 | 3,297 | 3 | 1.0 mg estrogen | normal activity | N | normal | N |
498 | 1 | 41 | 78 | 127 | 16 | 10 | 15.799 | 5 | 9 | 0.500 | 3,349 | 3 | 0.2 mg estrogen | normal activity | Y | heart strain | N |
499 | 0 | 53 | 77 | 93 | 17 | 10 | 16.000 | 11 | 9 | 0.800 | 3,357 | 3 | 1.0 mg estrogen | normal activity | N | heart block or conduction def | N |
500 | 1 | 19 | 82 | 96 | 12 | 6 | 12.398 | 32 | 13 | 8.398 | 3,311 | 4 | 5.0 mg estrogen | normal activity | Y | heart strain | N |
483 rows × 17 columns
Descriptive statistics of non-numeric variables:
clinic.describe(include='object')
feat_name | fac_rx | fac_pf | fac_hx | fac_ekg | fac_bm |
---|---|---|---|---|---|
count | 483 | 483 | 483 | 483 | 483 |
unique | 4 | 4 | 2 | 8 | 2 |
top | placebo | normal activity | N | normal | N |
freq | 122 | 435 | 276 | 161 | 404 |
Set the binary variables and convert them to categories:
vars_binary = ['fac_hx', 'fac_bm']
clinic[vars_binary] = clinic[vars_binary].astype('category')
Covariates to adjust for:
Show code cell source
check_isin_clinic = partial(njab.pandas.col_isin_df, df=clinic)
covar = check_isin_clinic(da_covar)
covar
['num_age', 'num_wt']
Set continous variables
vars_cont = [
'num_age', 'num_wt', 'num_sbp', 'num_dbp', 'num_hg', 'num_sz', 'num_sg',
'num_ap', 'num_sdate', 'fac_stage'
]
Collect outputs#
in an excel file:
Show code cell source
fname = FOLDER / '1_differential_analysis.xlsx'
files_out = {fname.name: fname}
writer = pd.ExcelWriter(fname)
print(f"Output will be written to: {fname}")
Output will be written to: prostate/1_differential_analysis.xlsx
Differences between groups defined by target#
Perform an uncontrolled t-test for continous and binomal test for binary variables. Besides the test results, summary statistics of both groups are provided.
First, set the binary target as a boolean mask:
happend = clinic[TARGET].astype(bool)
Univariate t-test for continous variables#
Show code cell source
ana_differential = njab.stats.groups_comparision.diff_analysis(
clinic[vars_cont],
happend,
event_names=(TARGET, 'no event'),
)
ana_differential = ana_differential.sort_values(('ttest', 'p-val'))
ana_differential.to_excel(writer, "clinic continous", float_format='%.4f')
ana_differential
event | no event | ttest | |||||||
---|---|---|---|---|---|---|---|---|---|
count | mean | std | count | mean | std | alternative | p-val | cohen-d | |
variable | |||||||||
num_sz | 344.000 | 15.779 | 13.124 | 139.000 | 10.986 | 9.033 | two-sided | 0.000 | 0.396 |
num_sg | 344.000 | 10.503 | 2.026 | 139.000 | 9.791 | 1.894 | two-sided | 0.000 | 0.358 |
num_hg | 344.000 | 13.259 | 2.020 | 139.000 | 13.826 | 1.629 | two-sided | 0.001 | 0.296 |
num_wt | 344.000 | 98.023 | 13.783 | 139.000 | 101.259 | 12.003 | two-sided | 0.011 | 0.243 |
fac_stage | 344.000 | 3.459 | 0.499 | 139.000 | 3.338 | 0.475 | two-sided | 0.013 | 0.246 |
num_age | 344.000 | 72.029 | 7.004 | 139.000 | 70.424 | 6.775 | two-sided | 0.020 | 0.231 |
num_sdate | 344.000 | 3,027.677 | 214.218 | 139.000 | 3,072.885 | 234.628 | two-sided | 0.051 | 0.205 |
num_sbp | 344.000 | 14.410 | 2.419 | 139.000 | 14.252 | 2.497 | two-sided | 0.526 | 0.065 |
num_dbp | 344.000 | 8.131 | 1.407 | 139.000 | 8.201 | 1.593 | two-sided | 0.649 | 0.048 |
num_ap | 344.000 | 13.046 | 63.550 | 139.000 | 11.124 | 63.021 | two-sided | 0.762 | 0.030 |
Binomal test for binary variables:#
Show code cell source
diff_binomial = []
for var in vars_binary:
if len(clinic[var].cat.categories) == 2:
diff_binomial.append(
njab.stats.groups_comparision.binomtest(clinic[var],
happend,
event_names=(TARGET,
'no-event')))
else:
logging.warning(
f"Non-binary variable: {var} with {len(clinic[var].cat.categories)} categories"
)
diff_binomial = pd.concat(diff_binomial).sort_values(
('binomial test', 'pvalue'))
diff_binomial.to_excel(writer, 'clinic binary', float_format='%.4f')
with pd.option_context('display.max_rows', len(diff_binomial)):
display(diff_binomial)
event | no-event | binomial test | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
count | p | count | p | k | n | alternative | statistic | pvalue | proportion_estimate | |
variable | ||||||||||
fac_hx | 344.000 | 0.491 | 139.000 | 0.273 | 169 | 344 | two-sided | 0.491 | 0.000 | 0.491 |
fac_bm | 344.000 | 0.198 | 139.000 | 0.079 | 68 | 344 | two-sided | 0.198 | 0.000 | 0.198 |
Analaysis of covariance (ANCOVA)#
Next, we select continous variables controlling for covariates.
First the summary statistics for the target and covariates:
Show code cell source
clinic_ancova = [TARGET, *covar]
clinic_ancova = clinic[clinic_ancova].copy()
clinic_ancova.describe(include='all')
feat_name | event | num_age | num_wt |
---|---|---|---|
count | 483.000 | 483.000 | 483.000 |
mean | 0.712 | 71.567 | 98.954 |
std | 0.453 | 6.970 | 13.364 |
min | 0.000 | 48.000 | 69.000 |
25% | 0.000 | 70.000 | 90.000 |
50% | 1.000 | 73.000 | 98.000 |
75% | 1.000 | 76.000 | 107.000 |
max | 1.000 | 89.000 | 152.000 |
Discard all rows with a missing features (if present):
Show code cell source
clinic_ancova = clinic_ancova.dropna(
)
categorical_columns = clinic_ancova.columns[clinic_ancova.dtypes == 'category']
print("Available covariates: " ", ".join(categorical_columns.to_list()))
for categorical_column in categorical_columns:
# only works if no NA and only binary variables!
clinic_ancova[categorical_column] = clinic_ancova[
categorical_column].cat.codes
desc_ancova = clinic_ancova.describe()
desc_ancova.to_excel(writer, "covars", float_format='%.4f')
desc_ancova
feat_name | event | num_age | num_wt |
---|---|---|---|
count | 483.000 | 483.000 | 483.000 |
mean | 0.712 | 71.567 | 98.954 |
std | 0.453 | 6.970 | 13.364 |
min | 0.000 | 48.000 | 69.000 |
25% | 0.000 | 70.000 | 90.000 |
50% | 1.000 | 73.000 | 98.000 |
75% | 1.000 | 76.000 | 107.000 |
max | 1.000 | 89.000 | 152.000 |
Remove non-varying variables (if present):
Show code cell source
if (desc_ancova.loc['std'] < 0.001).sum():
non_varying = desc_ancova.loc['std'] < 0.001
non_varying = non_varying[non_varying].index
print("Non varying columns: ", ', '.join(non_varying))
clinic_ancova = clinic_ancova.drop(non_varying, axis=1)
for col in non_varying:
covar.remove(col)
Run ANCOVA:
Show code cell source
ancova = njab.stats.ancova.AncovaOnlyTarget(
df_proteomics=clinic[vars_cont].drop(covar, axis=1),
df_clinic=clinic_ancova,
target=TARGET,
covar=covar,
value_name='')
ancova = ancova.ancova().sort_values('p-unc')
ancova = ancova.loc[:, "p-unc":]
ancova.columns = pd.MultiIndex.from_product([['ancova'], ancova.columns],
names=('test', 'var'))
ancova.to_excel(writer, "olink controlled", float_format='%.4f')
ancova.head(20)
test | ancova | |||||
---|---|---|---|---|---|---|
var | p-unc | np2 | -Log10 pvalue | qvalue | rejected | |
feat_name | Source | |||||
num_sz | event | 0.000 | 0.030 | 3.885 | 0.001 | True |
num_sg | event | 0.000 | 0.025 | 3.353 | 0.002 | True |
fac_stage | event | 0.017 | 0.012 | 1.776 | 0.045 | True |
num_hg | event | 0.025 | 0.010 | 1.608 | 0.049 | True |
num_sdate | event | 0.031 | 0.010 | 1.514 | 0.049 | True |
num_sbp | event | 0.348 | 0.002 | 0.459 | 0.464 | False |
num_ap | event | 0.772 | 0.000 | 0.113 | 0.839 | False |
num_dbp | event | 0.839 | 0.000 | 0.076 | 0.839 | False |
Show code cell source
writer.close()
Kaplan-Meier (KM) plot for top markers#
Cutoff is defined using a univariate logistic regression
the default cutoff p=0.5
corresponds to a feature value of:
Optional: The cutoff could be adapted to the prevalence of the target.
List of markers with significant difference between groups as defined by ANCOVA:
rejected = ancova.query("`('ancova', 'rejected')` == True")
rejected
test | ancova | |||||
---|---|---|---|---|---|---|
var | p-unc | np2 | -Log10 pvalue | qvalue | rejected | |
feat_name | Source | |||||
num_sz | event | 0.000 | 0.030 | 3.885 | 0.001 | True |
num_sg | event | 0.000 | 0.025 | 3.353 | 0.002 | True |
fac_stage | event | 0.017 | 0.012 | 1.776 | 0.045 | True |
num_hg | event | 0.025 | 0.010 | 1.608 | 0.049 | True |
num_sdate | event | 0.031 | 0.010 | 1.514 | 0.049 | True |
Settings for plots
class_weight = 'balanced'
y_km = clinic[TARGET]
time_km = clinic[TIME_KM]
compare_km_curves = partial(compare_km_curves,
time=time_km,
y=y_km,
xlim=(0, 80),
xlabel='time passed',
ylabel=f'rate {y_km.name}')
log_rank_test = partial(
log_rank_test,
time=time_km,
y=y_km,
)
TOP_N = 2 # None = all
Plot KM curves for TOP_N
(here two) makers with log-rank test p-value:
for marker, _ in rejected.index[:TOP_N]: # first case done above currently
fig, ax = plt.subplots()
class_weight = 'balanced'
# class_weight=None
model = sklearn.linear_model.LogisticRegression(class_weight=class_weight)
model = model.fit(X=clinic[marker].to_frame(), y=happend)
print(
f"Intercept {float(model.intercept_):5.3f}, coef.: {float(model.coef_):5.3f}"
)
# offset = np.log(p/(1-p)) # ! could be adapted based on proportion of target (for imbalanced data)
offset = np.log(0.5 / (1 - 0.5)) # ! standard cutoff of probability of 0.5
cutoff = offset - float(model.intercept_) / float(model.coef_)
direction = '>' if model.coef_ > 0 else '<'
print(
f"Custom cutoff defined by Logistic regressor for {marker:>10}: {cutoff:.3f}"
)
pred = njab.sklearn.scoring.get_pred(model, clinic[marker].to_frame())
ax, kmf_0, kmf_1 = compare_km_curves(pred=pred)
res = log_rank_test(mask=pred)
ax.set_title(
f'KM curve for {TARGET.lower()}'
f' and marker {marker} \n'
f'(cutoff{direction}{cutoff:.2f}, log-rank-test p={res.p_value:.3f})')
ax.legend([
f"KP pred=0 (N={(~pred).sum()})", '95% CI (pred=0)',
f"KP pred=1 (N={pred.sum()})", '95% CI (pred=1)'
])
fname = FOLDER / f'KM_plot_{marker}.pdf'
files_out[fname.name] = fname
njab.plotting.savefig(ax.get_figure(), fname)
# add counts
add_at_risk_counts(kmf_0, kmf_1, ax=ax)
fname = FOLDER / f'KM_plot_{marker}_w_counts.pdf'
files_out[fname.name] = fname
njab.plotting.savefig(ax.get_figure(), fname)
Intercept -0.517, coef.: 0.039
Custom cutoff defined by Logistic regressor for num_sz: 13.135
Intercept -1.877, coef.: 0.185
Custom cutoff defined by Logistic regressor for num_sg: 10.139