Logistic regression model#

Procedure:

Example: Alzheimers mass spectrometry-based proteomics dataset

Predict Alzheimer disease based on proteomics measurements.

# Setup colab installation
# You need to restart the runtime after running this cell
%pip install njab heatmapz openpyxl plotly
Hide code cell output
Requirement already satisfied: njab in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (0.0.6)
Collecting heatmapz
  Downloading heatmapz-0.0.4-py3-none-any.whl.metadata (4.7 kB)
Requirement already satisfied: openpyxl in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (3.1.2)
Collecting plotly
  Downloading plotly-5.22.0-py3-none-any.whl.metadata (7.1 kB)
Requirement already satisfied: omegaconf in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (2.3.0)
Requirement already satisfied: lifelines in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.28.0)
Requirement already satisfied: numpy in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (1.26.4)
Requirement already satisfied: pandas in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (2.2.2)
Requirement already satisfied: scikit-learn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (1.4.2)
Requirement already satisfied: statsmodels in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.14.2)
Requirement already satisfied: umap-learn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.5.6)
Requirement already satisfied: matplotlib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (3.8.4)
Requirement already satisfied: mrmr-selection in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.2.8)
Requirement already satisfied: pingouin in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.5.4)
Requirement already satisfied: seaborn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.13.2)
Requirement already satisfied: et-xmlfile in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from openpyxl) (1.1.0)
Collecting tenacity>=6.2.0 (from plotly)
  Downloading tenacity-8.3.0-py3-none-any.whl.metadata (1.2 kB)
Requirement already satisfied: packaging in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from plotly) (24.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (1.4.5)
Requirement already satisfied: pillow>=8 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (2.9.0.post0)
Requirement already satisfied: scipy>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.13.0)
Requirement already satisfied: autograd>=1.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.6.2)
Requirement already satisfied: autograd-gamma>=0.3 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (0.5.0)
Requirement already satisfied: formulaic>=0.2.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.0.1)
Requirement already satisfied: pytz>=2020.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab) (2024.1)
Requirement already satisfied: category-encoders in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (2.6.3)
Requirement already satisfied: jinja2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (3.1.4)
Requirement already satisfied: tqdm in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (4.66.4)
Requirement already satisfied: joblib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (1.4.2)
Requirement already satisfied: polars>=0.12.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (0.20.25)
Requirement already satisfied: antlr4-python3-runtime==4.9.* in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab) (4.9.3)
Requirement already satisfied: PyYAML>=5.1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab) (6.0.1)
Requirement already satisfied: pandas-flavor in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab) (0.6.0)
Requirement already satisfied: tabulate in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab) (0.9.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from scikit-learn->njab) (3.5.0)
Requirement already satisfied: patsy>=0.5.6 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from statsmodels->njab) (0.5.6)
Requirement already satisfied: numba>=0.51.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from umap-learn->njab) (0.59.1)
Requirement already satisfied: pynndescent>=0.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from umap-learn->njab) (0.5.12)
Requirement already satisfied: future>=0.15.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from autograd>=1.5->lifelines->njab) (1.0.0)
Requirement already satisfied: interface-meta>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (1.3.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (4.11.0)
Requirement already satisfied: wrapt>=1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (1.16.0)
Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from numba>=0.51.2->umap-learn->njab) (0.42.0)
Requirement already satisfied: six in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from patsy>=0.5.6->statsmodels->njab) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from jinja2->mrmr-selection->njab) (2.1.5)
Requirement already satisfied: xarray in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas-flavor->pingouin->njab) (2024.5.0)
Downloading heatmapz-0.0.4-py3-none-any.whl (5.8 kB)
Downloading plotly-5.22.0-py3-none-any.whl (16.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.4/16.4 MB 70.8 MB/s eta 0:00:00
?25hDownloading tenacity-8.3.0-py3-none-any.whl (25 kB)
Installing collected packages: tenacity, plotly, heatmapz
Successfully installed heatmapz-0.0.4 plotly-5.22.0 tenacity-8.3.0
Note: you may need to restart the kernel to use updated packages.
Hide code cell source
import itertools
import logging
from pathlib import Path
from typing import Optional

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn
import sklearn
import sklearn.impute
import statsmodels.api as sm
import umap
from heatmap import corrplot
from IPython.display import display
from sklearn.metrics import log_loss, make_scorer

import njab.sklearn
from njab.plotting.metrics import plot_auc, plot_prc
from njab.sklearn import StandardScaler
from njab.sklearn import pca as njab_pca
from njab.sklearn.scoring import (ConfusionMatrix,
                                  get_lr_multiplicative_decomposition,
                                  get_pred, get_score,
                                  get_target_count_per_bin)
from njab.sklearn.types import Splits

logger = logging.getLogger('njab')
logger.setLevel(logging.INFO)

njab.pandas.set_pandas_options()
pd.options.display.min_rows = 10
pd.options.display.max_columns = 20
njab.plotting.set_font_sizes('x-small')
seaborn.set_style("whitegrid")

njab.plotting.set_font_sizes(8)

Set parameters#

CLINIC: str = 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/clinic_ml.csv'  # clincial data
fname_omics: str = 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/proteome.csv'  # omics data
TARGET: str = 'AD'  # target column in CLINIC dataset (binary)
TARGET_LABEL: Optional[str] = None  # optional: rename target variable
n_features_max: int = 5
freq_cutoff: float = 0.5  # Omics cutoff for sample completeness
VAL_IDS: str = ''  #
VAL_IDS_query: str = ''
weights: bool = True
FOLDER = 'alzheimer'
model_name = 'all'

Setup#

Load data#

clinic = pd.read_csv(CLINIC, index_col=0).convert_dtypes()
cols_clinic = njab.pandas.get_colums_accessor(clinic)
omics = pd.read_csv(fname_omics, index_col=0)

Data shapes

omics.shape, clinic.shape
((210, 1542), (210, 6))

See how common omics features are and remove feature below choosen frequency cutoff

ax = omics.notna().sum().sort_values().plot(rot=45)
../_images/4ff988969c9543033e8c33cc3829e176d01b7100a47eec49306116e6847e689c.png
Hide code cell source
M_before = omics.shape[1]
omics = omics.dropna(thresh=int(len(omics) * freq_cutoff), axis=1)
M_after = omics.shape[1]
msg = (
    f"Removed {M_before-M_after} features with more than {freq_cutoff*100}% missing values."
    f"\nRemaining features: {M_after} (of {M_before})")
print(msg)
# keep a map of all proteins in protein group, but only display first protein
# proteins are unique to protein groups
pg_map = {k: k.split(";")[0] for k in omics.columns}
omics = omics.rename(columns=pg_map)
# log2 transform raw intensity data:
omics = np.log2(omics + 1)
omics
Removed 248 features with more than 50.0% missing values.
Remaining features: 1294 (of 1542)
A0A024QZX5 A0A024R0T9 A0A024R3W6 A0A024R644 A0A075B6H9 A0A075B6I0 A0A075B6I1 A0A075B6I6 A0A075B6I9 A0A075B6J9 ... Q9Y653 Q9Y696 Q9Y6C2 Q9Y6N6 Q9Y6N7 Q9Y6R7 Q9Y6X5 Q9Y6Y8 Q9Y6Y9 S4R3U6
Sample ID
Sample_000 15.912 16.852 15.571 16.481 20.246 16.764 17.584 16.988 20.054 NaN ... 16.012 15.178 NaN 15.050 16.842 19.863 NaN 19.563 12.838 12.805
Sample_001 15.936 16.874 15.519 16.387 19.941 18.786 17.144 NaN 19.067 16.188 ... 15.528 15.576 NaN 14.833 16.597 20.299 15.556 19.386 13.970 12.443
Sample_002 16.112 14.523 15.935 16.416 19.251 16.832 15.671 17.012 18.569 NaN ... 15.229 14.728 13.757 15.118 17.440 19.598 15.735 20.447 12.637 12.505
Sample_003 16.107 17.032 15.802 16.979 19.628 17.852 18.877 14.182 18.985 13.438 ... 15.495 14.590 14.682 15.140 17.356 19.429 NaN 20.216 12.627 12.445
Sample_004 15.603 15.331 15.375 16.679 20.450 18.682 17.081 14.140 19.686 14.495 ... 14.757 15.094 14.048 15.256 17.075 19.582 15.328 19.867 13.145 12.235
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Sample_205 15.682 16.886 14.910 16.482 17.705 17.039 NaN 16.413 19.102 16.064 ... 15.236 15.684 14.236 15.415 17.551 17.922 16.340 19.928 12.930 11.803
Sample_206 15.798 17.554 15.600 15.938 18.155 18.152 16.503 16.860 18.538 15.288 ... 15.422 16.106 NaN 15.345 17.084 18.708 14.249 19.433 NaN NaN
Sample_207 15.740 16.877 15.469 16.898 18.636 17.950 16.321 16.401 18.849 17.580 ... 15.808 16.098 14.403 15.715 16.586 18.725 16.138 19.599 13.637 11.174
Sample_208 15.477 16.779 14.995 16.132 14.908 17.530 NaN 16.119 18.368 15.202 ... 15.157 16.712 NaN 14.640 16.533 19.411 15.807 19.545 13.216 NaN
Sample_209 15.727 17.261 15.175 16.235 17.893 17.744 16.371 15.780 18.806 16.532 ... 15.237 15.652 15.211 14.205 16.749 19.275 15.732 19.577 11.043 11.792

210 rows × 1294 columns

Clinical data#

View clinical data

clinic
Kiel Magdeburg Sweden male age AD
Sample ID
Sample_000 0 0 1 0 71 0
Sample_001 0 0 1 1 77 1
Sample_002 0 0 1 1 75 1
Sample_003 0 0 1 0 72 1
Sample_004 0 0 1 0 63 1
... ... ... ... ... ... ...
Sample_205 0 0 0 0 69 1
Sample_206 0 0 0 1 73 0
Sample_207 0 0 0 0 71 0
Sample_208 0 0 0 1 83 0
Sample_209 0 0 0 0 63 0

210 rows × 6 columns

Target#

Tabulate target and check for missing values

njab.pandas.value_counts_with_margins(clinic[TARGET])
counts prop
AD
0 122 0.581
1 88 0.419
Hide code cell source
target_counts = clinic[TARGET].value_counts()

if target_counts.sum() < len(clinic):
    print("Target has missing values."
          f" Can only use {target_counts.sum()} of {len(clinic)} samples.")
    mask = clinic[TARGET].notna()
    clinic, omics = clinic.loc[mask], omics.loc[mask]
if TARGET_LABEL is None:
    TARGET_LABEL = TARGET
y = clinic[TARGET].rename(TARGET_LABEL).astype(int)
clinic_for_ml = clinic.drop(TARGET, axis=1)

Test IDs#

Select some test samples:

Hide code cell source
olink_val, clinic_val = None, None
if not VAL_IDS:
    if VAL_IDS_query:
        logging.warning(f"Querying index using: {VAL_IDS_query}")
        VAL_IDS = clinic.filter(like=VAL_IDS_query, axis=0).index.to_list()
        logging.warning(f"Found {len(VAL_IDS)} Test-IDs")
    else:
        logging.warning("Create train and test split.")
        _, VAL_IDS = sklearn.model_selection.train_test_split(
            clinic.index,
            test_size=0.2,
            random_state=123,
            stratify=clinic[TARGET])
        VAL_IDS = list(VAL_IDS)
elif isinstance(VAL_IDS, str):
    VAL_IDS = VAL_IDS.split(",")
else:
    raise ValueError("Provide IDs in csv format as str: 'ID1,ID2'")
VAL_IDS
WARNING:root:Create train and test split.
['Sample_127',
 'Sample_164',
 'Sample_175',
 'Sample_048',
 'Sample_159',
 'Sample_141',
 'Sample_174',
 'Sample_145',
 'Sample_090',
 'Sample_191',
 'Sample_038',
 'Sample_009',
 'Sample_112',
 'Sample_096',
 'Sample_146',
 'Sample_135',
 'Sample_142',
 'Sample_205',
 'Sample_186',
 'Sample_095',
 'Sample_085',
 'Sample_011',
 'Sample_156',
 'Sample_153',
 'Sample_124',
 'Sample_194',
 'Sample_061',
 'Sample_079',
 'Sample_149',
 'Sample_179',
 'Sample_197',
 'Sample_125',
 'Sample_133',
 'Sample_099',
 'Sample_067',
 'Sample_202',
 'Sample_010',
 'Sample_171',
 'Sample_018',
 'Sample_060',
 'Sample_185',
 'Sample_173']

Data Splits#

Separate train and test split

Hide code cell source
TRAIN_LABEL = 'train'
TEST_LABEL = 'test'
if VAL_IDS:
    diff = pd.Index(VAL_IDS)
    VAL_IDS = X.index.intersection(VAL_IDS)
    if len(diff) < len(VAL_IDS):
        logging.warning("Some requested validation IDs are not in the data: "
                        ",".join(str(x) for x in diff.difference(VAL_IDS)))
    X_val = X.loc[VAL_IDS]
    X = X.drop(VAL_IDS)

    use_val_split = True

    y_val = y.loc[VAL_IDS]
    y = y.drop(VAL_IDS)

Output folder#

Hide code cell source
FOLDER = Path(FOLDER)
FOLDER.mkdir(exist_ok=True, parents=True)
print(f"Output folder: {FOLDER}")
Output folder: alzheimer

Outputs#

Save outputs to excel file:

Hide code cell source
# out
files_out = {}
fname = FOLDER / 'log_reg.xlsx'
files_out[fname.stem] = fname
writer = pd.ExcelWriter(fname)
print(f"Excel-file for tables: {fname}")
Excel-file for tables: alzheimer/log_reg.xlsx

Collect test predictions#

predictions = y_val.to_frame('true')

Fill missing values with training median#

Hide code cell source
feat_w_missings = X.isna().sum()
feat_w_missings = feat_w_missings.loc[feat_w_missings > 0]
feat_w_missings
age          10
A0A024QZX5   11
A0A024R0T9    2
A0A024R3W6   23
A0A024R644    1
             ..
Q9Y6N6        6
Q9Y6N7       10
Q9Y6X5       22
Q9Y6Y9       70
S4R3U6       65
Length: 894, dtype: int64
Hide code cell source
row_w_missing = X.isna().sum(axis=1).astype(bool)
col_w_missing = X.isna().sum(axis=0).astype(bool)
X.loc[row_w_missing, col_w_missing]
age A0A024QZX5 A0A024R0T9 A0A024R3W6 A0A024R644 A0A075B6H9 A0A075B6I0 A0A075B6I1 A0A075B6I6 A0A075B6J9 ... Q9Y5I4 Q9Y617 Q9Y653 Q9Y696 Q9Y6C2 Q9Y6N6 Q9Y6N7 Q9Y6X5 Q9Y6Y9 S4R3U6
Sample ID
Sample_000 71 15.912 16.852 15.571 16.481 20.246 16.764 17.584 16.988 NaN ... 17.187 16.859 16.012 15.178 NaN 15.050 16.842 NaN 12.838 12.805
Sample_001 77 15.936 16.874 15.519 16.387 19.941 18.786 17.144 NaN 16.188 ... 17.447 16.799 15.528 15.576 NaN 14.833 16.597 15.556 13.970 12.443
Sample_002 75 16.112 14.523 15.935 16.416 19.251 16.832 15.671 17.012 NaN ... 17.410 16.288 15.229 14.728 13.757 15.118 17.440 15.735 12.637 12.505
Sample_003 72 16.107 17.032 15.802 16.979 19.628 17.852 18.877 14.182 13.438 ... 17.545 17.075 15.495 14.590 14.682 15.140 17.356 NaN 12.627 12.445
Sample_004 63 15.603 15.331 15.375 16.679 20.450 18.682 17.081 14.140 14.495 ... 17.297 16.736 14.757 15.094 14.048 15.256 17.075 15.328 13.145 12.235
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Sample_204 <NA> NaN 17.279 15.287 16.513 20.183 19.674 20.251 16.334 19.778 ... 15.874 15.465 15.668 15.915 14.204 15.025 NaN 15.012 12.288 10.564
Sample_206 73 15.798 17.554 15.600 15.938 18.155 18.152 16.503 16.860 15.288 ... 17.109 15.035 15.422 16.106 NaN 15.345 17.084 14.249 NaN NaN
Sample_207 71 15.740 16.877 15.469 16.898 18.636 17.950 16.321 16.401 17.580 ... 16.938 16.283 15.808 16.098 14.403 15.715 16.586 16.138 13.637 11.174
Sample_208 83 15.477 16.779 14.995 16.132 14.908 17.530 NaN 16.119 15.202 ... 17.155 15.920 15.157 16.712 NaN 14.640 16.533 15.807 13.216 NaN
Sample_209 63 15.727 17.261 15.175 16.235 17.893 17.744 16.371 15.780 16.532 ... 16.776 15.713 15.237 15.652 15.211 14.205 16.749 15.732 11.043 11.792

168 rows × 894 columns

Impute using median of training data

median_imputer = sklearn.impute.SimpleImputer(strategy='median')

X = njab.sklearn.transform_DataFrame(X, median_imputer.fit_transform)
X_val = njab.sklearn.transform_DataFrame(X_val, median_imputer.transform)
assert X.isna().sum().sum() == 0
X.shape, X_val.shape
((168, 1299), (42, 1299))

Principal Components#

on standard normalized training data:

Hide code cell source
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

PCs, pca = njab_pca.run_pca(X_scaled, n_components=None)
files_out["var_explained_by_PCs.pdf"] = FOLDER / "var_explained_by_PCs.pdf"
ax = njab_pca.plot_explained_variance(pca)
ax.locator_params(axis='x', integer=True)
njab.plotting.savefig(ax.get_figure(), files_out["var_explained_by_PCs.pdf"])
X_scaled.shape
INFO:njab.plotting:Saved Figures to alzheimer/var_explained_by_PCs.pdf
(168, 1299)
../_images/7c6f0da8970efabec1bc46c88ef25a14741b783484610a55b457a864c8083e37.png

Plot first 5 PCs with binary target label annotating each sample::

Hide code cell source
files_out['scatter_first_5PCs.pdf'] = FOLDER / 'scatter_first_5PCs.pdf'

fig, axes = plt.subplots(5, 2, figsize=(6, 8), layout='constrained')
PCs.columns = [s.replace("principal component", "PC") for s in PCs.columns]
PCs = PCs.join(y.astype('category'))
up_to = min(PCs.shape[-1], 5)
# https://github.com/matplotlib/matplotlib/issues/25538
# colab: old pandas version and too new matplotlib version (2023-11-6)
for (i, j), ax in zip(itertools.combinations(range(up_to), 2), axes.flatten()):
    PCs.plot.scatter(i, j, c=TARGET_LABEL, cmap='Paired', ax=ax)
_ = PCs.pop(TARGET_LABEL)
njab.plotting.savefig(fig, files_out['scatter_first_5PCs.pdf'])
INFO:njab.plotting:Saved Figures to alzheimer/scatter_first_5PCs.pdf
../_images/be88c44a669147318b2dc370572ab9c30561f2cf53ac252fbb4732a848430298.png

UMAP#

of training data:

Hide code cell source
reducer = umap.UMAP()
embedding = reducer.fit_transform(X_scaled)

files_out['umap.pdf'] = FOLDER / 'umap.pdf'

embedding = pd.DataFrame(embedding,
                         index=X_scaled.index,
                         columns=['UMAP 1',
                                  'UMAP 2']).join(y.astype('category'))
ax = embedding.plot.scatter('UMAP 1', 'UMAP 2', c=TARGET_LABEL, cmap='Paired')
njab.plotting.savefig(ax.get_figure(), files_out['umap.pdf'])
INFO:njab.plotting:Saved Figures to alzheimer/umap.pdf
../_images/ea3eaf907663ec9ce50679425e90d16abce3ba257384bc4ebd9825bd31351b74.png

Baseline Model - Logistic Regression#

Based on parameters, use weighting:

if weights:
    weights = 'balanced'
    cutoff = 0.5
else:
    cutoff = None
    weights = None

Logistic Regression#

Procedure:

  1. Select best set of features from entire feature set selected using CV on train split

  2. Retrain best model configuration using entire train split and evalute on test split

Define splits and models:

splits = Splits(X_train=X_scaled,
                X_test=scaler.transform(X_val),
                y_train=y,
                y_test=y_val)
model = sklearn.linear_model.LogisticRegression(penalty='l2',
                                                class_weight=weights)
Hide code cell source
scoring = [
    'precision', 'recall', 'f1', 'balanced_accuracy', 'roc_auc',
    'average_precision'
]
scoring = {k: k for k in scoring}
# do not average log loss for AIC and BIC calculations
scoring['log_loss'] = make_scorer(log_loss,
                                  greater_is_better=True,
                                  normalize=False)
cv_feat = njab.sklearn.find_n_best_features(
    X=splits.X_train,
    y=splits.y_train,
    model=model,
    name=TARGET_LABEL,
    groups=splits.y_train,
    n_features_max=n_features_max,
    scoring=scoring,
    return_train_score=True,
    # fit_params=dict(sample_weight=weights)
)
cv_feat = cv_feat.drop('test_case',
                       axis=1).groupby('n_features').agg(['mean', 'std'])
cv_feat
100%|██████████| 1/1 [00:00<00:00, 139.28it/s]
100%|██████████| 2/2 [00:00<00:00,  6.83it/s]
100%|██████████| 3/3 [00:00<00:00,  5.28it/s]
100%|██████████| 4/4 [00:00<00:00,  4.69it/s]
100%|██████████| 5/5 [00:01<00:00,  4.42it/s]
fit_time score_time test_precision train_precision test_recall ... test_average_precision train_average_precision test_log_loss train_log_loss n_observations
mean std mean std mean std mean std mean std ... mean std mean std mean std mean std mean std
n_features
1 0.003 0.001 0.014 0.001 0.715 0.098 0.712 0.023 0.704 0.108 ... 0.751 0.091 0.720 0.028 294.837 86.679 1,168.535 82.092 168.000 0.000
2 0.003 0.000 0.014 0.000 0.687 0.066 0.687 0.021 0.713 0.099 ... 0.765 0.078 0.755 0.023 311.417 60.820 1,234.856 84.235 168.000 0.000
3 0.003 0.000 0.014 0.000 0.724 0.073 0.729 0.019 0.796 0.103 ... 0.801 0.087 0.795 0.024 262.398 61.376 1,004.897 78.645 168.000 0.000
4 0.003 0.000 0.014 0.000 0.767 0.078 0.780 0.020 0.814 0.091 ... 0.848 0.080 0.857 0.022 223.471 62.641 805.936 66.644 168.000 0.000
5 0.003 0.000 0.014 0.000 0.754 0.086 0.774 0.023 0.783 0.101 ... 0.848 0.078 0.862 0.021 245.818 62.510 845.584 92.744 168.000 0.000

5 rows × 34 columns

Add AIC and BIC for model selection

Hide code cell source
# AIC vs BIC on train and test data with bigger is better
IC_criteria = pd.DataFrame()
N_split = {
    'train': round(len(splits.X_train) * 0.8),
    'test': round(len(splits.X_train) * 0.2)
}

for _split in ('train', 'test'):

    IC_criteria[(f'{_split}_neg_AIC',
                 'mean')] = -(2 * cv_feat.index.to_series() -
                              2 * cv_feat[(f'{_split}_log_loss', 'mean')])
    IC_criteria[(
        f'{_split}_neg_BIC',
        'mean')] = -(cv_feat.index.to_series() * np.log(N_split[_split]) -
                     2 * cv_feat[(f'{_split}_log_loss', 'mean')])
IC_criteria.columns = pd.MultiIndex.from_tuples(IC_criteria.columns)
IC_criteria
train_neg_AIC train_neg_BIC test_neg_AIC test_neg_BIC
mean mean mean mean
n_features
1 2,335.070 2,332.173 587.674 586.148
2 2,465.711 2,459.915 618.834 615.782
3 2,003.794 1,995.101 518.796 514.217
4 1,603.872 1,592.281 438.941 432.836
5 1,681.168 1,666.679 481.635 474.004

All cross-validation metrics:

cv_feat = cv_feat.join(IC_criteria)
cv_feat = cv_feat.filter(regex="train|test", axis=1).style.highlight_max(
    axis=0, subset=pd.IndexSlice[:, pd.IndexSlice[:, 'mean']])
cv_feat
  test_precision train_precision test_recall train_recall test_f1 train_f1 test_balanced_accuracy train_balanced_accuracy test_roc_auc train_roc_auc test_average_precision train_average_precision test_log_loss train_log_loss train_neg_AIC train_neg_BIC test_neg_AIC test_neg_BIC
  mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean std mean mean mean mean
n_features                                                                
1 0.715090 0.097785 0.712124 0.022788 0.704286 0.107999 0.707500 0.019391 0.705759 0.088131 0.709702 0.019351 0.748748 0.073459 0.751437 0.016989 0.799259 0.065666 0.798173 0.016412 0.751101 0.091072 0.720311 0.028466 294.837085 86.679293 1,168.535243 82.092116 2,335.070486 2,332.172646 587.674169 586.147809
2 0.687396 0.066357 0.686706 0.020643 0.712857 0.099447 0.714286 0.024733 0.696015 0.064919 0.700109 0.020758 0.738534 0.053016 0.740687 0.017950 0.807192 0.055527 0.813695 0.013622 0.765328 0.078350 0.754689 0.022982 311.417165 60.820362 1,234.855565 84.234509 2,465.711130 2,459.915451 618.834331 615.781610
3 0.723886 0.073140 0.728746 0.019180 0.795714 0.103066 0.800357 0.023846 0.752519 0.059155 0.762733 0.018592 0.784752 0.052193 0.793655 0.016794 0.845158 0.058247 0.853656 0.013804 0.800901 0.086731 0.794904 0.023501 262.397797 61.375810 1,004.897056 78.644710 2,003.794113 1,995.100594 518.795593 514.216512
4 0.767039 0.078051 0.780211 0.020441 0.814286 0.091268 0.837143 0.018621 0.785609 0.061656 0.807484 0.015171 0.815038 0.053025 0.834130 0.013553 0.879639 0.056085 0.896144 0.013018 0.847946 0.080071 0.856597 0.022011 223.470651 62.641424 805.936090 66.644467 1,603.872180 1,592.280820 438.941302 432.835860
5 0.753626 0.085873 0.773605 0.022748 0.782857 0.100974 0.821786 0.026631 0.761591 0.063434 0.796848 0.022512 0.794929 0.052105 0.824919 0.019919 0.878444 0.055807 0.900023 0.012607 0.848134 0.077870 0.862431 0.021129 245.817716 62.510078 845.584109 92.743733 1,681.168217 1,666.679018 481.635432 474.003630

Save:

cv_feat.to_excel(writer, 'CV', float_format='%.3f')
cv_feat = cv_feat.data

Optimal number of features to use based on cross-validation by metric:

Hide code cell source
mask = cv_feat.columns.levels[0].str[:4] == 'test'
scores_cols = cv_feat.columns.levels[0][mask]
n_feat_best = cv_feat.loc[:, pd.IndexSlice[scores_cols, 'mean']].idxmax()
n_feat_best.name = 'best'
n_feat_best.to_excel(writer, 'n_feat_best')
n_feat_best
test_average_precision  mean   5
test_balanced_accuracy  mean   4
test_f1                 mean   4
test_log_loss           mean   2
test_neg_AIC            mean   2
test_neg_BIC            mean   2
test_precision          mean   4
test_recall             mean   4
test_roc_auc            mean   4
Name: best, dtype: int64

Retrain model with best number of features by selected metric::

results_model = njab.sklearn.run_model(
    model=model,
    splits=splits,
    n_feat_to_select=n_feat_best.loc['test_roc_auc', 'mean'],
)
results_model.name = model_name
100%|██████████| 4/4 [00:00<00:00,  4.61it/s]

Receiver Operating Curve of final model#

Hide code cell source
ax = plot_auc(results_model,
              label_train=TRAIN_LABEL,
              label_test=TEST_LABEL,
              figsize=(4, 2))
files_out['ROAUC'] = FOLDER / 'plot_roauc.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['ROAUC'])
INFO:njab.plotting:Saved Figures to alzheimer/plot_roauc.pdf
../_images/26ce48d1d43f277c1f0bdefa22e635fc812125134967088a832d1e9064955b58.png

Precision-Recall Curve for final model#

Hide code cell source
ax = plot_prc(results_model,
              label_train=TRAIN_LABEL,
              label_test=TEST_LABEL,
              figsize=(4, 2))
files_out['PRAUC'] = FOLDER / 'plot_prauc.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['PRAUC'])
INFO:njab.plotting:Saved Figures to alzheimer/plot_prauc.pdf
../_images/1e8f1b7b582196d2452afd76f85c39ec3e8ff5c0dec700aff0664c72e0c8060f.png

Coefficients with/out std. errors#

Hide code cell source
pd.DataFrame({
    'coef': results_model.model.coef_.flatten(),
    'name': results_model.model.feature_names_in_
})
coef name
0 1.094 P63104
1 -0.604 P09486
2 -0.791 A0A0B4J2B5
3 1.129 P10636-2
results_model.model.intercept_
array([-0.25165406])

Selected Features#

Hide code cell source
des_selected_feat = splits.X_train[results_model.selected_features].describe()
des_selected_feat.to_excel(writer, 'sel_feat', float_format='%.3f')
des_selected_feat
P63104 P09486 A0A0B4J2B5 P10636-2
count 168.000 168.000 168.000 168.000
mean 0.000 0.000 0.000 -0.000
std 1.003 1.003 1.003 1.003
min -2.527 -2.578 -2.756 -3.959
25% -0.721 -0.648 -0.660 -0.483
50% -0.108 0.101 -0.060 -0.021
75% 0.756 0.540 0.693 0.549
max 2.321 2.559 2.466 2.806

Heatmap of correlations#

Hide code cell source
fig = plt.figure(figsize=(6, 6))
files_out['corr_plot_train.pdf'] = FOLDER / 'corr_plot_train.pdf'
_ = corrplot(X[results_model.selected_features].join(y).corr(), size_scale=300)
njab.plotting.savefig(fig, files_out['corr_plot_train.pdf'])
INFO:njab.plotting:Saved Figures to alzheimer/corr_plot_train.pdf
../_images/fc122d9beab9211c5c346d6ad42dc93c322198e5fe42bdda1ef9bd1946e19144.png

Plot training data scores#

Hide code cell source
N_BINS = 20
score = get_score(clf=results_model.model,
                  X=splits.X_train[results_model.selected_features],
                  pos=1)
ax = score.hist(bins=N_BINS)
files_out['hist_score_train.pdf'] = FOLDER / 'hist_score_train.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['hist_score_train.pdf'])
pred_bins = get_target_count_per_bin(score, y, n_bins=N_BINS)
ax = pred_bins.plot(kind='bar', ylabel='count')
files_out[
    'hist_score_train_target.pdf'] = FOLDER / 'hist_score_train_target.pdf'
njab.plotting.savefig(ax.get_figure(),
                      files_out['hist_score_train_target.pdf'])
# pred_bins
INFO:njab.plotting:Saved Figures to alzheimer/hist_score_train.pdf
INFO:njab.plotting:Saved Figures to alzheimer/hist_score_train_target.pdf
../_images/cbd3e03adab7d60bf996e2fbae7a0a85b7c0501cba91e781f7018fd2b2739576.png ../_images/0e03bb843cb659c3ceb6dc9a3901ac7a8e856c8f4aeaa3bfd4a4965bc8a65b0b.png

Test data scores#

Hide code cell source
score_val = get_score(clf=results_model.model,
                      X=splits.X_test[results_model.selected_features],
                      pos=1)
predictions['score'] = score_val
ax = score_val.hist(bins=N_BINS)  # list(x/N_BINS for x in range(0,N_BINS)))
ax.set_ylabel('count')
ax.set_xlim(0, 1)
files_out['hist_score_test.pdf'] = FOLDER / 'hist_score_test.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['hist_score_test.pdf'])
pred_bins_val = get_target_count_per_bin(score_val, y_val, n_bins=N_BINS)
ax = pred_bins_val.plot(kind='bar', ylabel='count')
ax.locator_params(axis='y', integer=True)
files_out['hist_score_test_target.pdf'] = FOLDER / 'hist_score_test_target.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['hist_score_test_target.pdf'])
# pred_bins_val
INFO:njab.plotting:Saved Figures to alzheimer/hist_score_test.pdf
INFO:njab.plotting:Saved Figures to alzheimer/hist_score_test_target.pdf
../_images/a85b46bd0862fce41e1641872dbfc7e344cf9eb3f4874b8ac16b477b09db645b.png ../_images/59e06522d3ee3a402f601b1d8ad8f72391ad71a20dc8399acbdb69e2c2a9eb38.png

Performance evaluations#

Check if the cutoff can be adapted to maximize the F1 score between precision and recall:

Hide code cell source
prc = pd.DataFrame(results_model.train.prc,
                   index='precision recall cutoffs'.split())
prc
0 1 2 3 4 5 6 7 8 9 ... 159 160 161 162 163 164 165 166 167 168
precision 0.417 0.419 0.422 0.424 0.427 0.429 0.432 0.435 0.438 0.440 ... 0.889 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
recall 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 ... 0.114 0.114 0.100 0.086 0.071 0.057 0.043 0.029 0.014 0.000
cutoffs 0.000 0.004 0.009 0.010 0.011 0.011 0.014 0.023 0.028 0.032 ... 0.962 0.970 0.976 0.980 0.983 0.987 0.989 0.993 0.995 NaN

3 rows × 169 columns

Hide code cell source
prc.loc['f1_score'] = 2 * (prc.loc['precision'] * prc.loc['recall']) / (
    1 / prc.loc['precision'] + 1 / prc.loc['recall'])
f1_max = prc[prc.loc['f1_score'].argmax()]
f1_max
precision   0.741
recall      0.900
cutoffs     0.407
f1_score    0.542
Name: 83, dtype: float64

Cutoff set

Hide code cell source
cutoff = float(f1_max.loc['cutoffs'])
cutoff
0.4065495537928745
Hide code cell source
y_pred_val = njab.sklearn.scoring.get_custom_pred(
    clf=results_model.model,
    X=splits.X_test[results_model.selected_features],
    cutoff=cutoff)
predictions[model_name] = y_pred_val
predictions['dead'] = y_val
_ = ConfusionMatrix(y_val, y_pred_val).as_dataframe()
_.columns = pd.MultiIndex.from_tuples([(t[0] + f" - {cutoff:.3f}", t[1])
                                       for t in _.columns])
_.to_excel(writer, "CM_test_cutoff_adapted")
_
pred - 0.407
0 1
true
0 15 9
1 5 13
Hide code cell source
y_pred_val = get_pred(clf=results_model.model,
                      X=splits.X_test[results_model.selected_features])
predictions[model_name] = y_pred_val
predictions['dead'] = y_val
_ = ConfusionMatrix(y_val, y_pred_val).as_dataframe()
_.columns = pd.MultiIndex.from_tuples([(t[0] + f" - {0.5}", t[1])
                                       for t in _.columns])
_.to_excel(writer, "CM_test_cutoff_0.5")
_
pred - 0.5
0 1
true
0 18 6
1 5 13

Multiplicative decompositon#

Decompose the model into its components for both splits:

Hide code cell source
components = get_lr_multiplicative_decomposition(results=results_model,
                                                 X=splits.X_train,
                                                 prob=score,
                                                 y=y)
components.to_excel(writer, 'decomp_multiplicative_train')
components.to_excel(writer,
                    'decomp_multiplicative_train_view',
                    float_format='%.5f')
components.head(10)
P63104 P09486 A0A0B4J2B5 P10636-2 intercept odds prob AD
Sample ID
Sample_072 6.090 1.741 3.492 7.470 0.778 215.059 0.995 1
Sample_086 7.186 1.516 2.847 5.633 0.778 135.787 0.993 1
Sample_008 5.849 2.858 0.905 7.853 0.778 92.394 0.989 1
Sample_098 6.629 1.024 1.683 8.376 0.778 74.405 0.987 1
Sample_071 3.086 1.321 1.870 9.578 0.778 56.749 0.983 1
Sample_121 1.522 1.886 2.117 10.424 0.778 49.249 0.980 1
Sample_169 1.458 1.956 3.327 5.523 0.778 40.749 0.976 1
Sample_092 2.613 1.021 1.957 7.900 0.778 32.072 0.970 1
Sample_066 6.490 0.439 1.312 8.607 0.778 25.028 0.962 0
Sample_097 3.567 2.228 3.210 1.148 0.778 22.762 0.958 1
Hide code cell source
components_test = get_lr_multiplicative_decomposition(results=results_model,
                                                      X=splits.X_test,
                                                      prob=score_val,
                                                      y=y_val)
components_test.to_excel(writer, 'decomp_multiplicative_test')
components_test.to_excel(writer,
                         'decomp_multiplicative_test_view',
                         float_format='%.5f')
components_test.head(10)
P63104 P09486 A0A0B4J2B5 P10636-2 intercept odds prob AD
Sample ID
Sample_018 4.648 1.233 2.615 10.706 0.778 124.817 0.992 1
Sample_159 4.942 1.430 2.243 4.833 0.778 59.588 0.983 1
Sample_205 2.733 0.603 3.614 9.700 0.778 44.958 0.978 1
Sample_099 2.494 1.938 2.383 4.474 0.778 40.066 0.976 1
Sample_011 7.611 0.938 0.949 6.289 0.778 33.133 0.971 1
Sample_085 2.161 1.095 1.856 5.614 0.778 19.173 0.950 1
Sample_112 7.952 3.326 0.722 0.833 0.778 12.369 0.925 0
Sample_079 1.910 1.316 1.454 4.112 0.778 11.689 0.921 1
Sample_010 5.122 0.528 0.878 1.996 0.778 3.686 0.787 1
Sample_133 2.978 0.822 1.741 1.034 0.778 3.427 0.774 1

Plot TP, TN, FP and FN on PCA plot#

UMAP#

Hide code cell source
reducer = umap.UMAP(random_state=42)
# bug: how does UMAP works with only one feature?
# make sure to have two or more features?
M_sel = len(results_model.selected_features)
if M_sel > 1:
    embedding = reducer.fit_transform(
        X_scaled[results_model.selected_features])

    embedding = pd.DataFrame(embedding,
                             index=X_scaled.index,
                             columns=['UMAP dimension 1', 'UMAP dimension 2'
                                      ]).join(y.astype('category'))
    display(embedding.head(3))
else:
    embedding = None
UMAP dimension 1 UMAP dimension 2 AD
Sample ID
Sample_000 6.307 7.813 0
Sample_001 7.008 7.516 1
Sample_002 6.090 6.873 1

Annotate using target variable and predictions:

Hide code cell source
predictions['label'] = predictions.apply(
    lambda x: njab.sklearn.scoring.get_label_binary_classification(
        x['true'], x[model_name]),
    axis=1)
mask = predictions[['true', model_name]].sum(axis=1).astype(bool)
predictions.loc[mask].sort_values('score', ascending=False)
true score all dead label
Sample ID
Sample_018 1 0.992 1 1 TP
Sample_159 1 0.983 1 1 TP
Sample_205 1 0.978 1 1 TP
Sample_099 1 0.976 1 1 TP
Sample_011 1 0.971 1 1 TP
Sample_085 1 0.950 1 1 TP
Sample_112 0 0.925 1 0 FP
Sample_079 1 0.921 1 1 TP
Sample_010 1 0.787 1 1 TP
Sample_133 1 0.774 1 1 TP
Sample_127 0 0.758 1 0 FP
Sample_090 1 0.728 1 1 TP
Sample_156 1 0.727 1 1 TP
Sample_145 0 0.674 1 0 FP
Sample_186 0 0.637 1 0 FP
Sample_067 0 0.583 1 0 FP
Sample_060 1 0.553 1 1 TP
Sample_197 0 0.535 1 0 FP
Sample_146 1 0.516 1 1 TP
Sample_009 1 0.292 0 1 FN
Sample_194 1 0.237 0 1 FN
Sample_142 1 0.195 0 1 FN
Sample_171 1 0.140 0 1 FN
Sample_125 1 0.119 0 1 FN
Hide code cell source
X_val_scaled = scaler.transform(X_val)
if embedding is not None:
    embedding_val = pd.DataFrame(
        reducer.transform(X_val_scaled[results_model.selected_features]),
        index=X_val_scaled.index,
        columns=['UMAP dimension 1', 'UMAP dimension 2'])
    embedding_val.sample(3)
Hide code cell source
pred_train = (
    y.to_frame('true')
    # .join(get_score(clf=results_model.model, X=splits.X_train[results_model.selected_features], pos=1))
    .join(score.rename('score')).join(
        get_pred(results_model.model, splits.X_train[
            results_model.selected_features]).rename(model_name)))
pred_train['label'] = pred_train.apply(
    lambda x: njab.sklearn.scoring.get_label_binary_classification(
        x['true'], x[model_name]),
    axis=1)
pred_train.sample(5)
true score all label
Sample ID
Sample_204 0 0.048 0 TN
Sample_020 1 0.934 1 TP
Sample_163 0 0.014 0 TN
Sample_181 0 0.094 0 TN
Sample_000 0 0.943 1 FP
Hide code cell content
colors = seaborn.color_palette(n_colors=4)
colors
Hide code cell source
if embedding is not None:
    fig, axes = plt.subplots(1, 2, figsize=(8, 4), sharex=True, sharey=True)
    for _embedding, ax, _title, _model_pred_label in zip(
        [embedding, embedding_val], axes, [TRAIN_LABEL, TEST_LABEL],
        [pred_train['label'], predictions['label']]):  # noqa: E129
        ax = seaborn.scatterplot(
            x=_embedding.iloc[:, 0],
            y=_embedding.iloc[:, 1],
            hue=_model_pred_label,
            hue_order=['TN', 'TP', 'FN', 'FP'],
            palette=[colors[0], colors[2], colors[1], colors[3]],
            ax=ax)
        ax.set_title(_title)

    # files_out['pred_pca_labeled'] = FOLDER / 'pred_pca_labeled.pdf'
    # njab.plotting.savefig(fig, files_out['pred_pca_labeled'])

    files_out['umap_sel_feat.pdf'] = FOLDER / 'umap_sel_feat.pdf'
    njab.plotting.savefig(ax.get_figure(), files_out['umap_sel_feat.pdf'])
INFO:njab.plotting:Saved Figures to alzheimer/umap_sel_feat.pdf
../_images/d87f69f056e28afb2853b222843a7d02298e71c19909592c94d236df61c8f965.png

Interactive UMAP plot#

Not displayed in online documentation

Hide code cell source
if embedding is not None:
    embedding = embedding.join(X[results_model.selected_features])
    embedding_val = embedding_val.join(X_val[results_model.selected_features])
    embedding['label'], embedding_val['label'] = pred_train[
        'label'], predictions['label']
    embedding['group'], embedding_val['group'] = TRAIN_LABEL, TEST_LABEL
    combined_embeddings = pd.concat([embedding, embedding_val])
    combined_embeddings.index.name = 'ID'
Hide code cell source
if embedding is not None:
    cols = combined_embeddings.columns

    TEMPLATE = 'none'
    defaults = dict(width=800, height=400, template=TEMPLATE)

    fig = px.scatter(combined_embeddings.round(3).reset_index(),
                     x=cols[0],
                     y=cols[1],
                     color='label',
                     facet_col='group',
                     hover_data=['ID'] + results_model.selected_features,
                     **defaults)
    fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))

    fname = FOLDER / 'umap_sel_feat.html'
    files_out[fname.name] = fname
    fig.write_html(fname)
    print(fname)
    display(fig)
alzheimer/umap_sel_feat.html

PCA#

Hide code cell source
PCs_train, pca = njab_pca.run_pca(X_scaled[results_model.selected_features],
                                  n_components=None)
ax = njab_pca.plot_explained_variance(pca)
ax.locator_params(axis='x', integer=True)

fname = FOLDER / "feat_sel_PCA_var_explained_by_PCs.pdf"
files_out[fname.name] = fname
njab.plotting.savefig(ax.get_figure(), fname)
INFO:njab.plotting:Saved Figures to alzheimer/feat_sel_PCA_var_explained_by_PCs.pdf
../_images/ccf27a6aaf69f5130019ab4e0c8023af7284e1ce06d4fa0608d31e0a6c7e182b.png

Applied to the test split:

Hide code cell source
PCs_val = pca.transform(X_val_scaled[results_model.selected_features])
PCs_val = pd.DataFrame(PCs_val,
                       index=X_val_scaled.index,
                       columns=PCs_train.columns)
PCs_val
principal component 1 (36.36 %) principal component 2 (25.19 %) principal component 3 (24.95 %) principal component 4 (13.50 %)
Sample ID
Sample_009 -0.065 -0.509 -1.214 0.950
Sample_010 -1.181 0.502 -1.419 0.297
Sample_011 -2.363 0.173 -0.701 0.049
Sample_018 -2.564 1.044 0.453 -0.327
Sample_038 -0.173 -0.937 0.055 0.153
... ... ... ... ...
Sample_191 2.711 1.479 2.641 1.431
Sample_194 0.370 -0.036 -0.266 -0.796
Sample_197 -0.095 0.222 0.214 0.031
Sample_202 1.135 1.658 0.188 -0.318
Sample_205 -1.932 1.956 -0.139 -0.808

42 rows × 4 columns

Hide code cell source
if M_sel > 1:
    fig, axes = plt.subplots(1, 2, figsize=(6, 3), sharex=True, sharey=True)
    for _embedding, ax, _title, _model_pred_label in zip(
        [PCs_train, PCs_val], axes, [TRAIN_LABEL, TEST_LABEL],
        [pred_train['label'], predictions['label']]):  # noqa: E129
        ax = seaborn.scatterplot(
            x=_embedding.iloc[:, 0],
            y=_embedding.iloc[:, 1],
            hue=_model_pred_label,
            hue_order=['TN', 'TP', 'FN', 'FP'],
            palette=[colors[0], colors[2], colors[1], colors[3]],
            ax=ax)
        ax.set_title(_title)

    fname = FOLDER / 'pca_sel_feat.pdf'
    files_out[fname.name] = fname
    njab.plotting.savefig(ax.get_figure(), fname)
INFO:njab.plotting:Saved Figures to alzheimer/pca_sel_feat.pdf
../_images/c8c3532ce4f378dd0e5cb4b3c23336e5057095da5f1fd10f94323ab911f07c8f.png
Hide code cell source
if M_sel > 1:
    max_rows = min(3, len(results_model.selected_features))
    fig, axes = plt.subplots(max_rows,
                             2,
                             figsize=(6, 8),
                             sharex=False,
                             sharey=False,
                             layout='constrained')

    for axes_col, (_embedding, _title, _model_pred_label) in enumerate(
            zip([PCs_train, PCs_val], [TRAIN_LABEL, TEST_LABEL],
                [pred_train['label'], predictions['label']])):
        _row = 0
        axes[_row, axes_col].set_title(_title)
        for (i, j) in itertools.combinations(range(max_rows), 2):
            ax = seaborn.scatterplot(
                x=_embedding.iloc[:, i],
                y=_embedding.iloc[:, j],
                hue=_model_pred_label,
                hue_order=['TN', 'TP', 'FN', 'FP'],
                palette=[colors[0], colors[2], colors[1], colors[3]],
                ax=axes[_row, axes_col])
            _row += 1

    fname = FOLDER / f'pca_sel_feat_up_to_{max_rows}.pdf'
    files_out[fname.name] = fname
    njab.plotting.savefig(ax.get_figure(), fname)
INFO:njab.plotting:Saved Figures to alzheimer/pca_sel_feat_up_to_3.pdf
../_images/10632530c321d64e0b8d1b55b825a46c0f783d92e3d9e635d90c108b8f40e1de.png

Features#

  • top 3 scaled n_features_max (scatter)

  • or unscalled single features (swarmplot)

Hide code cell source
if M_sel > 1:
    max_rows = min(3, len(results_model.selected_features))
    fig, axes = plt.subplots(max_rows,
                             2,
                             figsize=(6, 8),
                             sharex=False,
                             sharey=False,
                             layout='constrained')

    for axes_col, (_embedding, _title, _model_pred_label) in enumerate(
            zip([
                X_scaled[results_model.selected_features],
                X_val_scaled[results_model.selected_features]
            ], [TRAIN_LABEL, TEST_LABEL],
                [pred_train['label'], predictions['label']])):
        _row = 0
        axes[_row, axes_col].set_title(_title)
        for (i, j) in itertools.combinations(range(max_rows), 2):
            ax = seaborn.scatterplot(
                x=_embedding.iloc[:, i],
                y=_embedding.iloc[:, j],
                hue=_model_pred_label,
                hue_order=['TN', 'TP', 'FN', 'FP'],
                palette=[colors[0], colors[2], colors[1], colors[3]],
                ax=axes[_row, axes_col])
            _row += 1

    fname = FOLDER / f'sel_feat_up_to_{max_rows}.pdf'
    files_out[fname.name] = fname
    njab.plotting.savefig(ax.get_figure(), fname)
else:
    fig, axes = plt.subplots(1, 1, figsize=(6, 2), layout='constrained')
    single_feature = results_model.selected_features[0]
    data = pd.concat([
        X[single_feature].to_frame().join(
            pred_train['label']).assign(group=TRAIN_LABEL),
        X_val[single_feature].to_frame().join(
            predictions['label']).assign(group=TEST_LABEL)
    ])
    ax = seaborn.swarmplot(data=data,
                           x='group',
                           y=single_feature,
                           hue='label',
                           ax=axes)
    fname = FOLDER / f'sel_feat_{single_feature}.pdf'
    files_out[fname.name] = fname
    njab.plotting.savefig(ax.get_figure(), fname)
INFO:njab.plotting:Saved Figures to alzheimer/sel_feat_up_to_3.pdf
../_images/6c710ec4bc3ba350f9683c90112a2af8df89179c3ac2f888c1a678ca59d82989.png

Savee annotation of errors for manuel analysis#

Saved to excel table.

X[results_model.selected_features].join(pred_train).to_excel(
    writer, sheet_name='pred_train_annotated', float_format="%.3f")
X_val[results_model.selected_features].join(predictions).to_excel(
    writer, sheet_name='pred_test_annotated', float_format="%.3f")

Outputs#

writer.close()
files_out
{'log_reg': PosixPath('alzheimer/log_reg.xlsx'),
 'var_explained_by_PCs.pdf': PosixPath('alzheimer/var_explained_by_PCs.pdf'),
 'scatter_first_5PCs.pdf': PosixPath('alzheimer/scatter_first_5PCs.pdf'),
 'umap.pdf': PosixPath('alzheimer/umap.pdf'),
 'ROAUC': PosixPath('alzheimer/plot_roauc.pdf'),
 'PRAUC': PosixPath('alzheimer/plot_prauc.pdf'),
 'corr_plot_train.pdf': PosixPath('alzheimer/corr_plot_train.pdf'),
 'hist_score_train.pdf': PosixPath('alzheimer/hist_score_train.pdf'),
 'hist_score_train_target.pdf': PosixPath('alzheimer/hist_score_train_target.pdf'),
 'hist_score_test.pdf': PosixPath('alzheimer/hist_score_test.pdf'),
 'hist_score_test_target.pdf': PosixPath('alzheimer/hist_score_test_target.pdf'),
 'umap_sel_feat.pdf': PosixPath('alzheimer/umap_sel_feat.pdf'),
 'umap_sel_feat.html': PosixPath('alzheimer/umap_sel_feat.html'),
 'feat_sel_PCA_var_explained_by_PCs.pdf': PosixPath('alzheimer/feat_sel_PCA_var_explained_by_PCs.pdf'),
 'pca_sel_feat.pdf': PosixPath('alzheimer/pca_sel_feat.pdf'),
 'pca_sel_feat_up_to_3.pdf': PosixPath('alzheimer/pca_sel_feat_up_to_3.pdf'),
 'sel_feat_up_to_3.pdf': PosixPath('alzheimer/sel_feat_up_to_3.pdf')}