Logistic regression model#
Procedure:
Example: Alzheimers mass spectrometry-based proteomics dataset
Predict Alzheimer disease based on proteomics measurements.
# Setup colab installation
# You need to restart the runtime after running this cell
%pip install njab heatmapz openpyxl plotly
Show code cell output
Requirement already satisfied: njab in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (0.0.6)
Collecting heatmapz
Downloading heatmapz-0.0.4-py3-none-any.whl.metadata (4.7 kB)
Requirement already satisfied: openpyxl in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (3.1.2)
Collecting plotly
Downloading plotly-5.22.0-py3-none-any.whl.metadata (7.1 kB)
Requirement already satisfied: omegaconf in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (2.3.0)
Requirement already satisfied: lifelines in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.28.0)
Requirement already satisfied: numpy in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (1.26.4)
Requirement already satisfied: pandas in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (2.2.2)
Requirement already satisfied: scikit-learn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (1.4.2)
Requirement already satisfied: statsmodels in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.14.2)
Requirement already satisfied: umap-learn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.5.6)
Requirement already satisfied: matplotlib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (3.8.4)
Requirement already satisfied: mrmr-selection in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.2.8)
Requirement already satisfied: pingouin in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.5.4)
Requirement already satisfied: seaborn in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from njab) (0.13.2)
Requirement already satisfied: et-xmlfile in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from openpyxl) (1.1.0)
Collecting tenacity>=6.2.0 (from plotly)
Downloading tenacity-8.3.0-py3-none-any.whl.metadata (1.2 kB)
Requirement already satisfied: packaging in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from plotly) (24.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (1.2.1)
Requirement already satisfied: cycler>=0.10 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (4.51.0)
Requirement already satisfied: kiwisolver>=1.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (1.4.5)
Requirement already satisfied: pillow>=8 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (10.3.0)
Requirement already satisfied: pyparsing>=2.3.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (3.1.2)
Requirement already satisfied: python-dateutil>=2.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from matplotlib->njab) (2.9.0.post0)
Requirement already satisfied: scipy>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.13.0)
Requirement already satisfied: autograd>=1.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.6.2)
Requirement already satisfied: autograd-gamma>=0.3 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (0.5.0)
Requirement already satisfied: formulaic>=0.2.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from lifelines->njab) (1.0.1)
Requirement already satisfied: pytz>=2020.1 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas->njab) (2024.1)
Requirement already satisfied: category-encoders in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (2.6.3)
Requirement already satisfied: jinja2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (3.1.4)
Requirement already satisfied: tqdm in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (4.66.4)
Requirement already satisfied: joblib in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (1.4.2)
Requirement already satisfied: polars>=0.12.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from mrmr-selection->njab) (0.20.25)
Requirement already satisfied: antlr4-python3-runtime==4.9.* in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab) (4.9.3)
Requirement already satisfied: PyYAML>=5.1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from omegaconf->njab) (6.0.1)
Requirement already satisfied: pandas-flavor in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab) (0.6.0)
Requirement already satisfied: tabulate in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pingouin->njab) (0.9.0)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from scikit-learn->njab) (3.5.0)
Requirement already satisfied: patsy>=0.5.6 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from statsmodels->njab) (0.5.6)
Requirement already satisfied: numba>=0.51.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from umap-learn->njab) (0.59.1)
Requirement already satisfied: pynndescent>=0.5 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from umap-learn->njab) (0.5.12)
Requirement already satisfied: future>=0.15.2 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from autograd>=1.5->lifelines->njab) (1.0.0)
Requirement already satisfied: interface-meta>=1.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (1.3.0)
Requirement already satisfied: typing-extensions>=4.2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (4.11.0)
Requirement already satisfied: wrapt>=1.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from formulaic>=0.2.2->lifelines->njab) (1.16.0)
Requirement already satisfied: llvmlite<0.43,>=0.42.0dev0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from numba>=0.51.2->umap-learn->njab) (0.42.0)
Requirement already satisfied: six in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from patsy>=0.5.6->statsmodels->njab) (1.16.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from jinja2->mrmr-selection->njab) (2.1.5)
Requirement already satisfied: xarray in /home/docs/checkouts/readthedocs.org/user_builds/njab/envs/latest/lib/python3.10/site-packages (from pandas-flavor->pingouin->njab) (2024.5.0)
Downloading heatmapz-0.0.4-py3-none-any.whl (5.8 kB)
Downloading plotly-5.22.0-py3-none-any.whl (16.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.4/16.4 MB 70.8 MB/s eta 0:00:00
?25hDownloading tenacity-8.3.0-py3-none-any.whl (25 kB)
Installing collected packages: tenacity, plotly, heatmapz
Successfully installed heatmapz-0.0.4 plotly-5.22.0 tenacity-8.3.0
Note: you may need to restart the kernel to use updated packages.
Show code cell source
import itertools
import logging
from pathlib import Path
from typing import Optional
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import plotly.express as px
import seaborn
import sklearn
import sklearn.impute
import statsmodels.api as sm
import umap
from heatmap import corrplot
from IPython.display import display
from sklearn.metrics import log_loss, make_scorer
import njab.sklearn
from njab.plotting.metrics import plot_auc, plot_prc
from njab.sklearn import StandardScaler
from njab.sklearn import pca as njab_pca
from njab.sklearn.scoring import (ConfusionMatrix,
get_lr_multiplicative_decomposition,
get_pred, get_score,
get_target_count_per_bin)
from njab.sklearn.types import Splits
logger = logging.getLogger('njab')
logger.setLevel(logging.INFO)
njab.pandas.set_pandas_options()
pd.options.display.min_rows = 10
pd.options.display.max_columns = 20
njab.plotting.set_font_sizes('x-small')
seaborn.set_style("whitegrid")
njab.plotting.set_font_sizes(8)
Set parameters#
CLINIC: str = 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/clinic_ml.csv' # clincial data
fname_omics: str = 'https://raw.githubusercontent.com/RasmussenLab/njab/HEAD/docs/tutorial/data/alzheimer/proteome.csv' # omics data
TARGET: str = 'AD' # target column in CLINIC dataset (binary)
TARGET_LABEL: Optional[str] = None # optional: rename target variable
n_features_max: int = 5
freq_cutoff: float = 0.5 # Omics cutoff for sample completeness
VAL_IDS: str = '' #
VAL_IDS_query: str = ''
weights: bool = True
FOLDER = 'alzheimer'
model_name = 'all'
Setup#
Load data#
clinic = pd.read_csv(CLINIC, index_col=0).convert_dtypes()
cols_clinic = njab.pandas.get_colums_accessor(clinic)
omics = pd.read_csv(fname_omics, index_col=0)
Data shapes
omics.shape, clinic.shape
((210, 1542), (210, 6))
See how common omics features are and remove feature below choosen frequency cutoff
ax = omics.notna().sum().sort_values().plot(rot=45)
Show code cell source
M_before = omics.shape[1]
omics = omics.dropna(thresh=int(len(omics) * freq_cutoff), axis=1)
M_after = omics.shape[1]
msg = (
f"Removed {M_before-M_after} features with more than {freq_cutoff*100}% missing values."
f"\nRemaining features: {M_after} (of {M_before})")
print(msg)
# keep a map of all proteins in protein group, but only display first protein
# proteins are unique to protein groups
pg_map = {k: k.split(";")[0] for k in omics.columns}
omics = omics.rename(columns=pg_map)
# log2 transform raw intensity data:
omics = np.log2(omics + 1)
omics
Removed 248 features with more than 50.0% missing values.
Remaining features: 1294 (of 1542)
A0A024QZX5 | A0A024R0T9 | A0A024R3W6 | A0A024R644 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6I9 | A0A075B6J9 | ... | Q9Y653 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8 | Q9Y6Y9 | S4R3U6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample ID | |||||||||||||||||||||
Sample_000 | 15.912 | 16.852 | 15.571 | 16.481 | 20.246 | 16.764 | 17.584 | 16.988 | 20.054 | NaN | ... | 16.012 | 15.178 | NaN | 15.050 | 16.842 | 19.863 | NaN | 19.563 | 12.838 | 12.805 |
Sample_001 | 15.936 | 16.874 | 15.519 | 16.387 | 19.941 | 18.786 | 17.144 | NaN | 19.067 | 16.188 | ... | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 20.299 | 15.556 | 19.386 | 13.970 | 12.443 |
Sample_002 | 16.112 | 14.523 | 15.935 | 16.416 | 19.251 | 16.832 | 15.671 | 17.012 | 18.569 | NaN | ... | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 19.598 | 15.735 | 20.447 | 12.637 | 12.505 |
Sample_003 | 16.107 | 17.032 | 15.802 | 16.979 | 19.628 | 17.852 | 18.877 | 14.182 | 18.985 | 13.438 | ... | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | 19.429 | NaN | 20.216 | 12.627 | 12.445 |
Sample_004 | 15.603 | 15.331 | 15.375 | 16.679 | 20.450 | 18.682 | 17.081 | 14.140 | 19.686 | 14.495 | ... | 14.757 | 15.094 | 14.048 | 15.256 | 17.075 | 19.582 | 15.328 | 19.867 | 13.145 | 12.235 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Sample_205 | 15.682 | 16.886 | 14.910 | 16.482 | 17.705 | 17.039 | NaN | 16.413 | 19.102 | 16.064 | ... | 15.236 | 15.684 | 14.236 | 15.415 | 17.551 | 17.922 | 16.340 | 19.928 | 12.930 | 11.803 |
Sample_206 | 15.798 | 17.554 | 15.600 | 15.938 | 18.155 | 18.152 | 16.503 | 16.860 | 18.538 | 15.288 | ... | 15.422 | 16.106 | NaN | 15.345 | 17.084 | 18.708 | 14.249 | 19.433 | NaN | NaN |
Sample_207 | 15.740 | 16.877 | 15.469 | 16.898 | 18.636 | 17.950 | 16.321 | 16.401 | 18.849 | 17.580 | ... | 15.808 | 16.098 | 14.403 | 15.715 | 16.586 | 18.725 | 16.138 | 19.599 | 13.637 | 11.174 |
Sample_208 | 15.477 | 16.779 | 14.995 | 16.132 | 14.908 | 17.530 | NaN | 16.119 | 18.368 | 15.202 | ... | 15.157 | 16.712 | NaN | 14.640 | 16.533 | 19.411 | 15.807 | 19.545 | 13.216 | NaN |
Sample_209 | 15.727 | 17.261 | 15.175 | 16.235 | 17.893 | 17.744 | 16.371 | 15.780 | 18.806 | 16.532 | ... | 15.237 | 15.652 | 15.211 | 14.205 | 16.749 | 19.275 | 15.732 | 19.577 | 11.043 | 11.792 |
210 rows × 1294 columns
Clinical data#
View clinical data
clinic
Kiel | Magdeburg | Sweden | male | age | AD | |
---|---|---|---|---|---|---|
Sample ID | ||||||
Sample_000 | 0 | 0 | 1 | 0 | 71 | 0 |
Sample_001 | 0 | 0 | 1 | 1 | 77 | 1 |
Sample_002 | 0 | 0 | 1 | 1 | 75 | 1 |
Sample_003 | 0 | 0 | 1 | 0 | 72 | 1 |
Sample_004 | 0 | 0 | 1 | 0 | 63 | 1 |
... | ... | ... | ... | ... | ... | ... |
Sample_205 | 0 | 0 | 0 | 0 | 69 | 1 |
Sample_206 | 0 | 0 | 0 | 1 | 73 | 0 |
Sample_207 | 0 | 0 | 0 | 0 | 71 | 0 |
Sample_208 | 0 | 0 | 0 | 1 | 83 | 0 |
Sample_209 | 0 | 0 | 0 | 0 | 63 | 0 |
210 rows × 6 columns
Target#
Tabulate target and check for missing values
njab.pandas.value_counts_with_margins(clinic[TARGET])
counts | prop | |
---|---|---|
AD | ||
0 | 122 | 0.581 |
1 | 88 | 0.419 |
Show code cell source
target_counts = clinic[TARGET].value_counts()
if target_counts.sum() < len(clinic):
print("Target has missing values."
f" Can only use {target_counts.sum()} of {len(clinic)} samples.")
mask = clinic[TARGET].notna()
clinic, omics = clinic.loc[mask], omics.loc[mask]
if TARGET_LABEL is None:
TARGET_LABEL = TARGET
y = clinic[TARGET].rename(TARGET_LABEL).astype(int)
clinic_for_ml = clinic.drop(TARGET, axis=1)
Test IDs#
Select some test samples:
Show code cell source
olink_val, clinic_val = None, None
if not VAL_IDS:
if VAL_IDS_query:
logging.warning(f"Querying index using: {VAL_IDS_query}")
VAL_IDS = clinic.filter(like=VAL_IDS_query, axis=0).index.to_list()
logging.warning(f"Found {len(VAL_IDS)} Test-IDs")
else:
logging.warning("Create train and test split.")
_, VAL_IDS = sklearn.model_selection.train_test_split(
clinic.index,
test_size=0.2,
random_state=123,
stratify=clinic[TARGET])
VAL_IDS = list(VAL_IDS)
elif isinstance(VAL_IDS, str):
VAL_IDS = VAL_IDS.split(",")
else:
raise ValueError("Provide IDs in csv format as str: 'ID1,ID2'")
VAL_IDS
WARNING:root:Create train and test split.
['Sample_127',
'Sample_164',
'Sample_175',
'Sample_048',
'Sample_159',
'Sample_141',
'Sample_174',
'Sample_145',
'Sample_090',
'Sample_191',
'Sample_038',
'Sample_009',
'Sample_112',
'Sample_096',
'Sample_146',
'Sample_135',
'Sample_142',
'Sample_205',
'Sample_186',
'Sample_095',
'Sample_085',
'Sample_011',
'Sample_156',
'Sample_153',
'Sample_124',
'Sample_194',
'Sample_061',
'Sample_079',
'Sample_149',
'Sample_179',
'Sample_197',
'Sample_125',
'Sample_133',
'Sample_099',
'Sample_067',
'Sample_202',
'Sample_010',
'Sample_171',
'Sample_018',
'Sample_060',
'Sample_185',
'Sample_173']
Combine clinical and olink data#
# in case you need to subselect
feat_to_consider = clinic_for_ml.columns.to_list()
feat_to_consider += omics.columns.to_list()
feat_to_consider
Show code cell output
['Kiel',
'Magdeburg',
'Sweden',
'male',
'age',
'A0A024QZX5',
'A0A024R0T9',
'A0A024R3W6',
'A0A024R644',
'A0A075B6H9',
'A0A075B6I0',
'A0A075B6I1',
'A0A075B6I6',
'A0A075B6I9',
'A0A075B6J9',
'A0A075B6K4',
'A0A075B6K5',
'A0A075B6P5',
'A0A075B6Q5',
'A0A075B6R2',
'A0A075B6R9',
'A0A075B6S2',
'A0A075B6S5',
'A0A075B6S9',
'A0A075B7D0',
'A0A087WSV8',
'A0A087WSY5',
'A0A087WSY6',
'A0A087WSZ0',
'A0A087WTA1',
'A0A087WTA8',
'A0A087WTK0',
'A0A087WTP3',
'A0A087WTT8',
'A0A087WTY6',
'A0A087WU43',
'A0A087WUM0',
'A0A087WUT8',
'A0A087WV17',
'A0A087WVV2',
'A0A087WW87',
'A0A087WWA5',
'A0A087WWF1',
'A0A087WWT2',
'A0A087WX80',
'A0A087WXE9',
'A0A087WXM8',
'A0A087WXW9',
'A0A087WYK9',
'A0A087WYL5',
'A0A087WZ82',
'A0A087WZM2',
'A0A087WZR4',
'A0A087X089',
'A0A087X0D5',
'A0A087X0K0',
'A0A087X0M8',
'A0A087X0S5',
'A0A087X106',
'A0A087X117',
'A0A087X136',
'A0A087X152',
'A0A087X1G7',
'A0A087X1J7',
'A0A087X1T7',
'A0A087X1V2',
'A0A087X240',
'A0A087X253',
'A0A0A0MQU6',
'A0A0A0MQY7',
'A0A0A0MR25',
'A0A0A0MRJ6',
'A0A0A0MRJ7',
'A0A0A0MRZ8',
'A0A0A0MS09',
'A0A0A0MS15',
'A0A0A0MS20',
'A0A0A0MS52',
'A0A0A0MSC4',
'A0A0A0MSQ0',
'A0A0A0MSX3',
'A0A0A0MT26',
'A0A0A0MT32',
'A0A0A0MT36',
'A0A0A0MT66',
'A0A0A0MT69',
'A0A0A0MT71',
'A0A0A0MTC8',
'A0A0A0MTH0',
'A0A0A0MTI5',
'A0A0A0MTP9',
'A0A0A0MTQ4',
'A0A0A0MTS2',
'A0A0A6YY99',
'A0A0A6YYA0',
'A0A0A6YYG9',
'A0A0A6YYJ2',
'A0A0B4J1V0',
'A0A0B4J1V2',
'A0A0B4J1V6',
'A0A0B4J1Y8',
'A0A0B4J1Y9',
'A0A0B4J1Z1',
'A0A0B4J231',
'A0A0B4J259',
'A0A0B4J2B5',
'A0A0B4J2C3',
'A0A0B4J2D9',
'A0A0C4DFP6',
'A0A0C4DFZ2',
'A0A0C4DG90',
'A0A0C4DGV8',
'A0A0C4DGY8',
'A0A0C4DH07',
'A0A0C4DH24',
'A0A0C4DH25',
'A0A0C4DH33',
'A0A0C4DH34',
'A0A0C4DH35',
'A0A0C4DH36',
'A0A0C4DH38',
'A0A0C4DH67',
'A0A0C4DH73',
'A0A0D9SEM5',
'A0A0D9SEP4',
'A0A0D9SF30',
'A0A0D9SFP6',
'A0A0G2JHA9',
'A0A0G2JHN4',
'A0A0G2JIW1',
'A0A0G2JLB3',
'A0A0G2JLQ0',
'A0A0G2JLV7',
'A0A0G2JMC9',
'A0A0G2JPR0',
'A0A0G2JQ91',
'A0A0G2JRN3',
'A0A0G2JRQ6',
'A0A0G2JSC0',
'A0A0J9YW40',
'A0A0J9YX35',
'A0A0J9YXX1',
'A0A0J9YY99',
'A0A0U1RQC5',
'A0A0U1RQV3',
'A0A0U1RR20',
'A0A0U1RR32',
'A0A140T8Z9',
'A0A140T913',
'A0A140T962',
'A0A140T971',
'A0A140TA33',
'A0A182DWH7',
'A0A182DWI4',
'A0A1B0GTM3',
'A0A1B0GUU9',
'A0A1B0GV23',
'A0A1B0GVB9',
'A0A1B0GXF2',
'A0A1W2PQ11',
'A0A1W2PQB1',
'A0A1W2PR05',
'A0A1W2PRS4',
'A0A1X7SBT7',
'A0A286YEY1',
'A0A286YEY5',
'A0A286YFE3',
'A0A286YFJ8',
'A0AVL1',
'A0M8Q6',
'A1L4H1',
'A2A2D0',
'A2A2V1',
'A2A3C1',
'A2NJV5',
'A5D6W6',
'A6NC48',
'A6NCE7',
'A6NCT7',
'A6NFX8',
'A6NG10',
'A6NGN9',
'A6NL88',
'A6NLU5',
'A6NNI4',
'A6XMH3',
'A6XND0',
'A8MU39',
'A8MZH3',
'A9UJN9',
'B0QY80',
'B0QYF8',
'B0YIW2',
'B1AHL2',
'B1AJR6',
'B1AKZ9',
'B1AMW1',
'B1AN15',
'B1AN99',
'B1AP13',
'B3KTY4',
'B4DGQ7',
'B4DHN5',
'B4DKD2',
'B4DPQ0',
'B4DYV8',
'B4E1Z4',
'B4E3Q4',
'B5MBX2',
'B5MCX6',
'B7Z2R4',
'B7Z4G8',
'B7ZKJ8',
'B7ZKR5',
'B7ZLJ8',
'B7ZM79',
'B8ZZ19',
'B8ZZE5',
'C9IY66',
'C9J0J0',
'C9J1I0',
'C9J2H1',
'C9J4G9',
'C9J6G4',
'C9J712',
'C9J8T4',
'C9JE82',
'C9JF17',
'C9JFR7',
'C9JHR8',
'C9JIZ6',
'C9JKT8',
'C9JP35',
'C9JPD0',
'C9JYY6',
'D3DWX8',
'D3YTI2',
'D6R934',
'D6R938',
'D6R956',
'D6R960',
'D6RAR4',
'D6RCE0',
'D6RD17',
'D6REA1',
'D6RER2',
'D6RF86',
'D6RGG3',
'D6RGW2',
'D6RH31',
'D6RJG0',
'E5RGY1',
'E5RHU3',
'E5RIP7',
'E5RJR5',
'E7EMB3',
'E7EMS2',
'E7EN28',
'E7EN65',
'E7END6',
'E7EPS8',
'E7EPV7',
'E7EQ64',
'E7EQB2',
'E7EQM8',
'E7ESX1',
'E7ETH0',
'E7EU04',
'E7EUF1',
'E7EWD3',
'E7EX60',
'E9PC84',
'E9PCD7',
'E9PEK4',
'E9PEX6',
'E9PF59',
'E9PFD7',
'E9PG71',
'E9PGA6',
'E9PGC5',
'E9PHN6',
'E9PHY0',
'E9PIM6',
'E9PJP6',
'E9PK25',
'E9PKE3',
'E9PKP3',
'E9PL83',
'E9PLK3',
'E9PMI0',
'E9PNW4',
'F5GY55',
'F5GY80',
'F5GY87',
'F5GZK2',
'F5GZN3',
'F5GZS6',
'F5H1U3',
'F5H265',
'F5H2B5',
'F5H3H7',
'F5H423',
'F5H5D3',
'F6S8M0',
'F6SYF8',
'F6SYP7',
'F6VDH7',
'F8VVB6',
'F8W1B7',
'F8W703',
'F8W785',
'F8W876',
'F8W8W7',
'F8W9L4',
'F8WAD8',
'F8WD41',
'F8WDW9',
'F8WE04',
'G3V0E5',
'G3V150',
'G3V164',
'G3V1Q7',
'G3V2U7',
'G3V2W1',
'G3V3A0',
'G3V3X5',
'G3V4U0',
'G3V533',
'G3XAI2',
'G3XAK1',
'G3XAP6',
'G5E968',
'G5E9G7',
'H0Y4H1',
'H0Y512',
'H0Y5E4',
'H0YAC1',
'H0YDE5',
'H0YDJ3',
'H0YEX9',
'H0YI30',
'H0YLF3',
'H0YMB1',
'H0YNK6',
'H3BLU2',
'H3BM42',
'H3BMA1',
'H3BP20',
'H3BPK2',
'H3BPK3',
'H3BRV9',
'H3BUV8',
'H7BY64',
'H7BYX6',
'H7BZJ3',
'H7BZT7',
'H7C0V4',
'H7C2F2',
'H9KV31',
'I3L0A0',
'I3L0A1',
'I3L0N3',
'I3L145',
'I3L397',
'I3L3E6',
'I3L3J8',
'I3L3R5',
'J3KNA1',
'J3KNC5',
'J3KNE3',
'J3KNF6',
'J3KNP4',
'J3KNV4',
'J3KQ18',
'J3KQG3',
'J3KRI5',
'J3KSN0',
'J3QQR8',
'J3QS03',
'K4DIA0',
'K7EKE8',
'K7EKL3',
'K7ELW0',
'K7ENA4',
'K7ENE5',
'K7ER15',
'K7ERG9',
'K7ERI9',
'K7ES70',
'M0QXF7',
'M0QZI8',
'M0R009',
'M0R1Q1',
'O00187',
'O00241',
'O00264',
'O00339',
'O00391',
'O00451',
'O00462',
'O00468-6',
'O00533',
'O00754',
'O14498',
'O14594',
'O14672',
'O14773',
'O14791',
'O14793',
'O14917',
'O15031',
'O15041',
'O15197',
'O15204',
'O15230',
'O15240',
'O15335',
'O15354',
'O15466',
'O43157',
'O43173',
'O43291',
'O43300',
'O43405',
'O43505',
'O43529',
'O43827',
'O43852',
'O43854',
'O43916',
'O60242',
'O60243',
'O60245',
'O60279',
'O60476',
'O60568',
'O60814',
'O60883',
'O60888',
'O75051',
'O75063',
'O75071',
'O75083',
'O75084',
'O75173',
'O75326',
'O75368',
'O75460',
'O75493',
'O75509',
'O75636',
'O75711',
'O75752',
'O75781',
'O75874',
'O75882',
'O76061',
'O76070',
'O94760',
'O94769',
'O94772',
'O94856',
'O94910',
'O94919',
'O94933',
'O94985',
'O94985-2',
'O94991',
'O95158',
'O95206',
'O95428',
'O95445',
'O95450',
'O95467',
'O95479',
'O95967',
'O95998',
'P00338',
'P00390',
'P00441',
'P00450',
'P00488',
'P00491',
'P00492',
'P00505',
'P00533',
'P00558',
'P00568',
'P00734',
'P00738',
'P00739',
'P00740',
'P00742',
'P00747',
'P00748',
'P01008',
'P01009',
'P01011',
'P01019',
'P01023',
'P01024',
'P01031',
'P01033',
'P01034',
'P01036',
'P01042',
'P01042-2',
'P01210',
'P01213',
'P01258',
'P01282',
'P01303',
'P01344',
'P01593',
'P01597',
'P01599',
'P01602',
'P01619',
'P01624',
'P01700',
'P01701',
'P01704',
'P01714',
'P01717',
'P01721',
'P01742',
'P01743',
'P01768',
'P01780',
'P01814',
'P01817',
'P01833',
'P01834',
'P01857',
'P01859',
'P01860',
'P02100',
'P02452',
'P02458',
'P02461',
'P02533',
'P02538',
'P02647',
'P02649',
'P02652',
'P02671',
'P02675',
'P02679',
'P02741',
'P02743',
'P02745',
'P02747',
'P02748',
'P02749',
'P02750',
'P02751',
'P02753',
'P02760',
'P02763',
'P02765',
'P02766',
'P02768',
'P02774',
'P02787',
'P02790',
'P02792',
'P02794',
'P03950',
'P04003',
'P04004',
'P04066',
'P04075',
'P04080',
'P04114',
'P04179',
'P04180',
'P04196',
'P04217-2',
'P04264',
'P04271',
'P04275',
'P04406',
'P05023',
'P05026',
'P05060',
'P05067',
'P05109',
'P05121',
'P05154',
'P05155',
'P05160',
'P05362',
'P05408-2',
'P05413',
'P05452',
'P05534',
'P05543',
'P05546',
'P05937',
'P05997',
'P06276',
'P06307',
'P06310',
'P06312',
'P06331',
'P06396',
'P06396-2',
'P06681',
'P06702',
'P06727',
'P06732',
'P06733',
'P07093',
'P07093-2',
'P07195',
'P07225',
'P07237',
'P07357',
'P07360',
'P07585',
'P07686',
'P07711',
'P07737',
'P07858',
'P07900',
'P07996',
'P07998',
'P08185',
'P08253',
'P08294',
'P08493',
'P08571',
'P08572',
'P08603',
'P08670',
'P08697',
'P08758',
'P09104',
'P09172',
'P09211',
'P09382',
'P09417',
'P09486',
'P09603',
'P09619',
'P09871',
'P09960',
'P09972',
'P0C0L4',
'P0C0L5',
'P0C6S8',
'P0DJI8',
'P0DMQ5',
'P0DOY3',
'P10124',
'P10153',
'P10253',
'P10451',
'P10451-3',
'P10451-5',
'P10586',
'P10599',
'P10619',
'P10636-2',
'P10643',
'P10644',
'P10645',
'P10721',
'P10909',
'P10909-3',
'P11021',
'P11047',
'P11279',
'P11362',
'P11362-21',
'P11597',
'P11717',
'P12107',
'P12110',
'P12111',
'P12277',
'P12318',
'P12955',
'P13473',
'P13489',
'P13521',
'P13591',
'P13611',
'P13645',
'P13647',
'P13667',
'P13671',
'P13796',
'P14138',
'P14151',
'P14174',
'P14209',
'P14314-2',
'P14543',
'P14618',
'P14618-2',
'P14621',
'P14625',
'P15086',
'P15144',
'P15151-2',
'P15169',
'P15291',
'P15509',
'P15531',
'P16035',
'P16083',
'P16152',
'P16519',
'P16870',
'P16930',
'P17050',
'P17174',
'P17302',
'P17405',
'P17677',
'P17900',
'P17931',
'P18065',
'P18428',
'P18669',
'P19021',
'P19022',
'P19320',
'P19367',
'P19652',
'P19823',
'P19827',
'P19961',
'P20366',
'P20382',
'P20618',
'P20742',
'P20774',
'P20827',
'P20916',
'P20933',
'P21333',
'P21579',
'P21810',
'P22303',
'P22304',
'P22676',
'P22692',
'P22748',
'P22792',
'P22891',
'P22897',
'P23083',
'P23142',
'P23284',
'P23435',
'P23468',
'P23468-2',
'P23470',
'P23471',
'P23515',
'P23560',
'P23582',
'P24592',
'P24593',
'P25189',
'P25311',
'P25774',
'P26038',
'P26447',
'P26572',
'P26992',
'P27169',
'P27797',
'P27824',
'P29120',
'P29144',
'P29401',
'P29622',
'P29966',
'P30041',
'P30043',
'P30086',
'P30101',
'P30530',
'P30740',
'P31146',
'P31150',
'P31321',
'P31937',
'P31944',
'P31946',
'P32004',
'P32119',
'P33151',
'P33908',
'P34059',
'P34096',
'P35052',
'P35241',
'P35442',
'P35443',
'P35527',
'P35542',
'P35555',
'P35590',
'P35749',
'P35754',
'P35858',
'P35908',
'P36222',
'P36871',
'P36955',
'P37173',
'P37802',
'P37837',
'P39060',
'P40121',
'P40189',
'P40925',
'P40926',
'P41217',
'P41222',
'P42785',
'P43121',
'P43234',
'P43251',
'P43652',
'P45877',
'P47972',
'P48163',
'P48539',
'P48637',
'P48723',
'P48740',
'P48740-2',
'P48745',
'P49006',
'P49257',
'P49641',
'P50395',
'P51693-2',
'P51884',
'P51888',
'P52758',
'P52797-2',
'P52799',
'P52848',
'P53634',
'P54289',
'P54289-2',
'P54802',
'P54803',
'P55058',
'P55087',
'P55268',
'P55283',
'P55287',
'P55290',
'P55291',
'P55774',
'P56159',
'P56817',
'P58401',
'P60174',
'P60709',
'P61088',
'P61278',
'P61769',
'P61981',
'P62258',
'P62736',
'P62805',
'P62937',
'P62942',
'P62993',
'P63104',
'P63167',
'P68036',
'P68104',
'P68371',
'P68871',
'P69905',
'P78310',
'P78324',
'P78417',
'P78509',
'P78539',
'P80108',
'P80188',
'P80723',
'P80748',
'P84157-2',
'P98095',
'P98160',
'P98172',
'Q00973',
'Q01459',
'Q01469',
'Q01995',
'Q02487',
'Q02809',
'Q02818',
'Q03167',
'Q03591',
'Q04721',
'Q04760',
'Q05707',
'Q05996',
'Q06033',
'Q06481',
'Q06828',
'Q06830',
'Q07507',
'Q07954',
'Q08174-2',
'Q08380',
'Q08554',
'Q08629',
'Q08ET2',
'Q09327',
'Q09328',
'Q0P6D2',
'Q10469',
'Q10471',
'Q10472',
'Q12794',
'Q12797',
'Q12841',
'Q12860',
'Q12866',
'Q12907',
'Q13217',
'Q13228',
'Q13231',
'Q13332',
'Q13332-5',
'Q13433',
'Q13444',
'Q13508',
'Q13554',
'Q13641',
'Q13740',
'Q13790',
'Q13867',
'Q14019',
'Q14112',
'Q14118',
'Q14126',
'Q14152',
'Q14165',
'Q14257',
'Q14314',
'Q14315',
'Q14393',
'Q14515',
'Q14520',
'Q14571',
'Q14574',
'Q14624',
'Q14696',
'Q14697',
'Q14766',
'Q14894',
'Q14956',
'Q14982',
'Q14C87',
'Q14DG7',
'Q15084',
'Q15113',
'Q15121',
'Q15223',
'Q15256',
'Q15293',
'Q15375',
'Q15459',
'Q15465',
'Q15582',
'Q15768',
'Q15782',
'Q15818',
'Q15828',
'Q15848',
'Q15904',
'Q16270',
'Q16394',
'Q16568',
'Q16581',
'Q16610',
'Q16620',
'Q16658',
'Q16674',
'Q16706',
'Q16769',
'Q16799',
'Q16849',
'Q16853',
'Q24JP5',
'Q2TAL6',
...]
View data for training
Show code cell source
X = clinic_for_ml.join(omics)[feat_to_consider]
X
Kiel | Magdeburg | Sweden | male | age | A0A024QZX5 | A0A024R0T9 | A0A024R3W6 | A0A024R644 | A0A075B6H9 | ... | Q9Y653 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7 | Q9Y6R7 | Q9Y6X5 | Q9Y6Y8 | Q9Y6Y9 | S4R3U6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample ID | |||||||||||||||||||||
Sample_000 | 0 | 0 | 1 | 0 | 71 | 15.912 | 16.852 | 15.571 | 16.481 | 20.246 | ... | 16.012 | 15.178 | NaN | 15.050 | 16.842 | 19.863 | NaN | 19.563 | 12.838 | 12.805 |
Sample_001 | 0 | 0 | 1 | 1 | 77 | 15.936 | 16.874 | 15.519 | 16.387 | 19.941 | ... | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 20.299 | 15.556 | 19.386 | 13.970 | 12.443 |
Sample_002 | 0 | 0 | 1 | 1 | 75 | 16.112 | 14.523 | 15.935 | 16.416 | 19.251 | ... | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 19.598 | 15.735 | 20.447 | 12.637 | 12.505 |
Sample_003 | 0 | 0 | 1 | 0 | 72 | 16.107 | 17.032 | 15.802 | 16.979 | 19.628 | ... | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | 19.429 | NaN | 20.216 | 12.627 | 12.445 |
Sample_004 | 0 | 0 | 1 | 0 | 63 | 15.603 | 15.331 | 15.375 | 16.679 | 20.450 | ... | 14.757 | 15.094 | 14.048 | 15.256 | 17.075 | 19.582 | 15.328 | 19.867 | 13.145 | 12.235 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Sample_205 | 0 | 0 | 0 | 0 | 69 | 15.682 | 16.886 | 14.910 | 16.482 | 17.705 | ... | 15.236 | 15.684 | 14.236 | 15.415 | 17.551 | 17.922 | 16.340 | 19.928 | 12.930 | 11.803 |
Sample_206 | 0 | 0 | 0 | 1 | 73 | 15.798 | 17.554 | 15.600 | 15.938 | 18.155 | ... | 15.422 | 16.106 | NaN | 15.345 | 17.084 | 18.708 | 14.249 | 19.433 | NaN | NaN |
Sample_207 | 0 | 0 | 0 | 0 | 71 | 15.740 | 16.877 | 15.469 | 16.898 | 18.636 | ... | 15.808 | 16.098 | 14.403 | 15.715 | 16.586 | 18.725 | 16.138 | 19.599 | 13.637 | 11.174 |
Sample_208 | 0 | 0 | 0 | 1 | 83 | 15.477 | 16.779 | 14.995 | 16.132 | 14.908 | ... | 15.157 | 16.712 | NaN | 14.640 | 16.533 | 19.411 | 15.807 | 19.545 | 13.216 | NaN |
Sample_209 | 0 | 0 | 0 | 0 | 63 | 15.727 | 17.261 | 15.175 | 16.235 | 17.893 | ... | 15.237 | 15.652 | 15.211 | 14.205 | 16.749 | 19.275 | 15.732 | 19.577 | 11.043 | 11.792 |
210 rows × 1299 columns
Data Splits#
Separate train and test split
Show code cell source
TRAIN_LABEL = 'train'
TEST_LABEL = 'test'
if VAL_IDS:
diff = pd.Index(VAL_IDS)
VAL_IDS = X.index.intersection(VAL_IDS)
if len(diff) < len(VAL_IDS):
logging.warning("Some requested validation IDs are not in the data: "
",".join(str(x) for x in diff.difference(VAL_IDS)))
X_val = X.loc[VAL_IDS]
X = X.drop(VAL_IDS)
use_val_split = True
y_val = y.loc[VAL_IDS]
y = y.drop(VAL_IDS)
Output folder#
Show code cell source
FOLDER = Path(FOLDER)
FOLDER.mkdir(exist_ok=True, parents=True)
print(f"Output folder: {FOLDER}")
Output folder: alzheimer
Outputs#
Save outputs to excel file:
Show code cell source
# out
files_out = {}
fname = FOLDER / 'log_reg.xlsx'
files_out[fname.stem] = fname
writer = pd.ExcelWriter(fname)
print(f"Excel-file for tables: {fname}")
Excel-file for tables: alzheimer/log_reg.xlsx
Collect test predictions#
predictions = y_val.to_frame('true')
Fill missing values with training median#
Show code cell source
feat_w_missings = X.isna().sum()
feat_w_missings = feat_w_missings.loc[feat_w_missings > 0]
feat_w_missings
age 10
A0A024QZX5 11
A0A024R0T9 2
A0A024R3W6 23
A0A024R644 1
..
Q9Y6N6 6
Q9Y6N7 10
Q9Y6X5 22
Q9Y6Y9 70
S4R3U6 65
Length: 894, dtype: int64
Show code cell source
row_w_missing = X.isna().sum(axis=1).astype(bool)
col_w_missing = X.isna().sum(axis=0).astype(bool)
X.loc[row_w_missing, col_w_missing]
age | A0A024QZX5 | A0A024R0T9 | A0A024R3W6 | A0A024R644 | A0A075B6H9 | A0A075B6I0 | A0A075B6I1 | A0A075B6I6 | A0A075B6J9 | ... | Q9Y5I4 | Q9Y617 | Q9Y653 | Q9Y696 | Q9Y6C2 | Q9Y6N6 | Q9Y6N7 | Q9Y6X5 | Q9Y6Y9 | S4R3U6 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample ID | |||||||||||||||||||||
Sample_000 | 71 | 15.912 | 16.852 | 15.571 | 16.481 | 20.246 | 16.764 | 17.584 | 16.988 | NaN | ... | 17.187 | 16.859 | 16.012 | 15.178 | NaN | 15.050 | 16.842 | NaN | 12.838 | 12.805 |
Sample_001 | 77 | 15.936 | 16.874 | 15.519 | 16.387 | 19.941 | 18.786 | 17.144 | NaN | 16.188 | ... | 17.447 | 16.799 | 15.528 | 15.576 | NaN | 14.833 | 16.597 | 15.556 | 13.970 | 12.443 |
Sample_002 | 75 | 16.112 | 14.523 | 15.935 | 16.416 | 19.251 | 16.832 | 15.671 | 17.012 | NaN | ... | 17.410 | 16.288 | 15.229 | 14.728 | 13.757 | 15.118 | 17.440 | 15.735 | 12.637 | 12.505 |
Sample_003 | 72 | 16.107 | 17.032 | 15.802 | 16.979 | 19.628 | 17.852 | 18.877 | 14.182 | 13.438 | ... | 17.545 | 17.075 | 15.495 | 14.590 | 14.682 | 15.140 | 17.356 | NaN | 12.627 | 12.445 |
Sample_004 | 63 | 15.603 | 15.331 | 15.375 | 16.679 | 20.450 | 18.682 | 17.081 | 14.140 | 14.495 | ... | 17.297 | 16.736 | 14.757 | 15.094 | 14.048 | 15.256 | 17.075 | 15.328 | 13.145 | 12.235 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Sample_204 | <NA> | NaN | 17.279 | 15.287 | 16.513 | 20.183 | 19.674 | 20.251 | 16.334 | 19.778 | ... | 15.874 | 15.465 | 15.668 | 15.915 | 14.204 | 15.025 | NaN | 15.012 | 12.288 | 10.564 |
Sample_206 | 73 | 15.798 | 17.554 | 15.600 | 15.938 | 18.155 | 18.152 | 16.503 | 16.860 | 15.288 | ... | 17.109 | 15.035 | 15.422 | 16.106 | NaN | 15.345 | 17.084 | 14.249 | NaN | NaN |
Sample_207 | 71 | 15.740 | 16.877 | 15.469 | 16.898 | 18.636 | 17.950 | 16.321 | 16.401 | 17.580 | ... | 16.938 | 16.283 | 15.808 | 16.098 | 14.403 | 15.715 | 16.586 | 16.138 | 13.637 | 11.174 |
Sample_208 | 83 | 15.477 | 16.779 | 14.995 | 16.132 | 14.908 | 17.530 | NaN | 16.119 | 15.202 | ... | 17.155 | 15.920 | 15.157 | 16.712 | NaN | 14.640 | 16.533 | 15.807 | 13.216 | NaN |
Sample_209 | 63 | 15.727 | 17.261 | 15.175 | 16.235 | 17.893 | 17.744 | 16.371 | 15.780 | 16.532 | ... | 16.776 | 15.713 | 15.237 | 15.652 | 15.211 | 14.205 | 16.749 | 15.732 | 11.043 | 11.792 |
168 rows × 894 columns
Impute using median of training data
median_imputer = sklearn.impute.SimpleImputer(strategy='median')
X = njab.sklearn.transform_DataFrame(X, median_imputer.fit_transform)
X_val = njab.sklearn.transform_DataFrame(X_val, median_imputer.transform)
assert X.isna().sum().sum() == 0
X.shape, X_val.shape
((168, 1299), (42, 1299))
Principal Components#
on standard normalized training data:
Show code cell source
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
PCs, pca = njab_pca.run_pca(X_scaled, n_components=None)
files_out["var_explained_by_PCs.pdf"] = FOLDER / "var_explained_by_PCs.pdf"
ax = njab_pca.plot_explained_variance(pca)
ax.locator_params(axis='x', integer=True)
njab.plotting.savefig(ax.get_figure(), files_out["var_explained_by_PCs.pdf"])
X_scaled.shape
INFO:njab.plotting:Saved Figures to alzheimer/var_explained_by_PCs.pdf
(168, 1299)
Plot first 5 PCs with binary target label annotating each sample::
Show code cell source
files_out['scatter_first_5PCs.pdf'] = FOLDER / 'scatter_first_5PCs.pdf'
fig, axes = plt.subplots(5, 2, figsize=(6, 8), layout='constrained')
PCs.columns = [s.replace("principal component", "PC") for s in PCs.columns]
PCs = PCs.join(y.astype('category'))
up_to = min(PCs.shape[-1], 5)
# https://github.com/matplotlib/matplotlib/issues/25538
# colab: old pandas version and too new matplotlib version (2023-11-6)
for (i, j), ax in zip(itertools.combinations(range(up_to), 2), axes.flatten()):
PCs.plot.scatter(i, j, c=TARGET_LABEL, cmap='Paired', ax=ax)
_ = PCs.pop(TARGET_LABEL)
njab.plotting.savefig(fig, files_out['scatter_first_5PCs.pdf'])
INFO:njab.plotting:Saved Figures to alzheimer/scatter_first_5PCs.pdf
UMAP#
of training data:
Show code cell source
reducer = umap.UMAP()
embedding = reducer.fit_transform(X_scaled)
files_out['umap.pdf'] = FOLDER / 'umap.pdf'
embedding = pd.DataFrame(embedding,
index=X_scaled.index,
columns=['UMAP 1',
'UMAP 2']).join(y.astype('category'))
ax = embedding.plot.scatter('UMAP 1', 'UMAP 2', c=TARGET_LABEL, cmap='Paired')
njab.plotting.savefig(ax.get_figure(), files_out['umap.pdf'])
INFO:njab.plotting:Saved Figures to alzheimer/umap.pdf
Baseline Model - Logistic Regression#
Based on parameters, use weighting:
if weights:
weights = 'balanced'
cutoff = 0.5
else:
cutoff = None
weights = None
Logistic Regression#
Procedure:
Select best set of features from entire feature set selected using CV on train split
Retrain best model configuration using entire train split and evalute on test split
Define splits and models:
splits = Splits(X_train=X_scaled,
X_test=scaler.transform(X_val),
y_train=y,
y_test=y_val)
model = sklearn.linear_model.LogisticRegression(penalty='l2',
class_weight=weights)
Show code cell source
scoring = [
'precision', 'recall', 'f1', 'balanced_accuracy', 'roc_auc',
'average_precision'
]
scoring = {k: k for k in scoring}
# do not average log loss for AIC and BIC calculations
scoring['log_loss'] = make_scorer(log_loss,
greater_is_better=True,
normalize=False)
cv_feat = njab.sklearn.find_n_best_features(
X=splits.X_train,
y=splits.y_train,
model=model,
name=TARGET_LABEL,
groups=splits.y_train,
n_features_max=n_features_max,
scoring=scoring,
return_train_score=True,
# fit_params=dict(sample_weight=weights)
)
cv_feat = cv_feat.drop('test_case',
axis=1).groupby('n_features').agg(['mean', 'std'])
cv_feat
100%|██████████| 1/1 [00:00<00:00, 139.28it/s]
100%|██████████| 2/2 [00:00<00:00, 6.83it/s]
100%|██████████| 3/3 [00:00<00:00, 5.28it/s]
100%|██████████| 4/4 [00:00<00:00, 4.69it/s]
100%|██████████| 5/5 [00:01<00:00, 4.42it/s]
fit_time | score_time | test_precision | train_precision | test_recall | ... | test_average_precision | train_average_precision | test_log_loss | train_log_loss | n_observations | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | ... | mean | std | mean | std | mean | std | mean | std | mean | std | |
n_features | |||||||||||||||||||||
1 | 0.003 | 0.001 | 0.014 | 0.001 | 0.715 | 0.098 | 0.712 | 0.023 | 0.704 | 0.108 | ... | 0.751 | 0.091 | 0.720 | 0.028 | 294.837 | 86.679 | 1,168.535 | 82.092 | 168.000 | 0.000 |
2 | 0.003 | 0.000 | 0.014 | 0.000 | 0.687 | 0.066 | 0.687 | 0.021 | 0.713 | 0.099 | ... | 0.765 | 0.078 | 0.755 | 0.023 | 311.417 | 60.820 | 1,234.856 | 84.235 | 168.000 | 0.000 |
3 | 0.003 | 0.000 | 0.014 | 0.000 | 0.724 | 0.073 | 0.729 | 0.019 | 0.796 | 0.103 | ... | 0.801 | 0.087 | 0.795 | 0.024 | 262.398 | 61.376 | 1,004.897 | 78.645 | 168.000 | 0.000 |
4 | 0.003 | 0.000 | 0.014 | 0.000 | 0.767 | 0.078 | 0.780 | 0.020 | 0.814 | 0.091 | ... | 0.848 | 0.080 | 0.857 | 0.022 | 223.471 | 62.641 | 805.936 | 66.644 | 168.000 | 0.000 |
5 | 0.003 | 0.000 | 0.014 | 0.000 | 0.754 | 0.086 | 0.774 | 0.023 | 0.783 | 0.101 | ... | 0.848 | 0.078 | 0.862 | 0.021 | 245.818 | 62.510 | 845.584 | 92.744 | 168.000 | 0.000 |
5 rows × 34 columns
Add AIC and BIC for model selection
Show code cell source
# AIC vs BIC on train and test data with bigger is better
IC_criteria = pd.DataFrame()
N_split = {
'train': round(len(splits.X_train) * 0.8),
'test': round(len(splits.X_train) * 0.2)
}
for _split in ('train', 'test'):
IC_criteria[(f'{_split}_neg_AIC',
'mean')] = -(2 * cv_feat.index.to_series() -
2 * cv_feat[(f'{_split}_log_loss', 'mean')])
IC_criteria[(
f'{_split}_neg_BIC',
'mean')] = -(cv_feat.index.to_series() * np.log(N_split[_split]) -
2 * cv_feat[(f'{_split}_log_loss', 'mean')])
IC_criteria.columns = pd.MultiIndex.from_tuples(IC_criteria.columns)
IC_criteria
train_neg_AIC | train_neg_BIC | test_neg_AIC | test_neg_BIC | |
---|---|---|---|---|
mean | mean | mean | mean | |
n_features | ||||
1 | 2,335.070 | 2,332.173 | 587.674 | 586.148 |
2 | 2,465.711 | 2,459.915 | 618.834 | 615.782 |
3 | 2,003.794 | 1,995.101 | 518.796 | 514.217 |
4 | 1,603.872 | 1,592.281 | 438.941 | 432.836 |
5 | 1,681.168 | 1,666.679 | 481.635 | 474.004 |
All cross-validation metrics:
cv_feat = cv_feat.join(IC_criteria)
cv_feat = cv_feat.filter(regex="train|test", axis=1).style.highlight_max(
axis=0, subset=pd.IndexSlice[:, pd.IndexSlice[:, 'mean']])
cv_feat
test_precision | train_precision | test_recall | train_recall | test_f1 | train_f1 | test_balanced_accuracy | train_balanced_accuracy | test_roc_auc | train_roc_auc | test_average_precision | train_average_precision | test_log_loss | train_log_loss | train_neg_AIC | train_neg_BIC | test_neg_AIC | test_neg_BIC | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | std | mean | mean | mean | mean | |
n_features | ||||||||||||||||||||||||||||||||
1 | 0.715090 | 0.097785 | 0.712124 | 0.022788 | 0.704286 | 0.107999 | 0.707500 | 0.019391 | 0.705759 | 0.088131 | 0.709702 | 0.019351 | 0.748748 | 0.073459 | 0.751437 | 0.016989 | 0.799259 | 0.065666 | 0.798173 | 0.016412 | 0.751101 | 0.091072 | 0.720311 | 0.028466 | 294.837085 | 86.679293 | 1,168.535243 | 82.092116 | 2,335.070486 | 2,332.172646 | 587.674169 | 586.147809 |
2 | 0.687396 | 0.066357 | 0.686706 | 0.020643 | 0.712857 | 0.099447 | 0.714286 | 0.024733 | 0.696015 | 0.064919 | 0.700109 | 0.020758 | 0.738534 | 0.053016 | 0.740687 | 0.017950 | 0.807192 | 0.055527 | 0.813695 | 0.013622 | 0.765328 | 0.078350 | 0.754689 | 0.022982 | 311.417165 | 60.820362 | 1,234.855565 | 84.234509 | 2,465.711130 | 2,459.915451 | 618.834331 | 615.781610 |
3 | 0.723886 | 0.073140 | 0.728746 | 0.019180 | 0.795714 | 0.103066 | 0.800357 | 0.023846 | 0.752519 | 0.059155 | 0.762733 | 0.018592 | 0.784752 | 0.052193 | 0.793655 | 0.016794 | 0.845158 | 0.058247 | 0.853656 | 0.013804 | 0.800901 | 0.086731 | 0.794904 | 0.023501 | 262.397797 | 61.375810 | 1,004.897056 | 78.644710 | 2,003.794113 | 1,995.100594 | 518.795593 | 514.216512 |
4 | 0.767039 | 0.078051 | 0.780211 | 0.020441 | 0.814286 | 0.091268 | 0.837143 | 0.018621 | 0.785609 | 0.061656 | 0.807484 | 0.015171 | 0.815038 | 0.053025 | 0.834130 | 0.013553 | 0.879639 | 0.056085 | 0.896144 | 0.013018 | 0.847946 | 0.080071 | 0.856597 | 0.022011 | 223.470651 | 62.641424 | 805.936090 | 66.644467 | 1,603.872180 | 1,592.280820 | 438.941302 | 432.835860 |
5 | 0.753626 | 0.085873 | 0.773605 | 0.022748 | 0.782857 | 0.100974 | 0.821786 | 0.026631 | 0.761591 | 0.063434 | 0.796848 | 0.022512 | 0.794929 | 0.052105 | 0.824919 | 0.019919 | 0.878444 | 0.055807 | 0.900023 | 0.012607 | 0.848134 | 0.077870 | 0.862431 | 0.021129 | 245.817716 | 62.510078 | 845.584109 | 92.743733 | 1,681.168217 | 1,666.679018 | 481.635432 | 474.003630 |
Save:
cv_feat.to_excel(writer, 'CV', float_format='%.3f')
cv_feat = cv_feat.data
Optimal number of features to use based on cross-validation by metric:
Show code cell source
mask = cv_feat.columns.levels[0].str[:4] == 'test'
scores_cols = cv_feat.columns.levels[0][mask]
n_feat_best = cv_feat.loc[:, pd.IndexSlice[scores_cols, 'mean']].idxmax()
n_feat_best.name = 'best'
n_feat_best.to_excel(writer, 'n_feat_best')
n_feat_best
test_average_precision mean 5
test_balanced_accuracy mean 4
test_f1 mean 4
test_log_loss mean 2
test_neg_AIC mean 2
test_neg_BIC mean 2
test_precision mean 4
test_recall mean 4
test_roc_auc mean 4
Name: best, dtype: int64
Retrain model with best number of features by selected metric::
results_model = njab.sklearn.run_model(
model=model,
splits=splits,
n_feat_to_select=n_feat_best.loc['test_roc_auc', 'mean'],
)
results_model.name = model_name
100%|██████████| 4/4 [00:00<00:00, 4.61it/s]
Receiver Operating Curve of final model#
Show code cell source
ax = plot_auc(results_model,
label_train=TRAIN_LABEL,
label_test=TEST_LABEL,
figsize=(4, 2))
files_out['ROAUC'] = FOLDER / 'plot_roauc.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['ROAUC'])
INFO:njab.plotting:Saved Figures to alzheimer/plot_roauc.pdf
Precision-Recall Curve for final model#
Show code cell source
ax = plot_prc(results_model,
label_train=TRAIN_LABEL,
label_test=TEST_LABEL,
figsize=(4, 2))
files_out['PRAUC'] = FOLDER / 'plot_prauc.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['PRAUC'])
INFO:njab.plotting:Saved Figures to alzheimer/plot_prauc.pdf
Coefficients with/out std. errors#
Show code cell source
pd.DataFrame({
'coef': results_model.model.coef_.flatten(),
'name': results_model.model.feature_names_in_
})
coef | name | |
---|---|---|
0 | 1.094 | P63104 |
1 | -0.604 | P09486 |
2 | -0.791 | A0A0B4J2B5 |
3 | 1.129 | P10636-2 |
results_model.model.intercept_
array([-0.25165406])
Selected Features#
Show code cell source
des_selected_feat = splits.X_train[results_model.selected_features].describe()
des_selected_feat.to_excel(writer, 'sel_feat', float_format='%.3f')
des_selected_feat
P63104 | P09486 | A0A0B4J2B5 | P10636-2 | |
---|---|---|---|---|
count | 168.000 | 168.000 | 168.000 | 168.000 |
mean | 0.000 | 0.000 | 0.000 | -0.000 |
std | 1.003 | 1.003 | 1.003 | 1.003 |
min | -2.527 | -2.578 | -2.756 | -3.959 |
25% | -0.721 | -0.648 | -0.660 | -0.483 |
50% | -0.108 | 0.101 | -0.060 | -0.021 |
75% | 0.756 | 0.540 | 0.693 | 0.549 |
max | 2.321 | 2.559 | 2.466 | 2.806 |
Heatmap of correlations#
Show code cell source
fig = plt.figure(figsize=(6, 6))
files_out['corr_plot_train.pdf'] = FOLDER / 'corr_plot_train.pdf'
_ = corrplot(X[results_model.selected_features].join(y).corr(), size_scale=300)
njab.plotting.savefig(fig, files_out['corr_plot_train.pdf'])
INFO:njab.plotting:Saved Figures to alzheimer/corr_plot_train.pdf
Plot training data scores#
Show code cell source
N_BINS = 20
score = get_score(clf=results_model.model,
X=splits.X_train[results_model.selected_features],
pos=1)
ax = score.hist(bins=N_BINS)
files_out['hist_score_train.pdf'] = FOLDER / 'hist_score_train.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['hist_score_train.pdf'])
pred_bins = get_target_count_per_bin(score, y, n_bins=N_BINS)
ax = pred_bins.plot(kind='bar', ylabel='count')
files_out[
'hist_score_train_target.pdf'] = FOLDER / 'hist_score_train_target.pdf'
njab.plotting.savefig(ax.get_figure(),
files_out['hist_score_train_target.pdf'])
# pred_bins
INFO:njab.plotting:Saved Figures to alzheimer/hist_score_train.pdf
INFO:njab.plotting:Saved Figures to alzheimer/hist_score_train_target.pdf
Test data scores#
Show code cell source
score_val = get_score(clf=results_model.model,
X=splits.X_test[results_model.selected_features],
pos=1)
predictions['score'] = score_val
ax = score_val.hist(bins=N_BINS) # list(x/N_BINS for x in range(0,N_BINS)))
ax.set_ylabel('count')
ax.set_xlim(0, 1)
files_out['hist_score_test.pdf'] = FOLDER / 'hist_score_test.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['hist_score_test.pdf'])
pred_bins_val = get_target_count_per_bin(score_val, y_val, n_bins=N_BINS)
ax = pred_bins_val.plot(kind='bar', ylabel='count')
ax.locator_params(axis='y', integer=True)
files_out['hist_score_test_target.pdf'] = FOLDER / 'hist_score_test_target.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['hist_score_test_target.pdf'])
# pred_bins_val
INFO:njab.plotting:Saved Figures to alzheimer/hist_score_test.pdf
INFO:njab.plotting:Saved Figures to alzheimer/hist_score_test_target.pdf
Performance evaluations#
Check if the cutoff can be adapted to maximize the F1 score between precision and recall:
Show code cell source
prc = pd.DataFrame(results_model.train.prc,
index='precision recall cutoffs'.split())
prc
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 159 | 160 | 161 | 162 | 163 | 164 | 165 | 166 | 167 | 168 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
precision | 0.417 | 0.419 | 0.422 | 0.424 | 0.427 | 0.429 | 0.432 | 0.435 | 0.438 | 0.440 | ... | 0.889 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 |
recall | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | 1.000 | ... | 0.114 | 0.114 | 0.100 | 0.086 | 0.071 | 0.057 | 0.043 | 0.029 | 0.014 | 0.000 |
cutoffs | 0.000 | 0.004 | 0.009 | 0.010 | 0.011 | 0.011 | 0.014 | 0.023 | 0.028 | 0.032 | ... | 0.962 | 0.970 | 0.976 | 0.980 | 0.983 | 0.987 | 0.989 | 0.993 | 0.995 | NaN |
3 rows × 169 columns
Show code cell source
prc.loc['f1_score'] = 2 * (prc.loc['precision'] * prc.loc['recall']) / (
1 / prc.loc['precision'] + 1 / prc.loc['recall'])
f1_max = prc[prc.loc['f1_score'].argmax()]
f1_max
precision 0.741
recall 0.900
cutoffs 0.407
f1_score 0.542
Name: 83, dtype: float64
Cutoff set
Show code cell source
cutoff = float(f1_max.loc['cutoffs'])
cutoff
0.4065495537928745
Show code cell source
y_pred_val = njab.sklearn.scoring.get_custom_pred(
clf=results_model.model,
X=splits.X_test[results_model.selected_features],
cutoff=cutoff)
predictions[model_name] = y_pred_val
predictions['dead'] = y_val
_ = ConfusionMatrix(y_val, y_pred_val).as_dataframe()
_.columns = pd.MultiIndex.from_tuples([(t[0] + f" - {cutoff:.3f}", t[1])
for t in _.columns])
_.to_excel(writer, "CM_test_cutoff_adapted")
_
pred - 0.407 | ||
---|---|---|
0 | 1 | |
true | ||
0 | 15 | 9 |
1 | 5 | 13 |
Show code cell source
y_pred_val = get_pred(clf=results_model.model,
X=splits.X_test[results_model.selected_features])
predictions[model_name] = y_pred_val
predictions['dead'] = y_val
_ = ConfusionMatrix(y_val, y_pred_val).as_dataframe()
_.columns = pd.MultiIndex.from_tuples([(t[0] + f" - {0.5}", t[1])
for t in _.columns])
_.to_excel(writer, "CM_test_cutoff_0.5")
_
pred - 0.5 | ||
---|---|---|
0 | 1 | |
true | ||
0 | 18 | 6 |
1 | 5 | 13 |
Multiplicative decompositon#
Decompose the model into its components for both splits:
Show code cell source
components = get_lr_multiplicative_decomposition(results=results_model,
X=splits.X_train,
prob=score,
y=y)
components.to_excel(writer, 'decomp_multiplicative_train')
components.to_excel(writer,
'decomp_multiplicative_train_view',
float_format='%.5f')
components.head(10)
P63104 | P09486 | A0A0B4J2B5 | P10636-2 | intercept | odds | prob | AD | |
---|---|---|---|---|---|---|---|---|
Sample ID | ||||||||
Sample_072 | 6.090 | 1.741 | 3.492 | 7.470 | 0.778 | 215.059 | 0.995 | 1 |
Sample_086 | 7.186 | 1.516 | 2.847 | 5.633 | 0.778 | 135.787 | 0.993 | 1 |
Sample_008 | 5.849 | 2.858 | 0.905 | 7.853 | 0.778 | 92.394 | 0.989 | 1 |
Sample_098 | 6.629 | 1.024 | 1.683 | 8.376 | 0.778 | 74.405 | 0.987 | 1 |
Sample_071 | 3.086 | 1.321 | 1.870 | 9.578 | 0.778 | 56.749 | 0.983 | 1 |
Sample_121 | 1.522 | 1.886 | 2.117 | 10.424 | 0.778 | 49.249 | 0.980 | 1 |
Sample_169 | 1.458 | 1.956 | 3.327 | 5.523 | 0.778 | 40.749 | 0.976 | 1 |
Sample_092 | 2.613 | 1.021 | 1.957 | 7.900 | 0.778 | 32.072 | 0.970 | 1 |
Sample_066 | 6.490 | 0.439 | 1.312 | 8.607 | 0.778 | 25.028 | 0.962 | 0 |
Sample_097 | 3.567 | 2.228 | 3.210 | 1.148 | 0.778 | 22.762 | 0.958 | 1 |
Show code cell source
components_test = get_lr_multiplicative_decomposition(results=results_model,
X=splits.X_test,
prob=score_val,
y=y_val)
components_test.to_excel(writer, 'decomp_multiplicative_test')
components_test.to_excel(writer,
'decomp_multiplicative_test_view',
float_format='%.5f')
components_test.head(10)
P63104 | P09486 | A0A0B4J2B5 | P10636-2 | intercept | odds | prob | AD | |
---|---|---|---|---|---|---|---|---|
Sample ID | ||||||||
Sample_018 | 4.648 | 1.233 | 2.615 | 10.706 | 0.778 | 124.817 | 0.992 | 1 |
Sample_159 | 4.942 | 1.430 | 2.243 | 4.833 | 0.778 | 59.588 | 0.983 | 1 |
Sample_205 | 2.733 | 0.603 | 3.614 | 9.700 | 0.778 | 44.958 | 0.978 | 1 |
Sample_099 | 2.494 | 1.938 | 2.383 | 4.474 | 0.778 | 40.066 | 0.976 | 1 |
Sample_011 | 7.611 | 0.938 | 0.949 | 6.289 | 0.778 | 33.133 | 0.971 | 1 |
Sample_085 | 2.161 | 1.095 | 1.856 | 5.614 | 0.778 | 19.173 | 0.950 | 1 |
Sample_112 | 7.952 | 3.326 | 0.722 | 0.833 | 0.778 | 12.369 | 0.925 | 0 |
Sample_079 | 1.910 | 1.316 | 1.454 | 4.112 | 0.778 | 11.689 | 0.921 | 1 |
Sample_010 | 5.122 | 0.528 | 0.878 | 1.996 | 0.778 | 3.686 | 0.787 | 1 |
Sample_133 | 2.978 | 0.822 | 1.741 | 1.034 | 0.778 | 3.427 | 0.774 | 1 |
Plot TP, TN, FP and FN on PCA plot#
UMAP#
Show code cell source
reducer = umap.UMAP(random_state=42)
# bug: how does UMAP works with only one feature?
# make sure to have two or more features?
M_sel = len(results_model.selected_features)
if M_sel > 1:
embedding = reducer.fit_transform(
X_scaled[results_model.selected_features])
embedding = pd.DataFrame(embedding,
index=X_scaled.index,
columns=['UMAP dimension 1', 'UMAP dimension 2'
]).join(y.astype('category'))
display(embedding.head(3))
else:
embedding = None
UMAP dimension 1 | UMAP dimension 2 | AD | |
---|---|---|---|
Sample ID | |||
Sample_000 | 6.307 | 7.813 | 0 |
Sample_001 | 7.008 | 7.516 | 1 |
Sample_002 | 6.090 | 6.873 | 1 |
Annotate using target variable and predictions:
Show code cell source
predictions['label'] = predictions.apply(
lambda x: njab.sklearn.scoring.get_label_binary_classification(
x['true'], x[model_name]),
axis=1)
mask = predictions[['true', model_name]].sum(axis=1).astype(bool)
predictions.loc[mask].sort_values('score', ascending=False)
true | score | all | dead | label | |
---|---|---|---|---|---|
Sample ID | |||||
Sample_018 | 1 | 0.992 | 1 | 1 | TP |
Sample_159 | 1 | 0.983 | 1 | 1 | TP |
Sample_205 | 1 | 0.978 | 1 | 1 | TP |
Sample_099 | 1 | 0.976 | 1 | 1 | TP |
Sample_011 | 1 | 0.971 | 1 | 1 | TP |
Sample_085 | 1 | 0.950 | 1 | 1 | TP |
Sample_112 | 0 | 0.925 | 1 | 0 | FP |
Sample_079 | 1 | 0.921 | 1 | 1 | TP |
Sample_010 | 1 | 0.787 | 1 | 1 | TP |
Sample_133 | 1 | 0.774 | 1 | 1 | TP |
Sample_127 | 0 | 0.758 | 1 | 0 | FP |
Sample_090 | 1 | 0.728 | 1 | 1 | TP |
Sample_156 | 1 | 0.727 | 1 | 1 | TP |
Sample_145 | 0 | 0.674 | 1 | 0 | FP |
Sample_186 | 0 | 0.637 | 1 | 0 | FP |
Sample_067 | 0 | 0.583 | 1 | 0 | FP |
Sample_060 | 1 | 0.553 | 1 | 1 | TP |
Sample_197 | 0 | 0.535 | 1 | 0 | FP |
Sample_146 | 1 | 0.516 | 1 | 1 | TP |
Sample_009 | 1 | 0.292 | 0 | 1 | FN |
Sample_194 | 1 | 0.237 | 0 | 1 | FN |
Sample_142 | 1 | 0.195 | 0 | 1 | FN |
Sample_171 | 1 | 0.140 | 0 | 1 | FN |
Sample_125 | 1 | 0.119 | 0 | 1 | FN |
Show code cell source
X_val_scaled = scaler.transform(X_val)
if embedding is not None:
embedding_val = pd.DataFrame(
reducer.transform(X_val_scaled[results_model.selected_features]),
index=X_val_scaled.index,
columns=['UMAP dimension 1', 'UMAP dimension 2'])
embedding_val.sample(3)
Show code cell source
pred_train = (
y.to_frame('true')
# .join(get_score(clf=results_model.model, X=splits.X_train[results_model.selected_features], pos=1))
.join(score.rename('score')).join(
get_pred(results_model.model, splits.X_train[
results_model.selected_features]).rename(model_name)))
pred_train['label'] = pred_train.apply(
lambda x: njab.sklearn.scoring.get_label_binary_classification(
x['true'], x[model_name]),
axis=1)
pred_train.sample(5)
true | score | all | label | |
---|---|---|---|---|
Sample ID | ||||
Sample_204 | 0 | 0.048 | 0 | TN |
Sample_020 | 1 | 0.934 | 1 | TP |
Sample_163 | 0 | 0.014 | 0 | TN |
Sample_181 | 0 | 0.094 | 0 | TN |
Sample_000 | 0 | 0.943 | 1 | FP |
Show code cell content
colors = seaborn.color_palette(n_colors=4)
colors
Show code cell source
if embedding is not None:
fig, axes = plt.subplots(1, 2, figsize=(8, 4), sharex=True, sharey=True)
for _embedding, ax, _title, _model_pred_label in zip(
[embedding, embedding_val], axes, [TRAIN_LABEL, TEST_LABEL],
[pred_train['label'], predictions['label']]): # noqa: E129
ax = seaborn.scatterplot(
x=_embedding.iloc[:, 0],
y=_embedding.iloc[:, 1],
hue=_model_pred_label,
hue_order=['TN', 'TP', 'FN', 'FP'],
palette=[colors[0], colors[2], colors[1], colors[3]],
ax=ax)
ax.set_title(_title)
# files_out['pred_pca_labeled'] = FOLDER / 'pred_pca_labeled.pdf'
# njab.plotting.savefig(fig, files_out['pred_pca_labeled'])
files_out['umap_sel_feat.pdf'] = FOLDER / 'umap_sel_feat.pdf'
njab.plotting.savefig(ax.get_figure(), files_out['umap_sel_feat.pdf'])
INFO:njab.plotting:Saved Figures to alzheimer/umap_sel_feat.pdf
Interactive UMAP plot#
Not displayed in online documentation
Show code cell source
if embedding is not None:
embedding = embedding.join(X[results_model.selected_features])
embedding_val = embedding_val.join(X_val[results_model.selected_features])
embedding['label'], embedding_val['label'] = pred_train[
'label'], predictions['label']
embedding['group'], embedding_val['group'] = TRAIN_LABEL, TEST_LABEL
combined_embeddings = pd.concat([embedding, embedding_val])
combined_embeddings.index.name = 'ID'
Show code cell source
if embedding is not None:
cols = combined_embeddings.columns
TEMPLATE = 'none'
defaults = dict(width=800, height=400, template=TEMPLATE)
fig = px.scatter(combined_embeddings.round(3).reset_index(),
x=cols[0],
y=cols[1],
color='label',
facet_col='group',
hover_data=['ID'] + results_model.selected_features,
**defaults)
fig.for_each_annotation(lambda a: a.update(text=a.text.split("=")[1]))
fname = FOLDER / 'umap_sel_feat.html'
files_out[fname.name] = fname
fig.write_html(fname)
print(fname)
display(fig)
alzheimer/umap_sel_feat.html
PCA#
Show code cell source
PCs_train, pca = njab_pca.run_pca(X_scaled[results_model.selected_features],
n_components=None)
ax = njab_pca.plot_explained_variance(pca)
ax.locator_params(axis='x', integer=True)
fname = FOLDER / "feat_sel_PCA_var_explained_by_PCs.pdf"
files_out[fname.name] = fname
njab.plotting.savefig(ax.get_figure(), fname)
INFO:njab.plotting:Saved Figures to alzheimer/feat_sel_PCA_var_explained_by_PCs.pdf
Applied to the test split:
Show code cell source
PCs_val = pca.transform(X_val_scaled[results_model.selected_features])
PCs_val = pd.DataFrame(PCs_val,
index=X_val_scaled.index,
columns=PCs_train.columns)
PCs_val
principal component 1 (36.36 %) | principal component 2 (25.19 %) | principal component 3 (24.95 %) | principal component 4 (13.50 %) | |
---|---|---|---|---|
Sample ID | ||||
Sample_009 | -0.065 | -0.509 | -1.214 | 0.950 |
Sample_010 | -1.181 | 0.502 | -1.419 | 0.297 |
Sample_011 | -2.363 | 0.173 | -0.701 | 0.049 |
Sample_018 | -2.564 | 1.044 | 0.453 | -0.327 |
Sample_038 | -0.173 | -0.937 | 0.055 | 0.153 |
... | ... | ... | ... | ... |
Sample_191 | 2.711 | 1.479 | 2.641 | 1.431 |
Sample_194 | 0.370 | -0.036 | -0.266 | -0.796 |
Sample_197 | -0.095 | 0.222 | 0.214 | 0.031 |
Sample_202 | 1.135 | 1.658 | 0.188 | -0.318 |
Sample_205 | -1.932 | 1.956 | -0.139 | -0.808 |
42 rows × 4 columns
Show code cell source
if M_sel > 1:
fig, axes = plt.subplots(1, 2, figsize=(6, 3), sharex=True, sharey=True)
for _embedding, ax, _title, _model_pred_label in zip(
[PCs_train, PCs_val], axes, [TRAIN_LABEL, TEST_LABEL],
[pred_train['label'], predictions['label']]): # noqa: E129
ax = seaborn.scatterplot(
x=_embedding.iloc[:, 0],
y=_embedding.iloc[:, 1],
hue=_model_pred_label,
hue_order=['TN', 'TP', 'FN', 'FP'],
palette=[colors[0], colors[2], colors[1], colors[3]],
ax=ax)
ax.set_title(_title)
fname = FOLDER / 'pca_sel_feat.pdf'
files_out[fname.name] = fname
njab.plotting.savefig(ax.get_figure(), fname)
INFO:njab.plotting:Saved Figures to alzheimer/pca_sel_feat.pdf
Show code cell source
if M_sel > 1:
max_rows = min(3, len(results_model.selected_features))
fig, axes = plt.subplots(max_rows,
2,
figsize=(6, 8),
sharex=False,
sharey=False,
layout='constrained')
for axes_col, (_embedding, _title, _model_pred_label) in enumerate(
zip([PCs_train, PCs_val], [TRAIN_LABEL, TEST_LABEL],
[pred_train['label'], predictions['label']])):
_row = 0
axes[_row, axes_col].set_title(_title)
for (i, j) in itertools.combinations(range(max_rows), 2):
ax = seaborn.scatterplot(
x=_embedding.iloc[:, i],
y=_embedding.iloc[:, j],
hue=_model_pred_label,
hue_order=['TN', 'TP', 'FN', 'FP'],
palette=[colors[0], colors[2], colors[1], colors[3]],
ax=axes[_row, axes_col])
_row += 1
fname = FOLDER / f'pca_sel_feat_up_to_{max_rows}.pdf'
files_out[fname.name] = fname
njab.plotting.savefig(ax.get_figure(), fname)
INFO:njab.plotting:Saved Figures to alzheimer/pca_sel_feat_up_to_3.pdf
Features#
top 3 scaled n_features_max (scatter)
or unscalled single features (swarmplot)
Show code cell source
if M_sel > 1:
max_rows = min(3, len(results_model.selected_features))
fig, axes = plt.subplots(max_rows,
2,
figsize=(6, 8),
sharex=False,
sharey=False,
layout='constrained')
for axes_col, (_embedding, _title, _model_pred_label) in enumerate(
zip([
X_scaled[results_model.selected_features],
X_val_scaled[results_model.selected_features]
], [TRAIN_LABEL, TEST_LABEL],
[pred_train['label'], predictions['label']])):
_row = 0
axes[_row, axes_col].set_title(_title)
for (i, j) in itertools.combinations(range(max_rows), 2):
ax = seaborn.scatterplot(
x=_embedding.iloc[:, i],
y=_embedding.iloc[:, j],
hue=_model_pred_label,
hue_order=['TN', 'TP', 'FN', 'FP'],
palette=[colors[0], colors[2], colors[1], colors[3]],
ax=axes[_row, axes_col])
_row += 1
fname = FOLDER / f'sel_feat_up_to_{max_rows}.pdf'
files_out[fname.name] = fname
njab.plotting.savefig(ax.get_figure(), fname)
else:
fig, axes = plt.subplots(1, 1, figsize=(6, 2), layout='constrained')
single_feature = results_model.selected_features[0]
data = pd.concat([
X[single_feature].to_frame().join(
pred_train['label']).assign(group=TRAIN_LABEL),
X_val[single_feature].to_frame().join(
predictions['label']).assign(group=TEST_LABEL)
])
ax = seaborn.swarmplot(data=data,
x='group',
y=single_feature,
hue='label',
ax=axes)
fname = FOLDER / f'sel_feat_{single_feature}.pdf'
files_out[fname.name] = fname
njab.plotting.savefig(ax.get_figure(), fname)
INFO:njab.plotting:Saved Figures to alzheimer/sel_feat_up_to_3.pdf
Savee annotation of errors for manuel analysis#
Saved to excel table.
X[results_model.selected_features].join(pred_train).to_excel(
writer, sheet_name='pred_train_annotated', float_format="%.3f")
X_val[results_model.selected_features].join(predictions).to_excel(
writer, sheet_name='pred_test_annotated', float_format="%.3f")
Outputs#
writer.close()
files_out
{'log_reg': PosixPath('alzheimer/log_reg.xlsx'),
'var_explained_by_PCs.pdf': PosixPath('alzheimer/var_explained_by_PCs.pdf'),
'scatter_first_5PCs.pdf': PosixPath('alzheimer/scatter_first_5PCs.pdf'),
'umap.pdf': PosixPath('alzheimer/umap.pdf'),
'ROAUC': PosixPath('alzheimer/plot_roauc.pdf'),
'PRAUC': PosixPath('alzheimer/plot_prauc.pdf'),
'corr_plot_train.pdf': PosixPath('alzheimer/corr_plot_train.pdf'),
'hist_score_train.pdf': PosixPath('alzheimer/hist_score_train.pdf'),
'hist_score_train_target.pdf': PosixPath('alzheimer/hist_score_train_target.pdf'),
'hist_score_test.pdf': PosixPath('alzheimer/hist_score_test.pdf'),
'hist_score_test_target.pdf': PosixPath('alzheimer/hist_score_test_target.pdf'),
'umap_sel_feat.pdf': PosixPath('alzheimer/umap_sel_feat.pdf'),
'umap_sel_feat.html': PosixPath('alzheimer/umap_sel_feat.html'),
'feat_sel_PCA_var_explained_by_PCs.pdf': PosixPath('alzheimer/feat_sel_PCA_var_explained_by_PCs.pdf'),
'pca_sel_feat.pdf': PosixPath('alzheimer/pca_sel_feat.pdf'),
'pca_sel_feat_up_to_3.pdf': PosixPath('alzheimer/pca_sel_feat_up_to_3.pdf'),
'sel_feat_up_to_3.pdf': PosixPath('alzheimer/sel_feat_up_to_3.pdf')}