The predictive power of Case Detection Rate (CDR) in country-specific mortality outcomes for patients with a dual burden of Tuberculosis (TB) and Human Immunodeficiency Viruses (HIV)

Introduction

The dataset, Tuberculosis Burden by Country, focuses on the global burden of TB across various countries, with an emphasis on its intersection with HIV. The data provides a detailed overview of country-specific TB burden indicators, which are critical for understanding the public health challenges related to TB and its co-morbidities, including HIV.

Before any cleaning, the dataset contains 5120 rows. The following columns are relevant to the analysis:

  1. Country or territory name
    • The name of the country.
  2. Year
    • The reporting year.
  3. Population
    • The total population of the country.
  4. Case Detection Rate (CDR)
    • The proportion of new and relapsed TB cases detected and notified per year, expressed as a percentage.
    • Independent variable of interest, representing the effectiveness of a country’s TB diagnostic and reporting systems.
  5. Estimated number of deaths from TB (all forms, excluding HIV)
    • Mortality per 100,000 population.
    • Used to calculate Mortality-Incidence Ratio for only TB (MIR_TB).
  6. Estimated number of deaths from TB in people who are HIV-positive
    • Mortality per 100,000 population.
    • Used to calculate Mortality-Incidence Ratio for only TB (MIR_TB).
  7. Estimated number of incident cases (all forms)
    • Estimated number of TB cases per 100,000 population.
    • Used to calculate Mortality-Incidence Ratio for dual burden of HIV and TB (MIR_TB_HIV).
  8. Estimated incidence of TB cases who are HIV-positive
    • Estimated number of TB cases among HIV-positive individuals per 100,000 population.
    • Used to calculate Mortality-Incidence Ratio for dual burden of HIV and TB (MIR_TB_HIV).
  9. Estimated prevalence of TB (all forms)e
    • Estimated number of TB cases (new and existing) per 100,000 population.

Data Cleaning and Exploratory Data Analysis

Calculation of target variables

These variables will be employed as predictor variables to help analyze the predictive power of Case Detection Rate (CDR) in country-specific mortality outcomes for patients with a dual burden of Human Immunodeficiency Viruses (HIV) and Tuberculosis (TB).

Country or territory name Year Estimated total population number Estimated prevalence of TB (all forms) Estimated number of deaths from TB (all forms, excluding HIV) Estimated number of deaths from TB in people who are HIV-positive Estimated number of incident cases (all forms) Estimated HIV in incident TB (percent) Estimated incidence of TB cases who are HIV-positive Case detection rate (all forms), percent MIR_TB MIR_TB_HIV
Afghanistan 2013 30551674 100000 13000 82 58000 0.34 200 53 0.224138 0.41
Algeria 2013 39208194 49000 5100 35 32000 0.37 120 66 0.159375 0.291667
Angola 2013 21471618 91000 6900 1600 69000 11 7500 85 0.1 0.213333
Argentina 2013 41446246 13000 570 44 10000 2.7 270 89 0.057 0.162963
Armenia 2013 2976566 2000 170 12 1500 4.5 66 95 0.113333 0.181818

MIR_TB vs MIR_TB_HIV

The closer the MIR is to 1, the higher the mortality rate. From the plots below, we can observe that in cases of a dual burden of TB and HIV, MIRs are significantly higher and exhibit more variation compared to MIRs for TB alone. This suggests that, regardless of whether a country has a higher incidence of TB, its healthcare system is less effective at treating patients who are burdened by both TB and HIV.

CDR vs MIRs

Through the scatterplot below, we observe that the rate of decrease in the case detection rate is much higher in countries with a dual burden of TB and HIV. This suggests that the presence of dual burden cases will have a large impact on our model, given the strong correlation detected between the case detection rate and mortality outcomes.

Detection to Prevalence Ratio

To provide some more insight, I calculated detection-to-prevalence ratios (DPRs) and grouped them by region to understand the relationships between MIRs and DPRs amongst different regions.

df['DPR'] = df['Case detection rate (all forms), percent'] / df['Estimated prevalence of TB (all forms)']
grouped_table = df.groupby(['Region']).agg(
    Total_Incidence=('Estimated number of incident cases (all forms)', 'sum'),
    Mean_DPR = ('DPR', 'sum'),
    Mean_TB_MIR=('MIR_TB', 'mean'),
    Mean_TB_HIV_MIR=('MIR_TB_HIV', 'mean')
).reset_index()
Region Total_Incidence Mean_DPR Mean_TB_MIR Mean_TB_HIV_MIR
AFR 2.57845e+06 0.76766 0.129574 0.306569
AMR 285057 4.47835 0.0670003 0.196754
EMR 713290 0.226031 0.133612 0.300686
EUR 350150 2.1533 0.0684857 0.148143
SEA 3.357e+06 0.00630488 0.121748 0.246023
WPR 1.60893e+06 0.345882 0.0853927 0.170309

Framing a Prediction Problem

The goal of this analysis is to evaluate the predictive power of CDR in determining country-specific mortality outcomes for patients experiencing a dual burden of Human HIV and TB.

Problem Type

Features for Prediction

Evaluation Metrics

Baseline Model

To prepare the data for performing the logistic regression, I binned the MIRs by percentiles so as not to under-assign or over-assign values to specific bins.

quartile_discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
df['MIR_TB_quartile'] = quartile_discretizer.fit_transform(df[['MIR_TB']]).astype(int) + 1
df['MIR_TB_HIV_quartile'] = quartile_discretizer.fit_transform(df[['MIR_TB_HIV']]).astype(int) + 1

Defining the Logistic Regression model

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_features),
        ("cat", categorical_transformer, categorical_features)
    ]
)

logistic_model = LogisticRegression(
    multi_class="multinomial",  
    solver="lbfgs",  
    max_iter=500
)

MIR TB

X = df[['Country or territory name', 'Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']]
y = df[['MIR_TB_quartile']]

MIR TB-HIV

X = df[['Country or territory name', 'Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']]
y = df[['MIR_TB_HIV_quartile']]

Final Model

Through the above visualization, we can see that the total population and TB prevalence are extremely right-skewed. Since log transformations efficiently normalize right-skewed data, I have used them to transform those columns. Since the CDR is skewed in the other direction, I have used a quantile transformer to transform it into a uniform distribution. These transformations will ensure our data is less skewed and more robust to outliers.

log_transformer = FunctionTransformer(np.log1p, validate=True)
quantile_transformer = QuantileTransformer(output_distribution='uniform', random_state=42)

numeric_transformer = ColumnTransformer(
    transformers=[
        ("log", log_transformer, ['Estimated total population number', 'Estimated prevalence of TB (all forms)']),
        ("quantile", quantile_transformer, ['Case detection rate (all forms), percent']),
        ("scaler", StandardScaler(), numeric_features)  
    ]
)

Modeling Algorithm

param_grid = {
    "logistic__C": [0.01, 0.1, 1, 10, 100],  
    "logistic__solver": ["lbfgs", "liblinear"],  
    "logistic__penalty": ["l1", "l2"],  
    "logistic__multi_class": ["ovr", "multinomial"], 
    "logistic__max_iter": [100, 200, 500]  
}

auc_scorer = make_scorer(roc_auc_score, needs_proba=True, multi_class="ovr", average="macro")

grid_search = GridSearchCV(
    estimator=pipeline,
    param_grid=param_grid,
    cv=2,  
    scoring=auc_scorer, 
    verbose=1,  
    n_jobs=-1 
)

New MIR TB

X = df[['Country or territory name', 'Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']]
y = df[['MIR_TB_quartile']]

New MIR TB-HIV

X = df[['Country or territory name', 'Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']]
y = df[['MIR_TB_HIV_quartile']]

Key Findings

Conclusion

This study demonstrates that incorporating Case Detection Rate (CDR) and applying feature engineering techniques like log and quantile transformations substantially improve the predictive power of logistic regression models for mortality outcomes in TB and TB-HIV patients. The findings underscore the critical role of early detection systems in reducing mortality and provide actionable insights for public health policies aimed at addressing TB and HIV’s dual burden globally.