The dataset, Tuberculosis Burden by Country, focuses on the global burden of TB across various countries, with an emphasis on its intersection with HIV. The data provides a detailed overview of country-specific TB burden indicators, which are critical for understanding the public health challenges related to TB and its co-morbidities, including HIV.
Before any cleaning, the dataset contains 5120 rows. The following columns are relevant to the analysis:
ffill
) for each country. This approach assumes that the latest available data is the closest estimate for subsequent missing years and avoids potential biases from averaging.MIR_TB = df['Estimated number of deaths from TB (all forms, excluding HIV)'] / df['Estimated number of incident cases (all forms)']
MIR_TB_HIV = df['Estimated number of deaths from TB in people who are HIV-positive'] / df['Estimated incidence of TB cases who are HIV-positive']
These variables will be employed as predictor variables to help analyze the predictive power of Case Detection Rate (CDR) in country-specific mortality outcomes for patients with a dual burden of Human Immunodeficiency Viruses (HIV) and Tuberculosis (TB).
Country or territory name | Year | Estimated total population number | Estimated prevalence of TB (all forms) | Estimated number of deaths from TB (all forms, excluding HIV) | Estimated number of deaths from TB in people who are HIV-positive | Estimated number of incident cases (all forms) | Estimated HIV in incident TB (percent) | Estimated incidence of TB cases who are HIV-positive | Case detection rate (all forms), percent | MIR_TB | MIR_TB_HIV |
---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan | 2013 | 30551674 | 100000 | 13000 | 82 | 58000 | 0.34 | 200 | 53 | 0.224138 | 0.41 |
Algeria | 2013 | 39208194 | 49000 | 5100 | 35 | 32000 | 0.37 | 120 | 66 | 0.159375 | 0.291667 |
Angola | 2013 | 21471618 | 91000 | 6900 | 1600 | 69000 | 11 | 7500 | 85 | 0.1 | 0.213333 |
Argentina | 2013 | 41446246 | 13000 | 570 | 44 | 10000 | 2.7 | 270 | 89 | 0.057 | 0.162963 |
Armenia | 2013 | 2976566 | 2000 | 170 | 12 | 1500 | 4.5 | 66 | 95 | 0.113333 | 0.181818 |
The closer the MIR is to 1, the higher the mortality rate. From the plots below, we can observe that in cases of a dual burden of TB and HIV, MIRs are significantly higher and exhibit more variation compared to MIRs for TB alone. This suggests that, regardless of whether a country has a higher incidence of TB, its healthcare system is less effective at treating patients who are burdened by both TB and HIV.
Through the scatterplot below, we observe that the rate of decrease in the case detection rate is much higher in countries with a dual burden of TB and HIV. This suggests that the presence of dual burden cases will have a large impact on our model, given the strong correlation detected between the case detection rate and mortality outcomes.
To provide some more insight, I calculated detection-to-prevalence ratios (DPRs) and grouped them by region to understand the relationships between MIRs and DPRs amongst different regions.
df['DPR'] = df['Case detection rate (all forms), percent'] / df['Estimated prevalence of TB (all forms)']
grouped_table = df.groupby(['Region']).agg(
Total_Incidence=('Estimated number of incident cases (all forms)', 'sum'),
Mean_DPR = ('DPR', 'sum'),
Mean_TB_MIR=('MIR_TB', 'mean'),
Mean_TB_HIV_MIR=('MIR_TB_HIV', 'mean')
).reset_index()
Region | Total_Incidence | Mean_DPR | Mean_TB_MIR | Mean_TB_HIV_MIR |
---|---|---|---|---|
AFR | 2.57845e+06 | 0.76766 | 0.129574 | 0.306569 |
AMR | 285057 | 4.47835 | 0.0670003 | 0.196754 |
EMR | 713290 | 0.226031 | 0.133612 | 0.300686 |
EUR | 350150 | 2.1533 | 0.0684857 | 0.148143 |
SEA | 3.357e+06 | 0.00630488 | 0.121748 | 0.246023 |
WPR | 1.60893e+06 | 0.345882 | 0.0853927 | 0.170309 |
The goal of this analysis is to evaluate the predictive power of CDR in determining country-specific mortality outcomes for patients experiencing a dual burden of Human HIV and TB.
MIR_TB_quartile
) and TB-HIV MIR (MIR_TB_HIV_quartile
), derived from the distribution of mortality-to-incidence ratios for TB only and TB-HIV patients. These variables were chosen because they capture the gradient of mortality risk across countries, enabling nuanced predictions rather than a binary outcome.To prepare the data for performing the logistic regression, I binned the MIRs by percentiles so as not to under-assign or over-assign values to specific bins.
quartile_discretizer = KBinsDiscretizer(n_bins=4, encode='ordinal', strategy='quantile')
df['MIR_TB_quartile'] = quartile_discretizer.fit_transform(df[['MIR_TB']]).astype(int) + 1
df['MIR_TB_HIV_quartile'] = quartile_discretizer.fit_transform(df[['MIR_TB_HIV']]).astype(int) + 1
Defining the Logistic Regression model
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
]
)
logistic_model = LogisticRegression(
multi_class="multinomial",
solver="lbfgs",
max_iter=500
)
X = df[['Country or territory name', 'Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']]
y = df[['MIR_TB_quartile']]
numeric_features = ['Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']
categorical_features = ['Country or territory name']
Class | Precision | Recall | F1-Score | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|
1 | 0.333 | 0.8 | 0.471 | 0.8 | 0.652 | 0.608 |
2 | 0.429 | 0.429 | 0.429 | 0.429 | 0.81 | 0.608 |
3 | 0 | 0 | 0 | 0 | 0.95 | 0.608 |
4 | 0.625 | 0.625 | 0.625 | 0.625 | 0.85 | 0.608 |
numeric_features = ['Estimated total population number', 'Estimated prevalence of TB (all forms)']
categorical_features = ['Country or territory name']
Class | Precision | Recall | F1-Score | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|
1 | 0.19 | 0.8 | 0.308 | 0.8 | 0.261 | 0.566 |
2 | 0.667 | 0.286 | 0.4 | 0.286 | 0.952 | 0.566 |
3 | 0 | 0 | 0 | 0 | 0.95 | 0.566 |
4 | 0 | 0 | 0 | 0 | 0.85 | 0.566 |
X = df[['Country or territory name', 'Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']]
y = df[['MIR_TB_HIV_quartile']]
numeric_features = ['Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']
categorical_features = ['Country or territory name']
Class | Precision | Recall | F1-Score | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|
1 | 0.538 | 0.875 | 0.667 | 0.875 | 0.7 | 0.856 |
2 | 1 | 0.375 | 0.545 | 0.375 | 1 | 0.856 |
3 | 0.6 | 0.6 | 0.6 | 0.6 | 0.913 | 0.856 |
4 | 0.857 | 0.857 | 0.857 | 0.857 | 0.952 | 0.856 |
numeric_features = ['Estimated total population number', 'Estimated prevalence of TB (all forms)']
categorical_features = ['Country or territory name']
Class | Precision | Recall | F1-Score | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|
1 | 0.389 | 0.875 | 0.538 | 0.875 | 0.45 | 0.693 |
2 | 0.333 | 0.125 | 0.182 | 0.125 | 0.9 | 0.693 |
3 | 0 | 0 | 0 | 0 | 1 | 0.693 |
4 | 0.571 | 0.571 | 0.571 | 0.571 | 0.857 | 0.693 |
Through the above visualization, we can see that the total population and TB prevalence are extremely right-skewed. Since log transformations efficiently normalize right-skewed data, I have used them to transform those columns. Since the CDR is skewed in the other direction, I have used a quantile transformer to transform it into a uniform distribution. These transformations will ensure our data is less skewed and more robust to outliers.
log_transformer = FunctionTransformer(np.log1p, validate=True)
quantile_transformer = QuantileTransformer(output_distribution='uniform', random_state=42)
numeric_transformer = ColumnTransformer(
transformers=[
("log", log_transformer, ['Estimated total population number', 'Estimated prevalence of TB (all forms)']),
("quantile", quantile_transformer, ['Case detection rate (all forms), percent']),
("scaler", StandardScaler(), numeric_features)
]
)
param_grid = {
"logistic__C": [0.01, 0.1, 1, 10, 100],
"logistic__solver": ["lbfgs", "liblinear"],
"logistic__penalty": ["l1", "l2"],
"logistic__multi_class": ["ovr", "multinomial"],
"logistic__max_iter": [100, 200, 500]
}
auc_scorer = make_scorer(roc_auc_score, needs_proba=True, multi_class="ovr", average="macro")
grid_search = GridSearchCV(
estimator=pipeline,
param_grid=param_grid,
cv=2,
scoring=auc_scorer,
verbose=1,
n_jobs=-1
)
X = df[['Country or territory name', 'Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']]
y = df[['MIR_TB_quartile']]
numeric_features = ['Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']
categorical_features = ['Country or territory name']
Class | Precision | Recall | F1-Score | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|
1 | 0.286 | 0.8 | 0.421 | 0.8 | 0.565 | 0.686 |
2 | 0.75 | 0.429 | 0.545 | 0.429 | 0.952 | 0.686 |
3 | 0 | 0 | 0 | 0 | 1 | 0.686 |
4 | 0.6 | 0.75 | 0.667 | 0.75 | 0.8 | 0.686 |
Best performing hyperparameters:
{'logistic__C': 0.1, 'logistic__max_iter': 200, 'logistic__multi_class': 'ovr', 'logistic__penalty': 'l1', 'logistic__solver': 'liblinear'}
numeric_features = ['Estimated total population number', 'Estimated prevalence of TB (all forms)']
categorical_features = ['Country or territory name']
Class | Precision | Recall | F1-Score | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|
1 | 0.286 | 0.8 | 0.421 | 0.8 | 0.565 | 0.593 |
2 | 0 | 0 | 0 | 0 | 1 | 0.593 |
3 | 0 | 0 | 0 | 0 | 1 | 0.593 |
4 | 0.429 | 0.75 | 0.545 | 0.75 | 0.6 | 0.593 |
Best performing hyperparameters:
{'logistic__C': 0.1, 'logistic__max_iter': 100, 'logistic__multi_class': 'ovr', 'logistic__penalty': 'l1', 'logistic__solver': 'liblinear'}
X = df[['Country or territory name', 'Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']]
y = df[['MIR_TB_HIV_quartile']]
numeric_features = ['Estimated total population number', 'Estimated prevalence of TB (all forms)', 'Case detection rate (all forms), percent']
categorical_features = ['Country or territory name']
Class | Precision | Recall | F1-Score | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|
1 | 0.75 | 0.75 | 0.75 | 0.75 | 0.9 | 0.874 |
2 | 0.6 | 0.75 | 0.667 | 0.75 | 0.8 | 0.874 |
3 | 0.5 | 0.4 | 0.444 | 0.4 | 0.913 | 0.874 |
4 | 1 | 0.857 | 0.923 | 0.857 | 1 | 0.874 |
Best performing hyperparameters:
{'logistic__C': 100, 'logistic__max_iter': 500, 'logistic__multi_class': 'multinomial', 'logistic__penalty': 'l2', 'logistic__solver': 'lbfgs'}
numeric_features = ['Estimated total population number', 'Estimated prevalence of TB (all forms)']
categorical_features = ['Country or territory name']
Class | Precision | Recall | F1-Score | Sensitivity | Specificity | AUC |
---|---|---|---|---|---|---|
1 | 0.5 | 0.875 | 0.636 | 0.875 | 0.65 | 0.749 |
2 | 0 | 0 | 0 | 0 | 0.9 | 0.749 |
3 | 0 | 0 | 0 | 0 | 1 | 0.749 |
4 | 0.5 | 0.857 | 0.632 | 0.857 | 0.714 | 0.749 |
Best performing hyperparameters:
{'logistic__C': 100, 'logistic__max_iter': 100, 'logistic__multi_class': 'multinomial', 'logistic__penalty': 'l2', 'logistic__solver': 'lbfgs'}
This study demonstrates that incorporating Case Detection Rate (CDR) and applying feature engineering techniques like log and quantile transformations substantially improve the predictive power of logistic regression models for mortality outcomes in TB and TB-HIV patients. The findings underscore the critical role of early detection systems in reducing mortality and provide actionable insights for public health policies aimed at addressing TB and HIV’s dual burden globally.