Introduction¶

The data mining goal of this project is to “Develop a cognitive system able to predict a boolean variable based on the probability that the patient suffers from kidney complication, given blood’s data, personal measurement and the drugs he or she is taking”. I will focus here on the processing of raw textual data, related to the medicines that the patient is taking. The dataframe size is [957 x 10]. Every instance is a patient, and the variables are the drugs that he or she is currently taking, from a minimum of 1 to maximum of 9. This dataset presents many NA values, since not every patient takes 9 drugs. The goal of this script is to create a dataframe containing binary values, based on wheter the patients take the drugs or not. In the final dataframe every instance represents a patient, and the following binary values represent the status of each prescription. There will be a variable for each drug, and the value for each patient is 0 if he or she doesn't take the drug, otherwise 1.

In order to do so, I need to create a header containing unique values of drugs. Two names whose similarity is higher than 79% will be grouped as a single variable. To achieve this, I use the library difflib. The variables in the dataset, shown in a fragment in Fig. XX, present many inconsistencies that make the DataFrame not suitable for a correct modeling.

An example of the original dataset is shown here:

import numpy as np
import pandas as pd
import csv
import difflib

#Import datasets
data = pd.read_csv("drugs.csv")
data.head()

As one can see, the main problems are:

Some variables are lower case, some are upper case.
Dosis are often expressed in different ways (1u/d, 1 u/d, 1G, ..)
Different spaces and tabs among names
Typos in writing the names (Paracetamol - Paracetamon)
Null Values
Not-alphabetic characters in drug names ( --, /, \, ..)

At this stage, 2637 unique nominal variables are presented in the dataset. The goal of the Data Cleaning is to decrease the number of unique variables, reducing the inconsistencies. In order to solve these issues, a Python script has been developed, following these steps:

All the names are capitalized.
A clear pattern, composed by [Name + doses] was identified. In order to consider only the name of the drug, every string containing number is split and only the first part is taken into account. The names still contain non-alphanumerical characters.
Specific non-alphanumerical characters, such as " ", "--", are deleted.

#Select variables
data = data.iloc[:,1:10]
data_cap = pd.DataFrame()

#Clean the drug names
for i in range(0,9):
    data_cap[i] = data[data.columns[i]].str.upper()
    data_cap[i] = data_cap[i].str.split("[0-9]")
    data_cap[i] = data_cap[i].str[0] 
    data_cap[i] = data_cap[i].str.split(" ")
    data_cap[i] = data_cap[i].str[0] 
data_cap = data_cap.fillna("0")
data_cap.head()

After this operation was performed, the number of unique variables reduced from 2637 to 1030. Dealing with typos is a challenging task, since there is not a standard procedure which can solve all the typos at once. A possible solution is to group all the values taking into account the similarity in their strings.

A Python script using the function SequenceMatcher() in the library difflib was developed. All the values with an index of similarities higher than 80% were grouped and considered as a single entity.

An example of index of similarities computed is: value1 = “PARACETAMOL” value2 = “PARACETAMON” Index of similarities (value1, value2) = 90.9%

Reminder: the goal is to obtain an array of unique values, representing all the drugs present in the original dataset.

data_cap2 = pd.DataFrame()

#Create a signle array containung all the drug names cleaned
data_cap2 = data_cap[0].append(data_cap[1]).append(data_cap[2]).append(data_cap[3]).append(data_cap[4]).append(data_cap[5]).append(data_cap[6]).append(data_cap[7]).append(data_cap[8])
data_cap2 = data_cap2.str.replace(' ','') 
data_cap2 = data_cap2.str.replace('-','')
data_cap2 = data_cap2.str.replace('_','')
data_cap2 = data_cap2.str.replace('.','')
data_cap2 = data_cap2.unique()
name_df = pd.DataFrame(data_cap2)

#Delete null values
name_df = name_df.dropna()
name_df = name_df.sort_values(0,ascending=True)

#Keep only the names of the drugs which are less than 79% similar. All the names more similar than 79% are grouped in a signle variable
final = name_df
for i in range(1,len(name_df[0])):
    a = name_df.iloc[i][0]
    for j in range(i+1,len(name_df[0])):
        b = name_df.iloc[j][0]
        seq = difflib.SequenceMatcher(None,a,b)
        d = seq.ratio()*100
        if d > 79:
            final = final[name_df[0].str.contains(b) == False]

C:\Users\mauro\Anaconda3\lib\site-packages\ipykernel_launcher.py:25: UserWarning: Boolean Series key will be reindexed to match DataFrame index.

final.head(7)

final.tail(7)

final.shape

(748, 1)

With this operation I created a list of 747 unique variables of drugs, from the initial 2637. A drawback of this computation is that different drugs with very similar names are going to be grouped as single variables. This introduces an error in the modeling, but a quick manual check suggests that this situation would be rare compared to the high number of spelling errors present in the dataset. For this reason, the percentage of error caused by this last step justified.

In total, the whole operation reduced the different values in the dataset by 72%, solving the main issues faced in the data format.

The current variables are 747, which is a number too high to be efficiently processed by machine learning algorithms. To reduce this number, I calculated the frequency of the drugs, finding out that almost 70% of the drugs appear less than 5 times. These variables are not going to give significant improvements in the efficiency, and only the 15 most frequent drugs will be considered as valid variables.

This array is used as header for the final dataframe that will be used for modeling. In the final dataframe, a binary value 0-1 is attributed under the columns of every drug for each patient: 0 if the patient is not taking that drug, 1 if the patient is taking that drug. Once again, all the variables need to be cleaned and associated to the column whose name is at least 79% similar. The header is written as _filename.csv

import chardet

#The goal of this script is to create a dataframe containing values 0-1, based on if the patients take or doesn't take the drugs.
#Every instance represents a patient, and the following binary values represents the status of the prescription.
#The categorical values of the drugs are cleaned, and every drug is associated to a drug whose name is similar (at least 79% similarity)

#Detect encoding format
with open('final_name.csv', 'rb') as f:
    result = chardet.detect(f.read()) 
header = pd.read_csv("final_name.csv",encoding=result['encoding'])
header.columns = [0]


#Fill null values with 0
final = pd.DataFrame(index=range(0,data_cap.shape[0]), columns=header[0])
final[:]=0
header_t=header.transpose()
data_cap=data_cap.fillna(value='0')

#   Associate Drugs to the unique value sin the header (which contains unique values, created in the script header.py),
#   when their similarity is higher than 79%

for i in range(0,data_cap.shape[0]):
    for j in range(0,data_cap.shape[1]):
        a = data_cap.iloc[i,j]
        for q in range(0,header.shape[0]):
            b = header.iloc[q][0]
            seq = difflib.SequenceMatcher(None,a,b)
            d = seq.ratio()*100
            if d > 79:
                pos = header_t.columns[(header_t.values==str(b)).any(0)].tolist()
                final.iloc[i][pos]=1

final.insert(0, "patient_id", final.index + 1)

Finally, the dataset is the desired one. The value of the drug is 1 only if the patient is taking that drig, otherwise it's 0. The next step, out of scope, is to reduce the number of variables number, calculate their frequency. Almost 70% of the drugs appear less than 5 times, and they are not going to give significant improvements in the efficiency. For this reason, only the 15 most frequent drugs will be considered as valid variables. Then, this dataset is joined with physiological parameters of the patient, and a machine learning model will predict if according to the prescription the patient will suffer from kidney disease.

final.head(10)

	patient_id	drug1	drug2	drug3	drug4	drug5	drug6	drug7	drug8	drug9
0	1	Miflonide	Foradil Aerolizer	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	2	paracetamol	rituximab	dumirox	daomil	tevetens	diemil	NaN	NaN	NaN
2	3	Cyclophosphamide	DIGOXINA 0.25	TENORMIN 50	NITROPLAST 5	RENITEC 20	water distiled 40	NaN	NaN	NaN
3	4	SERETIDE 50 500 MCG	prednisone 20 --	paracetamol 1G	TERBASMIN 500 MCG	IDEOS	ZARATOR 10 --	ACTONEL 35 --	NaN	NaN
4	5	paracetamol 1u	Digoxina 1u/d	Novartis 1u/d	thyle 2u/d	prednisone 1u/d	Cyclophosphamide segúnc	NaN	NaN	NaN

	0	1	2	3	4	5	6
0	MIFLONIDE	FORADIL	0	0	0	0	0
1	PARACETAMOL	RITUXIMAB	DUMIROX	DAOMIL	TEVETENS	DIEMIL	0
2	CYCLOPHOSPHAMIDE	DIGOXINA	TENORMIN	NITROPLAST	RENITEC	WATER	0
3	SERETIDE	PREDNISONE	PARACETAMOL	TERBASMIN	IDEOS	ZARATOR	ACTONEL
4	PARACETAMOL	DIGOXINA	NOVARTIS	THYLE	PREDNISONE	CYCLOPHOSPHAMIDE	0

	0
335
5	0
369	A
551	AAS
600	ABILITY
49	ACABEL
77	ACALKA

	0
762	ZOCAAR
708	ZOCOOR
921	ZOLAPEX
147	ZOLPIDEM
211	ZYLORIC
283	ZYPREXA
198	ZYTRAM

	patient_id	ACETAMINOPHEN	...	ZOCOOR
0	1	0	...	0
1	2	0	...	0
2	3	0	...	0
3	4	0	...	0
4	5	0	...	0
5	6	0	...	0
6	7	0	...	0
7	8	0	...	1
8	9	1	...	0
9	10	0	...	0

Raw data processing for Diseases Prediction

Introduction¶

Header¶

	patient_id	ACETAMINOPHEN	...	ZOCOOR
0	1	0	...	0
1	2	0	...	0
2	3	0	...	0
3	4	0	...	0
4	5	0	...	0
5	6	0	...	0
6	7	0	...	0
7	8	0	...	1
8	9	1	...	0
9	10	0	...	0

	patient_id	ACETAMINOPHEN	...	ZOCOOR
0	1	0	...	0
1	2	0	...	0
2	3	0	...	0
3	4	0	...	0
4	5	0	...	0
5	6	0	...	0
6	7	0	...	0
7	8	0	...	1
8	9	1	...	0
9	10	0	...	0

	patient_id	ACETAMINOPHEN	...	ZOCOOR
0	1	0	...	0
1	2	0	...	0
2	3	0	...	0
3	4	0	...	0
4	5	0	...	0
5	6	0	...	0
6	7	0	...	0
7	8	0	...	1
8	9	1	...	0
9	10	0	...	0