The data mining goal of this project is to “Develop a cognitive system able to predict a boolean variable based on the probability that the patient suffers from kidney complication, given blood’s data, personal measurement and the drugs he or she is taking”. I will focus here on the processing of raw textual data, related to the medicines that the patient is taking. The dataframe size is [957 x 10]. Every instance is a patient, and the variables are the drugs that he or she is currently taking, from a minimum of 1 to maximum of 9. This dataset presents many NA values, since not every patient takes 9 drugs. The goal of this script is to create a dataframe containing binary values, based on wheter the patients take the drugs or not. In the final dataframe every instance represents a patient, and the following binary values represent the status of each prescription. There will be a variable for each drug, and the value for each patient is 0 if he or she doesn't take the drug, otherwise 1.
In order to do so, I need to create a header containing unique values of drugs. Two names whose similarity is higher than 79% will be grouped as a single variable. To achieve this, I use the library difflib. The variables in the dataset, shown in a fragment in Fig. XX, present many inconsistencies that make the DataFrame not suitable for a correct modeling.
An example of the original dataset is shown here:
import numpy as np import pandas as pd import csv import difflib #Import datasets data = pd.read_csv("drugs.csv") data.head()
|2||3||Cyclophosphamide||DIGOXINA 0.25||TENORMIN 50||NITROPLAST 5||RENITEC 20||water distiled 40||NaN||NaN||NaN|
|3||4||SERETIDE 50 500 MCG||prednisone 20 --||paracetamol 1G||TERBASMIN 500 MCG||IDEOS||ZARATOR 10 --||ACTONEL 35 --||NaN||NaN|
|4||5||paracetamol 1u||Digoxina 1u/d||Novartis 1u/d||thyle 2u/d||prednisone 1u/d||Cyclophosphamide segúnc||NaN||NaN||NaN|
As one can see, the main problems are:
At this stage, 2637 unique nominal variables are presented in the dataset. The goal of the Data Cleaning is to decrease the number of unique variables, reducing the inconsistencies. In order to solve these issues, a Python script has been developed, following these steps:
#Select variables data = data.iloc[:,1:10] data_cap = pd.DataFrame() #Clean the drug names for i in range(0,9): data_cap[i] = data[data.columns[i]].str.upper() data_cap[i] = data_cap[i].str.split("[0-9]") data_cap[i] = data_cap[i].str data_cap[i] = data_cap[i].str.split(" ") data_cap[i] = data_cap[i].str data_cap = data_cap.fillna("0") data_cap.head()
After this operation was performed, the number of unique variables reduced from 2637 to 1030. Dealing with typos is a challenging task, since there is not a standard procedure which can solve all the typos at once. A possible solution is to group all the values taking into account the similarity in their strings.
A Python script using the function SequenceMatcher() in the library difflib was developed. All the values with an index of similarities higher than 80% were grouped and considered as a single entity.
An example of index of similarities computed is: value1 = “PARACETAMOL” value2 = “PARACETAMON” Index of similarities (value1, value2) = 90.9%
Reminder: the goal is to obtain an array of unique values, representing all the drugs present in the original dataset.
data_cap2 = pd.DataFrame() #Create a signle array containung all the drug names cleaned data_cap2 = data_cap.append(data_cap).append(data_cap).append(data_cap).append(data_cap).append(data_cap).append(data_cap).append(data_cap).append(data_cap) data_cap2 = data_cap2.str.replace(' ','') data_cap2 = data_cap2.str.replace('-','') data_cap2 = data_cap2.str.replace('_','') data_cap2 = data_cap2.str.replace('.','') data_cap2 = data_cap2.unique() name_df = pd.DataFrame(data_cap2) #Delete null values name_df = name_df.dropna() name_df = name_df.sort_values(0,ascending=True) #Keep only the names of the drugs which are less than 79% similar. All the names more similar than 79% are grouped in a signle variable final = name_df for i in range(1,len(name_df)): a = name_df.iloc[i] for j in range(i+1,len(name_df)): b = name_df.iloc[j] seq = difflib.SequenceMatcher(None,a,b) d = seq.ratio()*100 if d > 79: final = final[name_df.str.contains(b) == False]
C:\Users\mauro\Anaconda3\lib\site-packages\ipykernel_launcher.py:25: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
With this operation I created a list of 747 unique variables of drugs, from the initial 2637. A drawback of this computation is that different drugs with very similar names are going to be grouped as single variables. This introduces an error in the modeling, but a quick manual check suggests that this situation would be rare compared to the high number of spelling errors present in the dataset. For this reason, the percentage of error caused by this last step justified.
In total, the whole operation reduced the different values in the dataset by 72%, solving the main issues faced in the data format.
The current variables are 747, which is a number too high to be efficiently processed by machine learning algorithms. To reduce this number, I calculated the frequency of the drugs, finding out that almost 70% of the drugs appear less than 5 times. These variables are not going to give significant improvements in the efficiency, and only the 15 most frequent drugs will be considered as valid variables.
This array is used as header for the final dataframe that will be used for modeling. In the final dataframe, a binary value 0-1 is attributed under the columns of every drug for each patient: 0 if the patient is not taking that drug, 1 if the patient is taking that drug. Once again, all the variables need to be cleaned and associated to the column whose name is at least 79% similar. The header is written as _filename.csv
import chardet #The goal of this script is to create a dataframe containing values 0-1, based on if the patients take or doesn't take the drugs. #Every instance represents a patient, and the following binary values represents the status of the prescription. #The categorical values of the drugs are cleaned, and every drug is associated to a drug whose name is similar (at least 79% similarity) #Detect encoding format with open('final_name.csv', 'rb') as f: result = chardet.detect(f.read()) header = pd.read_csv("final_name.csv",encoding=result['encoding']) header.columns =  #Fill null values with 0 final = pd.DataFrame(index=range(0,data_cap.shape), columns=header) final[:]=0 header_t=header.transpose() data_cap=data_cap.fillna(value='0') # Associate Drugs to the unique value sin the header (which contains unique values, created in the script header.py), # when their similarity is higher than 79% for i in range(0,data_cap.shape): for j in range(0,data_cap.shape): a = data_cap.iloc[i,j] for q in range(0,header.shape): b = header.iloc[q] seq = difflib.SequenceMatcher(None,a,b) d = seq.ratio()*100 if d > 79: pos = header_t.columns[(header_t.values==str(b)).any(0)].tolist() final.iloc[i][pos]=1
final.insert(0, "patient_id", final.index + 1)
Finally, the dataset is the desired one. The value of the drug is 1 only if the patient is taking that drig, otherwise it's 0. The next step, out of scope, is to reduce the number of variables number, calculate their frequency. Almost 70% of the drugs appear less than 5 times, and they are not going to give significant improvements in the efficiency. For this reason, only the 15 most frequent drugs will be considered as valid variables. Then, this dataset is joined with physiological parameters of the patient, and a machine learning model will predict if according to the prescription the patient will suffer from kidney disease.
10 rows × 747 columns