1.2 Raw data processing for Diseases Prediction

Raw data processing for Diseases Prediction



Introduction

The data mining goal of this project is to “Develop a cognitive system able to predict a boolean variable based on the probability that the patient suffers from kidney complication, given blood’s data, personal measurement and the drugs he or she is taking”. I will focus here on the processing of raw textual data, related to the medicines that the patient is taking. The dataframe size is [957 x 10]. Every instance is a patient, and the variables are the drugs that he or she is currently taking, from a minimum of 1 to maximum of 9. This dataset presents many NA values, since not every patient takes 9 drugs. The goal of this script is to create a dataframe containing binary values, based on wheter the patients take the drugs or not. In the final dataframe every instance represents a patient, and the following binary values represent the status of each prescription. There will be a variable for each drug, and the value for each patient is 0 if he or she doesn't take the drug, otherwise 1.

In order to do so, I need to create a header containing unique values of drugs. Two names whose similarity is higher than 79% will be grouped as a single variable. To achieve this, I use the library difflib. The variables in the dataset, shown in a fragment in Fig. XX, present many inconsistencies that make the DataFrame not suitable for a correct modeling.

An example of the original dataset is shown here:

In [25]:
import numpy as np
import pandas as pd
import csv
import difflib

#Import datasets
data = pd.read_csv("drugs.csv")
data.head()
Out[25]:
patient_id drug1 drug2 drug3 drug4 drug5 drug6 drug7 drug8 drug9
0 1 Miflonide Foradil Aerolizer NaN NaN NaN NaN NaN NaN NaN
1 2 paracetamol rituximab dumirox daomil tevetens diemil NaN NaN NaN
2 3 Cyclophosphamide DIGOXINA 0.25 TENORMIN 50 NITROPLAST 5 RENITEC 20 water distiled 40 NaN NaN NaN
3 4 SERETIDE 50 500 MCG prednisone 20 -- paracetamol 1G TERBASMIN 500 MCG IDEOS ZARATOR 10 -- ACTONEL 35 -- NaN NaN
4 5 paracetamol 1u Digoxina 1u/d Novartis 1u/d thyle 2u/d prednisone 1u/d Cyclophosphamide segúnc NaN NaN NaN

As one can see, the main problems are:

  • Some variables are lower case, some are upper case.
  • Dosis are often expressed in different ways (1u/d, 1 u/d, 1G, ..)
  • Different spaces and tabs among names
  • Typos in writing the names (Paracetamol - Paracetamon)
  • Null Values
  • Not-alphabetic characters in drug names ( --, /, \, ..)

At this stage, 2637 unique nominal variables are presented in the dataset. The goal of the Data Cleaning is to decrease the number of unique variables, reducing the inconsistencies. In order to solve these issues, a Python script has been developed, following these steps:

  • All the names are capitalized.
  • A clear pattern, composed by [Name + doses] was identified. In order to consider only the name of the drug, every string containing number is split and only the first part is taken into account. The names still contain non-alphanumerical characters.
  • Specific non-alphanumerical characters, such as " ", "--", are deleted.
In [26]:
#Select variables
data = data.iloc[:,1:10]
data_cap = pd.DataFrame()

#Clean the drug names
for i in range(0,9):
    data_cap[i] = data[data.columns[i]].str.upper()
    data_cap[i] = data_cap[i].str.split("[0-9]")
    data_cap[i] = data_cap[i].str[0] 
    data_cap[i] = data_cap[i].str.split(" ")
    data_cap[i] = data_cap[i].str[0] 
data_cap = data_cap.fillna("0")
data_cap.head()
Out[26]:
0 1 2 3 4 5 6 7 8
0 MIFLONIDE FORADIL 0 0 0 0 0 0 0
1 PARACETAMOL RITUXIMAB DUMIROX DAOMIL TEVETENS DIEMIL 0 0 0
2 CYCLOPHOSPHAMIDE DIGOXINA TENORMIN NITROPLAST RENITEC WATER 0 0 0
3 SERETIDE PREDNISONE PARACETAMOL TERBASMIN IDEOS ZARATOR ACTONEL 0 0
4 PARACETAMOL DIGOXINA NOVARTIS THYLE PREDNISONE CYCLOPHOSPHAMIDE 0 0 0

After this operation was performed, the number of unique variables reduced from 2637 to 1030. Dealing with typos is a challenging task, since there is not a standard procedure which can solve all the typos at once. A possible solution is to group all the values taking into account the similarity in their strings.

A Python script using the function SequenceMatcher() in the library difflib was developed. All the values with an index of similarities higher than 80% were grouped and considered as a single entity.

An example of index of similarities computed is: value1 = “PARACETAMOL” value2 = “PARACETAMON” Index of similarities (value1, value2) = 90.9%

Reminder: the goal is to obtain an array of unique values, representing all the drugs present in the original dataset.

In [18]:
data_cap2 = pd.DataFrame()

#Create a signle array containung all the drug names cleaned
data_cap2 = data_cap[0].append(data_cap[1]).append(data_cap[2]).append(data_cap[3]).append(data_cap[4]).append(data_cap[5]).append(data_cap[6]).append(data_cap[7]).append(data_cap[8])
data_cap2 = data_cap2.str.replace(' ','') 
data_cap2 = data_cap2.str.replace('-','')
data_cap2 = data_cap2.str.replace('_','')
data_cap2 = data_cap2.str.replace('.','')
data_cap2 = data_cap2.unique()
name_df = pd.DataFrame(data_cap2)

#Delete null values
name_df = name_df.dropna()
name_df = name_df.sort_values(0,ascending=True)

#Keep only the names of the drugs which are less than 79% similar. All the names more similar than 79% are grouped in a signle variable
final = name_df
for i in range(1,len(name_df[0])):
    a = name_df.iloc[i][0]
    for j in range(i+1,len(name_df[0])):
        b = name_df.iloc[j][0]
        seq = difflib.SequenceMatcher(None,a,b)
        d = seq.ratio()*100
        if d > 79:
            final = final[name_df[0].str.contains(b) == False]
C:\Users\mauro\Anaconda3\lib\site-packages\ipykernel_launcher.py:25: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
In [19]:
final.head(7)
Out[19]:
0
335
5 0
369 A
551 AAS
600 ABILITY
49 ACABEL
77 ACALKA
In [20]:
final.tail(7)
Out[20]:
0
762 ZOCAAR
708 ZOCOOR
921 ZOLAPEX
147 ZOLPIDEM
211 ZYLORIC
283 ZYPREXA
198 ZYTRAM
In [21]:
final.shape
Out[21]:
(748, 1)

With this operation I created a list of 747 unique variables of drugs, from the initial 2637. A drawback of this computation is that different drugs with very similar names are going to be grouped as single variables. This introduces an error in the modeling, but a quick manual check suggests that this situation would be rare compared to the high number of spelling errors present in the dataset. For this reason, the percentage of error caused by this last step justified.

In total, the whole operation reduced the different values in the dataset by 72%, solving the main issues faced in the data format.

The current variables are 747, which is a number too high to be efficiently processed by machine learning algorithms. To reduce this number, I calculated the frequency of the drugs, finding out that almost 70% of the drugs appear less than 5 times. These variables are not going to give significant improvements in the efficiency, and only the 15 most frequent drugs will be considered as valid variables.

This array is used as header for the final dataframe that will be used for modeling. In the final dataframe, a binary value 0-1 is attributed under the columns of every drug for each patient: 0 if the patient is not taking that drug, 1 if the patient is taking that drug. Once again, all the variables need to be cleaned and associated to the column whose name is at least 79% similar. The header is written as _filename.csv

In [27]:
import chardet

#The goal of this script is to create a dataframe containing values 0-1, based on if the patients take or doesn't take the drugs.
#Every instance represents a patient, and the following binary values represents the status of the prescription.
#The categorical values of the drugs are cleaned, and every drug is associated to a drug whose name is similar (at least 79% similarity)

#Detect encoding format
with open('final_name.csv', 'rb') as f:
    result = chardet.detect(f.read()) 
header = pd.read_csv("final_name.csv",encoding=result['encoding'])
header.columns = [0]


#Fill null values with 0
final = pd.DataFrame(index=range(0,data_cap.shape[0]), columns=header[0])
final[:]=0
header_t=header.transpose()
data_cap=data_cap.fillna(value='0')

#   Associate Drugs to the unique value sin the header (which contains unique values, created in the script header.py),
#   when their similarity is higher than 79%

for i in range(0,data_cap.shape[0]):
    for j in range(0,data_cap.shape[1]):
        a = data_cap.iloc[i,j]
        for q in range(0,header.shape[0]):
            b = header.iloc[q][0]
            seq = difflib.SequenceMatcher(None,a,b)
            d = seq.ratio()*100
            if d > 79:
                pos = header_t.columns[(header_t.values==str(b)).any(0)].tolist()
                final.iloc[i][pos]=1
In [28]:
final.insert(0, "patient_id", final.index + 1)

Finally, the dataset is the desired one. The value of the drug is 1 only if the patient is taking that drig, otherwise it's 0. The next step, out of scope, is to reduce the number of variables number, calculate their frequency. Almost 70% of the drugs appear less than 5 times, and they are not going to give significant improvements in the efficiency. For this reason, only the 15 most frequent drugs will be considered as valid variables. Then, this dataset is joined with physiological parameters of the patient, and a machine learning model will predict if according to the prescription the patient will suffer from kidney disease.

In [29]:
final.head(10)
Out[29]:
patient_id A AAS ABILITY ACABEL ACALKA ACARBOSA ACEDIUR ACETAMINOPHEN ACETENSIL ... ZESTRIL ZIGURIT ZIMBALTA ZOCAAR ZOCOOR ZOLAPEX ZOLPIDEM ZYLORIC ZYPREXA ZYTRAM
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 2 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 3 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 4 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 5 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 6 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 7 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 8 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
9 10 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

10 rows × 747 columns