Data Cleaning

Introduction

For data cleaning section, beside solving the common issues like missing values, duplicate contents, format problem, since we have ‘Complaints’ row with plain texts values, we also processed texts in this section.

We will generate two csv files in this section.

One is a basically processed file, simply deal with formatting issues and drop useless columns. What’s quite important is that we recategorized product column. The product column used to be not balance. There are some value appear very frequently and some value appear only less than 10 times. We combined product into different category. We used this dataset for EDA since it contains the complete data.
Another is a processed file mainly focusing on text data. We did much text cleaning, processing, and analysis in this part. One of the key step is that we did sentiment analysis with VADAR and extract the negative score and compound score from the result to form different columns. We also rescaled negative score using min-max scaling. We used this dataset for later supervised and unsupervised learning since the data in this dataset is well-prepared.

import pandas as pd
import re
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

import ssl
ssl._create_default_https_context = ssl._create_unverified_context
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/qing0304/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!

True

Overview our raw dataset

customer_complaints = pd.read_csv("../../data/raw-data/customer_complaints.csv")
customer_complaints.head(3)

	Complaint_ID	Date Received	Timely	Submitted via	Product	Sub-product	Issue	Sub-issue	Has Narrative	Complaint	State	ZIP Code	Tags	Company	Company Response	Company Public Response
0	7640401	2023-08-27T12:00:00-05:00	Yes	Web	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Personal information incorrect	False	NaN	DC	20020	NaN	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Closed with explanation	Company has responded to the consumer and the ...
1	7640321	2023-08-26T12:00:00-05:00	Yes	Web	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	False	NaN	DC	20011	NaN	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Closed with non-monetary relief	Company has responded to the consumer and the ...
2	7640280	2023-08-27T12:00:00-05:00	Yes	Web	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	False	NaN	DC	20002	NaN	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Closed with explanation	Company has responded to the consumer and the ...

customer_complaints.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13377 entries, 0 to 13376
Data columns (total 16 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Complaint_ID             13377 non-null  int64 
 1   Date Received            13377 non-null  object
 2   Timely                   13377 non-null  object
 3   Submitted via            13377 non-null  object
 4   Product                  13377 non-null  object
 5   Sub-product              13350 non-null  object
 6   Issue                    13377 non-null  object
 7   Sub-issue                11968 non-null  object
 8   Has Narrative            13377 non-null  bool  
 9   Complaint                6002 non-null   object
 10  State                    13377 non-null  object
 11  ZIP Code                 13373 non-null  object
 12  Tags                     1306 non-null   object
 13  Company                  13377 non-null  object
 14  Company Response         13377 non-null  object
 15  Company Public Response  7092 non-null   object
dtypes: bool(1), int64(1), object(14)
memory usage: 1.5+ MB

1. General Cleaning

Initially, we can observe that most data in our dataset are categorical data, like product, sub-product, issue, sub-issue, tags, and so on. At the first glance, we noticed that the format of the date is not correct, and there are useless columns like State (we chose to use data from DC when we use API to obtain our data), and ZIP Code, so firstly, we will deal with these problems in following steps:

Basic processing

Formatting date
Remove useless columns

Recategorize categorical data
Reorder columns and save the modifies dataframe

1.1 Basic Processing

# Reformat Date values
customer_complaints['Date'] = pd.to_datetime(customer_complaints['Date Received']).dt.date

# Remove useless columns
customer_complaints = customer_complaints.drop(columns = ['State', 'ZIP Code', 'Date Received'])

1.2 Dive into categorical data ‘Product’

We mainly used ‘Product’ column in later analysis, but we noticed that products had been categorized fragmented and the value counts are so not balanced, so we decided to recategorize the product categories into 6 more general categories and save these categories in a new column.

customer_complaints['Product'].value_counts()

Product
Credit reporting, credit repair services, or other personal consumer reports    7798
Debt collection                                                                 1248
Credit card or prepaid card                                                     1129
Checking or savings account                                                     1125
Mortgage                                                                         873
Student loan                                                                     406
Money transfer, virtual currency, or money service                               313
Vehicle loan or lease                                                            237
Payday loan, title loan, or personal loan                                        117
Credit reporting or other personal consumer reports                               90
Credit card                                                                       15
Credit reporting                                                                  15
Bank account or service                                                            7
Payday loan, title loan, personal loan, or advance loan                            2
Payday loan                                                                        1
Consumer Loan                                                                      1
Name: count, dtype: int64

def categorize_product(product):
    if "Credit reporting" in product or "Credit repair" in product:
        return "Credit reporting"
    elif "Loan" in product or "Mortgage" in product or "Vehicle loan" in product or "Student loan" in product or "Payday loan" in product:
        return "Loan"
    elif "Credit card" in product or "Prepaid card" in product:
        return "Credit card"
    elif "Checking" in product or "Savings account" in product or "Bank account" in product:
        return "Banking"
    elif "Money transfer" in product or "Virtual currency" in product:
        return "Money transfer"
    elif "Debt collection" in product:
        return "Debt collection"
    else:
        return "Others"

customer_complaints['Category'] = customer_complaints['Product'].apply(categorize_product)
customer_complaints['Category'].value_counts()

Category
Credit reporting    7903
Loan                1637
Debt collection     1248
Credit card         1144
Banking             1132
Money transfer       313
Name: count, dtype: int64

1.3 Reorder columns and save to file

# Reorder columns
new_column_order = ['Complaint_ID', 'Tags', 'Date', 'Timely', 'Submitted via', 'Company', 'Category', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Has Narrative', 'Complaint', 'Company Response', 'Company Public Response']
customer_complaints = customer_complaints[new_column_order]

# Save modified complete data frame to a csv file
customer_complaints.to_csv("../../data/processed-data/customer_complaints.csv", index = False)
customer_complaints.head(3)

	Complaint_ID	Tags	Date	Timely	Submitted via	Company	Category	Product	Sub-product	Issue	Sub-issue	Has Narrative	Complaint	Company Response	Company Public Response
0	7640401	NaN	2023-08-27	Yes	Web	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Personal information incorrect	False	NaN	Closed with explanation	Company has responded to the consumer and the ...
1	7640321	NaN	2023-08-26	Yes	Web	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	False	NaN	Closed with non-monetary relief	Company has responded to the consumer and the ...
2	7640280	NaN	2023-08-27	Yes	Web	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	False	NaN	Closed with explanation	Company has responded to the consumer and the ...

In this first part of data cleaning, we didn’t do much changes and maintained almost all data without removing any rows. We mainly use this dataframe to do the general EDA.

2. Text Cleaning

In this section of data cleaning, we deal with the text column “Complaint” in following steps: 1. General cleaning - Remove rows whose ‘Has Narrative’ column has False value, which means we build a new dataframe with those entities that has complaints narrrative. - Remove useless columns. We remove ‘Has Narrative’ column and removed ‘Submitted via’ column if this column has same value in each row.

Dive into Texts

We noticed that some complaints contain amounts of money. We extract the largest amounts in complaints and store them in a new column.
Remove all puntuations, numbers, extra blank spaces from texts. And we noticed that there are texts containing encrypted information presented as continuous ‘X’. We removed continuous ‘X’ also and transform the texts to lower case. We stored the cleaned texts in a new column.
We also count the length of the cleaned texts and store the lengths in a new column.

Text Analysis

Since users edited the complaint texts, the texts contains emotions. Therefore, we did a sentiment analysis with VADAR and saved the negative value and compound score of the results to new columns.
Analyze negative-score column and proceed.

Save to file

2.1 General Text Cleaning

# Build a new dataframe with rows containing complaint texts
df_has_narrative = customer_complaints[customer_complaints['Has Narrative'] == True].reset_index(drop = True)

# Remove useless columns
df_has_narrative = df_has_narrative.drop(columns = ['Submitted via', 'Has Narrative']) if len(df_has_narrative['Submitted via'].unique()) == 1 else df_has_narrative.drop(columns = ['Has Narrative'])
df_has_narrative.head(3)

	Complaint_ID	Tags	Date	Timely	Company	Category	Product	Sub-product	Issue	Sub-issue	Complaint	Company Response	Company Public Response
0	7485989	NaN	2023-08-31	Yes	EQUIFAX, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with non-monetary relief	NaN
1	7484469	NaN	2023-08-31	Yes	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with non-monetary relief	Company has responded to the consumer and the ...
2	7484234	NaN	2023-08-31	Yes	Experian Information Solutions Inc.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with explanation	Company has responded to the consumer and the ...

2.2 Dive into Texts

# Extract the largest amounts in complaints
def get_largest_amount(text):
    amounts = re.findall(r'\{\$(\d+(\.\d{1,2})?)\}', text)
    if len(amounts) > 0:
        amounts = [float(amount[0]) for amount in amounts]
        largest_amount = max(amounts) if amounts else None
        return largest_amount
    return 0

df_has_narrative['largest_amount'] = [get_largest_amount(text) for text in df_has_narrative['Complaint']]

# Remove punctuations, numbers, meaningless contents, extra blank spaces, and transform text to lower case
def preprocess_text(text):
    text = text.replace('-', ' ')
    text = text.replace('_', ' ')
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'X+', '', text)
    text = text.lower()
    text = re.sub(r'\s+', ' ', text).strip()
    return text

df_has_narrative['cleaned_complaints'] = [preprocess_text(complaint) for complaint in df_has_narrative['Complaint']]

# Store cleaned texts' length
df_has_narrative['Clean Complaint Length'] = [len(text) for text in df_has_narrative['cleaned_complaints']]
df_has_narrative.head(3)

	Complaint_ID	Tags	Date	Timely	Company	Category	Product	Sub-product	Issue	Sub-issue	Complaint	Company Response	Company Public Response	cleaned_complaints	Clean Complaint Length
0	7485989	NaN	2023-08-31	Yes	EQUIFAX, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with non-monetary relief	NaN	on i disputed an account via notarized affidav...	1524
1	7484469	NaN	2023-08-31	Yes	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with non-monetary relief	Company has responded to the consumer and the ...	on i disputed an account via notarized affidav...	1535
2	7484234	NaN	2023-08-31	Yes	Experian Information Solutions Inc.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with explanation	Company has responded to the consumer and the ...	on i disputed an account via notarized affidav...	1535

2.3 Simple Text Analysis - Sentiment Analysis

# Do sentiment analysis and store the compound and negative scores in new columns
sentiment_score = []
negative_score = []
sia = SentimentIntensityAnalyzer()

for text in df_has_narrative['cleaned_complaints']:
    score = sia.polarity_scores(text)
    sentiment_score.append(score['compound'])
    negative_score.append(score['neg'])

df_has_narrative['sentiment_score'] = sentiment_score
df_has_narrative['negative-score'] = negative_score
df_has_narrative['negative-score'].describe()

count    6002.000000
mean        0.088225
std         0.062562
min         0.000000
25%         0.047000
50%         0.077000
75%         0.119000
max         0.747000
Name: negative-score, dtype: float64

# Drop negative score outliers
Q1 = df_has_narrative['negative-score'].quantile(0.25)
Q3 = df_has_narrative['negative-score'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_no_outliers = df_has_narrative[(df_has_narrative['negative-score'] >= lower_bound) & (df_has_narrative['negative-score'] <= upper_bound)]
df_no_outliers['negative-score'].describe()

count    5798.000000
mean        0.081460
std         0.050669
min         0.000000
25%         0.046000
50%         0.075000
75%         0.114000
max         0.226000
Name: negative-score, dtype: float64

# Use min-max scaling to rescale negative score
y = df_no_outliers['negative-score']
df_no_outliers['negative-score']  = (y - min(y)) / (max(y) - min(y))
df_no_outliers.head(3)

/var/folders/s7/7ttf9mj16ms92j4k34q3lghr0000gn/T/ipykernel_26917/2217140667.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

	Complaint_ID	Tags	Date	Timely	Company	Category	Product	Sub-product	Issue	Sub-issue	Complaint	Company Response	Company Public Response	cleaned_complaints	Clean Complaint Length	sentiment_score	negative-score
0	7485989	NaN	2023-08-31	Yes	EQUIFAX, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with non-monetary relief	NaN	on i disputed an account via notarized affidav...	1524	-0.9753	0.553097
1	7484469	NaN	2023-08-31	Yes	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with non-monetary relief	Company has responded to the consumer and the ...	on i disputed an account via notarized affidav...	1535	-0.9753	0.548673
2	7484234	NaN	2023-08-31	Yes	Experian Information Solutions Inc.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with explanation	Company has responded to the consumer and the ...	on i disputed an account via notarized affidav...	1535	-0.9753	0.548673

Since we noticed that the distribution of negative-score is not balance, we used min-max scaling to rescaled the negative scores to in range of 0 to 1.

Save to file

df_no_outliers.to_csv("../../data/processed-data/complaints.csv", index = False)

This is the complete processed dataframe. We saved this dataframe to file and this will be used in later supervised and unsupervised learning.