For data cleaning section, beside solving the common issues like missing values, duplicate contents, format problem, since we have ‘Complaints’ row with plain texts values, we also processed texts in this section.
We will generate two csv files in this section.
One is a basically processed file, simply deal with formatting issues and drop useless columns. What’s quite important is that we recategorized product column. The product column used to be not balance. There are some value appear very frequently and some value appear only less than 10 times. We combined product into different category. We used this dataset for EDA since it contains the complete data.
Another is a processed file mainly focusing on text data. We did much text cleaning, processing, and analysis in this part. One of the key step is that we did sentiment analysis with VADAR and extract the negative score and compound score from the result to form different columns. We also rescaled negative score using min-max scaling. We used this dataset for later supervised and unsupervised learning since the data in this dataset is well-prepared.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13377 entries, 0 to 13376
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Complaint_ID 13377 non-null int64
1 Date Received 13377 non-null object
2 Timely 13377 non-null object
3 Submitted via 13377 non-null object
4 Product 13377 non-null object
5 Sub-product 13350 non-null object
6 Issue 13377 non-null object
7 Sub-issue 11968 non-null object
8 Has Narrative 13377 non-null bool
9 Complaint 6002 non-null object
10 State 13377 non-null object
11 ZIP Code 13373 non-null object
12 Tags 1306 non-null object
13 Company 13377 non-null object
14 Company Response 13377 non-null object
15 Company Public Response 7092 non-null object
dtypes: bool(1), int64(1), object(14)
memory usage: 1.5+ MB
1. General Cleaning
Initially, we can observe that most data in our dataset are categorical data, like product, sub-product, issue, sub-issue, tags, and so on. At the first glance, we noticed that the format of the date is not correct, and there are useless columns like State (we chose to use data from DC when we use API to obtain our data), and ZIP Code, so firstly, we will deal with these problems in following steps:
We mainly used ‘Product’ column in later analysis, but we noticed that products had been categorized fragmented and the value counts are so not balanced, so we decided to recategorize the product categories into 6 more general categories and save these categories in a new column.
customer_complaints['Product'].value_counts()
Product
Credit reporting, credit repair services, or other personal consumer reports 7798
Debt collection 1248
Credit card or prepaid card 1129
Checking or savings account 1125
Mortgage 873
Student loan 406
Money transfer, virtual currency, or money service 313
Vehicle loan or lease 237
Payday loan, title loan, or personal loan 117
Credit reporting or other personal consumer reports 90
Credit card 15
Credit reporting 15
Bank account or service 7
Payday loan, title loan, personal loan, or advance loan 2
Payday loan 1
Consumer Loan 1
Name: count, dtype: int64
# Reorder columnsnew_column_order = ['Complaint_ID', 'Tags', 'Date', 'Timely', 'Submitted via', 'Company', 'Category', 'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Has Narrative', 'Complaint', 'Company Response', 'Company Public Response']customer_complaints = customer_complaints[new_column_order]# Save modified complete data frame to a csv filecustomer_complaints.to_csv("../../data/processed-data/customer_complaints.csv", index =False)customer_complaints.head(3)
Complaint_ID
Tags
Date
Timely
Submitted via
Company
Category
Product
Sub-product
Issue
Sub-issue
Has Narrative
Complaint
Company Response
Company Public Response
0
7640401
NaN
2023-08-27
Yes
Web
TRANSUNION INTERMEDIATE HOLDINGS, INC.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Personal information incorrect
False
NaN
Closed with explanation
Company has responded to the consumer and the ...
1
7640321
NaN
2023-08-26
Yes
Web
TRANSUNION INTERMEDIATE HOLDINGS, INC.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Information belongs to someone else
False
NaN
Closed with non-monetary relief
Company has responded to the consumer and the ...
2
7640280
NaN
2023-08-27
Yes
Web
TRANSUNION INTERMEDIATE HOLDINGS, INC.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Information belongs to someone else
False
NaN
Closed with explanation
Company has responded to the consumer and the ...
In this first part of data cleaning, we didn’t do much changes and maintained almost all data without removing any rows. We mainly use this dataframe to do the general EDA.
2. Text Cleaning
In this section of data cleaning, we deal with the text column “Complaint” in following steps: 1. General cleaning - Remove rows whose ‘Has Narrative’ column has False value, which means we build a new dataframe with those entities that has complaints narrrative. - Remove useless columns. We remove ‘Has Narrative’ column and removed ‘Submitted via’ column if this column has same value in each row.
Dive into Texts
We noticed that some complaints contain amounts of money. We extract the largest amounts in complaints and store them in a new column.
Remove all puntuations, numbers, extra blank spaces from texts. And we noticed that there are texts containing encrypted information presented as continuous ‘X’. We removed continuous ‘X’ also and transform the texts to lower case. We stored the cleaned texts in a new column.
We also count the length of the cleaned texts and store the lengths in a new column.
Text Analysis
Since users edited the complaint texts, the texts contains emotions. Therefore, we did a sentiment analysis with VADAR and saved the negative value and compound score of the results to new columns.
# Extract the largest amounts in complaintsdef get_largest_amount(text): amounts = re.findall(r'\{\$(\d+(\.\d{1,2})?)\}', text)iflen(amounts) >0: amounts = [float(amount[0]) for amount in amounts] largest_amount =max(amounts) if amounts elseNonereturn largest_amountreturn0df_has_narrative['largest_amount'] = [get_largest_amount(text) for text in df_has_narrative['Complaint']]
# Remove punctuations, numbers, meaningless contents, extra blank spaces, and transform text to lower casedef preprocess_text(text): text = text.replace('-', ' ') text = text.replace('_', ' ') text = re.sub(r'[^\w\s]', '', text) text = re.sub(r'\d+', '', text) text = re.sub(r'X+', '', text) text = text.lower() text = re.sub(r'\s+', ' ', text).strip()return textdf_has_narrative['cleaned_complaints'] = [preprocess_text(complaint) for complaint in df_has_narrative['Complaint']]
# Store cleaned texts' lengthdf_has_narrative['Clean Complaint Length'] = [len(text) for text in df_has_narrative['cleaned_complaints']]df_has_narrative.head(3)
Complaint_ID
Tags
Date
Timely
Company
Category
Product
Sub-product
Issue
Sub-issue
Complaint
Company Response
Company Public Response
largest_amount
cleaned_complaints
Clean Complaint Length
0
7485989
NaN
2023-08-31
Yes
EQUIFAX, INC.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Information belongs to someone else
On XX/XX/XXXX I disputed an account via notari...
Closed with non-monetary relief
NaN
0.0
on i disputed an account via notarized affidav...
1524
1
7484469
NaN
2023-08-31
Yes
TRANSUNION INTERMEDIATE HOLDINGS, INC.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Information belongs to someone else
On XX/XX/XXXX I disputed an account via notari...
Closed with non-monetary relief
Company has responded to the consumer and the ...
0.0
on i disputed an account via notarized affidav...
1535
2
7484234
NaN
2023-08-31
Yes
Experian Information Solutions Inc.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Information belongs to someone else
On XX/XX/XXXX I disputed an account via notari...
Closed with explanation
Company has responded to the consumer and the ...
0.0
on i disputed an account via notarized affidav...
1535
2.3 Simple Text Analysis - Sentiment Analysis
# Do sentiment analysis and store the compound and negative scores in new columnssentiment_score = []negative_score = []sia = SentimentIntensityAnalyzer()for text in df_has_narrative['cleaned_complaints']: score = sia.polarity_scores(text) sentiment_score.append(score['compound']) negative_score.append(score['neg'])df_has_narrative['sentiment_score'] = sentiment_scoredf_has_narrative['negative-score'] = negative_scoredf_has_narrative['negative-score'].describe()
count 6002.000000
mean 0.088225
std 0.062562
min 0.000000
25% 0.047000
50% 0.077000
75% 0.119000
max 0.747000
Name: negative-score, dtype: float64
count 5798.000000
mean 0.081460
std 0.050669
min 0.000000
25% 0.046000
50% 0.075000
75% 0.114000
max 0.226000
Name: negative-score, dtype: float64
# Use min-max scaling to rescale negative scorey = df_no_outliers['negative-score']df_no_outliers['negative-score'] = (y -min(y)) / (max(y) -min(y))df_no_outliers.head(3)
/var/folders/s7/7ttf9mj16ms92j4k34q3lghr0000gn/T/ipykernel_26917/2217140667.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Complaint_ID
Tags
Date
Timely
Company
Category
Product
Sub-product
Issue
Sub-issue
Complaint
Company Response
Company Public Response
largest_amount
cleaned_complaints
Clean Complaint Length
sentiment_score
negative-score
0
7485989
NaN
2023-08-31
Yes
EQUIFAX, INC.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Information belongs to someone else
On XX/XX/XXXX I disputed an account via notari...
Closed with non-monetary relief
NaN
0.0
on i disputed an account via notarized affidav...
1524
-0.9753
0.553097
1
7484469
NaN
2023-08-31
Yes
TRANSUNION INTERMEDIATE HOLDINGS, INC.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Information belongs to someone else
On XX/XX/XXXX I disputed an account via notari...
Closed with non-monetary relief
Company has responded to the consumer and the ...
0.0
on i disputed an account via notarized affidav...
1535
-0.9753
0.548673
2
7484234
NaN
2023-08-31
Yes
Experian Information Solutions Inc.
Credit reporting
Credit reporting or other personal consumer re...
Credit reporting
Incorrect information on your report
Information belongs to someone else
On XX/XX/XXXX I disputed an account via notari...
Closed with explanation
Company has responded to the consumer and the ...
0.0
on i disputed an account via notarized affidav...
1535
-0.9753
0.548673
Since we noticed that the distribution of negative-score is not balance, we used min-max scaling to rescaled the negative scores to in range of 0 to 1.
Save to file
df_no_outliers.to_csv("../../data/processed-data/complaints.csv", index =False)
This is the complete processed dataframe. We saved this dataframe to file and this will be used in later supervised and unsupervised learning.