Exploratory Data Analysis

With the data cleaned and prepared, the next logical step was to explore the dataset in depth through Exploratory Data Analysis (EDA). This phase allowed for a closer look at the patterns, trends, and relationships within the data, providing the groundwork for understanding consumer complaints and comany responses. By visualizing and summarizing the key variables, EDA helped uncover insights into the types of complaints, the companies involved, and the timeliness of responses. This analysis set the stage for more detailed investigations and informed subsequent steps in the project.

Import packages and data

# Import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
import plotly.graph_objects as go
import plotly.express as px

# Load the dataset
df = pd.read_csv("../../data/processed-data/customer_complaints.csv")
df.head()

	Complaint_ID	Tags	Date	Timely	Submitted via	Company	Category	Product	Sub-product	Issue	Sub-issue	Has Narrative	Complaint	Company Response	Company Public Response
0	7640401	NaN	2023-08-27	Yes	Web	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Personal information incorrect	False	NaN	Closed with explanation	Company has responded to the consumer and the ...
1	7640321	NaN	2023-08-26	Yes	Web	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	False	NaN	Closed with non-monetary relief	Company has responded to the consumer and the ...
2	7640280	NaN	2023-08-27	Yes	Web	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	False	NaN	Closed with explanation	Company has responded to the consumer and the ...
3	7639311	NaN	2023-08-26	Yes	Web	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Improper use of your report	Credit inquiries on your report that you don't...	False	NaN	Closed with non-monetary relief	Company has responded to the consumer and the ...
4	7615026	NaN	2023-08-26	Yes	Web	EQUIFAX, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	False	NaN	Closed with non-monetary relief	NaN

# Set figure parameters
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['font.size'] = 10
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 12

Overview of the Dataset

print("----------------------", "GENERAL INFORMATION:", "----------------------", sep="\n")
print("number of rows:", df.shape[0])
print("number of columns:", df.shape[1])
print(df.columns)

----------------------
GENERAL INFORMATION:
----------------------
number of rows: 13377
number of columns: 15
Index(['Complaint_ID', 'Tags', 'Date', 'Timely', 'Submitted via', 'Company',
       'Category', 'Product', 'Sub-product', 'Issue', 'Sub-issue',
       'Has Narrative', 'Complaint', 'Company Response',
       'Company Public Response'],
      dtype='object')

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13377 entries, 0 to 13376
Data columns (total 15 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Complaint_ID             13377 non-null  int64 
 1   Tags                     1306 non-null   object
 2   Date                     13377 non-null  object
 3   Timely                   13377 non-null  object
 4   Submitted via            13377 non-null  object
 5   Company                  13377 non-null  object
 6   Category                 13377 non-null  object
 7   Product                  13377 non-null  object
 8   Sub-product              13350 non-null  object
 9   Issue                    13377 non-null  object
 10  Sub-issue                11968 non-null  object
 11  Has Narrative            13377 non-null  bool  
 12  Complaint                6002 non-null   object
 13  Company Response         13377 non-null  object
 14  Company Public Response  7092 non-null   object
dtypes: bool(1), int64(1), object(13)
memory usage: 1.4+ MB

Distribution of Complaints by Product

We analyze the distribution of complaints across different product categories (e.g., Credit reporting, Debt collection). This will give us insight into which products are most frequently complained about.

df['Product_short'] = df['Product'].apply(lambda x: x.split(',')[0].split(' or')[0])

sns.countplot(data=df, x='Product_short', order=df['Product_short'].value_counts().index)
plt.title('Distribution of Complaints by Product')
plt.xticks(rotation=45, ha='right')
plt.xlabel('Product')
plt.ylabel('Number of Complaints')
plt.show()

As shown in the plot, complaints about credit reporting significantly outnumber those for other financial products. This indicates that issues related to credit reporting are more common or more frustrating to consumers compared to other financial products. Credit reporting errors can have serious consequences, such as affecting consumers’ credit scores and their ability to secure loans, which may explain the higher volume of complaints. This trend also reflects the complexity and importance of accurate credit reporting in the overall financial system.

Most Complaint Issues

We analyze the most common issues within the complaints to gain a deeper understanding of specific problems faced by consumers, which can also help companies prioritize improvements in areas that are most impactful for their customers. Given that there are 95 issue types, we will focus on the top 10 most frequent issues, which will make the plot clearer and more focused.

top_issues = df['Issue'].value_counts().head(10)
simplified_labels = [
    "Credit report investigation issues" if label == "Problem with a credit reporting company's investigation into an existing problem" 
    else label for label in top_issues.index
]

sns.barplot(x=top_issues.index, y=top_issues.values)
plt.title('Top 10 Issues in Complaints')
plt.xlabel('Issue')
plt.ylabel('Number of Complaints')
plt.xticks(ticks=range(len(top_issues.index)), rotation=45, ha='right', labels=simplified_labels)
plt.show()

The top complaints are primarily related to issues with credit reporting, which coincides with our previous observations. The most frequent issue is “Incorrect information on your report,” followed by “Problems with a credit reporting company’s investigation into an existing problem.” These concerns highlight the significant role credit report accuracy plays in consumer complaints. Other notable issues include improper use of reports, account management problems, and debt collection disputes. These results suggest that consumers are most concerned with the accuracy and security of their financial information, particularly with regard to credit reports and debt-related matters.

Timeliness of Responses

This analysis looks at how timely companies respond to consumer complaints. The ‘Timely’ column indicates if the response was timely (‘Yes’ or ‘No’).

df['Timely'].value_counts(normalize=True)

Timely
Yes    0.99088
No     0.00912
Name: proportion, dtype: float64

# Visualizing the timely response distribution
sns.countplot(x='Timely', data=df)
plt.title('Timely Responses to Complaints')
plt.xlabel('Timely Response')
plt.ylabel('Number of Complaints')
plt.show()

The result shows that more than 99% of the complaints were responded to promptly, indicating that the majority of complaints received a response within the expected timeframe. This suggests that the companies handling these complaints are generally prompt in addressing consumer issues. It also reflects the efficiency of the complaint management system.

Number of Complaints Over Time

We explore how the number of complaints changes over time, using the Date Received column to perform a time-based analysis.

df['Date Received'] = pd.to_datetime(df['Date'])
df['Year'] = df['Date Received'].dt.year
complaints_over_time = df['Year'].value_counts().sort_index()

sns.lineplot(x=complaints_over_time.index, y=complaints_over_time.values)
plt.title('Complaints Received Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Complaints')
plt.show()

The plot above shows the trend of complaints received over time from 2017 to 2023. From 2017 to 2021, the number of complaints steadily increased, which could be due to growing awareness among consumers about their rights and the availability of the CFPB complaint platform. However, in 2021-2022, there was a sharp increase in complaints. This could be linked to the financial difficulties caused by the COVID-19 pandemic, which led to issues like unemployment, loan deferrals, and debt problems. In contrast, from 2022 to 2023, the growth in complaints slowed down significantly. This may indicate a stabilization of the financial situation for many consumers. The overall upward trend suggests that there was a rising awareness of the CFPB and a growing willingness of consumers to report issues.

Building on the trend observed over the years, further analysis can be conducted to understand how complaints vary across different months. By examining monthly complaint patterns, we can identify if there are any seasonal fluctuations.

df['Month'] = df['Date Received'].dt.month
complaints_over_month = df['Month'].value_counts().sort_index()

sns.lineplot(x=complaints_over_month.index, y=complaints_over_month.values)
plt.title('Complaints Received Over Month')
plt.xlabel('Month')
plt.ylabel('Number of Complaints')
plt.show()

The plot above shows a seasonal pattern in complaint submissions, with a peak during the middle of the year and a dip at the start and end. One potential reason is that consumers tend to engage with their finances more actively during the middle months, due to tax filings or summer financial planning. On the other hand, during the start and end of the year, many consumers might be less focused on financial issues due to the holiday season.

Companies with Most Complaints

We analyze the companies that receive the most complaints from consumers. This can help identify companies with a poor reputation or those frequently involved in consumer issues.

top_companies = df['Company'].value_counts().head(10)

sns.barplot(x=top_companies.index, y=top_companies.values)
plt.title('Top 10 Companies with the Most Complaints')
plt.xlabel('Company')
plt.ylabel('Number of Complaints')
plt.xticks(rotation=45, ha='right')
plt.show()

Distribution of Complaints by Tags

The following pie chart illustrates the distribution of consumer complaints by their associated tags. Tags provide additional context or characteristics about the complaints, such as whether they involve servicemember or older consumers. By analyzing the tag distribution, we can identify patterns in the types of consumers most affected by financial issues.

tag_counts = df['Tags'].value_counts()

plt.pie(tag_counts.values, labels=tag_counts.index, autopct='%1.1f%%', startangle=140)
plt.title('Distribution of Complaints by Tags')
plt.show()

From the pie chart, it is evident that servicemembers represent a significant proportion of the tagged complaints. Older Americans also form a notable share, while a smaller percentage of complaints involve individuals who belong to both categories. This distribution emphasizes the need for targeted policies to address the financial challenges faced by these groups, ensuring that financial services cater effectively to their unique needs.

Overview of the Narrative Complaints

Next, we focused on analyzing the cleaned text data, specifically the complaint narratives and their associated sentiment scores. These narratives provide detailed descriptions of the issues consumers faced, offering valuable context to the structured data.

Out of the entire dataset, which contains 13,377 rows, only 5,798 rows include narrative complaints. These narratives are only available when consumers explicitly consent to share their descriptions publicly. For the analysis below, we used the complaints table, which includes only the rows with narrative complaints that were published with consumer consent.

By narrowing the scope to these records, we could examine not only the specific issues raised but also the sentiment associated with each narrative. This approach helps reveal the emotional tone of the complaints and provides additional insight into the consumer experience.

df_has_narrative = pd.read_csv("../../data/processed-data/complaints.csv")
df_has_narrative.head()

	Complaint_ID	Tags	Date	Timely	Company	Category	Product	Sub-product	Issue	Sub-issue	Complaint	Company Response	Company Public Response	largest_amount	cleaned_complaints	Clean Complaint Length	sentiment_score	negative-score
0	7485989	NaN	2023-08-31	Yes	EQUIFAX, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with non-monetary relief	NaN	0.0	on i disputed an account via notarized affidav...	1524	-0.9753	0.553097
1	7484469	NaN	2023-08-31	Yes	TRANSUNION INTERMEDIATE HOLDINGS, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with non-monetary relief	Company has responded to the consumer and the ...	0.0	on i disputed an account via notarized affidav...	1535	-0.9753	0.548673
2	7484234	NaN	2023-08-31	Yes	Experian Information Solutions Inc.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Incorrect information on your report	Information belongs to someone else	On XX/XX/XXXX I disputed an account via notari...	Closed with explanation	Company has responded to the consumer and the ...	0.0	on i disputed an account via notarized affidav...	1535	-0.9753	0.548673
3	7475961	NaN	2023-08-30	Yes	JPMORGAN CHASE & CO.	Banking	Checking or savings account	Checking account	Problem with a lender or other company chargin...	Transaction was not authorized	On XX/XX/XXXX, an unauthorized wire for {$5400...	Closed with explanation	NaN	5400.0	on an unauthorized wire for was sent on my cha...	648	-0.9100	0.566372
4	7474987	NaN	2023-08-30	Yes	EQUIFAX, INC.	Credit reporting	Credit reporting or other personal consumer re...	Credit reporting	Improper use of your report	Reporting company used your report improperly	My credit reports are inaccurate. These inaccu...	Closed with non-monetary relief	NaN	0.0	my credit reports are inaccurate these inaccur...	268	0.2732	0.601770

print("----------------------", "GENERAL INFORMATION:", "----------------------", sep="\n")
print("number of rows:", df_has_narrative.shape[0])
print("number of columns:", df_has_narrative.shape[1])
print(df_has_narrative.columns)

----------------------
GENERAL INFORMATION:
----------------------
number of rows: 5798
number of columns: 18
Index(['Complaint_ID', 'Tags', 'Date', 'Timely', 'Company', 'Category',
       'Product', 'Sub-product', 'Issue', 'Sub-issue', 'Complaint',
       'Company Response', 'Company Public Response', 'largest_amount',
       'cleaned_complaints', 'Clean Complaint Length', 'sentiment_score',
       'negative-score'],
      dtype='object')

df_has_narrative.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5798 entries, 0 to 5797
Data columns (total 18 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Complaint_ID             5798 non-null   int64  
 1   Tags                     740 non-null    object 
 2   Date                     5798 non-null   object 
 3   Timely                   5798 non-null   object 
 4   Company                  5798 non-null   object 
 5   Category                 5798 non-null   object 
 6   Product                  5798 non-null   object 
 7   Sub-product              5785 non-null   object 
 8   Issue                    5798 non-null   object 
 9   Sub-issue                5033 non-null   object 
 10  Complaint                5798 non-null   object 
 11  Company Response         5798 non-null   object 
 12  Company Public Response  2931 non-null   object 
 13  largest_amount           5798 non-null   float64
 14  cleaned_complaints       5798 non-null   object 
 15  Clean Complaint Length   5798 non-null   int64  
 16  sentiment_score          5798 non-null   float64
 17  negative-score           5798 non-null   float64
dtypes: float64(3), int64(2), object(13)
memory usage: 815.5+ KB

Word Cloud of Complaints

To gain insights into the common themes and issues raised by consumers, we generated a word cloud from the complaint narratives. Larger words in the word cloud indicate higher frequencies, reflecting the primary concerns expressed by consumers.

def generate_word_cloud(text_data):
    text = " ".join(text_data)

    wordcloud = WordCloud(width=800, height=400, background_color='white', colormap='viridis', stopwords=STOPWORDS).generate(text)
    
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title("Word Cloud of Complaints", fontsize=16)
    plt.show()

generate_word_cloud(df_has_narrative['cleaned_complaints'])

Terms like “account”, “credit”, “report” and “payment” appear prominently, suggesting that issues related to credit reporting, financial transactions, credit management, and loan servicing are major areas of concern. This visualization provides an intuitive overview of the narrative complaints, serving as a starting point for deeper text analysis.

Sentiment Score of Complaints for Top 10 Companies

To explore the sentiment of consumer complaints across different companies, we created a violin plot that visualizes the distribution of sentiment scores for complaints associated with the top 10 companies. The sentiment scores reflect the emotional tone of each complaint narrative, where higher scores typically indicate more positive sentiments and lower scores represent more negative sentiments.

The top 10 companies were selected based on the frequency of complaints they received. For each company, we added a trace to the plot that shows the distribution of sentiment scores, with box plots and mean lines to highlight key statistical measures. This allows us to compare how the sentiment of complaints varies across companies and identify any potential patterns or outliers.

fig = go.Figure()  

top_10_companies = df['Company'].value_counts().head(10).index.to_list()

for company in top_10_companies:
    fig.add_trace(go.Violin(x=df_has_narrative['Company'][df_has_narrative['Company'] == company],
                            y=df_has_narrative['sentiment_score'][df_has_narrative['Company'] == company],
                            name=company, box_visible=True, meanline_visible=True))  
fig.update_layout(title_text='Sentiment Score of Complaints for Top 10 Companies', template='plotly_white')

fig.show()

Negative Score of Complaints for Top 10 Companies

A violin plot was created to visualize the distribution of negative sentiment scores for complaints associated with the top 10 companies. Negative sentiment scores represent the extent of dissatisfaction or negative emotions in the complaint narratives. Each company’s complaints are displayed as separate traces, showing the distribution of negative scores.

fig = go.Figure()  

for company in top_10_companies:
    fig.add_trace(go.Violin(x=df_has_narrative['Company'][df_has_narrative['Company'] == company],
                            y=df_has_narrative['negative-score'][df_has_narrative['Company'] == company],
                            name=company, box_visible=True, meanline_visible=True))  
fig.update_layout(title_text='Negative Score of Complaints for Top 10 Companies', template='plotly_white')  
fig.show()

Sentiment Score vs. Complaint Length for Top 10 Companies

A scatter plot was generated to examine the relationship between the sentiment score and the length of the complaint narratives for the top 10 companies. The sentiment score reflects the overall emotional tone of the complaint, while the complaint length indicates the amount of detail provided in the narrative. Each point on the plot represents a complaint and is color-coded by company, allowing us to observe trends and differences across the top 10 companies. This analysis helps identify whether there is a correlation between the length of a complaint and its sentiment.

df_top_10_comapnies = df_has_narrative[df_has_narrative['Company'].isin(top_10_companies)]

fig = px.scatter(df_top_10_comapnies, x='sentiment_score', y='Clean Complaint Length', color='Company', template="plotly_white")  
fig.update_layout(title_text='Sentiment Compound against Complaint Length of Complaints for Top 10 Companies')  
fig.show()

Summary

In the EDA phase, we examined the overall consumer complaint dataset and cleaned narratives to uncover patterns, trends, and relationships. We began by providing an overview of the dataset, ensuring an understanding of its structure and key variables.

We then analyzed the distribution of complaints by product, identifying the most common financial products involved in consumer complaints. This was followed by an exploration of the most frequently reported issues, offering insights into the primary concerns consumers have with financial services. We also assessed the timeliness of responses, highlighting how quickly companies address complaints.

The number of complaints over time was examined to observe trends and fluctuations, revealing how complaint volumes changed over the study period. Additionally, we identified the companies receiving the most complaints, providing a deeper understanding of which firms were most frequently involved in consumer disputes.

We explored the distribution of complaints by tags to gain insight into specific consumer groups, such as older Americans or servicemembers, who might be affected by particular issues. We also focused on the narrative complaints, analyzing the sentiment scores associated with them to gauge the emotional tone of consumer feedback.

To further explore the sentiment, we visualized the sentiment scores of complaints for the top 10 companies and examined the negative sentiment scores, helping to identify which companies were associated with higher levels of dissatisfaction. Finally, we analyzed the relationship between sentiment scores and complaint length to understand whether longer complaints tend to have more negative or positive sentiment.

Overall, the EDA provided valuable insights into the dataset, setting the stage for more in-depth analysis and actionable conclusions regarding consumer complaints and company responses.