With the data cleaned and prepared, the next logical step was to explore the dataset in depth through Exploratory Data Analysis (EDA). This phase allowed for a closer look at the patterns, trends, and relationships within the data, providing the groundwork for understanding consumer complaints and comany responses. By visualizing and summarizing the key variables, EDA helped uncover insights into the types of complaints, the companies involved, and the timeliness of responses. This analysis set the stage for more detailed investigations and informed subsequent steps in the project.
Import packages and data
# Import necessary packagesimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsfrom wordcloud import WordCloud, STOPWORDSimport plotly.graph_objects as goimport plotly.express as px# Load the datasetdf = pd.read_csv("../../data/processed-data/customer_complaints.csv")df.head()
We analyze the distribution of complaints across different product categories (e.g., Credit reporting, Debt collection). This will give us insight into which products are most frequently complained about.
df['Product_short'] = df['Product'].apply(lambda x: x.split(',')[0].split(' or')[0])sns.countplot(data=df, x='Product_short', order=df['Product_short'].value_counts().index)plt.title('Distribution of Complaints by Product')plt.xticks(rotation=45, ha='right')plt.xlabel('Product')plt.ylabel('Number of Complaints')plt.show()
As shown in the plot, complaints about credit reporting significantly outnumber those for other financial products. This indicates that issues related to credit reporting are more common or more frustrating to consumers compared to other financial products. Credit reporting errors can have serious consequences, such as affecting consumers’ credit scores and their ability to secure loans, which may explain the higher volume of complaints. This trend also reflects the complexity and importance of accurate credit reporting in the overall financial system.
Most Complaint Issues
We analyze the most common issues within the complaints to gain a deeper understanding of specific problems faced by consumers, which can also help companies prioritize improvements in areas that are most impactful for their customers. Given that there are 95 issue types, we will focus on the top 10 most frequent issues, which will make the plot clearer and more focused.
top_issues = df['Issue'].value_counts().head(10)simplified_labels = ["Credit report investigation issues"if label =="Problem with a credit reporting company's investigation into an existing problem"else label for label in top_issues.index]sns.barplot(x=top_issues.index, y=top_issues.values)plt.title('Top 10 Issues in Complaints')plt.xlabel('Issue')plt.ylabel('Number of Complaints')plt.xticks(ticks=range(len(top_issues.index)), rotation=45, ha='right', labels=simplified_labels)plt.show()
The top complaints are primarily related to issues with credit reporting, which coincides with our previous observations. The most frequent issue is “Incorrect information on your report,” followed by “Problems with a credit reporting company’s investigation into an existing problem.” These concerns highlight the significant role credit report accuracy plays in consumer complaints. Other notable issues include improper use of reports, account management problems, and debt collection disputes. These results suggest that consumers are most concerned with the accuracy and security of their financial information, particularly with regard to credit reports and debt-related matters.
Timeliness of Responses
This analysis looks at how timely companies respond to consumer complaints. The ‘Timely’ column indicates if the response was timely (‘Yes’ or ‘No’).
df['Timely'].value_counts(normalize=True)
Timely
Yes 0.99088
No 0.00912
Name: proportion, dtype: float64
# Visualizing the timely response distributionsns.countplot(x='Timely', data=df)plt.title('Timely Responses to Complaints')plt.xlabel('Timely Response')plt.ylabel('Number of Complaints')plt.show()
The result shows that more than 99% of the complaints were responded to promptly, indicating that the majority of complaints received a response within the expected timeframe. This suggests that the companies handling these complaints are generally prompt in addressing consumer issues. It also reflects the efficiency of the complaint management system.
Number of Complaints Over Time
We explore how the number of complaints changes over time, using the Date Received column to perform a time-based analysis.
df['Date Received'] = pd.to_datetime(df['Date'])df['Year'] = df['Date Received'].dt.yearcomplaints_over_time = df['Year'].value_counts().sort_index()sns.lineplot(x=complaints_over_time.index, y=complaints_over_time.values)plt.title('Complaints Received Over Time')plt.xlabel('Year')plt.ylabel('Number of Complaints')plt.show()
The plot above shows the trend of complaints received over time from 2017 to 2023. From 2017 to 2021, the number of complaints steadily increased, which could be due to growing awareness among consumers about their rights and the availability of the CFPB complaint platform. However, in 2021-2022, there was a sharp increase in complaints. This could be linked to the financial difficulties caused by the COVID-19 pandemic, which led to issues like unemployment, loan deferrals, and debt problems. In contrast, from 2022 to 2023, the growth in complaints slowed down significantly. This may indicate a stabilization of the financial situation for many consumers. The overall upward trend suggests that there was a rising awareness of the CFPB and a growing willingness of consumers to report issues.
Building on the trend observed over the years, further analysis can be conducted to understand how complaints vary across different months. By examining monthly complaint patterns, we can identify if there are any seasonal fluctuations.
df['Month'] = df['Date Received'].dt.monthcomplaints_over_month = df['Month'].value_counts().sort_index()sns.lineplot(x=complaints_over_month.index, y=complaints_over_month.values)plt.title('Complaints Received Over Month')plt.xlabel('Month')plt.ylabel('Number of Complaints')plt.show()
The plot above shows a seasonal pattern in complaint submissions, with a peak during the middle of the year and a dip at the start and end. One potential reason is that consumers tend to engage with their finances more actively during the middle months, due to tax filings or summer financial planning. On the other hand, during the start and end of the year, many consumers might be less focused on financial issues due to the holiday season.
Companies with Most Complaints
We analyze the companies that receive the most complaints from consumers. This can help identify companies with a poor reputation or those frequently involved in consumer issues.
top_companies = df['Company'].value_counts().head(10)sns.barplot(x=top_companies.index, y=top_companies.values)plt.title('Top 10 Companies with the Most Complaints')plt.xlabel('Company')plt.ylabel('Number of Complaints')plt.xticks(rotation=45, ha='right')plt.show()
Distribution of Complaints by Tags
The following pie chart illustrates the distribution of consumer complaints by their associated tags. Tags provide additional context or characteristics about the complaints, such as whether they involve servicemember or older consumers. By analyzing the tag distribution, we can identify patterns in the types of consumers most affected by financial issues.
tag_counts = df['Tags'].value_counts()plt.pie(tag_counts.values, labels=tag_counts.index, autopct='%1.1f%%', startangle=140)plt.title('Distribution of Complaints by Tags')plt.show()
From the pie chart, it is evident that servicemembers represent a significant proportion of the tagged complaints. Older Americans also form a notable share, while a smaller percentage of complaints involve individuals who belong to both categories. This distribution emphasizes the need for targeted policies to address the financial challenges faced by these groups, ensuring that financial services cater effectively to their unique needs.
Overview of the Narrative Complaints
Next, we focused on analyzing the cleaned text data, specifically the complaint narratives and their associated sentiment scores. These narratives provide detailed descriptions of the issues consumers faced, offering valuable context to the structured data.
Out of the entire dataset, which contains 13,377 rows, only 5,798 rows include narrative complaints. These narratives are only available when consumers explicitly consent to share their descriptions publicly. For the analysis below, we used the complaints table, which includes only the rows with narrative complaints that were published with consumer consent.
By narrowing the scope to these records, we could examine not only the specific issues raised but also the sentiment associated with each narrative. This approach helps reveal the emotional tone of the complaints and provides additional insight into the consumer experience.
To gain insights into the common themes and issues raised by consumers, we generated a word cloud from the complaint narratives. Larger words in the word cloud indicate higher frequencies, reflecting the primary concerns expressed by consumers.
Terms like “account”, “credit”, “report” and “payment” appear prominently, suggesting that issues related to credit reporting, financial transactions, credit management, and loan servicing are major areas of concern. This visualization provides an intuitive overview of the narrative complaints, serving as a starting point for deeper text analysis.
Sentiment Score of Complaints for Top 10 Companies
To explore the sentiment of consumer complaints across different companies, we created a violin plot that visualizes the distribution of sentiment scores for complaints associated with the top 10 companies. The sentiment scores reflect the emotional tone of each complaint narrative, where higher scores typically indicate more positive sentiments and lower scores represent more negative sentiments.
The top 10 companies were selected based on the frequency of complaints they received. For each company, we added a trace to the plot that shows the distribution of sentiment scores, with box plots and mean lines to highlight key statistical measures. This allows us to compare how the sentiment of complaints varies across companies and identify any potential patterns or outliers.
fig = go.Figure() top_10_companies = df['Company'].value_counts().head(10).index.to_list()for company in top_10_companies: fig.add_trace(go.Violin(x=df_has_narrative['Company'][df_has_narrative['Company'] == company], y=df_has_narrative['sentiment_score'][df_has_narrative['Company'] == company], name=company, box_visible=True, meanline_visible=True)) fig.update_layout(title_text='Sentiment Score of Complaints for Top 10 Companies', template='plotly_white')fig.show()
Negative Score of Complaints for Top 10 Companies
A violin plot was created to visualize the distribution of negative sentiment scores for complaints associated with the top 10 companies. Negative sentiment scores represent the extent of dissatisfaction or negative emotions in the complaint narratives. Each company’s complaints are displayed as separate traces, showing the distribution of negative scores.
fig = go.Figure() for company in top_10_companies: fig.add_trace(go.Violin(x=df_has_narrative['Company'][df_has_narrative['Company'] == company], y=df_has_narrative['negative-score'][df_has_narrative['Company'] == company], name=company, box_visible=True, meanline_visible=True)) fig.update_layout(title_text='Negative Score of Complaints for Top 10 Companies', template='plotly_white') fig.show()
Sentiment Score vs. Complaint Length for Top 10 Companies
A scatter plot was generated to examine the relationship between the sentiment score and the length of the complaint narratives for the top 10 companies. The sentiment score reflects the overall emotional tone of the complaint, while the complaint length indicates the amount of detail provided in the narrative. Each point on the plot represents a complaint and is color-coded by company, allowing us to observe trends and differences across the top 10 companies. This analysis helps identify whether there is a correlation between the length of a complaint and its sentiment.
df_top_10_comapnies = df_has_narrative[df_has_narrative['Company'].isin(top_10_companies)]fig = px.scatter(df_top_10_comapnies, x='sentiment_score', y='Clean Complaint Length', color='Company', template="plotly_white") fig.update_layout(title_text='Sentiment Compound against Complaint Length of Complaints for Top 10 Companies') fig.show()
Summary
In the EDA phase, we examined the overall consumer complaint dataset and cleaned narratives to uncover patterns, trends, and relationships. We began by providing an overview of the dataset, ensuring an understanding of its structure and key variables.
We then analyzed the distribution of complaints by product, identifying the most common financial products involved in consumer complaints. This was followed by an exploration of the most frequently reported issues, offering insights into the primary concerns consumers have with financial services. We also assessed the timeliness of responses, highlighting how quickly companies address complaints.
The number of complaints over time was examined to observe trends and fluctuations, revealing how complaint volumes changed over the study period. Additionally, we identified the companies receiving the most complaints, providing a deeper understanding of which firms were most frequently involved in consumer disputes.
We explored the distribution of complaints by tags to gain insight into specific consumer groups, such as older Americans or servicemembers, who might be affected by particular issues. We also focused on the narrative complaints, analyzing the sentiment scores associated with them to gauge the emotional tone of consumer feedback.
To further explore the sentiment, we visualized the sentiment scores of complaints for the top 10 companies and examined the negative sentiment scores, helping to identify which companies were associated with higher levels of dissatisfaction. Finally, we analyzed the relationship between sentiment scores and complaint length to understand whether longer complaints tend to have more negative or positive sentiment.
Overall, the EDA provided valuable insights into the dataset, setting the stage for more in-depth analysis and actionable conclusions regarding consumer complaints and company responses.