Written By: Hana Gabrielle Bidon
Photo by Luke Chesser on Unsplash
Before creating data visualizations, it is important to ask yourself what problems you can tackle with data science. While data science can be a valuable tool to analyze topics and issues like looking at carbon emissions in first-world countries, it will not always be the best tool out there to solve a problem. While creating data visualizations, consider the context of the dataset(s) you are working with. When working with datasets involving humans, be aware that data collection is neither objective nor neutral. Instead, collected data reflects human biases, which reflect societal norms.
Recently, Lensa AI has faced backlash for oversexualizing women and undressing them without their consent. Women of color are complaining that Lensa AI is lightening their skin tone and altering their features to fit Eurocentric beauty standards. Unsurprisingly, this is not an issue for men and masculine-presenting folks. According to the founders of Lensa AI, Prisma Labs, this is mainly because the model was trained with unfiltered data online, which is overly saturated with women who are sexually objectified. If someone were to collect data and visualize users’ ratings with Lensa AI, it would be more likely that women and people of color would have a lower satisfaction rating with the app compared to men and white people. This is one example of how misogyny and racism seep into how women and people of color are portrayed in the media.
Additionally, consider the limitations in your dataset(s). Say you want to visualize carbon emissions in first-world countries. A potential limitation of your data collection could be that it eliminates countries that are developed nations but not first-world countries. By reflecting on this, you can go through a process I created to help you make insightful data visualizations. Note that this process is nonlinear.
Before you create your data visualization(s), you want to first ask yourself what you want to analyze using data science. There are a plethora of datasets that are publicly available that you can look at. If you are stuck on what you want to visualize, take a look at some examples for inspiration.
Select a topic or issue you would like to analyze. Here are some topics and issues you can potentially delve into for starters.
Carbon Emissions in the United States
Results of the 2022 FIFA World Cup
Mental Health of Students during the COVID-19 Pandemic
COVID-19 Tweets
And many more
After picking a single dataset or multiple datasets you want to use, observe what features are within those datasets.
As an example, let’s analyze this dataset: Global Trends in Mental Health Disorder | Kaggle. It contains data about the prevalence of mental illnesses, including anxiety disorders, depression, bipolar disorder, eating disorders, schizophrenia, and substance abuse (e.g., drugs, alcohol). A vast majority of the data were captured from 1870 to 2019. Furthermore, the dataset is saved as a CSV file, which may affect how different programming languages read the dataset.
Now, identify the limitations of your dataset(s), and explain why those limitations might exist. Consider the context of how your data was collected, whether you or someone else collected the data.
Looking at the dataset previously mentioned, one limitation is that the data was collected from 1870 to 2019, which does not consider how the COVID-19 pandemic impacted people’s mental health. Another limitation is that there are thousands of missing values for schizophrenia, anxiety disorders, bipolar disorder, drug use disorders, and depression. The only mental disorder with over 50% of valid data is eating disorders, which has 92% of valid data points. Furthermore, this dataset does not encapsulate all mental disorders, including but not limited to personality disorders and dissociative disorders. On a similar note, it does not break down disorders within anxiety disorders, schizophrenia spectrum disorders, and eating disorders.
Then, ask yourself what you would like to further investigate with the dataset(s) at hand.
With Global Trends in Mental Health Disorder | Kaggle, the owner provided some questions to explore:
What types of mental illnesses do people have around the world?
How many people in each country suffer from mental disorders?
Which age groups are more susceptible to depression?
Based on the limitations of this dataset and other datasets you may work with in the future, you could see which questions you want to explore more with data visualizations. If you do not have enough prior knowledge of a topic, it is best to learn more before creating your visualizations.
Resources to learn more about mental health and mental illnesses:
Next, preprocess your dataset so that the computer can easily interpret its features.
No matter what data preprocessing tool(s) you will use, it is vital to omit the following values to improve the overall data quality:
Missing and duplicate data values
Outliers
Here are some of the libraries and tools to process and manipulate data: 7 Best Tools and Libraries for Data processing and manipulation
Learn more about data preprocessing: Data Preprocessing in Machine Learning [Steps & Techniques]
Once you standardize your dataset, it is time to decide what kind of data visualization(s) you would like to create based on the question(s) you are going to explore. For example, if you want to ask about the percentage of each mental health disorder around the world from 1990 to 2017, it is best to make a bar graph with the bars representing different mental illnesses. If you want to map the rate of depression in various countries, it is better to visualize it on a map.
Create the data visualizations you want to make for the question(s) you want to explore. Depending on the type of graphs you plan on making, look at tutorials for creating data visualizations.
Types of Data Visualizations:
Tools to help you analyze data and create visualizations:
Python:
R:
JavaScript d3 library:
SQL:
Excel:
Tableau:
After completing this step, it is imperative that you communicate your data visualization(s) in an accessible way to your audience. For example, create a caption and alternative for each data visualization you make. Furthermore, write the limitations of your dataset(s) and data visualization(s) as needed. If you want to take it a step further, write a written report, present your findings in a PowerPoint, or write an article.
To write more about the limitations of your findings within your dataset(s), I highly recommend reading Data Feminism by Catherine D’Ignazio and Lauren F. Klein. In particular, it would be helpful to read these chapters from the book:
Making insightful data visualizations is a nonlinear process, so take your time and make sure you create meaningful data visualizations. By exploring your dataset(s) and asking questions you can further explore, you can use various tools and techniques to communicate your findings to your audience in an accessible way.
Comments