Obtaining and Cleaning the Data

Section 5.3 Obtaining and Cleaning the Data

Here, we will find reputable sources for data on anti-trans bills and get the data into a form we can analyze.

In this section, I will...

Provide reasons for why data needs to be cleaned.
Learn how to access, merge, and clean my own data sets.
Explain and execute the data cleaning process for analysis.

Subsection 5.3.1 Importing and Merging the Data

Exercises Exercises

Finding Data Online.

Our first goal is to find data. Note that any data found online may change web addresses, stop updating, go offline, or become unreliable over time so downloading the dataset as a .csv (comma-separated values) or .xlsx (Microsoft Excel Open XML Format Spreadsheet) file could save you time and stress in the future. Your instructor may give you alternative sources of data if these are no longer reliable.

1.

Go to the website tracktranslegislation.com ¹⁰², which in turn obtains its data from LegiScan ¹⁰³, and browse around. What are two things that you notice about the site? Two things that surprise you? Two things that aren’t on the site that you wonder about?

2.

This data is up-to-date from 2021 through April 21, 2023. In order to track bills prior to 2021, we use data from the ACLU’s “Past Legislation Affecting LGBT Rights Across the Country” pages for 2018, 2019, and 2020 ¹⁰⁴.

Go to the ACLU "Past Legislation Affecting LGBT Rights Across the Country 2020"¹⁰⁵ webpage and skim that page. What similarities do you notice with the data found on the Track Trans Legislation website? What differences do you notice? Click on the "View 2019" and "View 2018 Session bills" links and do the same for those years.

3.

We’d like to get a sense of how anti-trans legislation in the U.S. changed over time, so we’re going to try to merge the Track Trans Legislation (TTL) data with the ACLU data. Since the ACLU data has different bill "Status" categories than TTL, we’ll need to figure out how to classify each ACLU bill into one of the TTL categories.

Use the Terminology ¹⁰⁶ page on the TTL website to answer the following question. Which of the TTL categories would you classify "Referred to committee" into? "Hearing scheduled"? "Withdrawn"? You may want to click on the bill numbers on the ACLU site to see how the website LegiScan, a constantly-updated bill tracker, classifies each bill.

Note that the 2020 ACLU page was last updated on March 20, 2020, since many state legislatures were suspended or closed during the first year of the COVID-19 pandemic; the ACLU page promised to “update the tracker as major new developments occur[red] .” This data has distinct variable names and organization, so we also modify the ACLU data to match the Track Trans Legislation data as closely as possible.

For example, bills that were withdrawn, not passed by the end of a given legislative session, explicitly listed as “Dead”, or were recommended against by a committee and did not proceed in the legislature were relabeled “Dead” (at least for that year). The exception is when the bill description is specifically listed as “hearing scheduled”, “referred to committee”, carried over from another year, or otherwise makes clear that the bill is still under consideration, in which case “Introduced” or “Crossed Over” (depending on whether the bill had been passed by at least one chamber) is used.

We only include bills in categories tracked by both data sources; this leaves out, for example, bills preventing localities from passing anti-discrimination ordinances within a state. We use a broad reading of the “religious freedom” category to include bills that allow for people with “sincerely-held religious beliefs” in that state to challenge nondiscrimination laws, discriminate against LGBTQ+ people, refuse to provide healthcare to LGBTQ+ people, refuse to provide adoption services to LGBTQ+ people; discriminate against married LGBTQ+ people, and receive funding for discriminatory student groups at public universities, among others.

Subsection 5.3.2 Cleaning the Data

Cleaning is a process that involves fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. For more information, check out Tableau’s Guide to Data Cleaning ¹⁰⁷. After importing and merging data sets, it is integral to clean the data before working or manipulating the data and producing any conclusions. While the cleaning process can be done manually, it is much more efficient to use a software program such as Microsoft Excel or a programming language such as R when working with multiple data sets or large quantities of data.

Here are the steps to clean your data:

1. Remove duplicate or irrelevant information 2. Fix naming conventions, typos, or any other structural errors that do not make the data uniform 3. Filter unwanted data such as outliers 4. Handle missing data by dropping observations, inputting missing values, or altering the data 5. Validate and ensure that the cleaning process was completed correctly before beginning the coding or visualizing process

First, the 2018-2020 ACLU datasets code state names by their two-letter abbreviations (e.g., “AZ” instead of “Arizona”), while Track Trans Legislation uses full names. So we use a program to convert full names to abbreviations in the whole dataset. We have done this cleaning and it can be found in ADD GITHUB LINK HERE! You can see the code we used here: ADD THE CODE FOR STUDENTS THAT ARE CURIOUS!!!! We also note that, for example, the 2021 dataset includes some bills passed in January 2022, so we eliminate duplicate bills.

This decreases the number of bills in our dataset from 927 to 893. Next, note that the bills whose status is labeled Introduced* by TTL are those that failed to meet their state’s “crossover deadline”, the date by which a bill must pass out of the chamber in which it was introduced and to the other chamber (e.g. State House vs. Senate). According to the site Track Trans Legislation ¹⁰⁸, a bill that is not passed in its initial chamber by the crossover deadline “faces high procedural hurdles in order to move forward.” Thus, we wish to classify these bills (at least for the current session) as “Dead/Failed”.

Moreover, one bill’s status is listed as “Posted”, Kentucky’s HB132 in 2020. LegiScan research ¹⁰⁹ reveals that this bill died in committee, so we update its status to Dead/Failed.

Let’s walk through cleaning the most recent anti-trans legislation data from Track Trans Legislation ¹¹⁰

1. Let’s look at the data from all states, all bill types, and all statuses (i.e. you should not be filtering anything in this case). Click on the “Export” button and download the data as a CSV file.

2. We now need to open the CSV file in Excel. To do this, take the following steps: Open Excel. Select File → Open…→ Choose the CSV file. The Excel sheet should populate with the data.

3. To get a preview of what the data looks like, go to the Home tab within Excel and press Analyze Data. A side panel should appear on the right of the screen. Always be sure to check these insights present the information clearly.

Excel file downloaded from Track Trans Legislation displaying the "Analyzing Data" button and corresponding window.

4. Remove any duplicates. To remove duplicates: Data → Remove Duplicates, ensure that all columns with duplicates are checked off so they are removed.

Excel heading with the "Data" tab and "Remove Duplicates" button highlighted in a red box.

5. Now, let’s make sure the formatting is consistent. Let’s try lowercasing all of the state names. First, create a new column to the side of the states column. Next, in this new column, enter “=LOWER( )” function into the first cell. In the parentheses, write X:X, replacing the Xs with the letter of the original column and hit enter. This will populate the new column with all of the state names in lowercase letters:

Columns E and F from Excel file showing the before and after results of using the "=LOWER()" function

Note: If you don’t want the formulas in the resultant cells, you just want the new lowercased versions of the names as if they had been hand-typed, you can select the names, press Control-C to copy them, navigate to Paste → Paste Values, and paste them into a new column or on top of the column you had previously made. Doing this will allow you to delete the original column without affecting the new, lowercase column. We can also capitalize all letters in the cells we select by using “=UPPER( )” or use “=PROPER( )” to reset the capitalization so the first letter is the only one capitalized.

Drop down from the "Paste" with "Paste Values" highlighted in green.

6. Next, we will find entries for the state of “Oklahoma” and replace them with the state abbreviation “OK” using the Find and Replace function. To do this, the following steps apply: First, navigate to Home → Editing → Find & Select. Next, Replace → Find what: “Oklahoma” and Replace with: “OK” → Replace All. Note, the alert will notify you of how many cells were altered and you should make sure that all planned changes were made.

Excel heading with the "Home" tab and the "Find & Select" button highlighted in red boxes.

The "Find & Replace" pop-up window with "Find what:" and "Replace with:" dropdowns, the "Replace All" button, and alert message highlighted in red boxes.

7. Finally, we can use the TRIM( ) function to eliminate excessive spaces. For example, if in our data set we have “ Arizona” (note the extra space before the A) and we want “Arizona,” we first need to create a new column adjacent to the state column. Next, in this new column, enter “=TRIM( )” into the first cell. In the parentheses, write X:X, replacing the Xs with the letter of the original state column and hit enter. This will populate the new column with all of the state names without excessive spaces.

Note: The TRIM( ) function will not remove any spaces between two words in a state name (“North Carolina” to “NorthCarolina”), it will just remove excess spaces at the beginning and end of the state name.

Subsection 5.3.3 Exercise

Instructions: Open the Oklahoma Excel file ¹¹¹ and click “view raw” to download the file.

You are part of a data-cleaning team focused on this data set. Your job is to clean the provided data set as follows: tracktranslegislation.com ¹¹²

Delete all conditional formatting (i.e. capitalization)
Replace all mentions of “Alabama” with “Oklahoma”
Ensure that everything is spelled correctly (use spell check)
Remove all duplicate bills
Create a new column called “Bill Description 2.” Use the TRIM() function to remove excess spaces in the “Bill Description” column.
Your boss wants a new “State2” column where all of the state names are completely capitalized (i.e. “OKLAHOMA”). Create a new “State2” column using the UPPER( ) function.

Note: If you want to know how to clean data using the R language, check out Chapter 5 on Data Structures ¹¹³ in RCookbook.

tracktranslegislation.com

legiscan.com

www.aclu.org/past-legislation-affecting-lgbt-rights-across-country-2018

www.aclu.org/past-legislation-affecting-lgbt-rights-across-country-2020

www.tracktranslegislation.com/terminology

www.tableau.com/learn/articles/what-is-data-cleaning

www.tracktranslegislation.com/

legiscan.com/KY/bill/HB132/2020

www.tracktranslegislation.com/

github.com/BatesDCS23-24/FemSTEM2/blob/b680dd6723a2a4097ad27e978de7d686d9deeb02/Excel%20files/5.3%20Exercise%201.xlsx