Data Wrangling in Python

Faiza Khalid
12.08.2021
|
4112
Data wrangling in Python

These days we hear a lot about data and data forms the basis of the digital world. The human race is producing data every second with a button click and a web search. By the end of 2020, we have generated 44 zettabytes of data! One zettabyte is equal to a billion terabytes. All this data has revolutionized science and technology. This data also forms the core of advanced machine learning models. However, all of this data in raw form is not of much use to us and we cannot infer much from it unless it is in a specific format.

The process of converting raw data into a clean format to deduce useful information is called data wrangling. It is also called the data preprocessing stage. In this stage, the garbage and missing values in the data are handled. Hence, it is also called the data cleaning phase. In this article, we will discuss in detail data wrangling. We will see how we can prepare a dataset for actual processing and analysis in Python language.

Data Problem

Let’s start with an example. Suppose we want to sell a car. We have to find the right price for it. The price should be neither too low nor too high, given the specifications of the car. Now, a data analyst will think, are there other cars similar to my car? How much do they sell for? What are their prices given their specifications? The first thing a data analyst looks for while solving this problem is data. To continue with this example problem, we will use an open-source dataset of used cars by Jeffrey C. Schlimmer. We can download this dataset from here. There are 26 columns in total. Our target column is 26th, the price. We want to know the price of a car given the other columns as predictors.

Data Wrangling Methods

Data understanding:

The first step in the analysis of data is its acquisition and loading. We will use the Pandas library in Python to load and read our dataset file. Pandas is a widely used library for data manipulation. It uses a dataframe data structure to store data. The dataset can be saved locally or is available online, as in our case. Read_csv method takes the path. It considers the first row in the dataset as the header of the file. If this is not the case, as in ours, we can set it as none. The df is our dataframe.

    pandas as pd
    DatasetPath =  "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/auto.csv"
    df = pd.read_csv(DatasetPath, header = None)

We can print the top and bottom five rows using the head and tail functions.

    print(df.head(5))
    print(df.tail(5))

Now, we can set header names using the documentation provided here for easy analysis of our data. Otherwise, Pandas assigns numbers to each column.

    headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style","drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type","num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower","peak-rpm","city-mpg","highway-mpg","price"]
    df.columns = headers

The describe method gives a summary of the whole dataset.

    print(df.describe())

It returns some useful statistical insights of data. The mean, standard deviation, minimum and maximum values of columns, etc. We can also get the datatypes of each column through,

    print(df.dtypes)

It gives an overall idea of the nature of data.

Handling Missing Values

Each row in our dataset is a data point or a record. A record consists of column values. A record can have missing column values. These missing values show as a blank, question mark, 0, or a NaN. It is important to handle such values in the initial stages to ensure smooth processing. Different methods help to handle such missing values.

The first approach is to retrieve the unknown or missing values from the source and fill them in the data set. The second approach is to skip records with the missing values. Dropping data is dependent on the amount of data and missing values in it. If there are not many missing values, we can consider dropping the tables. The third approach is entering values in blank places considering the other data values. This approach includes a chance of introducing inaccurate values, since we are guessing in place of an original data value. Lastly, we can use the method of average values introduction. We take the mean value in the case of quantitative data and add it in place of a missing value. If the column is categorical or qualitative, we can use the most appeared value in place of the missing column value and this value is called a mode of the column.

Let’s apply these methods to our data.

    df.dropna(subset=["peak-rpm"], axis=0, inplace=True)

dropna deletes the rows with the missing values in the specified column. Axis=0 indicates to delete rows, and axis=1 is to delete the whole column. The True Inplace argument makes changes to the dataframe directly.

We will replace the missing values in the city-mpg column with the average value of the city-mpg. Use replace() function.

    avg_city_mpg = df["city-mpg"].astype(float).mean(axis=0)
    df["city-mpg"].replace(np.nan,avg_city_mpg,inplace=True)

Data Formatting

If the data under study is collected from multiple sources, there are chances of data inconsistency. Consistent data helps with accurate data analysis. Changing different formats into one consistent format for meaningful data processing is data formatting.

We can use dtype() to check the data type of every column. Then we can change the data type to a consistent one wherever needed. Astype() function changes datatype from one into another.

Data Normalization

Data normalization is an important concept for data analysis. Normalization ensures that all of our variables have the same effect on the model irrespective of data values. For example, our predictor variables are price and area of the house. The price has a lower range of values whereas the area variable has large values. Area values, if used as is, can influence our model when making predictions, although, they are not more important than the price variable. However, if we normalize price and area to a range between 0 and 1, they will have the same effect.

Some ways to normalize the data values are simple feature scaling, minimum-maximum method, z-score, etc.

Dividing each value in the column with the maximum column value is feature scaling.

\( xn= \frac {x}{xmax} \)

The Min-Max method keeps all values in the range of 0 and 1. It takes a difference of each value from the minimum column value and then divides it with the column range.

\( xn= \frac {x - xmin}{xmax - xmin} \)

Standard score or z-score subtracts the mean value of the column from each column value. Then divides with the standard deviation of the column. This method keeps the values between 3 and -3.

\( xn=\frac {x-µ} {σ} \)

Here, &micro is the mean and σ is the standard deviation of the feature column.

We can apply these normalization methods in Python as,

    df["width"] = df["width"]/df["width"].max()
    df["width"] = (df["width"]-df["width"].min())/(df["width"].max()-df["width"].min())
    df["width"] = (df["width"]-df["width"].mean())/df["width"].std()

Data Binning

Binning refers to the grouping of data values. For example, we can group the age values as 0 to 18, 19 to 29, etc. The difficulty of a task can be grouped as easy, medium, and hard. Such grouping of column values makes data analysis easier. Let’s bin price range in our dataset.

The Linspace() function will give us equally spaced four numbers in our price column. The cut () function will divide the price column into the given number. First, we handle the missing and null values in the price column.

    df.replace("?", np.nan, inplace = True)
    df.dropna(subset=["price"], axis=0, inplace=True)
    df["price"] = df["price"].astype("int")
    df.reset_index(drop=True, inplace=True)
    
    bins = np.linspace(min(df["price"]),max(df["price"]),4)
    bin_names = ["Low","Medium","High"]
    df["price.binned"]=pd.cut(df["price"], bins, labels=bin_names, include_lowest = True)

Here, np is the numpy library.

We can also visualize this price range with a histogram.

    import matplotlib as plt
    from matplotlib import pyplot
    
    plt.pyplot.hist(df["price"])
    plt.pyplot.xlabel("price")
    plt.pyplot.ylabel("cars count")
    plt.pyplot.title("price groups")
    plt.pyplot.show()

Conversion of Categorical Data

Lastly, we will see how to convert qualitative or categorical data into quantitative data. In data analysis, we use statistical methods, and these methods usually work with quantitative data. Dummies() function in Pandas can help us here.

For example, we want to change the engine type column into straight, inline, and flat engine types. 0 and 1 values will indicate the car’s respective engine type.

    dummy_variable = pd.get_dummies(df["engine-type"])
    dummy_variable.rename(columns={'engine-type-available':'straight', 'engine-type-available':'inline', 'engine-type-available':'flat'}, inplace=True)

Then we merge the dataframe and the dummy variables and drop the original column that is now quantized.

    df = pd.concat([df, dummy_variable], axis=1)
    df.drop("engine-type", axis = 1, inplace=True)

So that’s it for data wrangling in Python and we are now familiar with an important prerequisite of data analysis and data processing.

More stories

implementation of MATLAB to C/C++ conversion and MEX-functions and simulation method Monte Carlo
MATLAB TO C/C++ Part2

Dive into the Monte Carlo simulation method. Read about specific circumstances for convenient implementation, real implementation of MEX-functions simulated by MATLAB.

Read more
translation of MATLAB algorithms to C++ via the MEX-function
From MATLAB to C/C++ Part 1

Look at some of the benefits and disadvantages of using MATLAB, the difference between a compiler and interpreter and how to use a MEX-function within the MATLAB environment to automatically translate algorithms.

Read more
GPS in Android as a Library Module
GPS in Android as a Library Module Published on Jitpack via Github. Part 2

Part 2 of how to create a GPS library model in android and add Java class to implement GPS, which includes methods to test created library in app.

Read more
create a GPS library model in android
GPS in Android as a Library Module Published on Jitpack via Github. Part 1

How to create a GPS library model in android and add Java class to implement GPS.

Read more
Kubernetes to expand cloud possibilities
Kubernetes - The Ideal Cloud Platform Choice

Google’s Kubernetes cloud-based app deployment platform as an optimal solution for developers and businesses due to the framework’s versatile architecture and features like scalability, immutability and separation of concern.

Read more
How to use Regular Expressions in Python

Learn how to work with regular expressions in Python to solve pattern-matching problems for which operators may not be enough.

Read more

Algo uses cookies for personalization and other purposes. Learn more about cookies use. Algo supports the Digital Advertising Alliance principles. By interacting with this site, you agree to our use of cookies.