Cara menggunakan python automated data cleaning

Data preparation is time-consuming. You've collected data from multiple sources, in multiple formats, and now you're wondering: how do I put all the pieces together? How do I decide what's important? And how do I present my results in a way that other people can understand?

Turbo Prep is designed to make data preparation easier. It provides a user interface where your data is always visible front and center, where you can make changes step-by-step and instantly see the results, with a wide range of supporting functions to prepare your data for for model-building or presentation.

In the background, while you prepare your data, Turbo Prep builds a RapidMiner process. You can save that process and apply it later to similar data sets, so you don't have to do the same job twice.

Turbo Prep won't solve all your problems. If you want to use your data to make predictions and understand the results, see Auto Model. But you can't make predictions with worthless or inconsistent data. Turbo Prep will help you to put all the pieces together, to eliminate worthless data, to transform the remaining data into a consistent and useful format, and to present the results, once you have achieved a better understanding of your data.

Turbo Prep's supporting functions are divided into five broad categories:

  • - these functions help you to create useful subsets of your data (Filter, Range, Sample, Remove) or to modify the data in individual columns (Replace).
  • - these functions help you with missing values, duplicates, normalization and binning. Low quality data, of the kind discussed in , can be removed automatically via Auto Cleansing.
  • - these functions help you generate new data columns from existing data columns. The large variety of logical and mathematical operators is especially useful for feature engineering and for more complex data transformations.
  • - these functions simplify the task of creating summary tables (pivot tables) from your data.
  • - these functions help you to combine two or more data sets (Join).

Within Turbo Prep, click on the 🛈 symbol to learn more about any of these categories.

Once you're done preparing the data, you can take , including:

  • Model - Pass your data to Auto Model to help you build a model!
  • Charts - Display your data using a a variety of charts.
  • Process - Save your data preparation steps as a RapidMiner process, for later reuse.
  • History - Examine the history of your data preparation, roll back to an earlier step, and make changes.
  • Export - Save your data to a file, or save it in a RapidMiner repository.

Within RapidMiner Studio, Turbo Prep appears as a view, next to the Design view, the Results view, and Auto Model.

Example: Presenting results in a summary table

In what follows, we'll apply Turbo Prep to the Titanic data set. Note that it is not our intention to prepare the data for model-building. The issues with cleaning the Titanic data set are discussed in ; Turbo Prep can clean the data in the same way via Cleansing > Auto Cleansing.

See also the video introduction to data cleansing with Turbo Prep.

Our purpose is to create a single data table that captures the essential factors relevant to survival. We assume that the results from the documentation are known. In particular, we know that survival on the Titanic depends on the following factors:

  • Sex
  • Passenger Class
  • Age

Auto Model makes it possible to study these factors in the context of an . Now we want to use Turbo Prep to present our results in a .

Because Sex dominates the other factors in determining survival on the Titanic, and because we want to understand the role of less important factors such as Passenger Class and Age, we will split the data into two parts, male and female, and study each of the parts separately, before recombining the data at the end.

The goal is to create a table of the following form, including both male and female passengers.

Survival rate of female passengers on the Titanic

Age1st class2nd class3rd class0-90.01.00.5110-191.00.920.5520-290.960.860.4630-390.970.900.4240-491.00.910.2550-590.950.8360-690.870.01.070-791.0

To get started, choose the

10 * floor([Age] / 10)
1 view by pressing the button at the top of RapidMiner Studio.

Cara menggunakan python automated data cleaning

Load Data

After starting Turbo Prep, the first step is to select a data set from one of your repositories.

  1. Click

    10 * floor([Age] / 10)
    
    2.

  2. Select the Titanic data set from the repository, under

    10 * floor([Age] / 10)
    
    3 >
    10 * floor([Age] / 10)
    
    4. (If your data isn't in a repository, select
    10 * floor([Age] / 10)
    
    5 at the top of the screen.)

  3. Click

    10 * floor([Age] / 10)
    
    2 once again. Your data set now appears on the left side of the screen.

Notice that Titanic data set, once loaded, has a context (right-click) menu, with numerous options.

Cara menggunakan python automated data cleaning

You can, for example, choose

10 * floor([Age] / 10)
7 with chart style
10 * floor([Age] / 10)
8, and plot "Survived" as a function of "Sex" to see the difference between Male and Female survival rates:

Cara menggunakan python automated data cleaning

Press

10 * floor([Age] / 10)
9 to leave the Chart view and return to the Data view.

Generate

At the top of the Data view, select the category called Filter0. The purpose of functions in this category is to generate new data columns based on the existing ones. For example, the Titanic data set includes two columns called "No of Siblings or Spouses Onboard" and "No of Parents or Children Onboard". If you were engineering new features, you might add these two columns to generate a new column called "No of Relatives Onboard".

Generate numeric values for "Survived"

In the current analysis, we are examining survival rates, and therefore it will be useful to generate a new column based on "Survived", but with numeric values 1 and 0 instead of "Yes" and "No", so that we can more easily calculate averages and other statistics. We give the new column a name ("survived_value") and build a function in the formula editor to convert "Yes" to 1 and "No" to 0. Notice that column names from the list on the left can be dragged into the formula editor, and that function documentation is available on the right.

if([Survived]=="Yes",1,0)

Click on Filter1 to see the resulting column, and Filter2 to save the result. The "Survived" column is now redundant, and could be deleted (Filter3 > Remove), but it's not necessary.

Cara menggunakan python automated data cleaning

Note: A similar result can be achieved via the function Filter3 > Replace (with "Yes" replaced by 1 and "No" replaced by 0), in combination with the function Filter3 > Filter8 ("Change to number"). In this case, no new data column is created, and the original data column "Survived" is converted from categorical to numerical.

Generate bins for the "Age" data

We said before that we want to understand the impact of "Passenger Class" and "Age" on survival. To make the data more suitable for a summary table, we will put the passengers into age groups -- ages 0-9, 10-19, 20-29, etc. To do so, we again click on Filter0, give a column name ("age_category"), and build a function in the formula editor. Notice that floor is a rounding function, so any number in the range 20-29 is rounded down to 20.

10 * floor([Age] / 10)

Click on Filter1 to see the resulting column, and Filter2 to save the result.

Cara menggunakan python automated data cleaning

Note: A similar result can be achieved via the function Cleansing > Range3, by creating 8 equal-width bins for the data in the "Age" column, since the range spans 0-80 years. In this case, no new data column is created, and the original data column "Age" is converted from numerical to categorical, with values {range1, range2,... range8}.

Copy data

We want to make two copies of the Titanic data set and call them "Titanic_male" and "Titanic_female".

  1. From the Data view, right-click on the Titanic data set, and choose Range4 from the menu.

  2. Right-click the copy, and choose Range5 from the menu. Call the copy "Titanic_female".

  3. Repeat steps (1) and (2) to create "Titanic_male"

Three identical data sets are now displayed in the Data view. To create male and female data sets, we need to transform the data.

Cara menggunakan python automated data cleaning

Transform

At the top of the Data view, select the category called Filter3. For the data set called "Titanic_female", our purpose is to keep all the data related to female passengers, discarding all the data related to male passengers.

  1. Click on the data column named "Sex".

  2. Select Filter from the list of functions on the left. Choose "equals" for the relationship, and "Female" for the value. Click Range8.

  3. Click Range9.

Once the transformation of "Titanic_female" is complete, repeat the operation for "Titanic_male", using the value "Male" in the Filter function.

Cara menggunakan python automated data cleaning

Pivot

See also the video introduction to data pivoting with Turbo Prep.

At the top of the Data view, select the category called Sample0. A pivot table is a summary data table. Usually the rows and columns are composed of categories from your original data set, while the individual cells contain numeric data, typically in the form of a sum (e.g., "Total Sales") or an average (e.g., "Survival Rate") for all the data points belonging to those categories.

With Turbo Prep, creating a pivot table is easy: drag a column name from the left, and drop it onto one of the three boxes:

  • Sample1 - The data categories you choose here will become rows in your pivot table.
  • Sample2 - The data categories you choose here will become columns in your pivot table.
  • Sample3 - The numeric data you choose here will typically be summed or averaged.

Take the following steps for each of the data sets "Titanic_female" and "Titanic_male":

  1. Drag "survived_value" into Sample3. This first version of the pivot table is composed of a single value, the survival rate for all females (males).

  2. Drag "Passenger Class" into Sample2. The pivot table now has 3 cells, with a survival rate for each passenger class.

  3. Drag "age_category" into Sample1. The pivot table now includes the survival rate for females (males) in each of several categories, sorted according to age (rows) and class (columns).

In our example, the the survival rate is calculated by taking the average of "survived_value" for each cell in the pivot table, but notice that you can right-click on "survived_value" and choose a different statistic, such as "sum" (to get the number of passengers that survived) or "count" (to get the total number of passengers).

When you're done creating the pivot table, click Sample7.

Cara menggunakan python automated data cleaning

Results

Examining the two pivot tables for "Titanic_female" and "Titanic_male", we can draw some conclusions:

  • Among female passengers, those in first and second class had a significantly better chance of survival than those in third class (90% vs 50%).

  • Among male passengers, those in first class had a significantly better chance of survival than those in second or third class (35% vs 15%).

  • Male passengers in third class were actually more likely to survive than those in second class, unless they were small children.

  • Passengers above the age of 40 were less likely to survive than younger passengers, with the exception of females in first and second class.

The survival rate for female passengers was given in the .

Survival rate of male passengers on the Titanic

Age1st class2nd class3rd class0-91.01.00.3710-190.420.060.0820-290.440.090.1930-390.410.090.1740-490.320.050.0650-590.280.00.060-690.070.160.070-790.00.00.080-891.0

Merge

See also the video introduction to merging data with Turbo Prep.

We now want to merge the two pivot tables, "Titanic_female" and "Titanic_male". Unfortunately, the two pivot tables have a nearly identical structure, and only the name of the data set makes it clear which data is male and which is female. To avoid losing important information, we rename (Filter3 > Range5) the 3 passenger classes to {female1, female2, female3} in "Titanic_female" and to {male1, male2, male3} in "Titanic_male".

Then we create a new pivot table called "Titanic_merged".

  1. Right-click "Titanic_female" and choose Range4.

  2. Right-click the copy and choose Range5. Call the new data set "Titanic_merged".

At the top of the Data view, select the category called Remove2. The idea of a "join" is that each row of data can be identified by a unique "key"; when two rows in the two data sets have the same key, their data is combined. A complication occurs when a key occurs in one data set, but not the other. Then you have to decide whether that data should be included in the combined table -- or not.

MethodInclude the data in the combined table...Inner Join...only if the key appears in both data setsLeft Join...only if the key appears in the first data setRight Join...only if the key appears in the second data setOuter Join...if the key appears in either data set (all data)

In our example, the "join keys" are the values of "age_category", but "Titanic_male" includes a passenger who is 80+ years of age, while there is no such passenger in "Titanic_female". With an inner join or a left join, we will lose this data; to include it, we must choose a right join or an outer join. The surest way to include all data is to use an outer join.

  • Merge With - "Titanic_male", since our starting point was "Titanic_female"
  • Merge Type - "Outer join", so no data is lost
  • Join Keys - "age_category"

Click on Remove3. Notice that "Titanic_merged" includes rows where the "age_category" is missing. You can delete them by clicking on "age_category", followed by Filter3 > Filter > "is not missing".

Cara menggunakan python automated data cleaning

Additional actions (⋯)

What's left to do? We've succeeded in generating a pivot table for survival rate on the Titanic, measuring the impact of "Sex", "Passenger Class", and "Age", with the results now contained in a single table.

The additional actions menu (⋯) on the top right of the Data view gives some hints.

Cara menggunakan python automated data cleaning

Export

You can save the final pivot table to a file or to a RapidMiner repository. The available file formats include Excel (.xlsx), CSV (.csv), and Qlik (.qvx).

History

You can examine the history of your data preparation, roll back to an earlier step, and make changes.

Cara menggunakan python automated data cleaning

Model

It's not relevant in our current example, but if a new version of the data set were generated once weekly, we could generate a weekly summary table by saving our work as a RapidMiner process, then feeding the new data sets to that process.

3 metode cara langkah langkah data cleaning?

Cara melakukan data cleaning.
Mendeteksi error. Langkah awal yang harus dilakukan adalah memantau notifikasi error atau corrupt. ... .
2. Hapus duplikat data atau data yang tidak perlu. ... .
Perbaiki kesalahan struktur. ... .
4. Filter outlier yang tidak diinginkan. ... .
Tangani data yang hilang. ... .
6. Validasi dan lakukan QA..

Apakah fungsi data cleaning dalam python?

Data cleansing atau data cleaning merupakan suatu proses mendeteksi dan memperbaiki (atau menghapus) suatu record yang 'corrupt' atau tidak akurat berdasarkan sebuah record set, tabel, atau database.

Apa yang harus dilakukan dalam proses data cleaning?

Langkah-langkah utama pembersihan data, meliputi memodifikasi dan menghapus bidang data yang salah dan tidak lengkap, mengidentifikasi dan menghapus informasi duplikat dan data yang tidak terkait, serta mengoreksi format, nilai yang hilang, dan kesalahan ejaan.

Jelaskan apa yang dimaksud dengan cleaning data?

Data cleansing adalah suatu proses mendeteksi dan memperbaiki (atau menghapus) data set, tabel, dan database yang korup atau tidak akurat.