Data preparation is time-consuming. You've collected data from multiple sources, in multiple formats, and now you're wondering: how do I put all the pieces together? How do I decide what's important? And how do I present my results in a way that other people can understand? Show
Turbo Prep is designed to make data preparation easier. It provides a user interface where your data is always visible front and center, where you can make changes step-by-step and instantly see the results, with a wide range of supporting functions to prepare your data for for model-building or presentation. In the background, while you prepare your data, Turbo Prep builds a RapidMiner process. You can save that process and apply it later to similar data sets, so you don't have to do the same job twice. Turbo Prep won't solve all your problems. If you want to use your data to make predictions and understand the results, see Auto Model. But you can't make predictions with worthless or inconsistent data. Turbo Prep will help you to put all the pieces together, to eliminate worthless data, to transform the remaining data into a consistent and useful format, and to present the results, once you have achieved a better understanding of your data. Turbo Prep's supporting functions are divided into five broad categories:
Within Turbo Prep, click on the 🛈 symbol to learn more about any of these categories. Once you're done preparing the data, you can take , including:
Within RapidMiner Studio, Turbo Prep appears as a view, next to the Design view, the Results view, and Auto Model. Example: Presenting results in a summary tableIn what follows, we'll apply Turbo Prep to the Titanic data set. Note that it is not our intention to prepare the data for model-building. The issues with cleaning the Titanic data set are discussed in ; Turbo Prep can clean the data in the same way via See also the video introduction to data cleansing with Turbo Prep. Our purpose is to create a single data table that captures the essential factors relevant to survival. We assume that the results from the documentation are known. In particular, we know that survival on the Titanic depends on the following factors:
Auto Model makes it possible to study these factors in the context of an . Now we want to use Turbo Prep to present our results in a . Because Sex dominates the other factors in determining survival on the Titanic, and because we want to understand the role of less important factors such as Passenger Class and Age, we will split the data into two parts, male and female, and study each of the parts separately, before recombining the data at the end. The goal is to create a table of the following form, including both male and female passengers. Survival rate of female passengers on the TitanicAge1st class2nd class3rd class0-90.01.00.5110-191.00.920.5520-290.960.860.4630-390.970.900.4240-491.00.910.2550-590.950.8360-690.870.01.070-791.0To get started, choose the 10 * floor([Age] / 10)1 view by pressing the button at the top of RapidMiner Studio. Load DataAfter starting Turbo Prep, the first step is to select a data set from one of your repositories.
Notice that Titanic data set, once loaded, has a context (right-click) menu, with numerous options. You can, for example, choose 10 * floor([Age] / 10)7 with chart style 10 * floor([Age] / 10)8, and plot "Survived" as a function of "Sex" to see the difference between Male and Female survival rates: Press 10 * floor([Age] / 10)9 to leave the Chart view and return to the Data view. GenerateAt the top of the Data view, select the category called Generate numeric values for "Survived"In the current analysis, we are examining survival rates, and therefore it will be useful to generate a new column based on "Survived", but with numeric values 1 and 0 instead of "Yes" and "No", so that we can more easily calculate averages and other statistics. We give the new column a name ("survived_value") and build a function in the formula editor to convert "Yes" to 1 and "No" to 0. Notice that column names from the list on the left can be dragged into the formula editor, and that function documentation is available on the right. if([Survived]=="Yes",1,0) Click on Note: A similar result can be achieved via the function Generate bins for the "Age" dataWe said before that we want to understand the impact of "Passenger Class" and "Age" on survival. To make the data more suitable for a summary table, we will put the passengers into age groups -- ages 0-9, 10-19, 20-29, etc. To do so, we again click on 10 * floor([Age] / 10) Click on Note: A similar result can be achieved via the function Copy dataWe want to make two copies of the Titanic data set and call them "Titanic_male" and "Titanic_female".
Three identical data sets are now displayed in the Data view. To create male and female data sets, we need to transform the data. TransformAt the top of the Data view, select the category called
Once the transformation of "Titanic_female" is complete, repeat the operation for "Titanic_male", using the value "Male" in the Filter function. PivotSee also the video introduction to data pivoting with Turbo Prep. At the top of the Data view, select the category called With Turbo Prep, creating a pivot table is easy: drag a column name from the left, and drop it onto one of the three boxes:
Take the following steps for each of the data sets "Titanic_female" and "Titanic_male":
In our example, the the survival rate is calculated by taking the average of "survived_value" for each cell in the pivot table, but notice that you can right-click on "survived_value" and choose a different statistic, such as "sum" (to get the number of passengers that survived) or "count" (to get the total number of passengers). When you're done creating the pivot table, click ResultsExamining the two pivot tables for "Titanic_female" and "Titanic_male", we can draw some conclusions:
The survival rate for female passengers was given in the . Survival rate of male passengers on the TitanicAge1st class2nd class3rd class0-91.01.00.3710-190.420.060.0820-290.440.090.1930-390.410.090.1740-490.320.050.0650-590.280.00.060-690.070.160.070-790.00.00.080-891.0MergeSee also the video introduction to merging data with Turbo Prep. We now want to merge the two pivot tables, "Titanic_female" and "Titanic_male". Unfortunately, the two pivot tables have a nearly identical structure, and only the name of the data set makes it clear which data is male and which is female. To avoid losing important information, we rename ( Then we create a new pivot table called "Titanic_merged".
At the top of the Data view, select the category called In our example, the "join keys" are the values of "age_category", but "Titanic_male" includes a passenger who is 80+ years of age, while there is no such passenger in "Titanic_female". With an inner join or a left join, we will lose this data; to include it, we must choose a right join or an outer join. The surest way to include all data is to use an outer join.
Click on Additional actions (⋯)What's left to do? We've succeeded in generating a pivot table for survival rate on the Titanic, measuring the impact of "Sex", "Passenger Class", and "Age", with the results now contained in a single table. The additional actions menu (⋯) on the top right of the Data view gives some hints. ExportYou can save the final pivot table to a file or to a RapidMiner repository. The available file formats include Excel (.xlsx), CSV (.csv), and Qlik (.qvx). HistoryYou can examine the history of your data preparation, roll back to an earlier step, and make changes. ModelIt's not relevant in our current example, but if a new version of the data set were generated once weekly, we could generate a weekly summary table by saving our work as a RapidMiner process, then feeding the new data sets to that process. 3 metode cara langkah langkah data cleaning?Cara melakukan data cleaning. Mendeteksi error. Langkah awal yang harus dilakukan adalah memantau notifikasi error atau corrupt. ... . 2. Hapus duplikat data atau data yang tidak perlu. ... . Perbaiki kesalahan struktur. ... . 4. Filter outlier yang tidak diinginkan. ... . Tangani data yang hilang. ... . 6. Validasi dan lakukan QA.. Apakah fungsi data cleaning dalam python?Data cleansing atau data cleaning merupakan suatu proses mendeteksi dan memperbaiki (atau menghapus) suatu record yang 'corrupt' atau tidak akurat berdasarkan sebuah record set, tabel, atau database.
Apa yang harus dilakukan dalam proses data cleaning?Langkah-langkah utama pembersihan data, meliputi memodifikasi dan menghapus bidang data yang salah dan tidak lengkap, mengidentifikasi dan menghapus informasi duplikat dan data yang tidak terkait, serta mengoreksi format, nilai yang hilang, dan kesalahan ejaan.
Jelaskan apa yang dimaksud dengan cleaning data?Data cleansing adalah suatu proses mendeteksi dan memperbaiki (atau menghapus) data set, tabel, dan database yang korup atau tidak akurat.
|