Data preparationis a key task in any Machine Learning workflow, but it’s often one of the most challenging and time-consuming parts.BigML’s upcoming releasebringsnew data transformation features that make it faster and easier than ever before to get your data ready for Machine Learning.

These features significantly expand the data preparation options that BigML already provides, such as missing values treatment, categorical values encoding, date-time fields expansion or NLP techniques for your text fields.

All the new data transformation features can be classified into two groups:

SQL queries:The capability of writing SQL queries to create new datasets opens up an infinite number of transformations to prepare your data for Machine Learning. Although the ability to freely write SQL statements will be an API-only feature for now, we are bringing somecommon transformations to the Dashboardfor users that prefer to transform their data in a few clicks:aggregate instances, join and merge datasets. The idea is toadd more optionsin the Dashboardon an ongoing basis; for example, the ability to order instances and remove duplicates. Please send an e-mail toroadmap@bigml.comif you have any particular request.
Feature engineering:newsliding windowsfeature, and significant improvements to theFlatline Editor, enabling more ways to easily create fields for your datasets.

Aggregating Instances

The aggregating instances option in BigML allows you togroup the rows of a dataset by a given field.

For example, imagine you have customer data stored in a dataset whereeach purchase is a different row. If you want to use this dataset to train models to analyze customers purchase behaviors, you need a dataset whereeach row is a customerinstead of a purchase. This is the case of the dataset in the image below where we can aggregate the instances by the field “customerID” to get a row per unique customer. You can also see that we needed to use some aggregation functions for the rest of the fields in order to add them to the new dataset such as the total purchases per customer (“Count_customerID”), the total units purchased (“Sum_Quantity”), the first purchase date (“Min_Date”) or the average price per unit spent per customer (“Avg_UnitPrice”).

aggregation-example

You can easily do this on the BigML Dashboard by following these steps:

Click the “Aggregate instances” option from the dataset configuration menu:

aggregate-instances

Select the “CustomerID” as the grouping field:

选取ect-grouping-field

Configure theaggregation operationsfor the fields you want to include in the final dataset. For example, in the image below we are including the count of rows per customer and the total amount of units purchased:

define-aggregation-operations

When you have all the operations defined, click on the“Aggregate instances”button:

aggregate-instances-cta

This willcreate a new datasetcontaining a customer per row and the columns that you defined using the aggregation functions described above. From this dataset, you can also see theSQL queryunder the hood by clicking the option highlighted in the image below.

Joining Datasets

BigML allows you tojoin several datasetstocombinetheirfieldsandinstancesbased on one or more related fields between them. This is very useful when your data is scattered in two or more datasets.

For example, imagine we want to predict employee performance and we have two different sources of data: a dataset containing employees’ data (employee name, salary, age, etc.) and another dataset containing departments data (department name, budget, etc.). If we want to include the department data as an additional predictor for our employees’ analysis, we can use a common field in both datasets (department_id) toadd the department characteristics to the employee dataset(see image below).

You can easily do this on the BigML Dashboard by following these steps:

Click the “Join datasets” option from the dataset configuration menu:

join-datasets-option

Then select the type of join:left joinif you want to get all the instances from the current (left) dataset and the matched instances from the selected (right) dataset; orinner joinif you want to get the instances that have matching values in both datasets. In this case, we are selecting the left join because we want all the employees regardless if they have a matching department or not. Next,选取ect the datasetyou want to make the join with:

Select one or more pairs ofjoining fieldsto match the instances of both datasets. In this example, we select the department_id to make the match:

Decide which fields of the selected dataset (the departments dataset in our case) you want to include in the final joined dataset:

Optionally, you canfilterthe joined dataset by selecting fields from the current or the selected dataset and setting up different filtering conditions. Then go ahead and click the“Join datasets”button.

This willcreate a datasetthat will contain the matched instances and fields from both datasets. From this dataset, you can also see theSQL queryunder the hood by clicking the option highlighted in the image below.

join-sql

Merging Datasets

Themerging datasetsoption in BigML allows you toinclude the instancesof several datasets in one dataset.

This functionality can be very useful when you use multiple sources of data. For example, imagine we have employees data in two different datasets and we want to merge them into one dataset.

You can easily do this on the BigML Dashboard by following these steps:

Click the“Merge datasets”option from the dataset configuration menu:

Select the datasetsyou want to merge. The datasets should have the same fields so the instances of one dataset can be added after the instances of the other dataset. You can select up to 32 datasets to merge. You can sample each of the datasets selected for the merge by configuring the typical BigML sampling parameters like the percentage rate, replacement, out-of-bag, and seed parameters.

选取ect-merge-dataset

Click on the “Merge datasets” option:

This willcreate a datasetthat will contain the instances from the merged datasets. From this dataset, you can also see themerging informationby clicking the option highlighted in the image below.

Feature Engineering

Feature engineering, i.e., thecreation of new featuresthat can be better predictors for your models, is one of the most important tasks in Machine Learning because it is usually the biggest source of model improvement. That’s why we also focused our efforts on bringingsliding windowsto the BigML Dashboard and improving theFlatline Editor.

Sliding windows

Creating new features usingsliding windowsis one of the most common feature engineering techniques in Machine Learning. It is usually applied toframe time series datausing previous data points as new input fields to predict the next time data points.

例如,假设我们有一年的销售a to predict sales. As domain experts, we know that past sales can be key predictors to predict today’s sales. Therefore, we can use our objective field “sales” to create additional input fields that contain past data. We can create an infinite number of fields: last day sales, the average of last week sales, the difference between last month and this month sales, etc. In the image below, we are creating a new predictor that calculates the average sales of the last two days (see the field in green “avgSales_L2D”).

sliding-windows

This can easily be done on the BigML Dashboard by following these steps:

Click the“Add fields”option from the dataset configuration menu:

sliding-window-option

Select the mean out of theSliding windowsoperations in the selector:

sliding-window-operation

Select the fieldyou want to apply the operation to, awindow start-2 and awindow end-1 (the window start and end define the first and last instances to be considered for the defined calculation; negative values are previous instances, positive values are next instances, with zero being the current instance). Then click on“Create dataset” button.

sliding-window-start-end

This willcreate a datasetwith anew fieldthat contains the average sales of the last two days and can be used as a new predictor.

Flatline Editor Improvements

TheFlatline editorallows you to easilycreate new fieldsfor your dataset by using BigML’s domain-specific languageFlatline.You canaccess the editorby selecting the option“Add fields”from the dataset configuration menu, then select theFlatline formula operationand click on theeditor icon(see image below).

You can see that thedataset previewnow includes atableview where you can easily see a sample of your instances.

When youwrite a formula and you want to view its result, thepreviewonly shows thefields involved in the formula. That way you can easily check if your formula is being calculated correctly. For example in the image below, you can see only two fields in the preview, the one used in the formula as input (the field “duration”) and the new field result of the formula (if the duration of the movie is higher than 100 minutes it is classified as “long”, otherwise it is “short”). You can also change this view toshow all the dataset fieldsagain using the green switcher on top of the table preview.

formula-preview-flatline

Want to know more about Data Transformations?

Stay tuned for thenext blog post学习如何执行数据转换SQL via the BigML API. If you have any questions or you would like to learn more about how Data Transformations work, please visit therelease page. It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

Data Transformations with the BigML Dashboard: Get your Machine Learning-Ready Data in a Few Clicks

Aggregating Instances

Joining Datasets

Merging Datasets

Feature Engineering

Sliding windows

Flatline Editor Improvements

Want to know more about Data Transformations?

Leave a ReplyCancel reply

Aggregating Instances

Joining Datasets

Merging Datasets

Feature Engineering

Sliding windows

Flatline Editor Improvements

Want to know more about Data Transformations?

Share this:

Like this:

Relacionado

Leave a ReplyCancel reply