Automating Data Transformations with WhizzML and the BigML Python Bindings

Posted by

这是我们的第五篇文章series of six关于BigMbeplay2网页登陆L的新版本:数据Transformations。This time we are focusing on the Data Preparation step, prior to any Machine Learning project, to create a new and improved dataset from an existing one.

CRISP-DM_diagram-bigml

数据准备阶段是为您的预测模型实现良好性能的关键。不仅如此,还可以执行各种各样的操作,因为数据通常没有准备好或没有我们需要创建机器学习模型所需的字段。意识到这一点,2014年BigML引入了beplay2网页登陆水平线,专门为数据转换设计的DSL语言。多年来,Flatline不断增长并增加了可以执行的操作数量。现在,在此版本中,我们改善了它的sliding windowoperations and added the ability to use a subset of年代QL instructionsthat add a new range of transformations, such as joins, aggregations, or adding rows to an existing dataset.

In this blog post, we will learn step-by-step how to automate these data transformations programmatically usingWhizzML, BigML’s Domain Specific Language for Machine Learning automation, and the officialPythonBindings

添加行:合并数据集

When you want to add data to an existing dataset that is already in the platform you will use the following code. This is an example used where data are collected in periods or the same kind of data comes from different sources.

;; creates a dataset merging two existing datasets(definemerged-dataset(create-dataset{"origin_datasets"["dataset/5bca3fb3421aa94735000003""dataset/5bcbd2b5421aa9560d000000"]})

The equivalent code in Python is:

# merge all the rows of two datasetsapi.create_dataset( ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"] )

As we saw in previous posts, theBigML API主要是异步,这意味着执行将在新数据集完成之前返回新数据集的ID。这意味着在执行代码代码段之后,对字段及其摘要的分析将继续。您可以使用指令“创建和等待数据”来确保数据集最终合并:

;; creates a dataset from two existing datasets and ;; once it's completed its ID is saved in merged-dataset variable(definemerged-dataset(create-and-wait-dataset{"origin_datasets"["dataset/5bca3fb3421aa94735000003","dataset/5bcbd2b5421aa9560d000000"]})

The equivalent code in Python is:

# merge all the rows of two datasets and store the ID of the # new dataset in merged_dataset variable merged_dataset =api.create_dataset( ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"] ) api.ok(merged_dataset)

合并数据集时,您可以更新几个参数,您可以在Multidasets sectionsAPI文档。这样,我们现在可以将合并的数据集与Whizzml配置设置样本率,并使用对使用相同的模式,我们在第一个示例中使用了。

;; creates a dataset from two existing datasets ;; setting the percentage of sample in each one ;; once it's completed its ID is saved in merged-dataset variable(definemerged-dataset(create-and-wait-dataset{"origin_datasets"["dataset/5bca3fb3421aa94735000003""dataset/5bcbd2b5421aa9560d000000"]"sample_rates"{"dataset/5bca3fb3421aa94735000003"0.6"dataset/5bcbd2b5421aa9560d000000"0.8}})

The equivalent code in Python is:

#创建一个合并的数据集,以指定原始数据集MERGED_DATASET =的每个#的速率api.create_dataset( ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"], { "sample_rates":{ "dataset/5bca3fb3421aa94735000003": 0.6, "dataset/5bcbd2b5421aa9560d000000": 0.8 }} ) api.ok(merged_dataset)

Denormalizing Data: Join Datasets

数据is commonly stored in relational databases, following the normal forms paradigm to avoid redundancies. Nevertheless, for Machine Learning workflows, data need to be denormalized.

BigML now allows you to make this process in the cloud as part of your workflow codified in WhizzML or with the Python Bindings. For this transformation, we can use the年代tructured Query Language (SQL)expressions. See below how it works. Assuming we have two different datasets in BigML, which we want to put together, and both share a field `employee_id` whose field ID is 000002:

;; creates a joined dataset composed by two datasets(definejoined_dataset (create-dataset{"origin_datasets"["dataset/5bca3fb3421aa94735000003""dataset/5bcbd2b5421aa9560d000000"]"origin_dataset_names"{"dataset/5bcbd2b5421aa9560d000000"“一个”"dataset/5bca3fb3421aa94735000003"“ B”}"sql_query""select A.* from A left join B on A.`000000` = B.`000000`"}))

The equivalent code in Python is:

# creates a joined dataset composed by two datasets api.create_dataset( ["dataset/5bca3fb3421aa94735000003", "dataset/5bcbd2b5421aa9560d000000"], { "origin_dataset_names":{ "dataset/5bca3fb3421aa94735000003": "A", "dataset/5bcbd2b5421aa9560d000000": "B" }, "sql_query": "SELECT A.* FROM A LEFT JOIN B ON A.`000000` = B.`000000`"} )

Aggregating Instances

The use of SQL opens the possibility to make a huge quantity of operations with your data like selection, values transformations, and rows groups between others. For instance, in some situations, we need to collect some statistics from the data creating groups around the value of a specific field. This transformation is commonly known as aggregation and the SQL keyword for that is ‘GROUP BY’. See below how to use it in WhizzML, assuming we are managing a dataset with some data of a company where the field 000001 is the department and the field 000005 is employee ID.

;; creates a new dataset aggregating the instances;; of the original one by the field 000001(defineaggregated_dataset(create-dataset{"origin_datasets"["dataset/5BCBD2B5421AA9560D000000"]"origin_dataset_names"{"dataset/5BCBD2B5421AA9560D000000""DS"}"sql_query"“从DS组中选择`000001`,count('000005`)``000001'''}))

The equivalent code in Python is:

# creates a new dataset aggregating the instances # of the original one by the field 000001 api.create_dataset( ["dataset/5bcbd2b5421aa9560d000000"], { "origin_dataset_names":{“数据集/ 5 bcbd2b5421aa9560d000000”:“DS”},“sql_query": "SELECT `000001`, count(`000005`) FROM DS GROUP BY `000001`"} )

It is possible to use the name of the fields in the queries but field IDs are preferred to avoid ambiguities. It is also possible to define aliases for the new fields using the keywordASafter the operation that follows the年代QL syntax。请注意,使用SQL,您还可以执行比我们在本文中演示的操作更复杂的操作。

Want to know more about Data Transformations?

如果您有任何疑问,或者想了解有关数据转换如何工作的更多信息,请访问release page。It includes a series of blog posts, the BigML Dashboard and API documentation, the webinar slideshow, as well as the full webinar recording.

2 comments

发表评论

Fill in your details below or click an icon to log in:

功能
WordPress.com Logo

您正在使用WordPress.com帐户评论。(登出/改变)

Google photo

You are commenting using your Google account.(登出/改变)

Twitter picture

You are commenting using your Twitter account.(登出/改变)

Facebook照片

You are commenting using your Facebook account.(登出/改变)

Connecting to %s