预测马德里的空气污染

在大城市中,空气污染是一个巨大的问题,在大城市中,健康问题和交通限制正在不断增加。The concentration of Nitrogen Dioxide (NO2) is commonly used to determine the level of pollution. InMadrid, Spain,城市不同地区有几个站点不断收集NO2级别。我的同事JaimeBoscá和我申请了BigML,看看我们是否可以准确预测马德里的beplay2网页登陆空气污染。

Air Pollution Map Spain
欧洲卫星Sentinel-5P的NO2视图(照片:ESA)

已经定义了一组基于NO2级别的警报(如下表所示),以监视和避免高污染水平。

alert_states
马德里政府空气污染警报

These alerts trigger some measures that subsequently enforce traffic restrictions for Madrid citizens. The main problem is that these levels of NO2 are usually reached at the end of the day and the traffic restriction measures take effect the next day. Therefore, the population affected has only a few hours to rearrange their means of transport the following day. These measures have caused many criticisms of the local government. Predicting such alerts would help warn the population in advance so they have more time to reschedule their transportation plans.

paris_restrictions
由于空气污染引起的交通限制在马德里或巴黎等欧洲大城市很常见(照片:法新社)

Is it possible to predict which days will have pollution alerts?

Our goal is to predict a pollution alert (YES/NO) in advance by 1, 4, and 7 days. A pollution alert means that one of the previous alert levels has been reached.

数据采集

为了解决马德里的空气污染问题,我们使用了有关城市的三个主要数据来源:

  • 空气质量数据:已经收集了多年,可用于每小时收集NO2水平的多个空气测量站。
  • Weather data:information available daily about temperature, rain, and wind.
  • Historical traffic data:详细的交通负荷信息在线可用于马德里周围的主要街道和高速公路。

使用的数据是从2013年到2017年收集的。为了简化问题,我们将分析限制在区域1(如下所示),因为它包括大多数马德里城市地区,这是空中数量最多的。

车站_map
马德里区的空气和气象站1

数据转换

天气信息和污染啤酒rts statuses are available daily. That’s why data has been represented with daily granularity: each sample (or instance) will provide information for a given day. Therefore, aggregated information of weather and air are included as additional features per day.

我们还考虑了交通状况和模型中流量的预测。为了包括流量预测,我们使用了另一种模型来预测马德里流量,马德里流量是在BIGML中使用工作日和假期等功能实施的。beplay2网页登陆The evaluation results have been promising, allowing us to use BigML traffic batch prediction results as features in our model for predicting air pollution.In the same way, temperature predictions were also modeled and used as features.

预测空气污染是一个挑战。我们可以提前几天获得可接受的预测?我们尝试了三个不同的预测:提前1、4和7天。每个预测使用相同功能的不同时间窗口。

Feature engineering

大多数数据集都可以用现有数据得出的额外功能丰富。在我们的情况下,我们可以使用基于时间的信息,例如以前的日期或事件发生以来的天数。我们使用以下功能:

  • NO2平均值和最大值。
  • Maximum, minimum and average temperatures.
  • 雨,风和交通信息。
  • 交通预测。
  • 自上次警报以来的天数。

数据集

使用包括所有功能的数据集都在BigML画廊上可用:beplay2网页登陆

Data exploration

The colored table previously mentioned shows the 3 air pollution alert levels defined in Madrid: “prior notice”, “notice” and “alert”.Within the five years of available data, only “prior notice” and “notice” alerts occurred; red “alert” never happened. Also, the distribution of pollution alerts is not balanced, but luckily, not many alerts are raised:less than 100 “notice” and “prior notice” states have been observed in total.

That’s why we decided to group alerts并创建一个布尔客观字段预测是否会提高污染警报。

从我们的分析,我们可以看到,NO2水平directly related to air pollution (shown in the visualization below). We can also see that significant rain and wind have an impact on NO2 levels.

data_exploration
NO2,交通负荷,风和降雨可视化

In the graph above, maximum total precipitation daily is represented in blue and wind maximum gust speed is in orange. Traffic load is represented in green while NO2 average level is represented in grey. In general, high wind speeds and abundant precipitations seem to correlate with lower NO2 levels, while low traffic loads seem to correlate with lower NO2 levels.

下面的Bbeplay2网页登陆IGML散点图支持此相关性。以下图显示了在过去3天内是否有15mm的降雨与NO2的平均水平之间的相关性。我们可以观察到,所有降雨的情况超过15mm,对应于55岁以下的NO2水平µGrams/m3。

Rain_Impact
3天的降雨超过15mm,平均No2水平

下一个图显示了过去3天的风平均最大速度与NO2级别之间的相关性。当风平均最高速度超过20 km/h,NO2不到50µGrams/m3

wind_impact
3 days wind maximum speed correlation to average NO2 level

造型

预测建模涉及评估模型和比较结果以选择适当的算法及其特定参数。最初,我们尝试了适用于分类的BIGML中可用的不同算法(模型,逻辑回归,合奏和深网)。beplay2网页登陆合奏给出了最好的结果(请参阅下一个评估部分中的所有模型比较)。使用thewhizzmlscript SMACdown我们可以自动测试合奏的所有可能参数设置。

Moduding_Strategy
建模策略

评估

最初,数据集按时间顺序分开:2013年至2016年的数据用于培训,2017年数据被用作评估的测试集。评估标准基于曲线下的区域ROC曲线的(AUC)(以图形表示分类问题的召回和特异性之间的权衡)。由于我们的数据集非常不平衡(与没有警报的日子相比,警报的日子很少),因此我们需要通过应用一个来平衡模型概率阈值。最佳阈值已设置为试图最小化虚假负面因素(预计没有警报,但实际上有警报)没有惩罚太多False Positives (days predicted as having alerts, but they don’t actually have an alert)。We have compared all the available models usingBIGMbeplay2网页登陆L比较工具确保我们选择最佳性能模型。

在下面,我们可以找到评估中使用的集合的字段重要性图形。最重要的领域是超过150的NO2度量的电台数量µGrams/m3the day before, followed by the NO2 average range the day before, and the NO2 maximum range over the 5 previous days. Traffic prediction, rainfall, and wind representative fields also appear in the top 10.

field_importance
beplay2网页登陆BIGML现场重要性图:1天预测合奏

We can see in the figure belowthe different evaluations for predictions 1 day in advance300次迭代的增强合奏(下面以橙色为代表)获得了最高的ROC AUC(0.8781)。

evaluations_comparison
BigML evaluations comparison tool: 1 day prediction

Once we have selected the best model by looking at the AUC metric, we need to look at the记起精确of a given model to select the optimal threshold to start making predictions.The记起是对积极实例数的真实阳性数量的数量,而精确是对积极预测数量的真实阳性数量。下图显示了1天的BIGML预测评估beplay2网页登陆概率阈值set to 27%. We can see how the model predicted 14 out of 19 actual alerts resulting in a 73.68% recall. It also predicted 16 other没有警报的日子这意味着一个精度为46.67%。

BigML ensemble evaluation: 1 day prediction

The chart below shows the recall and precision for the three predictions performed: 1, 4, and 7 days in advance.

结果_graph
Precision and recall results

正如预期的那样,提前的天数越高,我们试图预测性能越低。然而,即使提前一天提前做污染警报预测已经使公民在日常生活中受益,因为目前仅提前几个小时被警告。

将这种用例进一步迈出一步,预先预测污染水平,甚至可以使我们能够降低高污染水平,这是一个又一个城市。机器学习的见解并不是要简单地作为有关我们世界的其他信息,而是要充分利用它们,并改善人们的生活,在我们的业务,社会和其他地区。

一条评论

Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account.(登出/Change)

Google照片

您正在使用Google帐户评论。(登出/Change)

Twitter图片

您正在使用Twitter帐户评论。(登出/Change)

Facebook照片

您正在使用您的Facebook帐户发表评论。(登出/Change)

连接到%s