博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Common Pitfalls In Machine Learning Projects
阅读量:7212 次
发布时间:2019-06-29

本文共 3644 字,大约阅读时间需要 12 分钟。

In a recent presentation,  described the common pitfalls in machine learning projects he and his colleagues have observed during competitions on Kaggle.

The talk was titled “” and was presented in .

In this post we take a look at the pitfalls from Ben’s talk, what they look like and how to avoid them.

Machine Learning Process

Early in the talk, Ben presented a snap-shot of the process for working a machine learning problem end-to-end.

Machine Learning Process

Taken from “Machine Learning Gremlins” by Ben Hamner

This snapshot included 9 steps, as follows:

  1. Start with a business problem
  2. Source data
  3. Split data
  4. Select an evaluation metric
  5. Perform feature extraction
  6. Model Training
  7. Feature Selection
  8. Model Selection
  9. Production System

He commented that the process is iterative rather than linear.

He also commented that each step in this process can go wrong, derailing the whole project.

Discriminating Dogs and Cats

Ben presented a case study problem for building an automatic cat door that can let the cat in and keep the dog out. This was an instructive example as it touched on a number of key problems in working a data problem.

Discriminating Dogs and Cats

Taken from “Machine Learning Gremlins” by Ben Hamner

Sample Size

The first great takeaway from this example was that he studied accuracy of the model against data sample size and showed that more samples correlated with greater accuracy.

He then added more data until accuracy leveled off. This was a great example of understanding how easy it can be get an idea of the sensitivity of your system to sample size and adjust accordingly.

Wrong Problem

The second great takeaway from this example was that the system failed, it let in all cats in the neighborhood.

It was a clever example highlighting the importance of understanding the constraints of the problem that needs to be solved, rather than the problem that you want to solve.

Pitfalls In Machine Learning Projects

Ben went on to discuss four common pitfalls in when working on machine learning problems.

Although these problems are common, he points out that they can be identified and addressed relatively easily.

Overfitting

Taken from “Machine Learning Gremlins” by Ben Hamner

  • Data Leakage: The problem of making use of data in the model to which a production system would not have access. This is particularly common in time series problems. Can also happen with data like system id’s that may indicate a class label. Run a model and take a careful look at the attributes that contribute to the success of the model. Sanity check and consider whether it makes sense. (check out the referenced paper “” PDF)
  • Overfitting: Modeling the training data too closely such that the model also includes noise in the model. The result is poor ability to generalize. This becomes more of a problem in higher dimensions with more complex class boundaries.
  • Data Sampling and Splitting: Related to data leakage, you need to very careful that the train/test/validation sets are indeed independent samples. Much thought and work is required for time series problems to ensure that you can reply data to the system chronologically and validate model accuracy.
  • Data Quality: Check the consistency of your data. Ben gave an example of flight data where some aircraft were landing before taking off. Inconsistent, duplicate, and corrupt data needs to be identified and explicitly handled. It can directly hurt the modeling problem and ability of a model to generalize.

Summary

Ben’s talk “” is a quick and practical talk.

You will get a useful crash course in the common pitfalls we are all susceptible to when working on a data problem.

转载地址:http://iurum.baihongyu.com/

你可能感兴趣的文章
java 基础
查看>>
【Android】Android6.0发送短信Demo
查看>>
《徐徐道来话Java》:PriorityQueue和最小堆
查看>>
微信支付中的刷卡支付和扫码支付测试
查看>>
中国金融出版社出版的2013版《风险管理》
查看>>
BZOJ1055: [HAOI2008]玩具取名[区间DP]
查看>>
【转】iOS 10 UserNotifications 使用说明
查看>>
英文投稿模板
查看>>
计算机网络常见英文缩写
查看>>
Esper epl语句实验
查看>>
使用phpize建立php扩展(Cannot find config.m4)
查看>>
【TensorFlow】CNN
查看>>
redis cli命令
查看>>
C#中委托和事件的区别
查看>>
数据库操作
查看>>
CLIPS
查看>>
HBase中数据的多版本特性潜在的意外
查看>>
僵还是老的la,看myeclipse的spring mvc3 scaffolding 完胜sts
查看>>
网上开店进货渠道大参考
查看>>
经典语录
查看>>