spam filter

[2019-1-14, 2018-11-5]

Table: user_actions

ds (STRING) | user_id (BIGINT) |post_id (BIGINT) |action (STRING) | extra (STRING)

'2018-07-01'| 1209283021 | 329482048384792 | 'view' |

'2018-07-01'| 1209283021 | 329482048384792 | 'like' |

'2018-07-01'| 1938409273 | 349573908750923 | 'reaction' | 'LOVE'

'2018-07-01'| 1209283021 | 329482048384792 | 'comment' | 'Such nice Raybans'

'2018-07-01'| 1238472931 | 329482048384792 | 'report' | 'SPAM'

'2018-07-01'| 1298349287 | 328472938472087 | 'report' | 'NUDITY'

'2018-07-01'| 1238712388 | 329482048384792 | 'reshare' | 'I wanted to share with you all'

Table: reviewer_removals

ds (STRING) | reviewer_id (BIGINT) | post_id (BIGINT) |

'2018-07-01'| 3894729384729078 | 329482048384792 |

'2018-07-01'| 8477594743909585 | 388573002873499 |

Q1: How many posts were reported yesterday for each report Reason?

SELECT extra, count(post_id)
FROM user_actions
WHERE ds = CAST(SUBDATE(curdate(), 1) AS STRING) and action = 'report'
GROUP BY extra

Q2: What percent of daily content that users view on Facebook is actually Spam?

SELECT ds, avg(case when (action = 'report' and extra = 'SPAM') or r.post_id IS NOT NULL then 1 else 0 end)
FROM user_actions AS a
    LEFT JOIN reviewer_removals as r on a.post_id = r.post_id
GROUP BY ds
ORDER BY ds

Q3: How to find the user who abuses this spam system?

SELECT user_id, count(a.post_id) AS num_report, count(r.post_id) AS num_real_spam
FROM user_action AS a LEFT JOIN reviewer_removals AS r on a.post_id = r.post_id
WHERE action = 'report' and extra = 'SPAM'
GROUP BY user_id

Q3: Facebook has decided to be proactive about SPAM, instead of merely reactive. We decide to address the SPAM problem through a Machine Learning solution predicting whether a given post is Indeed SPAM. We want to use the predictions in order to downrank/deprioritize suspected SPAM from news feed. Q3的问题是如何来评估这个machine leaning有没有用.

PRODUCT:

Q1. Facebook用machine learning 建了一个model来rank content以达到filter spam的目的,需要关注什么metrics来评价这个model

Q2. 在用ab testing的时候发现用了新的spam model之后revenue下降了。面试官确定了首先这个model不会touch到ads,就是说ads不会被filter out。并且DAU/WAU/MAU和time spend没有变化,也就是说user方面没有变化。那么可能的原因是什么。

然后我问面试官revenue主要来自什么,面试官说是click ads。我说那么ads click的revenue主要可以break down成#user x CTR/CTP x price/click. 这个情况下只可能有变化的是CTR,也就是说因为用来新的model以后,这个平台的整体content质量更高了,那么user就更喜欢花更多时间去explore这些content,那么点击广告的时间就相对来说哦变少了,revenue也下降了。面试官说是这样的,采用新的model之后用户可能会花更多时间去看video之类的,那么用在ads上的时间就变少了。

Last updated