Logo

CS246: Mining Massive Data Sets - Shared screen with speaker view
sivanarayanagaddam
02:01
Hello Everybody. Nice to meet you
Jerry Huang
02:22
Hi : )
Wai Tong Chung
10:03
do we get a list of music played per week?
Jerry Huang
11:25
I think I can share the playlist.
Sivanarayana(Siva) Gaddam
13:03
BTW, this is the earth month
Sivanarayana(Siva) Gaddam
13:43
It would be curious to know about big data workloads that are not earth friendly
Sheikh Abdur Raheem Ali
14:59
Good afternoon guys
Sheikh Abdur Raheem Ali
17:58
This is what makes "suggestions" for people to cc when you're adding recipients to an email
Sheikh Abdur Raheem Ali
19:33
Wow, congratulations!
Sivanarayana(Siva) Gaddam
19:37
Awesome. Well deserved
Sheikh Abdur Raheem Ali
20:46
https://awards.acm.org/about/2020-turing Link if anyone wants to read more
Sheikh Abdur Raheem Ali
22:08
"People who buy masks and hand sanitizer are also more likely to buy toilet paper"
Sheikh Abdur Raheem Ali
26:57
So these sets could be singletons, pairs, triplets, and so on with decreasing support as you keep expanding the set
Jiaxuan You
27:22
Yes
Sheikh Abdur Raheem Ali
28:50
what if all baskets contain a certain j?
Jerry Huang
29:28
We will discuss this in later slides!
Chen, Chuanqi
30:24
confidence (I->J) likes conditional probability?
Sheikh Abdur Raheem Ali
30:30
why take the absolute value?
Jiaxuan You
30:45
@chuangqi. Yes
Michael Sun
30:45
Is there way to store this efficiently better than O(2^|items|)?
Euirim Choi
30:53
Kind of like basket-oriented TFIDF
Miles Zoltak
31:05
what does Pr[j] represent?
Tianyu Du
31:16
prior of item J
Miles Zoltak
31:23
<3
Andy Jin
31:24
@Miles - proportion of baskets that contain j
Euirim Choi
32:15
Why absolute value? Are low values interesting?
wilson nguyen
33:42
is support of an association rule = min(support left, support right?)
Sheikh Abdur Raheem Ali
34:15
So if we have a frequent itemset, remove one or more elements and put it on the other side to create the association rules
Jerry Huang
35:37
@wilson , yes
Jerry Huang
35:44
no
Jerry Huang
36:09
it is the support of set of (left union right)
Jiaxuan You
36:24
@sheikh @euirim: That’s a great question. My intuition is that we will measure the “suprise” as interests. So whether the difference is + or -, it doesn’t matter
Akram Sbaih
37:30
The data used for this recommendation is dependent on historical data which is dependent on the previous recommendation policy.. doesn’t this prevent us from discovering new better policies? How do we marginalize for the previous policy?
Miles Zoltak
38:32
do we have to sum the supports of the supersets? so like for A we would have to sum support of AB, AC, and ABC?
Akram Sbaih
38:36
Also subtracting probabilities seems awkward.. what’s an intuition for it? (For the interestedness measure)
Sheikh Abdur Raheem Ali
40:55
So the data could be158-1 (boundary, next basket)527
Jerry Huang
41:36
@Miles support(A) is the number of times A appearing in all baskets
Jiaxuan You
41:37
@Akram That’s a great research question. It’s a general question, that is how to generalize a trained model to unseen data. In this lecture we will focus on the algorithm for finding frequent item sets.
Jiaxuan You
43:31
@Akram I agree. From my point of view, this metric is heuristic driven, so that it can be efficiently conducted. You can see that a abs threshold can greatly speed up our algorithm later.
Akram Sbaih
45:42
Can we take a small random subset to find interesting pairs and then count only them on the entire dataset?
Jerry Huang
46:50
I think this is a similar idea to one of the algorithms which will be introduced at the end of today’s lecture.
Jiaxuan You
47:06
Yes. Starting from slides 47
Akram Sbaih
48:50
Are there available python libraries for triangular matrix without wasting memory for the other triangle?
Caleb Chiam
48:58
Does approach 2 use a 3d matrix? (Slightly confused about the data structure used)
Andy Jin
49:22
I think approach 2 uses hash table with key (I, j) and value c?
Peter Robert Boennighausen
49:25
Approach 2 only uses a dictionary (or HashSet)
Caleb Chiam
49:32
Ah ty
Peter Robert Boennighausen
50:11
Is using a hash table like approach 2 basically equivalent to using a sparse matrix data structure from scientific computing libraries?
Peter Robert Boennighausen
50:31
Since sparsity seems to be the crossover point between using approach 1 and 2
Michael Sun
50:53
So k passes if max itemset size k?
Sheikh Abdur Raheem Ali
51:23
so we stop at the maximal set
Michael Sun
51:47
Mhm I think, then you can make it parallel each sweep
Jiaxuan You
52:00
@Akram. Good question. I think sparse matrix libraries could help
Jerry Huang
53:10
@Caleb it looks like it is a 2D matrix with enumerated items at the column and row index
Euirim Choi
53:33
So under this system infrequently purchased items don’t have recommendations?
Jiaxuan You
54:38
@Peter Great question. I think you can think in that way. There may be some special designs in sparse matrices compared to dictionaries
Jiaxuan You
55:20
@Michael Yes. k passes if max size is k
Jiaxuan You
56:19
@Euirim. Yes, that’s our setting, described in slides 9
Michael Sun
58:05
Can this be interpreted as a DAG (itemset is node)
jameson weng
58:55
Is this guaranteed to use less memory in the worst case? For example, if many items are frequent
Sheikh Abdur Raheem Ali
58:58
{b,j} is not frequent
Andy Jin
59:56
@Jameson - in the theoretical worst case, I suppose not, but in practice I assume if too many items are “frequent” one should adjust the support threshold
Jiaxuan You
01:00:04
@Michael. What are you referring to?
Michael Sun
01:01:54
@Jiaxuan like counting itemsets, {a} -> {a, b}, is it a dag
Sivanarayana(Siva) Gaddam
01:02:26
Make sense
Sheikh Abdur Raheem Ali
01:02:32
no. of buckets always <= no. of baskets, yeah?
Sivanarayana(Siva) Gaddam
01:02:45
No, I was thinking about hashing
Sheikh Abdur Raheem Ali
01:03:25
and count of a bucket always >= count of a basket that hashes to it
Jiaxuan You
01:03:28
@Michael Good idea. I think you can understand that way
Sivanarayana(Siva) Gaddam
01:04:31
Can a bucket replaced by Counting Bloom Filter?
Sivanarayana(Siva) Gaddam
01:04:48
For further minimizing space
Rafael Esteves
01:05:00
what does it mean for a buket to be frequent
Jiaxuan You
01:05:01
Good idea. Yes I think so.
danny schwartz
01:05:04
So this only helps when a-priori would otherwise run out of memory maintaining counts in pass 2
Jiaxuan You
01:05:30
@Rafael. it means the count stored in that bucket is above the threshold we set
Jiaxuan You
01:06:22
@Danny. That’s definitely a scenario how it can help. It can reduce computational cost in 2nd pass as well
danny schwartz
01:06:31
Can you also subtract the pair count after processing a pair to make the buckets more informative?
Sivanarayana(Siva) Gaddam
01:06:48
Is it safe to say, if we use countable bloom filter, we can solve the problem with constant memory?
Sivanarayana(Siva) Gaddam
01:08:20
May be proving accuracy of counting bloom filter based algorithm might be interesting
Peter Robert Boennighausen
01:08:27
What is “the problem” in this case? @Siva
Sivanarayana(Siva) Gaddam
01:09:04
Bloom filters suffer from false positives, may lead to inaccurate counts if we are not careful
Sheikh Abdur Raheem Ali
01:09:34
@Akram this is the random sampling you were talking about earlier :D
Akram Sbaih
01:09:53
:)
Sivanarayana(Siva) Gaddam
01:10:05
BTW, another interesting question, how can we improve disk I/o incase of step1:
Sivanarayana(Siva) Gaddam
01:10:18
That looks like a bottleneck
Jiaxuan You
01:10:43
@danny. I think it’s a smart idea. But redirection a bucket may be challenging
Jiaxuan You
01:11:13
@Siva. I think that’s a good research question to ask
Andy Jin
01:12:48
Why is the monotonicity idea true for SON? Couldn’t an itemset’s frequency count be split across multiple subsets, so it is not frequent in any subset by itself?
danny schwartz
01:13:01
@Andy no, pigeonhole principle
Peter Robert Boennighausen
01:13:11
That’s why we have no guarantee that we find all of them
Michael Sun
01:13:17
Is there work on lower bound of minimal #subsets we need to guarantee we don’t miss too much itemsets
Jiaxuan You
01:13:34
@Siva great question. I think the algorithms we introduced after slides 47 can reduce disk I/O by a lot
Sivanarayana(Siva) Gaddam
01:13:53
I’m curious to know
Sivanarayana(Siva) Gaddam
01:14:05
How can we keep kernel happy? :-)
Jiaxuan You
01:16:07
@Andy. I think here, when deciding “frequent” in a subset, we will reduce the threshold by dividing it, just like slide 49
Andy Jin
01:16:39
Ah got it, thanks!
Sheikh Abdur Raheem Ali
01:16:49
http://stephanemoore.com/pdf/toivonen.pdf
Andy Jin
01:17:04
Sorry if I missed this, but how do we determine what the negative border is for an itemset? Is it a trial-and-error process?
Caleb Chiam
01:19:26
Could you explain again why the SON algorithm doesn’t necessarily find *all* frequent itemsets? (Doesn’t the monotonicity idea guarantee that any frequent itemset will be identified because said itemset will be frequent in at least one basket subset?)
Jerry Huang
01:21:38
Hi, there was a question about why to take absolute value of interest of an association rule at slide 12, maybe we could discuss this together?
vrinda vasavada
01:21:44
might have missed this but in practice how often do we find something in the negative border that is frequent (and end up having to restart)? Is it more likely based on certain dataset characteristics?
Caleb Chiam
01:23:42
wait, so it sounds like the SON also will find all frequent itemsets if we configure the threshold correctly. Or am I confused?
Michael Sun
01:24:34
is it like a recall v precision tradeoff playing with threshold
Caleb Chiam
01:24:36
“However, even with SON algorithm we still don’t know whether we found all frequent itemsets” <— isn’t this contradicted then? Or at least it needs to be further qualified?
Michael Sun
01:25:42
^ I think if u set support threshold 0… u find everything?
Caleb Chiam
01:26:48
I mean the non-trivial (and useful) thresholds would work as well I think, as described by Jure earlier
Chen, Chuanqi
01:27:02
frequent items might be different depending on community geo location and culture difference.
Andy Jin
01:27:24
Could we briefly go over how negative borders are found?
Andy Jin
01:27:30
For Toivonen
Akram Sbaih
01:28:19
In the proof of the last slide: what if no subsets of T are frequent in the sample? (Let’s take the extreme: the sample is very small and no items in T are in the sample)
Sheikh Abdur Raheem Ali
01:29:45
A simpler case: {A, B} is in the negative border if it is not frequent in the sample but {A} and {B} are frequent
steve gan
01:33:37
is this usually the case that toivonen will find all frequent item sets in several times?
Havin
01:33:54
thank you
Bhagirath Mehta
01:33:59
Thanks!