Logo

CS246: Mining Massive Data Sets - Shared screen with speaker view
Rachael Liu Wang
05:59
I like the soundtrack!
Sheikh Abdur Raheem Ali
06:22
Yeah, nice orchestra
Jerry Huang
06:28
Can everyone hear the music ? ~~~
jc
06:36
Yes, epic start
Sheikh Abdur Raheem Ali
06:51
Basically feels like being put on hold
Jerry Huang
07:07
CS246 plans to play music 15 mins before the lecture this quarter
jake silberg
07:43
Love it
Sheikh Abdur Raheem Ali
07:59
Is that to make it so that if you upload the recorded lecture it'll be immediately copystriked?
Rachael Liu Wang
08:21
Its like we are about to go on an epic adventure...
Sheikh Abdur Raheem Ali
09:22
What's the question policy during lectures BTW? Do we unmute, or put them in the chat, or save them for Piazza?
Natasha Sharp (she/her)
09:39
Please put all your questions in the chat
Rachael Liu Wang
12:00
so does class start at 12:30? or 12:15
Jerry Huang
12:13
12:30!
Sheikh Abdur Raheem Ali
12:13
I think it starts at 12:30 today
Sheikh Abdur Raheem Ali
12:25
but 12:15 from the next lecture onwards
Sheikh Abdur Raheem Ali
12:33
is that right?
Jerry Huang
12:51
I will start to play some music from 12:15 :)
alexandra porter
12:57
The lecture will always start at 12:30, but the zoom room will start at 12:15
Natasha Sharp (she/her)
12:59
Lecture will begin at 12:30, but we will be playing music starting at 12:15 PM PT to greet everyone. We will have TAs monitoring questions as they come up in the Zoom chat. But Prof. Jure Leskovec will lecture for 60 minutes, then have a
Natasha Sharp (she/her)
13:16
20 minute Q&A session after his lecture
Sheikh Abdur Raheem Ali
13:33
Ooh that sounds pretty unique, thanks
Sheikh Abdur Raheem Ali
17:26
I'm on https://web.stanford.edu/class/cs246/ and the link to the "Office of Accessible Education (OAE)" seems to be dead
alexandra porter
17:50
Thanks for catching that, I’ll find the new one and update it.
Sheikh Abdur Raheem Ali
19:40
Also, it says CS246 is (Winter, 3-4 Units, homework, final, no project) . But on the "Course Info" tab we only have Colabs and HWs under Coursework.
alexandra porter
20:50
Ah yes that is also out-dated. There will only be the Colabs and HW with grading as described under Course Info.
Sheikh Abdur Raheem Ali
23:28
We've lost your slides
Sheikh Abdur Raheem Ali
27:29
Would Distributed Systems be a pre-requisite for understanding material in CS246?
trey connelly
30:03
@Sheikh No; for the parts we cover that touch on distributed systems (e.g. MapReduce), we'll explain assuming you haven't had a distributed systems course
trey connelly
31:50
Hi everyone! =D
Sheikh Abdur Raheem Ali
34:26
Do song recommendations follow a FIFO queue, do we use a first past the post voting system, or is it all up to the DJ's discretion?
alexandra porter
35:04
We’ll leave it to Jerry’s discretion.
Brian MacDonald Powell
36:52
I think the slides posted on the website are from last year. Would it be possible to upload the updated version? Thanks!
alexandra porter
37:29
I’ll check, I thought I posted the new one.
trey connelly
37:40
https://web.stanford.edu/class/cs246/slides/01-intro.pdf should link to the current slides
alexandra porter
38:02
Yes, and the [slides] box on the schedule goes there as well.
Brian MacDonald Powell
38:05
Might be a cache issue on my end. Thank you!
alexandra porter
38:36
If you see a green bar at the top of the cs246 page, you have the old site cached, it should be orange for this year.
Brian MacDonald Powell
39:44
Perfect, it was a cache issue and it’s fixed now. Thank you
Sarthak Kanodia
39:56
Would the textbook be used heavily during the course or is it supposed to be a reference text?
Han Wu
40:17
Is there a 246h this year?
trey connelly
40:59
CS246H is not offered this year
Sheikh Abdur Raheem Ali
42:49
I understand that for the math/proof parts of assignments the preferred format for submission is LaTeX documents compiled to PDF. What's the best practice for submitting code?
trey connelly
42:56
@Sarthak The textbook serves as an additional reference/review for the lecture material (the "suggested readings" in the schedule on the website link to chapters in the textbook that relate to that day's lesson)
Sheikh Abdur Raheem Ali
44:03
Do we have automated test suites to help verify our code before we submit it?
alexandra porter
44:36
I believe the homeworks have questions that you will write code to solve, but the actual answer will be checked in your written submission.
Rafael Esteves
45:18
how do the homeworks compare to one another? is a specific homework easier or harder than the rest?
Saksham Gakhar
45:40
Can I only use a part of a single late period? I can see a scenario where I might only need 1 extra day rather than the entire 4 day period from Friday to Monday
Sophia Ying Wang
46:31
so there is no final, correct ?
Julius Stener
46:32
So is the colab 10 due a week after the quarter is ended?
alexandra porter
46:53
There is no final. The 10 colabs include 0 so there is no colab 10
trey connelly
46:59
For colabs, you just submit the answers in the gradescope; for homeworks we'll have a code submission section where you can upload your code. We'll put more instructions in the homework assignment document
Hikaru Hotta
48:16
What methods are being used to track contributions during lectures and discussion sections?
Rachael Liu Wang
48:53
for the larger homeworks, can we have pset partners?
Emily You
49:38
When are the discussion sections?
alexandra porter
50:04
You may work on homework in study groups, but each student must submit an independent solution (see Course Info page on the website)
trey connelly
50:49
recitation sessions will not be held live this quarter; we'll release the recordings from last year for you to watch later this week and next
Tracy Cai
51:10
Where can we find the tutorial video for colab 0?
alexandra porter
51:44
The link on the slides for this lecture should take you to the recitation video on Canvas.
Tracy Cai
51:52
Thanks!
Cortney Curtis Weintz
51:58
Could one of the TAs elaborate on how necessary something like CS145 would be to this class?
Sarthak Kanodia
53:02
Have the contents of previously offered 246H been merged into the course content itself for this iteration? If not, what would be the best way to get a grasp of that material along with the course?
trey connelly
53:05
@Courtney not very necessary; I took this class without CS145 and had no issue there. We'll cover in the class the level of database knowledge we need
Andy Jin
53:40
Just curious, are groups for CS341 formed before the class or during the class?
Andy Jin
53:53
(i.e. do we have to go in with a group)?
alexandra porter
54:40
@Andy I don’t know, I can find out and let you know.
Andy Jin
54:55
Great, thanks Alex!
Rafael Esteves
55:27
how do the homeworks compare to one another? is a specific homework easier or harder than the rest?
Sheikh Abdur Raheem Ali
57:35
I'm in industry, we constantly have to fight fires with service outages, he's totally right
trey connelly
58:25
@Rafael Each one you'll have two weeks to do and we will try to make them roughly equal in length. We cover a lot of different concepts in the course, so the homeworks are generally different material rather than increasingly-harder versions of the same content. So I'd say it depends on how well you get the different material which one would be harder.
sivanarayanagaddam
01:00:30
Is it safe to assume all big data workloads never modifies source data?
wilson nguyen
01:00:56
Is there a gradescope code?
Sarthak Kanodia
01:01:36
what do we mean by racks?
trey connelly
01:02:42
@wilson Gradescope is linked to canvas, so if you're on the canvas page you should be on gradescope already. Also the gradescope button on the homepage should link there, but currently it appears to link to the homepage again, which we'll fix.
Hikaru Hotta
01:02:48
What exactly is a node?
Abubakar Abid
01:03:04
What's the benefit of splitting into chunks? And why are they so small?
Meg Richey
01:03:32
Sorry if I missed this but if C is for chunk, what is D?
Sheikh Abdur Raheem Ali
01:03:36
This sounds like a virtual memory design where the physical segments allocated may be fragmented (but with replication and distribution).
Dian Huang
01:03:37
If one fails, other can still recover the data. So you want to split into chunk.
June Chan
01:04:00
What’s the difference between client library and master nodes? Can’t we lookup location directly with master nodes?
trey connelly
01:04:19
@Meg C and D are two different files; C_x is chunk x of file C
danny schwartz
01:04:24
You could, but programmers are going to get agitated having to do that explicitly @June
June Chan
01:05:43
@danny, thanks. better not to annoy them. So client library is really just an interface here.
Rafael Esteves
01:08:24
what is the output of reduce?
Joe Zhang
01:08:26
Is MapReduce (the original) still commonly used or has it been subsumed by newer and better systems?
sivanarayanagaddam
01:08:35
What kind of computations are hard to fit map-reduce programming model?
danny schwartz
01:08:47
@Rafael it depends, it could be a sum, a maximum, a lot of things
Sheikh Abdur Raheem Ali
01:08:50
Can chunks be freely moved around in the input, or are some chunks "pinned" to their location?
Sifan Ye
01:09:15
What are the keys in this context?
Miles Zoltak
01:09:30
could 2 keys potentially map to the same value? or is it really just dependent on the context at hand
Andy Jin
01:09:51
Which of these steps happen in memory vs. on disk?
trey connelly
01:10:07
@Sifan The keys and values can be arbitrary things. The keys just have to be group-able, i.e. hashable or sortable
sivanarayanagaddam
01:10:16
What if Map task fails?
June Chan
01:10:36
Does the “M” sign here represent identical Map Task applied to different chunks?
Michael Sun
01:10:54
Is there memory limit to #(k,v)’s each mapper node outputs
trey connelly
01:11:01
Often you'll have a "controller" of some kind that monitors all the tasks, and reassigns failed ones. (you actually code this yourself in CS110)
Sifan Ye
01:11:13
Can we have a simple example of possible map and reduce functions?
trey connelly
01:11:18
@June The "M" is the map task, yes
Emily You
01:11:24
What's the difference between the output and the grouped values after group by key?
trey connelly
01:11:37
@Sifan stay tuned; we're about to see an example
trey connelly
01:13:01
@Michael each mapper could theoretically output any amount of memory; if you want e.g. all the outputs to be held in memory at once, you could shrink the chunk size
trey connelly
01:13:33
@Emily it's really the same data, just rearranged into a format to be useful for the reducer
Sifan Ye
01:13:53
So the values could be a list of values for other applications? (say nearest n neighbors)
Ken Yang
01:15:07
Is it part of the Master’s node responsibility to
Andy Jin
01:15:20
What optimizations are done to ensure the load assigned to each map task is roughly equal to avoid bottlenecks? (I assume that we cannot begin reducing until all mapping is completed?)
Ken Yang
01:15:36
Split the input and deliver to all Map node? Would that be a bottle neck based on network bandwidth of th node
Abubakar Abid
01:15:37
Is there a reason you are not emitting (word, num_occurrences_of_word) in the map() function?
Hikaru Hotta
01:16:18
Does spark come with load balancing?
trey connelly
01:16:55
@andy often you'd have many small chunks and repeatedly send tasks in serial to each of the parallel workers. So if one it taking a long time then the other workers get sent the other tasks
Sheikh Abdur Raheem Ali
01:17:30
Do we usually cache the output of map tasks so that if we have to re-run a computation we only have to process the parts of the dataset that were changed?
Peter Robert Boennighausen
01:17:39
Why do completed map tasks need to be reset to idle?
danny schwartz
01:17:52
@Peter to reuse the hardware
trey connelly
01:18:19
@abubakar that would still give you the same output, but it takes longer to count the occurrences of each word that to just output each word directly, and you'd still have to do the rest of the reducing part since each mapper only gets a small chunk of the input
Ken Yang
01:19:47
Is it master node that splits the very original input? Wouldn’t that be a bottle neck or SPOF in the system?
trey connelly
01:20:20
@sheikh we write to filesystem between map and reduce steps, which essentially caches the outputs (you'd have to do some bookkeeping too)
trey connelly
01:21:09
@ken often your input is already in chunks, and the master node can "split" the input by just giving a worker a start/end index, without having to do any I/O itslef
sivanarayanagaddam
01:22:01
How it handles failures?
Andy Jin
01:22:05
@Ken re: SPOF, I believe the master node can also be replicated (from an earlier slide)
Sifan Ye
01:22:15
Is it possible to switch to zoom seminar format where Q&A threads are kept separately for easier reading?
Sifan Ye
01:23:10
chat would also be nicely seperate
alexandra porter
01:23:15
@Sifan I’ll ask Natasha about that
Sarthak Kanodia
01:24:59
+1 to @sifan’s request. Even without using the seminar format, zoom meetings can have Q&A enabled which will make navigating the question/answers easier for both the students and teaching team.
T
01:26:40
Will the recording capture both Q&A and chats for those viewing the recordings asynchronously?
Andy Jin
01:27:25
What exactly is the difference between Spark vs. Hadoop? Is Spark a more complex data processing engine based on RDDs, while Hadoop uses MR and offers file storage via HDFS?
trey connelly
01:28:07
Zoom recordings don't record Q&A; I'm not sure about chat. We'll look into that along with the Q&A/webinar question
Sheikh Abdur Raheem Ali
01:29:13
Totally unbiased opinion: MS Teams handles Q&A for live events a lot better
Ken Yang
01:31:00
@Andy, my take is the diff should be between MapReduce and Spark. Hadoop is more like the framework that consists of the workflow system + DFS, HIVE, MR etc
Andy Jin
01:31:58
Makes sense, thanks Ken!
danny schwartz
01:32:28
For this specific problem (n-grams), it seems like the chunk boundaries would be problematic
trey connelly
01:33:31
@danny true; you'd have to special-case that a little bit to handle the small number of cases that cross boundaries
giray ogut
01:34:36
What about Spark jobs that would boil down to a single MapReduce job? Would they run faster on Spark than on MapReduce even though they work on MapReduce?
Natasha Sharp (she/her)
01:34:48
@T, the recording should capture the lecture as well as the chat/Q&A. You should be able to find the recording later today on Canvas in the Panopto Course Videos section. Since lecture is over an hour, it can take some time to upload though. So thank you for your patience!
Alex Wang
01:34:57
Jure what's your setup to get such professional video
Sho
01:35:03
is there a difference between 3 and 4 credits for grad students?
Sheikh Abdur Raheem Ali
01:35:06
You mentioned Spark uses more memory than MapReduce. Obviously this is highly workload dependent, but could you give us an estimate?
Ken Yang
01:35:49
Logistic q: do we use Campuswire at all? I missed the early half of the lecture
Sarthak Kanodia
01:35:57
Have the contents of previously offered 246H been merged into the course content itself for this iteration? If not, what would be the best way to get a grasp of that material along with the course?
trey connelly
01:36:16
@Ken we use piazza; no campuswire
Ken Yang
01:36:22
Thx!
Chuanqi Chen
01:36:22
Will we discuss GNN and GCN in large scale?
Andy Jin
01:36:38
Is HDFS compatible with replication schemes like RAID? Or are there other specific ways of replication that it uses?
Sheikh Abdur Raheem Ali
01:36:56
Thanks, that makes sense
June Chan
01:37:30
So in operation, RDD is always on, waiting for query to provide answer?
sivanarayanagaddam
01:37:40
You highlighted, Spark perform lazy evaluation as opposed to MapReduce, Does spark actually fetch data from a local/remote source during evaluation time only?
giray ogut
01:38:10
What about Spark jobs that would boil down to a single MapReduce job? Would they run faster on Spark than on MapReduce even though they can be reduced to MapReduce?
Victoria Magdalena Dax
01:40:14
Is there a good textbook for folks who are new to servers and data storage / DFS?
June Chan
01:42:03
Thanks!
Joe Zhang
01:42:09
Thank you!
Andy Jin
01:42:10
Thank you!
Mingyan Zhao
01:42:11
Thank you
Bhagirath Mehta
01:42:11
Thanks!
Andrea Vallebueno
01:42:13
Thank you!
Christopher Wolff
01:42:14
ty!
sivanarayanagaddam
01:42:18
Thank you
Bryan Zhu
01:42:21
Thank you!
danny schwartz
01:42:22
Thank you
Ken Yang
01:42:23
TY!
Sifan Ye
01:42:24
thank you!
Sheikh Abdur Raheem Ali
01:42:25
Thank you!! Have a nice day!
Sasankh Munukutla
01:42:26
Thanks!!!
Will Geoghegan
01:42:27
Thanks!
Jonathan Michael Gomes Selman
01:42:31
Thank you