Sampling for Big Data

author: Nick Duffield, Department of Electrical and Computer Engineering, Texas A&M University
author: Graham Cormode, Department of Computer Science, University of Warwick
published: Oct. 8, 2014,   recorded: August 2014,   views: 6134
Categories

Slides

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.
  Delicious Bibliography

 Watch videos:   (click on thumbnail to launch)

Watch Part 1
Part 1 1:01:47
!NOW PLAYING
Watch Part 2
Part 2 37:57
!NOW PLAYING
Watch Part 3
Part 3 1:15:32
!NOW PLAYING

Description

One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph. However, not all environments can support this scale of resources, and not all queries need an exact response. This motivates the use of sampling to generate summary datasets that support rapid queries, and prolong the useful life of the data in storage. To be effective, sampling must mediate the tensions between resource constraints, data characteristics, and the required query accuracy. The state-of-the-art in sampling goes far beyond simple uniform selection of elements, to maximize the usefulness of the resulting sample. This tutorial reviews progress in sample design for large datasets, including streaming and graph-structured data. Applications are discussed to sampling network traffic and social networks.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: