Big Data Fest: Data, tools & practice

Big Data Fest
Data, tools & practice

by
Genoveva Vargas Solar, Senior Scientist, CNRS
Genoveva.Vargas@imag.fr
Regina Motz, Prof. Universidad de la República
regina.motz@gmail.com
Javier Alfonso Espinosa Oviedo, Postdoctoral fellow, LAFMIA
javiera.espinosa@imag.fr
Technical support

Plácido Antonio de Souza Neto, Postdoctoral fellow, LIG
Juan Carlos Castrejón, PhD student, University of Grenoble, LIG-LAFMIA

2014

CONTENT

1. Big Data Challenges [PDF]

1.1. Volume: Huge data collections

1.2. Velocity: Continuous on-line data streams

1.3. Variety: Big data models

2. Applications and tools

2.1. Data replication and sharding [pdf]

2.1.1 NoSql Systems: experience with CouchDB [pdf] CouchWrapup [pdf]

2.2 Life cycle I: sanitizing experience with Pig [pdf]

2.3 Life cycle II: data gathering techniques: Web scrapping, data services, crowdsourcing [pdf ]

2.3.1. Open data / data journalism [see examples in the project definition]

3. Big Data Processing Platforms

3.1. Parallel processing for analytics : Hadoop platforms  [pdf]

3.2 Some elements of data analytics [pdf]

3.3. Big Data Management Systems [see slides section 1 & the references section]

4. Big and smart data applications: examples

4.1 Elections [pdf]

4.2 Other applications [pdf]

HANDS ON

  • NoSQL data stores: expressing queries using MapReduce
  1. Downloading Couch: http://couchdb.apache.org
    1. Building a document database: using CouchDB [Ex-1] [Ex1-answers]
    2. Querying a document database [Ex-2] [answers on explicit demand]
  •  Data sanitation with Pig
  1. Installing Pig
    1. Hortonworks [pdf]
    2. Testing your installation: [data] [PigScript]
  2. Dealing with network behavior data collections [pdf] data[distributed in class ask for It !]
  •  Data analytics with Hadoop
  1. Environment: hadoop on Hortonworks
  2. Counting words and other summarization challenges [AllData]
    1. Counting words: first approach  [ pdf ] [WordCount Example]
    2. Counting with some optimizations using combiners: understanding some principles of the map reduce model [ pdf ] [MapReduce-book-final] [code examples]
  3. Some interesting map reduce patterns: see the challenges section [patterns reference]

CHALLENGES

challenge : “a test of one’s abilities or resources in a demanding but stimulating undertaking”, The free English dictionary 

  • CH1: Polyglot meets Xperanto [here]
  • CH-2: More intensive summarization: choose one of the following
    1. Median and standard deviation [ pdf ]
    2. Inverted index summarizations [ pdf ]
  • CH-3 Filtering patterns: choose one of the following
    1. Bloom [ pdf ]
    2. Top ten [ pdf ]
  • CH-4: Join patterns: choose two of the following
    1. Reduce side join classic and with bloom filter [ pdf ]
    2. Replicated join [ pdf ]
    3. Composite join [ pdf ]
    4. Cartesian product [ pdf ]

You may also like...