1. Big Data Challenges [PDF]
1.1. Volume: Huge data collections
1.2. Velocity: Continuous on-line data streams
1.3. Variety: Big data models
2. Applications and tools
2.1. Data replication and sharding [pdf]
2.1.1 NoSql Systems: experience with CouchDB [pdf] CouchWrapup [pdf]
2.2 Life cycle I: sanitizing experience with Pig [pdf]
2.3 Life cycle II: data gathering techniques: Web scrapping, data services, crowdsourcing [pdf ]
2.3.1. Open data / data journalism [see examples in the project definition]
3. Big Data Processing Platforms
3.1. Parallel processing for analytics : Hadoop platforms [pdf]
3.2 Some elements of data analytics [pdf]
3.3. Big Data Management Systems [see slides section 1 & the references section]
4. Big and smart data applications: examples
4.1 Elections [pdf]
4.2 Other applications [pdf]
HANDS ON
- NoSQL data stores: expressing queries using MapReduce
- Downloading Couch: http://couchdb.apache.org
- Building a document database: using CouchDB [Ex-1] [Ex1-answers]
- Querying a document database [Ex-2] [answers on explicit demand]
- Data sanitation with Pig
- Installing Pig
- Dealing with network behavior data collections [pdf] data[distributed in class ask for It !]
- Data analytics with Hadoop
- Environment: hadoop on Hortonworks
- Counting words and other summarization challenges [AllData]
- Counting words: first approach [ pdf ] [WordCount Example]
- Counting with some optimizations using combiners: understanding some principles of the map reduce model [ pdf ] [MapReduce-book-final] [code examples]
- Some interesting map reduce patterns: see the challenges section [patterns reference]
CHALLENGES
challenge : “a test of one’s abilities or resources in a demanding but stimulating undertaking”, The free English dictionary
- CH1: Polyglot meets Xperanto [here]
- CH-2: More intensive summarization: choose one of the following
- CH-3 Filtering patterns: choose one of the following
- CH-4: Join patterns: choose two of the following