[WS24/25] Big Data: Technologies, Methods, Concepts
Lecturer:
Prof. Dr. Andreas Harth
Details:
Lectures with exercise
ECT-Credits: 5
Language: English
Modul-Nr: 85765
Time and location: see campo
Prerequisites
No specific prerequisites are required. Some basic knowledge in databases and web technologies could be useful.
Contents
Big Data refers to datasets that are too large or too complex to handle in traditional data management and processing systems. The course presents an overview of methods and technologies related to the storage and processing of Big Data.
The goal of the course will be to provide a solid foundation in the traditional design aspects relating to Distributed Computing and Distributed Databases, showing how they have influenced modern developments in cloud computing, including distributed data storage (e.g., NoSQL storage techniques) and data processing abstractions (e.g., MapReduce/Hadoop, Pregel/Giraph).
Course Objectives
The course teaches the fundamentals of Big Data, including real-world use cases, as well as current technical challenges and opportunities with Big Data. Students will learn about the foundational algorithms used in large-scale distributed systems. Further, students will learn how to make use of available technologies to store, process and integrate Big Data on cloud infrastructures and to perform data analytics tasks. The hands-on sessions include setting up a cloud environment, and querying and visualizing a large dataset.
Learning Goals:
- Understand why parallel processing and distributed storage are key to handling massive data
- Learn about the different types of Distributed Systems
- Learn basics of distributed communication, learn modern distributed (cloud) computation abstractions, including MapReduce and Pregel (as used by Google et al.)
- Learn the fundamentals of Distributed Databases, including the trade-offs between fault-tolerance, scalability, performance and economy
- Understand the different types of guarantees a distributed database can make, and their formal limitations
- Cover the taxonomy of current NoSQL stores commonly used for large-scale data management in cluster/cloud computing environments
- Compare and contrast the strengths and weaknesses of different data models employed by stores
- Learn about the different query languages employed by different stores
Literature:
- A. S. Tanenbaum, M. Van Steen. Distributed Systems: Principles and Paradigms (2nd Edition). Prentice Hall, 2006.
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, G. Czajkowski. Pregel: a system for large-scale graph processing. SIGMOD Conference 2010: 135-146.
- K. Hwang, J. Dongarra, G. C. Fox. Distributed and Cloud Computing: From Parallel Processing to the Internet of Things (1st Edition). Morgan Kaufmann, 2011.
- M. T. Özsu, P. Valduriez. Principles of Distributed Database Systems. Springer, 2011.
- T. White. Hadoop: The Definitive Guide. O’Reilly, 2012.
- P. J. Sadalage, M. Fowler. NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Addison-Wesley Professional, 2012.
- Jure Leskovec, Anand Rajaraman, Jeff Ullman, Mining of Massive Datasets.
- AnHai Doan, Alon Halevy, Zachary Ives, Principles of Data Integration, Morgan Kaufmann, 2012
Join the StudOn instance: https://www.studon.fau.de/crs5831215_join.html