Big Data Management and Analysis in Linux
Vrije Universiteit Amsterdam
Amsterdam, The Netherlands
Area of Study
Computer Programming, Information Sciences
Taught In English
Recommended U.S. Semester Credits3
Recommended U.S. Quarter Units4
Hours & Credits
The growing availability of extremely large datasets requires scientists and analysts to use powerful supercomputers or computer clusters to store, manage, and analyze these data. These clusters typically run on Linux, which requires some programming skills and insights into suitable software packages. Our course will introduce you to programming in a Linux environment, teach you how to efficiently manage very large datasets (e.g. using sed, awk, and grep commands) and create simple shell scripts to analyze your data (e.g. using a Linux version of the freely available statistics program R). You will also learn how to visualize your data and results in customized plots and figures. These skills are extremely valuable for scientists from all disciplines as well as for business practitioners (e.g. consultants or financial analysts) who are planning to work with big data.
The format of the course is three hour lectures in the morning, followed by two hours of supervised work in computer tutorials in the afternoon. Both the lectures and tutorials will be held in a computer room. The lectures will be interactive, with short examples that allow students to apply the introduced concepts. In the tutorials, students will get more hands-on training in a supervised environment with exercises covering the day’s topics, and they will have the opportunity to work on the assignments. The computer room will stay open to students for self-study after the tutorials.
Students are not required to bring their own laptops, but they are allowed to do so if they wish to work on their own computers.
By the end of this course, the student should understand and feel comfortable with:
- Basic Linux programming
- The Unix philosophy and environment; files, processes, pipes, filters and basic utilities
- Login and logout procedures
- File transfer between systems
- Text file manipulation with sed, awk, cut, paste, cat, etc.
- Basic text editing using the vim editor
- Automation through functions, control structures and shell scripts
- Version control with Git
- Working with R through the UNIX command line
- Plotting in R
Interactive seminars, practicals
TYPE OF ASSSESSMENT
Visit to the SURFsara computer facilities at Amsterdam Science Park.
Scientists and data analysts from all disciplines, as well as business practitioners (e.g. consultants or financial analysts) who are planning to work with big data. If you have doubts about your eligibility for the course, please let us know. Our courses are multi-disciplinary and therefore are open to students with a wide variety of backgrounds.
The course will be fairly technical, combined with many computer tutorials. There are no entry requirements other than a willingness to learn about programming Linux, but a decent background in statistics, mathematics, and programming is an advantage.