How To Learn Hadoop For Free
2nd April 2017The "big data" technology landscape is changing really, really fast. One consequence of this is that it's hard to find good training resources since they become outdated so quickly. I wanted to get some baseline comfort with a variety of technologies in the Hadoop ecosystem but found my options for thorough, guided education somewhat lacking. I eventually settled on MapR's free training courses. Each one is like a miniature version of on online course (most require only a few hours of time). They include interactive video content, quizzes, and various labs to complete using the MapR sandbox. There's a fairly wide range of courses and the content is very professional.
Below is a brief synopsis of the courses they offer. They are completely free to try out - just follow the link above, create an account, and register for the course you're interested in. In addition, I put all of the content for the courses I worked through (including labs with example code) in a github repo.
Note that due to the fast-paced rate of change that I alluded to earlier (and MapR's vested interest in staying current) the course catalog will likely evolve over time. It's possible that this post will become outdated fairly quickly, although I'll try to revisit it periodically to make sure the guidance is still relevant. I should also note that are there snippets of content throughout the training that are specific to the MapR platform, however more than 90% of it is platform-agnostic.
This list is not exhaustive. It only includes the courses that I spent time working on. Feel free to visit the landing page for a complete list of courses.
Hadoop Essentials
These are short, introductory courses that present a very high-level overview of the Hadoop ecosystem.
ESS 100 - Introduction to Big Data
ESS 101 - Apache Hadoop Essentials
ESS 102 - MapR Converged Data Platform Essentials
MapReduce
MapReduce is how it all got started, and is still used quite a bit. MapReduce is a programming model for distributing work over very large data sets across a cluster of machines. The name comes from the two principal steps involved in the process - map (filtering, sorting etc.) and reduce (summary operations).
DEV301 - Developing Hadoop Applications
HBase
HBase is an open-source, non-relational, distributed column-store database written in Java. HBase is very widely used as an alternative to relational databases for certain types of applications where scale is an issue.
DEV320 - HBase Data Model and Architecture
DEV325 - HBase Schema Design
DEV330 - Developing HBase Applications: Basics
DEV335 - Developing HBase Applications: Advanced
DEV340 - HBase Bulk Loading, Performance, and Security
Spark
Spark is an open-source cluster-computing framework that runs on Hadoop. I've written about Spark in the past. Suffice to say that it is a very exciting (and very popular) framework.
DEV360 - Spark Essentials
DEV361 - Build and Monitor Spark Applications
DEV362 - Create Data Pipeline Using Spark
Drill
Drill is an open-source framework for querying semi-structured and unstructured data at scale using SQL-like syntax. I haven't seen a lot of interest in this outside of the MapR distribution but it's a mature technology that has a lot of potential.
DA410 - Drill Essentials
DA415 - Drill Architecture
Hive
Hive is a data warehousing infrastructure built on top of Hadoop that provides the capability to query data in the Hadoop file system using SQL-like syntax. There's some conceptual overlap between Hive, HBase and Drill that requires some background and context to understand. The relevant courses do a good job of clarifying these relationships.
Pig
Pig is a high-level programming language and framework for doing ETL (extract, transform, and load) tasks with data. I'm not sure how much Pig is used anymore with newer technologies like Spark offering similar capabilities, but I think there is still a use case for it.