Curious Insight


Technology, software, data science, machine learning, entrepreneurship, investing, and various other topics

Tags


Curious Insight

How To Learn Hadoop For Free

2nd April 2017

The "big data" technology landscape is changing really, really fast. One consequence of this is that it's hard to find good training resources since they become outdated so quickly. I wanted to get some baseline comfort with a variety of technologies in the Hadoop ecosystem but found my options for thorough, guided education somewhat lacking. I eventually settled on MapR's free training courses. Each one is like a miniature version of on online course (most require only a few hours of time). They include interactive video content, quizzes, and various labs to complete using the MapR sandbox. There's a fairly wide range of courses and the content is very professional.

Below is a brief synopsis of the courses they offer. They are completely free to try out - just follow the link above, create an account, and register for the course you're interested in. In addition, I put all of the content for the courses I worked through (including labs with example code) in a github repo.

Note that due to the fast-paced rate of change that I alluded to earlier (and MapR's vested interest in staying current) the course catalog will likely evolve over time. It's possible that this post will become outdated fairly quickly, although I'll try to revisit it periodically to make sure the guidance is still relevant. I should also note that are there snippets of content throughout the training that are specific to the MapR platform, however more than 90% of it is platform-agnostic.

This list is not exhaustive. It only includes the courses that I spent time working on. Feel free to visit the landing page for a complete list of courses.

Hadoop Essentials

These are short, introductory courses that present a very high-level overview of the Hadoop ecosystem.

ESS 100 - Introduction to Big Data

ESS 101 - Apache Hadoop Essentials

ESS 102 - MapR Converged Data Platform Essentials

MapReduce

MapReduce is how it all got started, and is still used quite a bit. MapReduce is a programming model for distributing work over very large data sets across a cluster of machines. The name comes from the two principal steps involved in the process - map (filtering, sorting etc.) and reduce (summary operations).

DEV301 - Developing Hadoop Applications

HBase

HBase is an open-source, non-relational, distributed column-store database written in Java. HBase is very widely used as an alternative to relational databases for certain types of applications where scale is an issue.

DEV320 - HBase Data Model and Architecture

DEV325 - HBase Schema Design

DEV330 - Developing HBase Applications: Basics

DEV335 - Developing HBase Applications: Advanced

DEV340 - HBase Bulk Loading, Performance, and Security

Spark

Spark is an open-source cluster-computing framework that runs on Hadoop. I've written about Spark in the past. Suffice to say that it is a very exciting (and very popular) framework.

DEV360 - Spark Essentials

DEV361 - Build and Monitor Spark Applications

DEV362 - Create Data Pipeline Using Spark

Drill

Drill is an open-source framework for querying semi-structured and unstructured data at scale using SQL-like syntax. I haven't seen a lot of interest in this outside of the MapR distribution but it's a mature technology that has a lot of potential.

DA410 - Drill Essentials

DA415 - Drill Architecture

Hive

Hive is a data warehousing infrastructure built on top of Hadoop that provides the capability to query data in the Hadoop file system using SQL-like syntax. There's some conceptual overlap between Hive, HBase and Drill that requires some background and context to understand. The relevant courses do a good job of clarifying these relationships.

DA440 - Hive Essentials

Pig

Pig is a high-level programming language and framework for doing ETL (extract, transform, and load) tasks with data. I'm not sure how much Pig is used anymore with newer technologies like Spark offering similar capabilities, but I think there is still a use case for it.

DA450 - Pig Essentials

Follow me on twitter to get new post updates.



Big DataData ScienceSoftware Development

Data scientist, engineer, author, investor, entrepreneur