Learning Big Data

From Traxel Wiki
Revision as of 20:36, 27 September 2021 by RobertBushman (talk | contribs)
Jump to navigation Jump to search


Registry of Open Data on AWS[1]

Q: There are lot open datasets to play with here

A: Yeah - you can learn Spark and maybe help cure cancer at the same time. :slightly_smiling_face:

Q: Could you also please help me with the most relevant skills I need to focus on for Data Engineering career path

A: Sure - the Open Source tools are from The Hadoop Ecosystem. Here's a Coursera course[2] that will give you an introduction,

The main components we use are Spark[3], Parquet[4], and Hive[5].

We use AWS implementations of Spark called EMR[6] and Glue[7], and the AWS implementation of Hive called Glue Data Catalog[8].

Then we store the data in S3[9].

The most challenging thing to learn is how to performance optimize joins between billion row datasets. When you get to the point where you can identify small changes in queries that make job runs go from hours to minutes, you'll know you're starting to see the problem.

You'll want to look into things like window functions, broadcast joins, and partitioning strategies.

Also the effects of things like sort order in stored data and coalesce, repartition, and part files sizes.

Those things should get you far enough down the path that you'll see lots of other roads you can explore. From there you can choose your own adventure according to how you'd like to develop your career.