Ian Hummel

https://themodernlife.github.io/ · 7 posts · history since 2014 · active

6 Jan 2016

6 Jan 2016 17 min read

Running Apache Flink on Amazon Elastic Mapreduce

I love really Amazon EMR. Over the years it’s grown from being “Hadoop on-demand” to a full-fledged cluster management system for running OSS big-data apps (Hadoop MR of course, but also Spark, Hue, Hive, Pig, Oozie and more). While Hadoop out of the box supports reading from S3, EMR has a proprietary implementation called EMRFS that has some nice features.…

scala hadoop hdfs scladingflink

20 Dec 2015

20 Dec 2015 8 min read

Running Scalding jobs on Apache Flink

My previous post showed a very simple Scalding workflow. Apache Flink is a real time streaming framework that’s very promising. It also supports running Cascading workflows with very little modification. Surely there must be some way to run a Scalding job on top of Flink? Turns out… YES! In a nutshell Here are the high-level things we need to solve…

scala hadoop hdfs scladingflink

20 Dec 2015 1 min read

Getting started with Scalding

I’ve been using Scalding for the last few years and really love how simple it makes writing scalalbe data processing jobs. I think many of the issues beginners have with Scalding relate to project setup. I hope this post simplifies things for people so they can started with less hassle. Building your project with SBT The official getting started guide…

scala hadoop hdfs sclading

28 Sept 2014

28 Sept 2014 1 min read

Getting the current filename with Spark and HDFS

It’s occasionally useful when writing map/reduce jobs to get a hold of the current filename that’s being processed. There’s a few ways to do this, depending on the version of Spark that you’re using. Spark 1.1.0 introduced a new method on HadoopRDD that makes this super easy: import org.apache.hadoop.io.LongWritable import org.apache.hadoop.io.Text import org.apache.hadoop.mapred.{FileSplit, TextInputFormat} import org.apache.spark.rdd.HadoopRDD // Create the text…

scala spark hadoop hdfs

7 Aug 2014

7 Aug 2014 9 min read

Strange bedfellows: how a web-tier validation framework enables strongly typed, big data pipelines

The other day I was talking with a colleague about data validation and the Play web framework came up. Play has a nice API for validating HTML form and JSON submissions. This works great when you’re processing small amounts of data from the web-tier of your application. But could that same tech benefit a Big Data team working on a…

scalavalidationplayspark

8 Jan 2014

8 Jan 2014 1 min read

Scala Unicode Arrows in IntelliJ IDEA

Several of my colleagues love IntelliJ for coding in Scala. I was pretty happy with Sublime Text 2 (and still use it for Ruby/Python/Shell/whatever) but the lack of code completion was really starting to affect my productivity. I spent way too much time looping through the edit/compile/fix typo cycle. Before I could switch though, I really wanted my fancy arrows…

scala intellij

2 Jan 2014

2 Jan 2014 4 min read

Making Your Local Hadoop more like AWS Elastic MapReduce

At MediaMath we’re big users of Elastic MapReduce. EMR’s incredible flexibility makes it a great fit for our analytics jobs. An extremely important best practice for any analytics project is to ensure your local dev and test environments match your production environment as much as possible. This eliminates the nasty surprise of launching a job that takes hours only to…

emrhadoop