I was a fellow of the September 2015 batch in NYC. It's a seven week program focused on building Data Engineering core concepts and skills. In the first half we spent 3 weeks on our projects to get hands on experience working with data engineering software.
For my project I compared stream processing software to better understand scalability concerns, in order to choose one of the popular stream processing software. First, I set up two pipelines on clusters of 4 nodes each, keeping them as minimal as possible to focus on the essence of each stream processing software. Then I created timestamps both at the beginning and end of pipelines to measure a microscopic amount of throughput and latency through each stream. These timestamps were saved into a separate Cassandra cluster.
I was particularly interested in streaming applications because last year Kanye West's Red October sneakers sold out in 10 mins on Nike.com. In a separate 10 minute interval, the price of a pair resold from the retail price of $245 to $7,500.
As an example application of stream processing, I used collectible sneaker sales data from eBay API to simulate a trend watching map dashboard view of sales across the country. There was much cleanup needed to parse natural language into simple buckets, so for simplicity in demonstration, we compare two broad categories of collectible shoes: Air Jordans and Yeezy's. Each of these contains 5-20 varieties and multiple shoe sizes. The bottom right corner shows an average of each, updated as each map marker is added.