WHUG #33: Apache Gobblin and TouK Nussknacker

Kiedy:
6 marzec 2018@18:00 – 20:00
2018-03-06T18:00:00+01:00
2018-03-06T20:00:00+01:00
Gdzie:
Stefana Banacha 2
02 Warsaw

Wydział MIMUW (sala 5440)
Banacha 2 · Warsaw

We are happy to invite you to the 33rd meetup of WHUG. We will be pleased to host a team from Allegro – Pawel Leszczynski, Rafal Kudliński, Michal Ferliński, and Maciek Próchniak from TouK Nussknacker. Below you will find details

Title: Apache Gobblin – swiss army knife for data ingestion and lifecycle management

Abstract:
Ingesting data from Kafka onto HDFS may look simple at first glance. When scaled up to hundreds of Kafka topics and making data available instantly via Hive, it gets tough as complexity of the system arises.

• How to fetch data from several Kafka topics with different sizes?
• How to avoid small files on HDFS?
• How to handle data retention?

You encounter plenty of issues, like the ones above, and You may end up with lots of separate tools to solve each of them. You may, but do not have to, as there exists one tool that solves them all. In this talk we present our transition from Camus to Gobblin and describe issues we encountered and show how they can be solved within Apache Gobblin.

Speakers:

Rafał Kluszczyński
Data platform engineer working at Allegro since 2013, one of the largest e-commerce and technology companies in Central and Eastern Europe. Member of the team responsible for data platform which stores millions of messages every day. In the years 2009-2015 involved in providing the Polish scientific community with an IT platform (PLGrid) which have enabled e-science research in various fields. Interested in all kinds of computing and distributed systems together with massive data processing and analysis methods.

Michał Ferliński
Datal platform engineer @ Allegro. Genuine JVM developer that takes care of data pipelines on a petabyte scale Hadoop cluster. Loves understanding problems in details, solving them and pushing patches to Open Source projects when applicable.

Paweł Leszczyński
Paweł holds PhD in distributed databases. Paweł holds PhD in distributed databases. He has over 9 years of technical experience at Allegro – a platform responsible for the majority of Polish ecommerce market. Since 3 years he is a Big Data Professional working as Hadoop Project Manager. Big Data Ecosystem @ Allegro contains two Hadoop clusters in data centers with more than 4 petabytes of data, Presto SQL Engine and Kafka ingestion process. He is passionate about optimizing data processing – formerly Hive, currently Spark.

Title: TouK Nussknacker – creating streaming jobs with GUI

Abstract:
Two years ago TouK introduced Flink in one of the biggest polish telcoms in the domain of real time marketing and fraud detection. One of the most significant problems in adoption was lack of programming skills at our client – the users were supposed to be analytics/business people. Therefore, we developed an open source solution – TouK Nussknacker – which allows users to design processes with GUI by drawing diagrams. We believe it can make stream processing with Flink more accessible in many use cases, especially in companies that don’t have their own development teams. During the talk I’m going to describe architecture of our platform, how is it used by our clients and what are the plans for the future. I’ll also describe our experiences – when being able to use GUI is great and when it’s better to develop jobs with standard API. If time permits I’ll also show a quick demo of our solution.

Speakers: Maciek Próchniak
More than 10 years of experience, currently usually as architect/(lead) developer – but my roles vary from analysis to deployment. My main fields of interest are integration functional programming and stream processing systems. I also like to give talks at conferences – Confitura, JEEConf, VoxxedDays just to name a few. Currently leader of TouK Nussknacker project.

https://www.meetup.com/warsaw-hug/events/247797926/