Image for post
Image for post

In the previous articles (1)(2), we started analyzing the individual features of Adaptive Query Execution introduced on Spark 3.0. In particular, we analyzed “dynamically coalescing shuffle partitions” and “dynamically switching join strategies”. Last but not least, let’s analyze what will probably be the most-awaited and appreciated feature:

Dynamically optimizing skew joins

To understand exactly what it is, let’s take a moment’s step back with the theory, remembering that a DataFrame is on Spark an abstraction from the concept of RDD, which is in turn a logical abstraction of the dataset that can be processed in a distributed way thanks to…

Image for post
Image for post

In the previous article, we started analyzing the individual features of Adaptive Query Execution introduced on Spark 3.0. In particular, the first feature analyzed was “dynamically coalescing shuffle partitions”. Let’s get on with our road test.

Dynamically switching join strategies

The second optimization implemented in AQE is the runtime switch of the dataframe join strategy.

Let’s start with the fact that Spark supports a variety of types of joins (inner, outer, left, etc.). The execution engine supports several implementations that can run them, each of which has advantages and disadvantages in terms of performance and resource utilization (memory in the…

Image for post
Image for post

Apache Spark is a distributed data processing framework that is suitable for any Big Data context thanks to its features. Despite being a relatively recent product (the first open-source BSD license was released in 2010, it was donated to the Apache Foundation) on June 18th the third major revision was released that introduces several new features including adaptive Query Execution (AQE) that we are about to talk about in this article.

A bit of history

Spark was born, before being donated to the community, in 2009 within the academic context of ampLab (curiosity: AMP is the acronym for Algorithms Machine…

Apache Spark is the most widely used in-memory parallel distributed processing framework in the field of Big Data advanced analytics. The main reasons for its success are the simplicity of use of its API and the rich set of features ranging from those for querying the data lake using SQL to the distributed training of complex Machine Learning models through the use the most popular algorithms.

Given this simplicity of using its API, however, one of the most frequently problem encountered by developers, similar to what happens with most distributed systems, is the creation of a development environment where you…

Nel post precedente ho illustrato in modo semplice come creare una REST API a partire da un modello di Machine Learning realizzato in Python tramite l’utilizzo del framework scikit-learn. In particolare il modello, esportato in formato pickle, era stato wrappato in un servizio HTTP REST implementato tramite l’utilizzo di Flask.

Esporre il modello tramite HTTP ha l’indubbio vantaggio di renderlo facilmente integrabile con altre applicazioni. …

La combinazione Python + Jupyter è oggi quasi uno standard-de-facto per quanto riguarda lo sviluppo di modelli di Machine (o Deep) Learning da parte dei Data Scientist. Google mette a disposizione, in modo assolutamente gratuito, un ambiente di sviluppo chiamato Colaboratory basato sullo stack di cui sopra con, addirittura, la possibilità di poter usufruire di un container dotato di GPU/TPU in grado di velocizzare notevolmente le operazioni di training delle reti neurali.

Mostreremo in questo tutorial, mediante un esempio facilmente riproducibile, come esporre un modello di Machine Learning, scritto in Python con l’utilizzo del popolarissimo framework scikit-learn, sotto forma di…

Gestire files di piccole dimensioni su HDFS: analisi del problema e best practices

Hadoop è ad oggi la piattaforma Big Data standard-de-facto nel mondo enterprise. In particolare HDFS, il modulo Hadoop che implementa la parte di storage distribuito, è la soluzione maggiormente diffusa per l’archiviazione dei files che compongono il cosiddetto “Data lake”. Andremo in questo articolo ad analizzare uno degli “antipattern” più frequenti ed insidiosi che è possibile incontrare nei casi un cui è stato fatto un utilizzo non corretto di questa tecnologia: l’archiviazione di files di piccole dimensioni.

Image for post
Image for post

HDFS, Hadoop distributed file-system, consente di gestire files di grandi…

Mario Cartia

Old school developer, veteran system administrator, technology lover and jazz piano player.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store