First real day

Data Engineering
Published

September 3, 2024

After 2 weeks of holiday/spin-up and a day of ONT training I am finally getting down to what I want to achieve in this sabbatical.

My goals for today are:

  1. Start a deep dive into the technologies Ray and Temporal. If I only look at one of these I will be satisfied
  2. Write-up my reasons for doing a sabbatical from the various notes I have spread around.

Ray

Ray is a Python distributed computing framework which is positioning itself at ML model training. It has a high profile, in particular it is mentioned frequently on The Data Exchange Podcast and AWS are offering Ray clusters as an alternative to Spark in Glue.

The docs lean heavily on ML examples but under the buzwords this looks like a traditional actor-based distributed framework in the same style as Akka. It offers task parallelism and actors (stateful tasks). It also has features more specifically relevant to data-intensive workloads in the ray.data package.

The pitch for Ray often compares itself to Spark a how it enables heregogeneous compute (e.g. mixed CPU/GPU workloads). E.g. this blog post. Although Ray portrays mixed workloads as a novel ML problem, it is no surprise to bioinformaticians and scientists in general. We have been managing workflows with a mixture of resource requirements for decades – as well as GPU tasks, some steps in an analysis will require one large task with access to huge amounts of RAM and the next task may be embarrassingly parallel and thus can be split into many small tasks with lower RAM requirements.

Further details of ray.data screaming are discussed in this blog post. It introduces the concept of Bulk synchronous parallel execution to describe systems sych as Spark and contrast it to pipelining execution supported by Ray. The “Discussion/Optimizations” section briefly describes Ray does to acheive this streaming.

  1. Memory stability. Ray acheives this by “only scheduling new tasks if it would keep the streaming execution under configured resources limits”. Interestingly, it has to predict what the resource consumption of a task will be, which will not be possible in general.

  2. Data locality. Ray will schedule tasks preferentially on nodes where its input data is being produced.

  3. Fault tolerance. Ray has a retry mechanism on failure so can recompute crashed tasks in a similar way to Spark

Ray Data Internals

From this page.

Ray data is a distributed data framework, similar to Spark’s RDD concept (maybe this is closer to a Spark dataframe). Within a python programm, a Dataset object is an array of pointers to PyArrow Tables in the Ray core object store. As such, the data is partitioned into these tables that are distributed across the cluster, are transparently copied between nodes as necessary and will spill to disk if the cluster runs out of RAM.

I.e. Ray offers a similar data abstractions to Spark where an apparent single dataset can reside across multiple nodes and therefore can be bigger than a single node’s RAM.

Ray Core fault tolerance

To paraphrase the “Fault Tolerance” section of the docs. Ray will expose application exceptions and retry framework exceptions (e.g. out of memory, crashed worker). It recommends checkpointing the state of actors but does not offer an automatic way to do this (unlike Akka or Temporal which use Event Sourcing)

Object fault tolerance

This is interesting:

A Ray object has both data (the value returned when calling ray.get) and metadata (e.g., the location of the value). > Data is stored in the Ray object store while the metadata is stored at the object’s owner. The owner of an object is > the worker process that creates the original ObjectRef […] Ray can automatically recover from data loss but not owner failure.

Ray will re-execute tasks if necessary to try and reconstruct an object when lost. Intuitively this would not work with actors because their internal state has changed. Indeed, the docs say actor tasks will not be retried by default but can be enabled (presumably if you have implemented checkpointing correctly).

Ray cannot recover from owner failure.

Temporal

I did some reading about Temporal but didn’t dig too deep today.

Togaf

I want to evaluate the “state of the art” of Enterprise Architecture and thus I have signed up to the Open Group and hope to read as much of their open-access material as possible. I have downloaded Why does Enterprise Architecture Matter. It contains a highly structured description of what Enterprise Architecture is, with lots of terms in bold. It then has a table with some assertions of why it matters. To my mind these are not particularly well backed up with reasoning. This expert made me smile, from a table on page 10 as a benefit to the Business.

It [Enterprise Architecture] ensures that the business can be early adopters of new innovations and is not held back by IT.

In my experience Enterprise Architecture can be a break against innovation and early adoption of technologies because adoption of any technology has to show it is aligned with the strategy and is the “right choice” of available options.

There are many other claims in the table that are worth exploring. I notice the document was written in 2008. I’ve now also downloaded a 90-day evaluation copy of the “TOGAF Fundamental Content PDF”.