Data engineering and software engineering are convergingclickhouse.com

23 points by craneca0 2 hours ago | 10 comments

CalRobert 22 minutes ago [-]

Data engineering was software engineering from the very beginning. Then a bunch of business analysts who didn't know anything about writing software got jealous and said that if you knew SQL/DBT you were a data engineer. I've had to explain too many times that yes, indeed, I can set up a CI/CD pipeline or set up kafka or deploy Dagster on ECS, to the point where I think I need to change my title just to not be cheapened.

sdairs 3 minutes ago [-]

I think even before dbt turned DE into "just write sql & yaml", there was an appreciable difference in DE vs SE. There was defo some DEs writing a lot of java/scala if they were in Spark heavy co's, but my experience is that DEs were doing a lot more platform engineering (similar to what you suggest), SQL and point-and-click (just because that was the nature of the tooling). I wasn't really seeing many DEs spending a lot of time in an IDE.

But I think whats interesting from the post is looking at SEs adopting data infra into their workflow, as opposed to DEs writing more software.

getnormality 5 minutes ago [-]

It's not hard to do data engineering to the standards of software engineering, and many people do it already, provided that

1. You use a real programming language that supports all the abstractions software engineers rely on, not (just) SQL.

2. The data is not too big, so the feedback cycle is not too horrendously slow.

#2 can't ever be fully solved, but testing a data pipeline on randomly subsampled data can help a lot in my experience.

giantg2 27 minutes ago [-]

I've never really seen the distinction between data and software engineering. It's more like front-end vs backend. If you're a data engineer and it's all no code tooling, then you're just an analyst or something.

zurfer 39 minutes ago [-]

Maybe. On the one side you have something like dbt or Moosestack. On the other hand analytics and data pipelining is still a lot of no code tooling and I doubt it will go away. However I would love to learn more about how other people use coding agents to do DE tasks.

rawgabbit 25 minutes ago [-]

In Snowflake, I am now writing Python Stored Procedures that make REST API calls to things like Datadog REST API and dumping the JSON into a Snowflake table. I then unpack the JSON and transform it into a normalized table. So far it works reasonably well. This is possible using Snowflake's external access feature. https://docs.snowflake.com/en/developer-guide/external-netwo...

zamalek 18 minutes ago [-]

One things have seen through my more recent exposure to experienced data engineers is the lack of repeatability rigor (CI/CD, IaC, etc.). There's a lot of doing things in notebooks and calling that production-ready. Databricks has git (GitHub only from what I can tell) integration, but that's just checking out and directly committing to trunk, if it's in git then we have SDLC right, right? It's fucking nuts.

Anyone have workflows or tooling that are highly compatible with the entrenched notebook approach, and are easy to adopt? I want to prevent theses people from learning well-trodden lessons the hard way.

RobinL 13 minutes ago [-]

I think this may be a databricks thing? From what I've seen there's a gap between data engineers forced to use databricks and everyone else. From what I've seen, at least how it's used in practice, databricks seems to result in a mess of notebooks with poor dependency and version management.

zamalek 2 minutes ago [-]

Interesting, databricks has been my first exposure to DE at scale and it does seem to solve many problems (even though it sounds like it's causing some). So what does everyone else do? Run spark etc. themselves?

esafak 16 minutes ago [-]

For CI, try dagger. It's code based and runs locally too, so you can write tests.