Back

Data in 2025 – The Year Ahead

Self-Service BI (again)?

2024 has seen the usual “innovations”, takeovers and tirades, and most of them have been about Data Engineering (especially the Data Lakes vs Warehouses controversy), not Data Analytics. In fact the BI tools, and Power BI in particular, have been pretty quiet in announcements and development. More self service BI tools are becoming prominent, like Sigma, ThoughtSpot and SiSense. Sigma especially is rocketing, but at ~£20k per year for minimum licencing (ok, you can probably negotiate privately to get it down to £10k for half the 200 users) it’s a big buy-in before you even get ROI from developed analytics (based on proper data modelling, not ad-hoc queries). Ok, the idea is that you spend up-front on the tooling and the self-service aspect provides the cost saving on developers. Nice idea, although many of us know where self service BI went (one wrong assumption has a thousand cuts). But agreed, it could well be used to keep development costs down whilst increasing user engagement and insight.

Back to DE not DA

And that is one of the key reasons there’s now so much more emphasis on Data Engineering, not Analytics – it’s crucial that organisations have a Single Version of Truth (SVOT). At least, until it may be mangled by self-service users – but then it can still be debugged much easier. These days, it’s less acceptable to have a BI reporting system fed from a myriad of sources like multiple individual spreadsheets, CSVs, mini databases (still another MS Access?), “master” tables, uploads and so on. Let alone all the unstructured data (emails, PDFs and other text). The recent idea has been to dump everything into a Data Lake (Lake House being a collection of them… but governed properly), where at least it’s all accessible in the same place. Apache Iceberg or Delta Lake being the key tech (Apache Hudi keeping up) to allow ACID ie. database modification control. There you can groom it once (add metadata, apply transformations etc) instead of each BI “report” doing that job, or worse, “schema-on-read” assigning metadata datatypes and relationships on the fly (quality not guaranteed!). Data Catalogs (Unity, Hive Metastore, Microsoft Purview etc) were a more successful metadata management method.

Medallion Architecture – Same Old!

The past year saw a resurgence in a new model of “Medallion Architecture” ie. Bronze -> Silver -> Gold levels of data management. But that’s been exposed as no more than a rebrand of Ingest -> Staging -> Final which we’ve all been doing for decades. It’s a distraction, where real focus should be on the Git based version control and CI/CD (Continual Integration/Continual Deployment) as ever.

Data Fabric

… was 2024’s buzzword. It’s mainly relevant for Terabyte sized data. What does it add to the usual infrastructure? “Deciphering Data Architectures” (2024) by James Serra on O’Reilly gives more detail:

  • Data Access policies at wider level;
  • Data Catalog and Lineage;
  • Master Data Management;
  • Data Virtualisation (logical views rather than shifting physical storage of data);
  • Real Time Processing;
  • APIs (allowing querying without knowing data locations);
  • Services (re-usable code blocks etc);
  • Products (Services sold to 3rd Parties).

Er, Microsoft Fabric?

Microsoft have joined the bandwagon with their “Fabric” product SKUs (F2, F64 etc) in order to provide SaaS (Software as a Service) rather than the usual PaaS (Platform). But many see it currently as just a rebundling of the already failed branding of Synapse together with the mishmash of Azure services such as Data Factory. And tbh, we’re inclined to agree at present. We can make it all work together, but that involves raising support tickets to fix fundamental issues (eg. can’t read Nulls from Excel Files in Data Factory… until they fixed it 5 weeks after our ticket!) and frankly, none of us has time for that kind of base testing. Also, dependent services even such as Generation 2 Storage (ADSL2) rack up surprisingly high monthly costs even when not fully utilized (whereas Snowflake reverts to zero charges when not used, and more reasonable incremental small charges), so we’ll keep an eye out – but preferably leave it until 2026.

Do it now – BAU!

So it’s Business As Usual… cleaning and consolidating data (and ensuring it’s collected appropriately in the first place) in preparation for analytics which are still best done on Power BI for cost vs functionality. But let’s hope we get more love for the analytics (cheaper and/or more SQL based alternatives to Power BI which may be easier to implement without complex and sometimes brittle DAX) or make use of recent years advances in Lake Houses and the Modern Data Warehouse – combining good ol’ Relational Databases with Data Lakes. Not to mention Semantic Models for better analytics… oh, then we can get back to ML and AI!

Happy new year!

TickboxPhil
TickboxPhil
https://tickboxanalytics.com

This website stores cookies on your computer. Cookie Policy