Explore Vaisala’s SAP Datasphere and SAC implementation, overcoming integration challenges, real-time analytics, and building a robust data-driven architecture.
Within the Vaisala case study on pioneering the future of data and analytics in SAP Cloud we promised to continue with a technical, architecture-focused blog post on Vaisala’s SAP Datasphere and SAC implementation.
To recap, Vaisala – a global leader in measurement instruments and intelligence for climate action, headquartered in Finland - embarked on a transformative journey to rebuild its data and analytics ecosystem.
This strategic initiative was implemented in partnership with Scandic Fusion, aiming to leverage the capabilities of the combination of SAP cloud-based tools: SAP Analytics Cloud (SAC) and SAP Datasphere. This collaboration marked a significant leap towards harnessing the power of cloud technology to foster data-driven decision-making across the organization.
The Beginnings: Define and Design phase
At the start of the project SAP Datasphere was a brand-new offering in the data analytics field from SAP. Fun fact, we started the project when it was still called SAP DWC. There were no real-world best practices to learn from, and documentation was sparse. Therefore, a crucial decision was made to begin with a conceptualization phase called Define and Design.
During the Define and Design phase, assumptions on SAP Datasphere were tested to see how they fit into the real-world Vaisala data analytics scenario. Different proof of concept tasks done together with the customer helped arrive at the core architectural principles on which the implementation project was based. In many cases, this was done together with SAP value assurance specialists, and many architectural decisions were made with their stamp of approval.
Some of the observations that were established during this phase still hold in true in summer of 2024:
- SAP Datasphere was missing native integration options with some of the source systems that Vaisala is using. For example, Salesforce integration is available only as an Open Connector (additional component for a separate price) and contains only basic integration options that were insufficient for Vaisala’s needs. Therefore, a custom solution using Python scripting was created that could load data incrementally in a parallel manner.
- SAP Datasphere is an open platform: it provides access to the underlying HANA database to third–party tools for data ingestion and allows data consumption from other tools. Having that access allows us to utilize Open SQL schemas as part of the data staging process when loading data from different APIs. For example, as mentioned previously with Salesforce.
- There is a lack of orchestration features when organizing ETL data flows. For example, Task Chain scheduling can be done only on the Space level.
Based on these observations and other information gathered in this phase, key design principles were established:
- Complementary third party ETL tool is used together with Datasphere native ETL to overcome lacking orchestration features and to ingest data from different API sources. We chose Pentaho since it is open source and has proven itself in many DWH projects in the past. It is an on-premise tool that was installed on Azure virtual machine to be close to SAP Datasphere.
- Data integration with S/4HANA is done through CDS views with the help of the Data Provisioning Agent. Knowing that not all CDS views are available for extraction, custom CDS view creation in the S4 system is introduced to solve missing extraction capabilities. The same approach is taken when delta capture for certain CDS views is unavailable out of the box.
- For ad-hoc access to legacy data, SAP HANA Data Lake is used to store legacy data. Again, Pentaho is used for one-time data migration from existing on-premise Oracle legacy DWH to HANA Data Lake. Multiple terabytes of data are stored in HANA Data Lake.
- Staging data is replicated to Datasphere in near real-time manner whenever possible. To achieve reporting performance, the dimension and fact layer is persisted. For real-time reporting needs, specific views are built on top of staging tables that provide real-time access to S/4HANA data.
You can see the summary of the overall architecture in the diagram below:
Let us give them some space!
Space management is crucial in designing any SAP Datasphere solution. Spaces can be used to split objects logically, allow granular resource allocation, and manage security access to different objects. Also, some Datasphere functions can be assigned to only one Space.
It was decided that for the cleanliness of the solution each source system’s raw data should be in its own Space. Common data models should reside in just one restricted Space where Data Access Controls are applied as well. Semantic layer models should also be split into Spaces to assign access rights properly. In the end, Spaces were created for:
- Each source system
- MAIN data modeling/transformation Space
- Semantic layer modeling Spaces distributed by business function
- Space for configuration of Data Access Controls
- Data auditing Space
- Configuration Space
- Space for real-time artefacts
- Ad hoc sandbox Spaces
Having separate Spaces for different business functions gave another benefit when the need for custom user-defined mappings arose. Vaisala power users already had access to Space, where they could utilize Datasphere Local Tables by entering needed data mapping straight into Datasphere. That allowed them to see mapped data immediately in the same tool, and any adjustments could be made without waiting for data reload. Datasphere itself takes care of auditing of entered data as well. The same approach is taken for entering Data Access Control dimension value and AD group combinations for row level security setup.
Another benefit of having a separate Space for each source system is that it complements the built-in data lineage functionality in SAP Datasphere. As lineage works across Spaces, that allows to see it not only on an object level but also to have an additional layer of logical grouping. That together with thorough object naming rules makes the already state of the art lineage solution in SAP Datasphere even more practical and usable:
Need for Speed: Approach on Real-Time Analytics
Nowadays customers are trying to minimize the time gap it takes to move the newest data from their operational systems to their analytical platform. Preferably it should be done in a near real-time manner. That was the case with Vaisala as well. Although previous data platform could not provide it and real-time reports were not in the original project scope, requirements started to pop up to see data in SAP Datasphere as soon as they appear in S/4HANA.
The chosen all-SAP data platform already provides some real-time analytical capabilities straight out of the box. S/4HANA provides the ability to connect from SAP Analytics Cloud to S/4HANA CDS views. Although this gives availability to cover some simple operational needs, it does not cover specific scenarios where you want to combine data from multiple business processes.
Data replication and transformation
Before we take a deep dive into real time analytics, let us take a step back and look at data replication and transformation set-up for overall common data model that serves as a basis for all existing reports. In theory, SAP Datasphere can provide federated access to source systems and transformations can be defined as views that provide real-time calculations. While this approach may work in some simple scenarios, in Vaisala’s case, data amounts, complexity of transformation logic and need for fast dashboard performance required data to be persisted at least at some stage. Fast forward – data is persisted both in staging layer and in transformation layer for common data model. What did drive this decision?
The recommended approach to load data from S/4HANA to Datasphere is through CDS views. This requires connecting to the application layer instead of connecting to the database. Connectors used in this approach are not really meant for federating data. Push-down functionality is extremely limited, meaning most transformations and filtering can only happen when all data is retrieved to SAP Datasphere. So, the decision was taken to persist data from remote tables into SAP Datasphere. It relieved the source system from extensive querying and significantly increased performance in SAP Datasphere. Some of the CDS views even provide real-time replication out of the box. This meant that we could get data both in real-time and persisted in SAP Datasphere as physical tables for increased performance.
At the start of the project, the semantic layer was done in Business Builder artifacts, such as perspectives and consumption models, which are now deprecated. Perspective internally was built as a complex view that contained many fact tables and joined with all possible dimension tables that served for one single report. This obviously created performance issues, hence why it was later superseded with Analytic Model by SAP. Having those performance issues and considering that most of transformation logic is complex, it was decided to materialize dimension and fact tables as well.
Real-time models
Now let us go back to real time scenarios. At this point we are replicating data in near real time manner to staging layer for many tables. With the help of ABAP programming in S/4HANA we can make most CDS views available to be replicated in real time to SAP Datasphere.
Since existing dimensions are persisted, we could not reuse them for real time artefacts. Instead, we decided to create a separate Space that would contain only a small subset of objects that are required for real time reporting. Star schema was replaced with fact views that already contained all needed fact measures and attributes needed for specific reporting case. This way we could minimize the number of unnecessary transformations needed when real-time reports are accessed.
This approach has been working well. Data is available in SAC reports in seconds after it appears in S/4HANA.
Back to the future: If we would start today?
Datasphere has changed a lot since it was introduced in 2021. In hindsight, decisions taken at the start of the project did prove themself well for the functionality available at the time. However, there are things now deprecated from SAP, and new functionality has been introduced. So, there are a couple of things that we would do differently now.
- Data replication from S/4HANA would be done by new Replication Flows where it is possible. They do not need Data Provisioning Agent, allow data to be loaded in parallel and 1 minute interval for delta capture loads gives an ability to replicate data near real time.
- Using replication flows lets you use another new feature of SAP Datasphere – Delta Capture for Local Tables. Having this function enabled, in some cases it would allow delta loading not just in staging layer but utilize that in transformation layer as well.
- Building on top of two previous points, another new feature Transformation Flows would be used to transform and materialize data in dimension and fact tables. It can make use of delta capture functionality and can persist data in an incremental manner where possible.
- Business builder modelling would not be used. Some of the objects are officially deprecated, for some we do not see any benefits. Instead, the new Analytic Model should be used in all scenarios. It simply provides much easier development and better performance.
Key Takeaways
"It has been an absolute pleasure to embark on the SAP Datasphere journey in cooperation with Scandic Fusion. Over the past two years, we have learned a lot together, and I can now confidently say that SAP Datasphere, in combination with SAC, offers a mature data analytics solution. Vaisala as an organization has gained greater control and understanding of the data modeling and dashboard-building processes. Thanks to the thorough technical knowledge transfer, my team is fully equipped to manage support and further development post go-live.
Joel Friman, Business Solution Manager, Data and Analytics at Vaisala
- Take the time to think about the final architecture. A full-fledged data analytics platform with SAP Datasphere will probably require a complementary ETL tool as well.
- Do not be afraid to use Spaces to logically organize your data artefacts. It helps build a clean solution and empowers the already top-tier lineage solution available in SAP Datasphere.
- Near real time reporting is possible with SAP Datasphere and S/4HANA. Due to performance, it might not be suitable for very complex analytical cases, but for operational needs with extra transformations where it is really needed, a proper data model can make this magic happen.
- SAP Datasphere is a rapidly evolving product. During the past couple of years Replication flows were introduced for replicating data to Datasphere. Semantic layer modeling was completely changed when the Analytic Model was introduced. Be ready to adapt and follow the SAP roadmap closely!