Data Makes The World Go Round

Behind the scenes in the Kaltura Server, a lot of data is being created. This data is collected and aggregated into Kaltura’s Data Warehouse  (DWH). The DWH contains all the information displayed in the  Analytics tab in the Kaltura Management Console (KMC), and more (some accessible through the API, and some only through direct querying). In this post I’ll shortly describe what data is stored in the DWH and how to get to it. In future posts I’ll dive into the technology powering the DWH, because it is different from the technology powering the rest of the Kaltura Server.

Star Schema
The DWH schema is based upon the star schema design pattern. According to this design pattern, we have several fact tables, containing granular data about events that occurred on the Kaltura Server, and dimension tables, containing data about entities in the Kaltura Server’s database. When you install the Kaltura Server, several schemas are created in your databases. The Kaltura Server uses the kaltura schema and the DWH uses the kalturadw schema (and some internal data staging tables in other schemas).
Dimension Tables
Once a day (and in the upcoming Eagle release, once an hour), metadata about new entities on the Kaltura Server are copied to the DWH. These entities include partners, entries, assets and more. In the DWH schema (kalturadw), you can find these entities in tables starting with dwh_dim. For example, entries will be in dwh_dim_entries, and partners will be in dwh_dim_partners.
Fact Tables
As I mentioned before, fact tables, starting with – you guessed it – dwh_fact, contain granular data created by users using the Kaltura Server. Adhering to the start schema design pattern, we connect data in the fact table to data in the dimension tables. Entry_id in the dwh_fact_events table will refer to an entry_id in dwh_dim_entries.
Let’s look at a few examples of data that is stored in these tables:

  • A partner adds a new video or converts an existing video. At the end of the process, a new video file was created. The information about the new file is then used to compute the amount of storage used by this partner in dwh_fact_entry_sizes.
  • A user is watching a video using the KDP. The KDP sends a series of events back to the Kaltura Server. These events are registered in the log created by the Apache server, and are then collected by the DWH. These events are parsed and logged in a table named dwh_fact_events. If you query this table, you will be able to see every event that was successfully received by the Kaltura Server. These events are the basis to most of Kaltura’s analytics.
  • Bandwidth logs are also collected, and we can then drill down into the log of bandwidth consumption all the way down to a specific file (video, image, conf, etc.). These bandwidth consumption events are stored in  dwh_fact_bandwidth_usage.

Aggregation Tables
Although the data is stored on a very granular level in the fact tables, in practice, it is very inefficient to query these tables often. For this reason we have another set of tables, used to store aggregated data which are much smaller and result in faster response times. Each such aggregation table corresponds with a view or query done by the Analytics tab in the KMC.
While the KMC refreshes the data presented in the Analytics tab of the KMC on a daily basis, we store the aggregations on an hourly basis. We do this in order to provide the KMC the ability to show the information correctly for different time zones. I will probably write a bit more about this issue in a later post. In any case, this is the reason why aggregated information can be found in tables with the dwh_hourly prefix.
The aggregation tables all look the same, except for the leading field in the aggregation. In this case, the entry id:

CREATE TABLE kalturadw.'dwh_hourly_events_entry' (
	'partner_id' INT DEFAULT NULL,
	'date_id' INT DEFAULT NULL,
	'hour_id' INT DEFAULT NULL,
	'entry_id' VARCHAR(20) DEFAULT NULL,
	'sum_time_viewed' DECIMAL(20,3) DEFAULT NULL,
	'count_time_viewed' INT DEFAULT NULL,
	'count_plays' INT DEFAULT NULL,
	'count_loads' INT DEFAULT NULL,
	/* Many more fields; ads (pre/post/overlay), drop-off metrics, etc. */
	PRIMARY KEY 'partner_id' ('partner_id','date_id','hour_id','entry_id'),
	KEY 'entry_id' ('entry_id')

Note that dwh_hourly_events_entry contains the entry_id column. Other aggregation tables contain different columns according to the aggregation field. So, dwh_hourly_events_domain contains the domain_id column instead of the entry_id column, but is otherwise identical, and the same is correct for the other aggregation tables. Some of the information in these tables is still unavailable through the KMC. If all you want to see is how many times an entry was loaded and played, the KMC is good enough, but if you need information about ads, for example, you will have to look at the preroll, midroll, postroll and overlay columns in the DWH.
In future posts I’ll write up some useful queries that you can use to get a bird’s eye-view of the total usage in the Kaltura Server.
That’s all folks… for now
This post is long enough, but there’s still so much to tell. If you have any questions or requests for information, leave a comment below.

Let's Get Going