Ali data warehouse practice sharing

Abstract
A data warehouse, is a strategic collection of all types of data to support the decision making process at all levels of an organization. It is a single data store, created for analytical reporting and decision support purposes. For organizations that need business intelligence to guide business process improvement, monitor time, cost, quality, and control.

Live sharing video at https://v.qq.com/iframe/player.html?vid=v0547ee0whs&width=670&height=376.875&auto=0
The Basics

Building a data warehouse mainly requires modeling skills and also precipitating some dimension tables. In addition to that one has to keep mining the data to build the model better.

A few things that should be done
After getting the activity data, it is important to put that data into the data warehouse and do the following things:

User or entity identification needs to be harmonized.

PC and wireless data to be bridged.

Factual data bridging around entities and across inter-businesses.

Dimensional redundancy of important fact tables.

User profiling or customer profiling.

What's the market value

At present, we have accumulated a lot of user data, which is a very valuable resource, and we have done some development and utilization based on this data.

After building a data warehouse, we can use the data from user analysis to make personalized recommendations, targeted marketing, wind control and so on.

The market value of the data warehouse lies in the construction of the bazaar layer driven by demand scenarios, and the vertical construction between the bazaars.

The bazaar layer digs deep into the value of the data and needs to be able to quickly trial and error.

Take the operation process of Ali Financial in the backend big data as an example, we will import all the data related to the user from the relational database into MaxCompute, and then record some of the user's operation logs, such as which websites have been logged in, which products have been browsed, and what kind of preferences they have. There are also other data that may come from other systems. We do a summary analysis of these data, and ultimately export these data to the business system, there is also a statistical service. This way, when a user comes for a credit loan, we can quickly locate whether the user meets the credit requirements and quickly proceed to approval.

As shown in the figure above, DataWorks relies heavily on MaxCompute. on the right are some of the basic components that Aliyun currently provides. The IDE piece is visualized, such as doing workflow and application scheduling on the workflow designer and configuring it inside.

We provide a web page code editor that supports MR, SQL, and so on. There is also a code debugger where written code can be debugged directly. With the code repository it is possible to keep several versions and preview previously saved versions.

Scheduling is divided into two parts: resource scheduling and workflow scheduling. Workflow scheduling is closely related to the previous workflow. If a workflow is designed in the workflow designer, the underlying scheduling will follow the order of scheduling. Resource scheduling is related to the resources of the underlying gateway cluster.

Data governance is mainly task monitoring and data quality.
Big Data Development Core Process

First analyze the requirement and then do the workflow design, such as when this task is run and what business it depends on. After the workflow design is completed, data collection and data synchronization are performed. Next is data development, we provide a WEB-IDE that supports SQL, MR, SHELL and PYTHON. Then we provide smoke test scenarios, which are released to the line after the test is completed, so that it can perform automatic scheduling and data quality monitoring on a regular basis every day. Once all of the above steps are completed, we can loop our data into the business system repository or use tools like QuickBI and DataV for page presentation.

We design tasks offline and will turn the designed tasks into an instance snapshot at 12 o'clock every day. Our task dependencies are currently the most advanced in the industry as well.

The most common requirement nowadays is to have daily reports, weekly reports to be written weekly and monthly reports to be written monthly. In order to save resources, it is possible to use the data from the daily newspaper to directly turn it into a weekly or monthly report.

The online system has to make sure that the data has been back to the business system at 6:00 pm every day, and the system is going to start using it.

As shown in the diagram above, suppose there are two tasks, D and E, which depend on B and A. Task D has a runtime of 1.5 hours, and E has a runtime of 2 hours. We have to make sure that B runs B's task to completion by 4:00 each day, and the typical normal runtime is 2 hours. Then we have to make sure that A finishes the task no later than 2:00 each day. If A's run time is 10 minutes, and we find that A's task has failed by 1:00, then we can calculate how much margin A has left, and we can manually supervise and troubleshoot. Manual intervention is done before 1:50, thus ensuring that tasks D and E can be produced on time by 6 o'clock.

Summarizing

As shown in the figure, MaxCompute is the “heart” of the villain in the figure, and all the running tasks are inside MaxCompute. Scheduling is the “brain” of the data architecture. The “eye” is data monitoring, which is still a “myopic eye” on the data architecture platform and has not yet been officially launched. Data integration is like two “hands”, constantly carrying data from other places. The underlying development environment and operation and maintenance center are like two “legs” to ensure that the entire data architecture platform to go farther. And data quality is like a “human health center”, that is, the monitoring of data quality.

That's all I have to say today, thanks for listening!