Intellicus Enterprise Reporting and Business Insights 18.1

Advanced Steps

0 views June 28, 2020 0

Merging two sources using Join/ Union

Intellicus Query Objects support two types of merging of data sets.

  1. Join – includes Equi and Outer joins
  2. Union – union of equal and unequal number of columns
Join

This step takes two inputs. When the data is passing through this step, the data of both inputs will be joined based on properties set for this step.


Figure 12: Join Step

The Join step properties are:

Property Values Comments
Select All Fields Select/De-select

 

Select = All fields from both the sources will be available in the output of this step

The duplicate field names will be automatically renamed (suffixed with a underscore and number), example: if there is a common field named “Status” in multiple data sources that are joined, the names become Status, Status_1, Status_2, and so on

De-select = You can choose which fields should be available in the output of this step

Choosing is done by selecting the check box adjacent to field names of both inputs

Join Type Select from:

Inner Join

Left Outer

Right Outer

Full Outer

Select the type of join from the available list

 

 

 

 

Join Conditions Field name Operator Field Name Forms the Join Key
Union

Union step takes two or more inputs. The data passing through this step appends to one another and forms a single output.

Generally, the data inputs selected for Union are of same structure i.e. the columns are same. But this step supports exceptions and allows having extra columns in some inputs. You can decide whether to carry forward the fields coming from only some inputs to output.

During the union process you can decide to take out the sorted or you can prefer unsorted appending.

Figure 13: Union Step

The properties of Union step are:

Property Values Comments
Union Type Unsorted

Sorted

 

The output of this step shall be sorted or not
Sort Priority Number 1-3 This property is at field level.

This is set below the field name.

Set the number 1, 2 or 3 to the fields which should be sort key

Sort Criteria Ascending

Descending

Sorting criteria as either ascending or descending order
Remove Duplicate Rows Select/De-select Select = To get distinct rows from Union step
Column Enter The selected field name populates automatically.

You can change the caption of the resultant field.

You can perform the actions on columns as mentioned in the below table

 

Action Button Comments
Add Column add button To add a new column
Rename Column rename button To apply typed name to the selected column
Delete Column delete button To delete the selected column

Query Objects with Dynamic Fields

Dynamic Fields Step

Dynamic Fields step allows you to get flexibility of adding or removing fields to a Query Object at run time based on your meta information.

Using this step, you can:

  1. Add dynamic fields by Pivoting data from input Data Source
  2. Dynamically fetch meta data for field properties

This step takes one input Data Source.


Figure 14: Dynamic Fields Step

Dynamic Metadata

This step takes an SQL or other Data Source that defines the metadata of the dynamic result set at run time.

This SQL will be fired just before fetching the input Data Source.

The Field Attribute Mapping section takes each field from the metadata result set and maps it to Query Object field properties.

The important mappings are Field ID, Field Name, Caption and Data Type.

Pivoting

Pivoting allows you to convert highly normalized, Name Value paired data into flattened tabular data.

Pivot Columns: specifies which column has field ID and which column has value.

Select Grouping: specifies grouping fields, when grouped on which, the normalized data converts to flat table.

Custom Step

External Task

External task step allows you to call standard and custom 3rd party processes from within Intellicus Query Object. Some may be in-proc and some tasks may be sent to remote execution by appropriate bridge components.

There are few pre-configured external tasks. More external task types can be configured.

Pre-configured external task types:

Task Type Comments
Custom MapReduce Job for Hadoop If you have custom-written scripts, you can use this option to execute such jobs
Pig Job Jobs created using Pig scripts can be executed using this option
Custom Row Processor You can create your own logic to transform your data using JAR files and provide it here
Hive Job Hive jobs can be executed, and results retrieved for analysis

Data Science Engine Step

To perform the Data Science Engine step, you must have a Data Science connection formed to any of the Data Science Engines. The connection to a Data Science Engine is talked about in “WorkingwithDatabaseConnections.pdf”.

The Data Science Engine step takes 2 inputs. This step helps you to transmit your data to Data Science Engines to perform machine learning and modelling. You can add Data Science Engine step before or after adding any other transformation step.

Adding Data Science engine step at Query Object level helps you when predictions on your data are adding new variables and columns in tables. For example, in a market basket analysis, the clusters that would form may require new columns & variables in the table. This can be achieved while data preparation and hence such algorithms need to be defined at Query Object level.

You can also perform Data Cleansing and other Data Science engine related transformation tasks by creating script at Query Object level.

Data Science Engines train on your data to bring out predictions. You can input Training as well as Prediction data based on the below conditions while transforming your data.

  • If you have separate data to train and predict you need to add data for training as well as prediction.
  • If you want training and prediction on the same data, only one data source can be added.
  • If you already have a trained model in your script, you need not add training data.

Data Science Engine Step
Figure 15: Data Science Engine Step

Properties

Property Values Comments
Data Source Engine R Job

 

Here you can select the Data Science Engine you want to use
Script Sample Script Here you can see the Data Science Engine script you have created
Edit Type Yourself Click the Edit button to create Data Science Engine script or edit an already created one.

When you click the Edit button, the script editor box opens. Here you can view the fields in your script and write R script for relevant fields. You can also verify your script to check if it is error-free.

Guidelines for writing R Script
  1. The script needs to have sections for Training and Prediction. These sections should start with #. These place holders should be surrounded by <%%> for Intellicus to be able to parse and understand the modularization. For example, #<% TRAINING.SECTION %>
  2. The first line of the Training and Prediction script should be for reading the CSV and the last line of Prediction script should be for writing. Argument passed in the reading section should be <% Stepname.data %>. Example,Read.csv(‘<% Train.data %>’)
  3. Previous step data should be referred as ‘StepName.data.’ For example, in the transformation area if you created the step as Train, the input must be ‘Train.data.’
  4. The model created is by default saved as ‘myModel.’ This is a mandatory name to the model you create as it is referred to while communicating with Data Science engines.
  5. The training will only happen if the training script is provided, otherwise it will be assumed that a trained model is used.
  6. If a trained model is used, it is mandatory for user to provide a prediction script.

Once you have added a script, you can click the Verify button to check if it is appropriately written. Click OK. You can further click Save or Save As to save your query object to use it in reporting.

An example script for your reference is given below:

#<%TRAINING.SECTION%> 

trainingDataset = read.csv(‘<%Train.Data%>’)

library(randomForest)

myModel = randomForest(x = trainingDataset[1:15], y = trainingDataset$TEMP,ntree = 500)

#<%PREDICTION.SECTION%>

predictionDataset = read.csv(‘<%Predict.Data%>’)

y_pred = predict(myModel,data.frame(predictionDataset[1:15]))

predictionDataset$ExpectedTemp <- y_pred>

write.csv(predictionDataset , file='<%Predict.Data%>’)

Data Extraction and Forwarding Agent (DEFA) Step

Data Extraction and Forwarding Agent (or DEFA) collects identified data at predefined intervals from a data source and forwards it to a data collector component (Intellicus Report Server) which can transform and load this data. It gives you an ease of receiving only the incremental data as well.

DEFA step in Query Object

A DEFA step will be defined for receiving data from one or more DEFA agents.

For defining a DEFA step, you need to have a snap shot of data that can be received by this DEFA step. This sampling data is used to understand structure of data that will be sent by configured DEFA agent(s) for this step.

You need to specify a connection for metadata for designing a query. This would help DEFA know which data to fetch. This data can come from RDBMS, JSON and file based (TXT, CSV) data sources.

DEFA Step
Figure 16: DEFA Step

The properties to be defined for DEFA are mentioned in the underneath table.

Property Values Comments
Connection for Metadata Select from list Connection to be used to design query for metadata.

Select from already created connections on the Configure>Databases tab

Incremental Column Select from list Identity column for DEFA to extract incremental data
Schedule Data Extraction Start at Select from list Start time for DEFA to fetch(extract) data
Data Extraction Interval Type Yourself (as hr:min:sec) Time interval after which DEFA should fetch data
Data Forward Interval Type Yourself (as hr:min:sec) Time interval after which DEFA should send (forward) data to Report Server
Field Properties
Property Values Comments
Data Type CHAR,

NUMBER,

DATE,

BINARY

Select the data type of the incoming data
Data Format Format String Specify the format of the incoming data. This is useful only if Date or IP Address type data are incoming in CHAR fields but needs to be converted to Date and Number types for further use
Sort Priority Number 0-N If the data is sorted on multiple fields then specify sort priority number of this field.

Primary sort field should be the lowest number

Sort Criteria Ascending/

Descending

Specify sort as either ascending or descending order
Length Type yourself Length of field for Char data type
Scale Type yourself Scale or number of digits after decimal point
Database Time Zone Select Time zone from the list Specify the time zone in which the incoming date data is stored. This is useful only if date time data needs to be converted to other time zone data based on reporting requirement.

For example, when incoming GMT data need to show EDT (or any requested TZ) value in report, specify that the incoming data is GMT. The output format is generally specified in Format step or in user preference

Locale Select from the list Select the language/ country in which the incoming date data is stored
Qualified Name Type yourself This name helps using the field name in all the SQL clauses such as WHERE and ORDER BY or to resolve field name ambiguity when same field comes from two tables or expressions

Load Step

Load step under Query Object loads the incoming data (after extraction and transformation) into a target system (File, Relational database, HDFS, No SQL / Columnar system etc.).

You need to select a target Intellicus connection (already created under Configure > Databases) on which data should be loaded. One of the important configurations is the unchecking of the ‘Read Only’ setting of the target system. If not unchecked, this will not show in the list of target system where the extracted data can be loaded.

There are target connection specific properties required to load data into the system.

You can fetch list of existing Tables/Files from the connection.

Load Step
Figure 17: Load Step

The properties of Load step (in case target system is an RDBMS) are:

Property Values Comments
Target Connection Select from list Select a target connection on which data should be loaded
Table Properties

 

 

 

 

 

 

 

Target Table Name Select from list Select target table name using the selected Target Connection
If Table Exists Select from list:

 

Append Data

 

Delete Existing Data

 

Skip Load

 

 

Drop and Create New Table

 

Truncate Existing Data

Action to be performed in case table exists:

 

To append data in the end

 

To delete data in existing table

 

 

To not perform any action on table and skip load step

 

To drop the table and create new table with same name and new schema.

 

To delete data in existing table, but Admin user can roll back this data

Suffix TimeStamp Check/Uncheck Check= suffix timestamp in the chosen format to table

 

Select from the available list

Data Properties If Record Exists Select from list:

 

Insert Only

 

Update Only

 

 

Update or Insert

Action to be performed in case record exists:

 

Insert the new record

 

Update the existing record with the new

 

In case of an existing record, update it with the new record

Else insert the new record

Ignore Empty Rows Check/Uncheck Check= ignores empty rows in table while loading
Error Handling Error Threshold Type Yourself Specify count of error after which loading process should stop.

Any positive number>0 means stop processing after that many +1 errors i.e. if its value is 2, stop processing upon 3rd error.

Column Properties Fetch Schema Click the option Fetches schema of the existing table

You can see the field details of the data to be loaded under Column Properties section

Remove All Columns Click the option Removes all columns in the target table

You can run and schedule Query Objects which are of Load type from under Explorer.

Load step can also be followed by the Data Source step further followed by Format step to be used for generating reports.

Note: To know about loading into file, HDFS, NoSQL (Columnar) system, please refer WorkingwithLoadStep.doc