Merging two sources using Join/ Union
Intellicus Query Objects support two types of merging of data sets.
- Join – includes Equi and Outer joins
- Union – union of equal and unequal number of columns
This step takes two inputs. When the data is passing through this step, the data of both inputs will be joined based on properties set for this step.
Figure 12: Join Step
The Join step properties are:
|Select All Fields||Select/De-select
|Select = All fields from both the sources will be available in the output of this step
The duplicate field names will be automatically renamed (suffixed with a underscore and number), example: if there is a common field named “Status” in multiple data sources that are joined, the names become Status, Status_1, Status_2, and so on
De-select = You can choose which fields should be available in the output of this step
Choosing is done by selecting the check box adjacent to field names of both inputs
|Join Type||Select from:
|Select the type of join from the available list
|Join Conditions||Field name Operator Field Name||Forms the Join Key|
Union step takes two or more inputs. The data passing through this step appends to one another and forms a single output.
Generally, the data inputs selected for Union are of same structure i.e. the columns are same. But this step supports exceptions and allows having extra columns in some inputs. You can decide whether to carry forward the fields coming from only some inputs to output.
During the union process you can decide to take out the sorted or you can prefer unsorted appending.
Figure 13: Union Step
The properties of Union step are:
|The output of this step shall be sorted or not|
|Sort Priority||Number 1-3||This property is at field level.
This is set below the field name.
Set the number 1, 2 or 3 to the fields which should be sort key
|Sorting criteria as either ascending or descending order|
|Remove Duplicate Rows||Select/De-select||Select = To get distinct rows from Union step|
|Column||Enter||The selected field name populates automatically.
You can change the caption of the resultant field.
You can perform the actions on columns as mentioned in the below table
|Add Column||To add a new column|
|Rename Column||To apply typed name to the selected column|
|Delete Column||To delete the selected column|
Query Objects with Dynamic Fields
Dynamic Fields Step
Dynamic Fields step allows you to get flexibility of adding or removing fields to a Query Object at run time based on your meta information.
Using this step, you can:
- Add dynamic fields by Pivoting data from input Data Source
- Dynamically fetch meta data for field properties
This step takes one input Data Source.
Figure 14: Dynamic Fields Step
This step takes an SQL or other Data Source that defines the metadata of the dynamic result set at run time.
This SQL will be fired just before fetching the input Data Source.
The Field Attribute Mapping section takes each field from the metadata result set and maps it to Query Object field properties.
The important mappings are Field ID, Field Name, Caption and Data Type.
Pivoting allows you to convert highly normalized, Name Value paired data into flattened tabular data.
Pivot Columns: specifies which column has field ID and which column has value.
Select Grouping: specifies grouping fields, when grouped on which, the normalized data converts to flat table.
External task step allows you to call standard and custom 3rd party processes from within Intellicus Query Object. Some may be in-proc and some tasks may be sent to remote execution by appropriate bridge components.
There are few pre-configured external tasks. More external task types can be configured.
Pre-configured external task types:
|Custom MapReduce Job for Hadoop||If you have custom-written scripts, you can use this option to execute such jobs|
|Pig Job||Jobs created using Pig scripts can be executed using this option|
|Custom Row Processor||You can create your own logic to transform your data using JAR files and provide it here|
|Hive Job||Hive jobs can be executed, and results retrieved for analysis|
Data Science Engine Step
To perform the Data Science Engine step, you must have a Data Science connection formed to any of the Data Science Engines. The connection to a Data Science Engine is talked about in “WorkingwithDatabaseConnections.pdf”.
The Data Science Engine step takes 2 inputs. This step helps you to transmit your data to Data Science Engines to perform machine learning and modelling. You can add Data Science Engine step before or after adding any other transformation step.
Adding Data Science engine step at Query Object level helps you when predictions on your data are adding new variables and columns in tables. For example, in a market basket analysis, the clusters that would form may require new columns & variables in the table. This can be achieved while data preparation and hence such algorithms need to be defined at Query Object level.
You can also perform Data Cleansing and other Data Science engine related transformation tasks by creating script at Query Object level.
Data Science Engines train on your data to bring out predictions. You can input Training as well as Prediction data based on the below conditions while transforming your data.
- If you have separate data to train and predict you need to add data for training as well as prediction.
- If you want training and prediction on the same data, only one data source can be added.
- If you already have a trained model in your script, you need not add training data.
Figure 15: Data Science Engine Step
|Data Source Engine||R Job
|Here you can select the Data Science Engine you want to use|
|Script||Sample Script||Here you can see the Data Science Engine script you have created|
|Edit||Type Yourself||Click the Edit button to create Data Science Engine script or edit an already created one.
When you click the Edit button, the script editor box opens. Here you can view the fields in your script and write R script for relevant fields. You can also verify your script to check if it is error-free.
Guidelines for writing R Script
- The script needs to have sections for Training and Prediction. These sections should start with #. These place holders should be surrounded by <%%> for Intellicus to be able to parse and understand the modularization. For example, #<% TRAINING.SECTION %>
- The first line of the Training and Prediction script should be for reading the CSV and the last line of Prediction script should be for writing. Argument passed in the reading section should be <% Stepname.data %>. Example,Read.csv(‘<% Train.data %>’)
- Previous step data should be referred as ‘StepName.data.’ For example, in the transformation area if you created the step as Train, the input must be ‘Train.data.’
- The model created is by default saved as ‘myModel.’ This is a mandatory name to the model you create as it is referred to while communicating with Data Science engines.
- The training will only happen if the training script is provided, otherwise it will be assumed that a trained model is used.
- If a trained model is used, it is mandatory for user to provide a prediction script.
Once you have added a script, you can click the Verify button to check if it is appropriately written. Click OK. You can further click Save or Save As to save your query object to use it in reporting.
An example script for your reference is given below:
trainingDataset = read.csv(‘<%Train.Data%>’)
myModel = randomForest(x = trainingDataset[1:15], y = trainingDataset$TEMP,ntree = 500)
predictionDataset = read.csv(‘<%Predict.Data%>’)
y_pred = predict(myModel,data.frame(predictionDataset[1:15]))
predictionDataset$ExpectedTemp <- y_pred>
write.csv(predictionDataset , file='<%Predict.Data%>’)
Data Extraction and Forwarding Agent (DEFA) Step
Data Extraction and Forwarding Agent (or DEFA) collects identified data at predefined intervals from a data source and forwards it to a data collector component (Intellicus Report Server) which can transform and load this data. It gives you an ease of receiving only the incremental data as well.
DEFA step in Query Object
A DEFA step will be defined for receiving data from one or more DEFA agents.
For defining a DEFA step, you need to have a snap shot of data that can be received by this DEFA step. This sampling data is used to understand structure of data that will be sent by configured DEFA agent(s) for this step.
You need to specify a connection for metadata for designing a query. This would help DEFA know which data to fetch. This data can come from RDBMS, JSON and file based (TXT, CSV) data sources.
Figure 16: DEFA Step
The properties to be defined for DEFA are mentioned in the underneath table.
|Connection for Metadata||Select from list||Connection to be used to design query for metadata.
Select from already created connections on the Configure>Databases tab
|Incremental Column||Select from list||Identity column for DEFA to extract incremental data|
|Schedule||Data Extraction Start at||Select from list||Start time for DEFA to fetch(extract) data|
|Data Extraction Interval||Type Yourself (as hr:min:sec)||Time interval after which DEFA should fetch data|
|Data Forward Interval||Type Yourself (as hr:min:sec)||Time interval after which DEFA should send (forward) data to Report Server|
|Select the data type of the incoming data|
|Data Format||Format String||Specify the format of the incoming data. This is useful only if Date or IP Address type data are incoming in CHAR fields but needs to be converted to Date and Number types for further use|
|Sort Priority||Number 0-N||If the data is sorted on multiple fields then specify sort priority number of this field.
Primary sort field should be the lowest number
|Specify sort as either ascending or descending order|
|Length||Type yourself||Length of field for Char data type|
|Scale||Type yourself||Scale or number of digits after decimal point|
|Database Time Zone||Select Time zone from the list||Specify the time zone in which the incoming date data is stored. This is useful only if date time data needs to be converted to other time zone data based on reporting requirement.
For example, when incoming GMT data need to show EDT (or any requested TZ) value in report, specify that the incoming data is GMT. The output format is generally specified in Format step or in user preference
|Locale||Select from the list||Select the language/ country in which the incoming date data is stored|
|Qualified Name||Type yourself||This name helps using the field name in all the SQL clauses such as WHERE and ORDER BY or to resolve field name ambiguity when same field comes from two tables or expressions|
Load step under Query Object loads the incoming data (after extraction and transformation) into a target system (File, Relational database, HDFS, No SQL / Columnar system etc.).
You need to select a target Intellicus connection (already created under Configure > Databases) on which data should be loaded. One of the important configurations is the unchecking of the ‘Read Only’ setting of the target system. If not unchecked, this will not show in the list of target system where the extracted data can be loaded.
There are target connection specific properties required to load data into the system.
You can fetch list of existing Tables/Files from the connection.
Figure 17: Load Step
The properties of Load step (in case target system is an RDBMS) are:
|Target Connection||Select from list||Select a target connection on which data should be loaded|
|Target Table Name||Select from list||Select target table name using the selected Target Connection|
|If Table Exists||Select from list:
Delete Existing Data
Drop and Create New Table
Truncate Existing Data
|Action to be performed in case table exists:
To append data in the end
To delete data in existing table
To not perform any action on table and skip load step
To drop the table and create new table with same name and new schema.
To delete data in existing table, but Admin user can roll back this data
|Suffix TimeStamp||Check/Uncheck||Check= suffix timestamp in the chosen format to table
Select from the available list
|Data Properties||If Record Exists||Select from list:
Update or Insert
|Action to be performed in case record exists:
Insert the new record
Update the existing record with the new
In case of an existing record, update it with the new record
Else insert the new record
|Ignore Empty Rows||Check/Uncheck||Check= ignores empty rows in table while loading|
|Error Handling||Error Threshold||Type Yourself||Specify count of error after which loading process should stop.
Any positive number>0 means stop processing after that many +1 errors i.e. if its value is 2, stop processing upon 3rd error.
|Column Properties||Fetch Schema||Click the option||Fetches schema of the existing table
You can see the field details of the data to be loaded under Column Properties section
|Remove All Columns||Click the option||Removes all columns in the target table|
You can run and schedule Query Objects which are of Load type from under Explorer.
Load step can also be followed by the Data Source step further followed by Format step to be used for generating reports.
Note: To know about loading into file, HDFS, NoSQL (Columnar) system, please refer WorkingwithLoadStep.doc