To perform the Data Science Engine step, you must have a Data Science connection formed to any of the Data Science Engines. Refer to Working with Database Connections document to understand the connection to a Data Science Engine.
The Data Science Engine step takes 2 inputs. This step helps you to transmit your data to Data Science Engines to perform machine learning and modelling. You can add Data Science Engine step before or after adding any other transformation step.
Adding Data Science engine step at Query Object level helps you when predictions on your data are adding new variables and columns in tables. For example, in a market basket analysis, the clusters that would form may require new columns & variables in the table. This can be achieved while data preparation and hence such algorithms need to be defined at Query Object level.
You can also perform Data Cleansing and other Data Science engine related transformation tasks by creating script at Query Object level.
Data Science Engines train on your data to bring out predictions. You can input Training as well as Prediction data based on the below conditions while transforming your data.
- If you have separate data to train and predict you need to add data for training as well as prediction.
- If you want training and prediction on the same data, only one data source can be added.
- If you already have a trained model in your script, you need not add training data.
Figure 17: Data Science Engine Step
|Data Source Engine||R Job
|Here you can select the Data Science Engine you want to use|
|Script||Sample Script||Here you can see the Data Science Engine script you have created|
|Edit||Type Yourself||Click the Edit button to create Data Science Engine script or edit an already created one.
When you click the Edit button, the script editor box opens. Here you can view the fields in your script and write R script for relevant fields. You can also verify your script to check if it is error-free.
Guidelines for writing R Script
- The script needs to have sections for Training and Prediction. These sections should start with #. These place holders should be surrounded by <%%> for Intellicus to be able to parse and understand the modularization. For example, #<% TRAINING.SECTION %>
- The first line of the Training and Prediction script should be for reading the CSV and the last line of Prediction script should be for writing. Argument passed in the reading section should be <% Stepname.data %>. Example,Read.csv(‘<% Train.data %>’)
- Previous step data should be referred as ‘StepName.data.’ For example, in the transformation area if you created the step as Train, the input must be ‘Train.data.’
- The model created is by default saved as ‘myModel.’ This is a mandatory name to the model you create as it is referred to while communicating with Data Science engines.
- The training will only happen if the training script is provided, otherwise it will be assumed that a trained model is used.
- If a trained model is used, it is mandatory for user to provide a prediction script.
Once you have added a script, you can click the Verify button to check if it is appropriately written. Click OK. You can further click Save or Save As to save your query object to use it in reporting.
An example script for your reference is given below:
trainingDataset = read.csv(‘<%Train.Data%>’)
myModel = randomForest(x = trainingDataset[1:15], y = trainingDataset$TEMP,ntree = 500)
predictionDataset = read.csv(‘<%Predict.Data%>’)
y_pred = predict(myModel,data.frame(predictionDataset[1:15]))
predictionDataset$ExpectedTemp <- y_pred>
write.csv(predictionDataset , file='<%Predict.Data%>’)