Settings
miniBRS uses Apache Airflow for workflow management and monitoring. Apache Airflow is an open-source platform to programmatically author, schedule and monitor workflows. It provides a great UI for monitoring of your DAGs. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
Before moving forward, It's important to know a few concepts specific to Apache Airflow. We are not going to get into the detailed working of Apache Airflow, That is not the scope of this document. We shall be detailing out a few basic concepts necessary to know before using miniBRS. If you want to get in-depth of Airflow please checkout there documentation pages, you can find the link in References [1]
Concepts
-
DAG: In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies. A DAG is defined in a Python script, which represents the DAGs structure (tasks and their dependencies) as code.
For example, a simple DAG could consist of three tasks: A, B, and C. It could say that A has to run successfully before B can run, but C can run anytime. It could say that task A times out after 5 minutes, and B can be restarted up to 5 times in case it fails. It might also say that the workflow will run every night at 10 pm, but shouldn’t start until a certain date. In this way, a DAG describes how you want to carry out your workflow
Example DAG
-
Tasks: Tasks represent the actual piece of work that needs to be executed so that the workflow can progress and ultimately finish in its entirety. Each task is an independent idempotent job, by idempotent we mean that for given inputs the task should always finish its execution giving the same output every time. Tasks are represented by
Operators
in aDAG
. EachOperator
instantiation represents an independent task executionA task goes through various stages from start to completion. In the Airflow UI (graph and tree views), these stages are displayed by a colour representing each stage:
-
Connections : The information needed to connect to external systems is stored in the Airflow meta database and can be managed in the UI (Menu -> Admin -> Connections). A
conn_id
is defined there, andhostname
/login
/password
/schema
information attached to it. Airflow pipelines retrieve centrally-managed connections information by specifying the relevant conn_id. -
Variables: Variables are a generic way to store and retrieve arbitrary content or settings as a simple key-value store within Airflow. Variables can be listed, created, updated and deleted from the UI (Admin -> Variables)
How to Use
miniBRS
makes your workflow creation and management easy and customizable. Airflow at the backend helps in scheduling and
monitoring of workflows, whereas miniBRS
gives to configurable workflows at your disposal. miniBRS
provides you with a set of
tested workflows maintained by the community, That helps you in getting your data backed up to a storage platform or seeded
to your data store. In the current version of miniBRS
, we have a variety of support for Service Now
platform and new SaaS
platforms are being added in new releases.
Using miniBRS
is a matter of few clicks, It makes your workflow creation painless. In this section, we shall describe how you
can configure your workflows. Make sure you check out How It Works document where a use case for Service Now
incident
table is described. Without any further ado, Let's get started.
DAGs
A couple of DAGs come bundled with miniBRS
, These are:
Static DAGs
-
dag_cleanup: This dag is used to free the storage space. Execution of DAGs lead to a generation of multiple
temporary
files, variables and logs. This DAG helps in getting them off the system. It is being run on a@daily
basis and It runs three tasksremove_recovery_dags
,remove_tmp_files
&purge_logs
.miniBRS
provides you with an option to recover your failed DAG executions. Each faileddag
is recovered by creating another dag specific for the recovery purpose. Once the recovery is completed recovery_dags are removed bydag_cleanup
.Disclaimer: Default functioning of
dag_cleanup
gives you 24Hrs of time to have your failed dag execution recovered.Please check Logs section for more information about recovery process
-
dag_generator: This dag is used to generate your workflows using a
config
variable. It is hidden from the UI so that but you can check its presence in thedags
folder.
Dynamic DAGs
miniBRS
is capable of generating DAGs on the go. You can specify the configuration for your workflow in the config
variable
and It will start generating dags based on the configured values. A dag is spawned for each table
entry in the config
variable. The generated dag will be named the same as table name
in the config
. and each dag will comprise of set of tasks
start
, fetch_record_count
, count_exceeds_threshold
, count_within_threshold
, count_is_zero
, send_dat_to_submission
and end
. Such a table named workflows is responsible for fetching your data from the SaaS platform such as Service Now.
Besides these table named
dags, you will also see some dags starting with r_
these dags are the recovery dags. Each recovery
dag will start its name with r_
and then a table name and it ends with the failed execution DateTime.
This is a snapshot of DAGs UI
- Graph View: generated ServiceNow DAG's dependencies and their current status for a specific run.
Connections
Before running any DAG, make sure you specify the requisite connections to the external services required for the functioning of workflows.
The connections can be found via Admin
tab in the navigation bar, Click on Admin
tab and select Connections
from the
drop-down menu, you will be redirected to airflow connections page, Airflow allows you to store your external connection
details in the meta database via this page. Few connections are of specific importance to miniBRS and you as a user have to make sure to configure these connections based on your specific needs. Let's have a look at few such connections.
servicenow_default:
Is the connection entry in the meta database which will hold your Service Now instance credentials.
This connection is where you would store your Service Now instance URL and login credentials. If you edit this connection
by clicking on the edit connection icon, you will have a form with fields like conn Id
, conn type
, Host
etc.
please do not change the Conn Id
value. Add your ServiceNow instance URL to Host
field of the form, you need to
add the URL with https
appended e.g if your instance is dev1234.service-now.com
save it as https://dev1234.service-now.com
in the Host
field of the form. Also, you need to add ServiceNow user name to Log in
field and password to Password
field of the form.
The absence of this connection id from the meta-database raises ServiceNowConnectionNotFoundException
sftp_default:
If you want to ingest your SaaS data to an SFTP account, you can add the SFTP connection
details in sftp_default
connection entry. Add sftp host IP in the Host
field, sftp user name in the Login
field and sftp user password in the Password
field of the form, nothing else needs to be changed.
The absence of this connection id from the meta-database raises SFTPConnectionNotFoundException
s3_default:
If you want to ingest your SaaS data to Amazon S3 account, you need to have access_key_id
and secret_key_id
for your s3 storage. Add access_key_id
to Login
and secret_key_id
to Password
field of the
s3_default
connection. The default S3 region is ap-south-1
and default bucket name is mini-brs
, To change the default
region
and bucket
name you need to add that to the extras field of the form. The extras field has a JSON string with
region-name
and bucket-name
as attributes, you can specify your region name and bucket name correspondingly
The absence of this connection id from the meta-database raises S3ConnectionNotFoundException
dropbox_default:
miniBRS provides you with an option to ingest your SaaS data to Dropbox
, for this you need to generate access_token
for your dropbox account and add that access_token
to the Password
field of the
connection. To generate access_token
for you account please check out the following Reference [2]
The absence of this connection id from the meta-database raises DropboxConnectionNotFoundException
google_drive_default:
miniBRS provides you with an option to ingest your SaaS data to Google Drive
, for this you need to generate client id
,
client secret
, access_token
and refresh_token
for your google drive account. To generate access_token
for your account
please check out the following Reference [3] . You need to add client id
to Login
field,
client secret
to Password
field and access_token
, refresh_token
as a JSON structure to the Extra
field of the
google_drive_default
connection id.
{
"access_token": "<YOUR-ACCESS-TOKEN_HERE>",
"scope": "https://www.googleapis.com/auth/drive",
"token_type": "Bearer",
"expires_in": 3599,
"refresh_token": "<YOUR-REFRESH-TOKEN_HERE>"
}
The absence of this connection id from the meta-database raises GoogleDriveConnectionNotFoundException
mysql_default:
This is a default connection used to store MySQL database credentials, If you want to use MySQL as a storage platform, you
can store your MySQL database credentials in this connection. Add your database hostname in Host
field, database name to
Schema
field, Username to Log in
field, Password to Password
field and port to Port
field of the connection form
The absence of this connection id from the meta-database raises MySQLConnectionNotFoundException
postgres_default:
This is a default connection used to store Postgres database credentials, If you want to use Postgres as a storage platform, you can store your Postgres database credentials in this connection.
The absence of this connection id from the meta-database raises PostgreSQLConnectionNotFoundException
mssql_default:
This is a default connection used to store Microsoft SQL Server database credentials, If you want to use SQL Server as a storage platform, you can store your database credentials in this connection.
The absence of this connection id from the meta-database raises MSSQLConnectionNotFoundException
Important Note:
You do not need to provide every connection details defined above,
You need to provide only that connection detail which you want to use for storage purpose. Please note that you need to provide the
connection details for that storage which you have set in the
'storage_type' attribute of the 'config' variable. If you have
placed 'storage_type:"sftp"' in 'config' variable, you must provide
the values to the 'sftp_default' connection.
Variables
miniBRS uses Variables
as a single point to configure workflows. Once you have installed miniBRS and added Service Now and storage connection details to the there respective connection ids, you can configure your workflows via configuration variables. Variables can be set via Airflow UI, click on Admin
in the navbar, select Variables. You will see a list of variables. At any specific moment, there could be many variables generated by the system but among all those variables three are of core importance lets check them out.
.
Lets go through these variables one by one:
config:
config variable is an important key-value pair in the meta-database, It provides you with a mechanism to generate workflows dynamically.
It uses the JSON format to store values. You can create
and delete
DAGs using this variable. We shall define the working of
each attribute now.
{
"tables": ["incident","problem","sc_request"],
"start_date": "1day",
"frequency": "hourly",
"threshold": 1000,
"export_format": "xml",
"storage_type": "dropbox",
"email": ""
}
Attributes
-
tables: Is an array where you can add the Service Now table names as comma-separated-values from which you want to ingest data to the storage. Please ensure the values inside the table should be valid ServiceNow table names. For each table name in this array, a DAG will be generated with name same as the entry name. e.g If you add
incident
to the list, You will see a DAG namedincident
in the DAG UI. If you remove theincident
entry from the list, The corresponding dag for the entry will be deleted and so will be the tasks associated with it. -
start_date: This attribute is used to set the
start_date
of the DAG.start_date
of a DAG is an important setting It provides you with a way to get historical data from your ServiceNow instance. It takes values of formatxday
orxdays
, wherex
is an integer value which specifies how many days ago should this DAG fetch the data from the source. e.g If you appendincident
value to the tables attribute at2020-04-13 13:00:00 UTC
and set start_date to2day
the generated incident dag will get data from2020-04-11 00:00:00 UTC
. start_date along with frequency will help you in getting historical data. -
frequency: Refers to the schedule interval of the workflow. It can take value such as
half-hourly
,hourly
,daily
etc.frequency
states the periodicity of the DAG, i.e how often this workflow should be scheduled. e.g If you appendincident
value to the tables attribute at2020-04-13 12:00:00 UTC
and set start_date to2day
and frequency ashourly
the generated incident dag will get data from2020-04-11 00:00:00 UTC
to2020-04-13 13:00:00 UTC
on an hourly basis i.e. its will runincident
dag 61 times (48 times + 13 times) and each time it will get the data hourly wise first from2020-04-11 00:00:00 UTC
to2020-04-11 01:00:00 UTC
then from2020-04-11 01:00:00 UTC
to2020-04-11 02:00:00 UTC
and so onfrequency
together withstart_date
helps you in getting historical data from your SaaS platform -
threshold:
threshold
is used to specify the threshold of records fetched from the ServiceNow instance. By default it is placed at its maximum value of 1000, placing a value greater than 10000 is not going to do any good, if the number of data records for a specific run exceeds the threshold, No data will be fetched for that period. In that case, try to change thefrequency
of your workflow to some lower value. -
export_type:
export_type
is used to specify the format of data to be stored in the storage, default isxml
. currently, we only supportxml
format -
storage_type:
storage_type
is used to specify the type of storage to be used for ingesting data, currently, miniBRS supports AmazonS3, DropBox, Google Drive, SFTP, MySQL, Postgres and SQL Server. The credentials of these storages are to be stored in Airflow Connections in their specific default connection_ids. This variable takes values such as"sftp", "s3", "dropbox", "googledrive", "mysql", "postgres", "mssql"
-
email: If you have configured SMTP server details during installation or you have manually set them in
airflow.cfg
file then you can specify the email_address here to which the failure alerts should be sent.
The absence of this variable from the meta-database raises ConfigVariableNotFoundException
The other two variables dag_creation_dates
and r_config
are meant for internal usage, their presence is
necessary for normal functioning of miniBRS.
Notifications
miniBRS
provides failure alert notification via emails
, we use Airflow's built-in mechanism for failure alerting to alert users about workflow failures. Installation of miniBRS
accompanies configuring of SMTP server for email alerting
If you have chosen to install miniBRS via installer script, then you might have been prompted for email configuration setup.
If you have configured SMTP server rightly you would be receiving alerts about the failed dags in your registered email.
which you have provided in the config
variable.
In case, you haven't configured email client at the time of installation, you can do it now. To receive email notifications you must have an Email Server or an email provider e.g Gmail, Outlook etc. Here we will brief you about setting up a Gmail email account as a email client, a similar process could be used for your specific provider.
In order to use Gmail for sending notifications, you need to generate app_password
for your email account. If you already have
app_password then you can proceed else checkout the References [3] link to generate app_password for Gmail account.
app_password
is used so that you don't use your original password or 2 Factor authentication
-
Open
airflow.cfg
file from theminiBRS
project folder~$ nano airflow.cfg
-
Search the file for
[smtp]
section, It would be looking somewhat this, set the corresponding values for theattributes
and save the file[smtp] smtp_host = smtp.gmail.com # Place your SMTP host here smtp_starttls = True # Keep this as True only smtp_ssl = False # Keep this as False only smtp_user = YOUR_EMAIL_ADDRESS # Enter your email address smtp_password = 16_DIGIT_APP_PASSWORD # Enter your app password smtp_port = 587 # This is default SMTP port smtp_mail_from = YOUR_EMAIL_ADDRESS # Enter your email address
-
That's it you have an email client configured, Now restart airflow webserver and scheduler
-
To receive an email notification, you need to tell
miniBRS
where to send notifications. This is done viaconfig
variable. Set theemail
attribute of theconfig
variable with the recipient email address, which is to be notified.