# NLQ Integration

This is the backend for the NLQ (Natural Language Query) demo. The code includes python scripts for processing user input: searching for existing insights with similar names and creating new insights based on the input. It also includes an API server (using Flask) that can be used to call the script from GoodData.UI. 

The app is meant to be called from a front-end GD.UI app for GoodData.CN (front-end available here: https://github.com/gooddata/ui-sdk-examples/tree/master/nlq). 

## Pre-requisites
* This app was built using python3 (you should use 3.7 or above), GoodData.CN 1.5.0, and GoodData Python SDK (https://gooddata-sdk.readthedocs.io/en/latest/services.html#catalog-service).
* The front-end GD.UI app requires yarn.

## App Description
* The GD.UI app typically runs on https://localhost:8443/ (see: https://github.com/gooddata/ui-sdk-examples/tree/master/nlq).
* This app connects to a GoodData.CN instance with imported layout [gd-cn-layout.json](gd-cn-layout.json). The GoodData.CN instance typically runs on http://localhost:3000/ and is run via a Docker container.
* The Python API server typically runs on http://localhost:5000/.

The host and ports can be adjusted as necessary. 

## API Description
The API server has three endpoints: /search, /create, and /delete. 

* /search uses the nlp_search function in nlp.py to search existing insights. It takes a string of user's input separated by + signs in the url. It does a fuzzy match comparing the input to existing insight names (using the levenshtein ratio) and also compares with synonyms. The endpoint returns a list of insights with their IDs and names (the JSON schema for the response is: {'dropdown_list': titles, 'title2id': title2id, 'title2url':title2url})  
* /create takes a string of user's input separated by +signs in the url. It uses the create_insight_definition function in nlp.py to create a new insight via the GoodData.CN API. The user input is read and processed according to pre-defined rules (described in more detail below). The endpoint returns the newly-created insight's ID and title (JSON schema: {'title': user_input, 'id': insight_id}). The insight title will be the user input and the insight ID will be an MD5 hash of the input text.
* /delete uses the user input to delete an existing insight. The input must match the insight ID that you are trying to delete.

## Set up
In constants.py, update the WORKSPACE_ID to the workspace you are using. If necessary, also update the host and token.  

## Description of user input
It is not recommended to include special characters in the user input. 
When creating a new insight, the script follows simple rules described below:
For creating a new insight, you can start the sentence with "What is", "What are", "Show me", "Show my", "Compute the", "Compute my" or nothing, followed by the metric or fact you want to see (if using a fact, you can either enter the aggregation type - sum, average, max, min, and median are supported and some synonyms also work). The metric or fact should match as closely as possible to what it is called in the LDM/metrics (we initially planned on supporting synonyms but the performance was too poor so it needs to be an exact or close match). Then you can optionally put "by" followed by one or two attribute name(s) which will slice the metric by that (anything without a slice by will default to headline report). Then you can put "where" followed by an attribute name to filter by and then "is", "are", "equals", "=", or "equal" followed by the phrase you value(s) you want to filter the attribute by (this value must be an exact match). If you want more than one value, separate them by "and", or "or". Anywhere in the sentence you can put the chart type you want to see ("bar", "donut", "line", "pie", "headline", or "table") followed by "chart", "graph", or "report" (it will default to table if not specified or headline if no slice by attributes are included).

Samples:
* Total revenue by customer state
* Show me revenue by customer state as bar graph
* Show me revenue by product category as pie chart
* Show me headline chart for total revenue
* What's revenue by product name as pie chart
* What is revenue by product name
* Show me revenue by product name where customer state is California
* Show me revenue by quarter where customer state is California line chart
* What is revenue by customer region and customer state
* Show me total revenue
* What is total revenue

Capitalization does not matter for user input. It converts everything to lower case for matching to existing insights and for matching to facts, metrics, and attributes to create new insights. For attribute values, the script currently does title case for all values (obviously this needs improvement for a productionalized solution). 

Note that you must type the exact filter values for filters. The app currently does not check whether the filter value is a valid option, it means it does not prevent user to set filter if the value does not exist in the dataset. If an user put where customer state is Germany while it is not a valid option, the app would still add this filter value in the visualization and show some unexpected visualization. Also, it is not able to convert abbreviation to full spelling. If a user put CA instead of California, it will pass CA as the filter value and show some unexpected visualization.

## Technical Notes
* constants.py stores the meta data of GoodData.CN and app.py. Be sure to update workspace id and socket ports. Currently the app may only attach with one workspace
* insight_template.json, attribute_template.json, measure_template.json are the templates to create insight metadata to make API call. Depends on what charts the app is going to create, the app will fill out the templates accordingly via create_insight_meta(), add_measure(), add_attribute(), add_sort(), add_filter(). If anyone wants to continue support on create_insight_meta() for creating visualization types that are not currently support, you may continue that in the if-else statement and utilize all the add_() functions. Note that add_sort() and add_filter() may not be fully functionable, be sure to work on those function if you rely on them.

## Known Bugs and Limitations
* If the user enters a long sentence, the performance is really bad during the fuzzy matching to existing insights part of the script (for example: "What is revenue where customer state is texas or california or new jersey or arizona" takes like 10 minutes to run. This is vastly improved if we remove the synonyms for fuzzy matching.
* Fuzzy matching on the full user input creates false positives when the user enters similar input to create new insights (for example, if the user creates several insights starting with "Show me...", then enters another search that starts with "Show me..." it will match many of the existing insights even if all/most other words are different). This can be fixed/improved by improving the way the fuzzy matching works, removing "Show me"/"What is"/etc. from the input before trying to match, and removing those phrases from the input before creating insight names. 
* Fuzzy matching has many limitations (for example order and string length matters).
* The filter by attribute values must be exact matches to the data.
* Currently the app is able to create line chart, bar chart, table, pie chart, donut chart and headline. If you want to expand this features to other available charts in AD, you may expand the code block to fill in the meta data in create_insight_meta() under create_insight.py.
* There is little to no error handling. Be careful with your input!
  * There is no logic in the script to ensure that the user input can create a valid insight. For example, if the fact can actually be sliced by the attribute provided.
  * As mentioned, the app does not convert abbreviation to full spelling or by-verse. And it will lead to adding the wrong filter values.
* If the user does not specify a visualization, it defaults to create a table (Or a headline if no slice by attribute is provided) that ties with the user experience in AD.
* API endpoint responses need improvement. Currently, we are using os.system() to make API calls to GoodData.CN, we may have to change to using request if we need to obtain API response
* We are using an MD5 hash of the user input text as the new insight IDs. If the user creates an insight and then tries to create another insight using the exact same input text, it will fail due to the insight ID already existing. We can improve this by adding some amount of random characters to the hash. 
* Currently the app can only handle one or two slice by attributes, and only one filter by attribute. It can only handle one or two metrics/facts. There are some fact aggregations that it does not currently support. 
  * This includes RUNSUM, RUNAVG, etc. 
  * The app does not currently support COUNT
* In the LDM, users can create attribute and fact names with *any* value. There are many examples that would break the script. This also applies for metrics. For example, if the user has a dataset with an attribute column called "show my revenue", we would not be able to match it. This seems like a weird thing to name an attribute, but there may be metrics with names that will not be able to be processed by this script as well such as "revenue by category". 
* There may be attributes in the LDM that have the exact same name but are part of different datasets. Such as an attribute named "Category" that is part of a product dataset and another attribute named "Category" that is part of a customer dataset. There is currently no way in the script to handle this case. 
* The script only retrieves the datasets and it's facts and attributes and the metrics once. If anything is changed in the LDM, or a fact or attribute are deleted or created, or a metric is deleted or created, the script will not find them. You will have to restart the app to find them. This should not be a major issue for the demo. It can be fixed by retrieving the catalog within the create_insight_definition function. That way it si retrieved every time the user wants to create a new insight.
* Previously, there were 500 errors on calling /create. Potentially it is a result of one of the following causes:
  * due to duplicated attribute in the meta data. It was originated from reusing the attribute meta template that duplicated attributes filled into the meta data.
  * wrong localIdentifier in attribute_dict['localIdentifier'] under add_attribute(), verify the right localIdentifier by comparing an insight created by AD, and pass the right localIdentifier in add_attribute()
