Endpoint configuration

Before you deploy an endpoint, you must tell it in which corpora it is going to search and what kind of corpora they are. This is done in the configuration files located in the /config directory, one per corpus. Do not forget to remove the test configuration files before publishing an endpoint.

Configuration files

You need one configuration file per corpus. All files are in JSON and must have the .json extension. Before you publish or reload the server, make sure you all files in /config are valid JSON, e.g. using JSONLint.

The name of each JSON file, excluding the extension, serves as the ID of a corpus it describes. This ID is used in the reqeust URLs. A request to a corpus whose ID is CORPUS_ID must be sent to http[s]://BASE_URL_OF_THE_ENDPOINT/fcs-endpoint/CORPUS_ID/.

A configuration file for a corpus is basically a dictionary with parameters. All possible parameters are listed below.

List of parameters

Basic info about the corpus

  • platform (string; required) – name of the corpus platform. Possible values are annis, tsakorpus and litterae.

  • resource_base_url (string; required) – base URL of the online corpus the endpoint is going to communicate with. For Tsakorpus corpora, do not include the final /search part.

Basic info about the endpoint

These values are somteimes sent back to the FCS Aggregator together with the search results.

  • url_path (string) – URL of this endpoint.

  • transport_protocol (string) – transport protocol (http or https) used by this endpoint.

  • host (string) – host name of this endpoint (without the protocol).

  • port (string) – port number of this endpoint.

Capabilities and settings of the endpoint

  • basic_search_capability (Boolean; defaults to True) – whether basic search is possible.

  • advanced_search_capability (Boolean; defaults to False) – whether advanced search (with a CQL-like query language) is possible.

  • hits_supported (Boolean; defaults to True) – whether the simple dataview (hits; only includes the text) is available.

  • adv_supported (Boolean; defaults to False) – whether the advanced dataview (adv; includes some annotation) is available.

  • max_hits (integer; defaults to 10) – maximal number of hits the endpoint will send to the aggregator.

  • search_lang_id (string) – ID of the language / layer to search in (for platforms and corpora that support multiple languages or multiple text layers, e.g. parallel translations).

Corpus metadata

The following values may be sent to the Aggregator when it sends an explain request, i.e. asks the endpoint to tell it more abot the corpora it covers.

  • titles (list of dictionaries) – title(s) of the resource (use multiple dictionaries if they are in multiple languages). Each title is described by a dictionary with three keys: content (the title itself), lang (an ISO 639 code of the language of the title) and primary (Boolean; optional; marks this version of the title as primary).

  • descriptions (list of dictionaries) – description(s) of the resource; work the same as titles.

  • authors (list of dictionaries) – author(s) of the resource; work the same as titles.

Query translation

POS tags are required to be in the UD standard, per FCS specifications. If a corpus only has a non-UD morphological annotation, you can use this workaround.

  • pos_convert (list or rules, each rule is a list with exactly two items) – rules that convert a corpus-specific morphological annotation string into an UD tag. Each rule contains two strings. The first string is a regex that is applied to the tag sequence from the corpus. The second is an UD tag that has to be sent to the Aggregator instead, if there is a match. Rules are applied in the order of their appearance.

  • pos_convert_reverse (dictionary) – rules that convert UD tags from a query to corpus-specific tags or expressions. Keys are UD tags, values are expressions they have to be replaced with.

ANNIS tier configuration

There are two parameters that define how tier/layer names in the search query should map to the layer names in ANNIS annotations, and how those should map to what is returned to the client.

  • tier_convert_reverse (dictionary) – tells the endpoint which tier names in the query should be mapped to differently named tiers in ANNIS. For example, if it contains a key-value pair "lemma": "GlossType" and the query is lemma="CAN2B", then the value CAN2B will be searched in the tiers named GlossedType (possibly with a :: prefix, e.g. PersonA::GlossType. By default, text is mapped to tok and all the rest is left as is. An ANNIS tier indicated as an equivalent of text here is treated as a token-level tier.

  • tier_convert (dictionary) – tells the endpoint which tiers from ANNIS should end up in the response, and (possibly) how they should be called in the output XML (<Layer id="layer_name">). Tiers not listed in this dictionary, apart from the token tier, will be disregarded. If you want to have a tier in the output, but do not want to rename it, just use identical key and value for it.