User Tools

Site Tools


slopeq_for_nkjp

This is an old revision of the document!


SlopeQ syntax

Spokes uses the SlopeQ 2 query syntax. The examples below are customized to show how the SlopeQ syntax can be used for searching the Polish written data sets we provide through Spokes. For practical reasons the number of examples illustrating each query in this presentation is very limited. However, a link to a page with all the results is given for each query.

Surface queries

This is the simplest type of queries. Just type in the word in the search box and click the “search” button. The result is presented in the form of a KWIC (Key Words In Context) list, with the number of occurrences of a given lexical item displayed above.

maszt

The same method might be used to find sequences of two or more items:

na zdrowie

Base form queries

These queries can be used to find different grammatical forms of a given word. They can be used in the case of verbs, nouns, and adjectives, as all these parts of speech tend to have numerous different forms in Polish.

The format of the query is as follows: open triangular bracket + “lemma=” + the base form of the word sought + close triangular bracket. For instance, <lemma=potwór> will fetch all forms of “potwór”, including potwór, potwora, potworów, potworem… etc.

<lemma=potwór>

The aforementioned base forms are: the infinitive for verbs,  nominative singular for nouns (except for pluralia tantum),  nominative singular masculine for adjectives

Base form queries can be combined with surface queries: <lemma=widzieć> problemy will fetch phrases such as: widzę problemy, widzimy problemy, widzieli problemy itp.

<lemma=widzieć> problemy

It is possible to combine two or more base form queries, for instance: <lemma=jeździć> <lemma=samochód> . The KWIC list for this query includes items such as: jeździmy samochodami, jadę samochodem, jeżdżą samochodem etc.

<lemma=jeździć> <lemma=samochód>

Operators

Alternative

The pipeline symbol “|” represents an alternative between two or more words, e.g. “kupować|sprzedawać|remontować samochód|samochody” will fetch all the examples of “kupować”, “sprzedawać” or “remontować” with either plural or singular nominative form of the noun “samochód” from the corpus.

kupować|sprzedawać|remontować samochód|samochody

Note: This feature only works for words. The words can be lemmatized – e.g. “<lemma=stracić>|<lemma=zgubić> portfel” – but the concordancer is not able to find be multi-word strings. For instance, “idzie zima|idzie wiosna” will NOT fetch the examples of “idzie zima” and “idzie wiosna” – instead the concordancer will be looking for “idzie zima wiosna” or “idzie idzie wiosna”. Neither of those phrases can be found in the corpus.

Slop factor

Slop factor allows the users to decide on the maximum number of words that can appear between the elements of a multi-word query (also referred to as “intervening words”). For instance, slop factor set to “1” in the case of “wielki człowiek” will return “wielki człowiek”, as well as “wielki mały człowiek” or “wielki jest człowiek”, since one lexical item was allowed to appear between “wielki” and “człowiek”. In order to set the Slop factor, place your query in brackets and specify the value, preceded by the equality sign, e.g. (wielki człowiek)=1.

(wielki człowiek)=1

Setting higher slop value is recommended in the case of words that might be located further away from one another in the sentence. For instance, (ciebie kocham)=2 returns the following results: “ciebie też bardzo kocham Ciebie w nich kocham“ or “ciebie i nadal kocham”.

(ciebie kocham)=2

Note: Remember that the slop factor is the total number of the intervening words. Thus, in the case of queries which are longer than two elements, one should take into account all the possible positions of a word within the string. For instance, in order to retrieve the Polish proverb “Nosił wilk razy kilka, ponieśli i wilka”, using the following string of words: “nosił razy ponieśli”, one should set the slop factor to “4”, since there are three intervening words and one comma that need to be added. Please note that punctuation marks count as words.

(nosił razy ponieśli)=4

Slop factor with relaxed order

This operator combines the regular slop factor with the option to change the order of the words typed in the searchbox. In the format of the query, the tilde “~” is used instead of the sign of equation.

(problem jest)~2

Regex queries

Queries of this type make use of special symbols and quantifiers. Each query is a formula describing a whole set of possible strings of signs (words, sequences of words). The results are occurrences of all predefined strings found in the data.

Full stop

A full stop “.” is a wild card, it stands for any sign. A full stop used within any word will replace a single letter. In tha case of “zaka.ę”, it may be “ż” or “ł”.

zaka.ę

Plus

A plus “+” is a quantifier: the preceding sign can appear one or more times.

wan+a

Asterisk

An asterisk “*” is another quantifier: the preceding sign can appear zero or more times. Thus, “pan*a” will fetch both “pana” (“n” repeated zero times) and “panna” (“n” repeated more than zero times).

pan*a

Combinations

The aforementioned symbols can be used directly with standard signs, but the most fruitful use in queries is to combine the wild card ”.“ with one of the quantifiers.

Some examples:

“.+” means that in this part of the query any sign or sequence of signs may appear.

t.+m

”.*“ is used when either nothing or any combination of signs may appear after the sequence typed in by the user:

tren.*

slopeq_for_nkjp.1434270980.txt.gz · Last modified: 2015/06/14 10:36 by mmolenda