User Tools

Site Tools


slopeq_for_nkjp

PELCRA for NKJP 2

This page gives an overview of the PELCRA for NKJP 2 search engine which was developed at the Univeristy of Łódź as part of the CLARIN-PL infrastructure. The search engine uses the SlopeQ query syntax. For practical reasons the number of examples illustrating each query in this presentation is very limited. However, a link to a page with all the results is given for each query.

Word form queries

This is the simplest type of queries. Just type in the word in the search box and click the “search” button. The result is presented in the form of a KWIC (Key Words In Context) list, with the number of sentences matching the querydisplayed above.

maszt

The same method might be used to find sequences of two or more items:

na zdrowie

Base form queries

These queries can be used to find different grammatical forms of a given word. They can be used in the case of verbs, nouns, and adjectives, as all these parts of speech tend to have numerous different forms in Polish.

The format of the query is as follows: open triangular bracket + “lemma=” + the base form of the word sought + close triangular bracket. For instance, <lemma=potwór> will fetch all forms of “potwór”, including potwór, potwora, potworów, potworem… etc.

<lemma=potwór>

The aforementioned base forms are: the infinitive for verbs,  nominative singular for nouns (except for pluralia tantum),  nominative singular masculine for adjectives

Base form queries can be combined with surface queries: <lemma=widzieć> problemy will fetch phrases such as: widzę problemy, widzimy problemy, widzieli problemy itp.

<lemma=widzieć> problemy

It is possible to combine two or more base form queries, for instance: <lemma=jeździć> <lemma=samochód> . The KWIC list for this query includes items such as: jeździmy samochodami, jadę samochodem, jeżdżą samochodem etc.

<lemma=jeździć> <lemma=samochód>

Operators

Alternative

The pipeline symbol “|” represents an alternative between two or more words, e.g. “kupować|sprzedawać|remontować samochód|samochody” will fetch all the examples of “kupować”, “sprzedawać” or “remontować” with either plural or singular nominative form of the noun “samochód” from the corpus.

kupować|sprzedawać|remontować samochód|samochody

Note: This feature only works for words. The words can be lemmatized – e.g. “<lemma=stracić>|<lemma=zgubić> portfel” – but the concordancer is not able to find be multi-word strings. For instance, “idzie zima|idzie wiosna” will NOT fetch the examples of “idzie zima” and “idzie wiosna” – instead the concordancer will be looking for “idzie zima wiosna” or “idzie idzie wiosna”. Neither of those phrases can be found in the corpus.

Slop factor

Slop factor allows the users to decide on the maximum number of words that can appear between the elements of a multi-word query (also referred to as “intervening words”). For instance, slop factor set to “1” in the case of “wielki człowiek” will return “wielki człowiek”, as well as “wielki mały człowiek” or “wielki jest człowiek”, since one lexical item was allowed to appear between “wielki” and “człowiek”. In order to set the Slop factor, choose the value using the slider located below the search bar. In this document the position of the slider is indicated by the following expression (Slop factor = 1,2,3…).

wielki człowiek (Slop factor = 1)

Setting higher slop value is recommended in the case of words that might be located further away from one another in the sentence. For instance, (ciebie kocham)=2 returns the following results: “ciebie też bardzo kocham Ciebie w nich kocham“ or “ciebie i nadal kocham”.

ciebie kocham (Slop factor = 2)

Note: Remember that the slop factor is the total number of the intervening words. Thus, in the case of queries which are longer than two elements, one should take into account all the possible positions of a word within the string. For instance, in order to retrieve the Polish proverb “Nosił wilk razy kilka, ponieśli i wilka”, using the following string of words: “nosił razy ponieśli”, one should set the slop factor to “4”, since there are three intervening words and one comma that need to be added. Please note that punctuation marks count as words.

nosił razy ponieśli (Slop factor = 4)

Slop factor with relaxed order

This operator combines the regular slop factor with the option to change the order of the words typed in the searchbox. In NKJP 2, unchecking the “order” box is necessary to run relaxed-order queries. Please note that in this Wiki the relaxed order is represented by the expression (“order” uchecked).

problem jest (Slop factor = 2) ("order" unchecked)

Unchecking the “order” box with Slop factor set to 0 can be used to take advantage of the relaxed word order option without any intervening words.

ciebie na ("order" unchecked)

Slop factor can be combined with other functionalities. For instance, ”<lemma=widzieć> kogo (Slop factor = 2) (“order” uchecked)“ will fetch examples of every form of the verb “widzieć” combined with “kogo”, appearing in any order, and separated by the maximum of two intervening words.

<lemma=widzieć> kogo (Slop factor = 2) ("order" uchecked)

Negation

This operator excludes specified variants of query terms from the results. Consequently, it must be combined with query types that produce variation in the results. Negation is marked by a pipe sign with an exclamation mark “|!”, which is to be read as “but not”. The example shows how it is used with a base form query. The specified form of the word is excluded from the results:

<lemma=prosić>|!proszę

Regex queries

Queries of this type make use of special symbols and quantifiers. Each query is a formula describing a whole set of possible strings of signs (words, sequences of words). The results are occurrences of all predefined strings found in the data.

Full stop

A full stop “.” is a wild card, it stands for any sign. A full stop used within any word will replace a single letter. In tha case of “zaka.ę”, it may be “ż” or “ł”.

zaka.ę

Plus

A plus “+” is a quantifier: the preceding sign can appear one or more times.

wan+a

Asterisk

An asterisk “*” is another quantifier: the preceding sign can appear zero or more times. Thus, “pan*a” will fetch both “pana” (“n” repeated zero times) and “panna” (“n” repeated more than zero times).

pan*a

Combinations

The aforementioned symbols can be used directly with standard signs, but the most fruitful use in queries is to combine the wild card ”.“ with one of the quantifiers.

Some examples:

“.+” means that in this part of the query any sign or sequence of signs may appear.

t.+m

”.*“ is used when either nothing or any combination of signs may appear after the sequence typed in by the user:

tren.*

using more than one dot allows the user to replace a set number of characters

b....s

slopeq_for_nkjp.txt · Last modified: 2017/02/03 00:17 by mmolenda