• CLARIN-D
  • WebLicht
  • Virtual Language Observatory
  • Exmaralda
  • ELAN
  • DWDS
  • ISOCAT
  • SPEECH FINDER
  • Wortschatz
  • COSMAS2

TüBa-D/Z lemmatizer available in WebLicht


The "TüBa-D/Z lemmatizer", a syntax-based lemmatizer for German that was developed in the context of the TüBa-D/Z treebank is now available for use as a component via the WebLicht interface to noncommercial users from the entire CLARIN community. By integrating morphology and syntactic information, the tool creates lemma and morphological tags that are often richer or more precise than the output of a surface-based model. Using frequency heuristics, the tool also provides an automatic classification of separable and inseparable prefix verbs as well as heuristic completion of truncated words. In the WebLicht version of the TüBa-D/Z lemmatizer, the syntax information is provided by the Berkeley parser and an internally developed grammar model.

TüBa-D/Z lemmatizer

 


1st CLARIN-D Doktorandentage - Corpora

Venue: Institut für Machinelle Sprachverarbeitung, Universität Stuttgart
Location: Universität Stuttgart, Institut für maschinelle Sprachverarbeitung (IMS), Forschungszentrum Informatik (FZI), Pfaffenwaldring 5b, 70569 Stuttgart, Seminarraum 1, V 5.01 (on the groundfloor)
Date: 25th-26th March 2013
Target group: PhD students, young researchers.

Information about the event at: http://fr46.uni-saarland.de/lsteich/ClarindDS2013

Programme of the event

CLARIN-D Doktorandentage - Corpora

Day 1 - Mo 25/3/2013

13:30 - 17:00 Corpus query with CQP (Hannah Kermes, UdS):

  • token-based queries,
  • regular expressions,
  • simple "statistical" analysis (grouping)

ca. 15:00 - 15:30 Break

Day 2 - Tue 26/3/2013

09:00 - 12:30 Syntactic annotation and corpus search with CLARIN-D resources (Heike Zinsmeister, IMS):

  • Syntactic annotation in WebLicht (parts of speech, constituency, topological fields, grammatical functions)
  • Search on syntactic annotated corpora (reference corpora and WebLicht output)
  • Visualization and simple statistic analysis of search results

ca. 10:30 - 11:00 Break

12:30 - 13:30 Lunch break

13:30 - 17:00 Statististical analysis with R (Marilisa Amoia, UdS):

  • exploratory data analysis
  • hypothesis testing

ca. 15:00 - 15:30 Break

 


Workshop Exploring data from language documentation

Dates: 10.05.2013 - 11.05.2013
Location: ZAS Berlin
Languages: English, German
Webpage: http://www.zas.gwz-berlin.de/workshop_edla.html
Organizers: Felix Rau ( This email address is being protected from spambots. You need JavaScript enabled to view it. ) Kilu von Prince ( This email address is being protected from spambots. You need JavaScript enabled to view it. )

Description

Language documentation has produced a large amount of extensive
spoken language corpora. These corpora consist of time-aligned and
annotated audio and video recordings of endangered and often lesser
known languages. The typological diversity and the variety of these
data pose new and interesting technological and methodological
challenges.

Moreover, in the last ten years, a considerable infrastructure has
been developed to create and archive larger corpora of time-aligned
and annotated primary data. This infrastructure involves digital
archives such as the TLA at the MPI in Nijmegen and tools such as
ELAN, Toolbox, FLEX, praat and Transcriber.

But to unlock the full potential of spoken language corpora,
researchers often face unique challenges: Depending on the properties
of the documented language, the primary research questions, and the
nature of the workflow, the tools listed above
might not fully correspond to the researchers' needs. Also, in studies
working with data from different documentation projects, it may be
difficult to integrate a variety of formats and standards.

This workshop, which is organized by the CLARIN-D project (F-AG3),
invites experts from language documentation and
linguistic typology as well as language technology and corpus
linguistics to present and discuss problems and solutions posed by the
analysis of typologically diverse spoken language corpora as well as
relevant practices and technologies of related fields.

Invited speakers:

Ciprian Gerstenberger (Uni Tromsø)
Roland Meyer (Uni Regensburg)
Nick Thieberger (Uni Melbourne/PARADISEC)
Taras Zakharko (Uni Zürich)

Organizers:
Felix Rau (f This email address is being protected from spambots. You need JavaScript enabled to view it. )
Kilu von Prince ( This email address is being protected from spambots. You need JavaScript enabled to view it. )

 


CLARIN Standards Guide

The Clarin Standards Guide provides information on standards, guidelines and standard-promulgating organizations that deal with language technology resources such as text corpora, lexica, and language databases. It was created at IDS Mannheim. A thorough description can be found here on this web site or you can inspect the Standard's Guide own site.