Corpora
Below we present the corpora, which are held by the CLARIN-D centers.
A corpus is a machine-readable collection of naturally-occurring language text, chosen to characterize a state or variety of a language.
A multimodal digitialized corpus is a computer-based collection of language and communication-related material drawing on more than one sensory modality or on more than one production modality.
You can choose between the CLARIN-D centers:
- Bavarian Archive of Speech Signals, Munich
- Digital Dictionary of German Language
- English Language and Translationscience, Saarbrücken
- The Hamburg Center for Language Corpora
- The Institute of German Language, Mannheim
- Institute of Computer Science: Department: Natural language processing, Leipzig
- Institute of Natural Language Processing, Stuttgart
- Department of General and Computational Linguistics, Tübingen
Bavarian Archive for Speech Signals, Munich
List of BAS spoken language corpora which are available in CLARIN-D; online corpora are linked to the BAS repository:
- ALC
This corpus contains recordings of 162 speakers while being sober and intoxicated. - AsiCa
The AsiCa-Corpus basically is a documentation of the south italian dialect 'calabrese'. The main objects when building this corpus were the analysis of syntactical structures and their geolinguistic mapping in form of interactive, webbased cartography. The corpus consists of several audio files containing recordings of some sixty speakers of Calabrese one half of which having migration experience in Germany the other half almost allways having stayed in Calabria. The results of syntactical analysis (maps and text) can be seen on the projects website. - HEMPEL
Hempels Sofa is a collection of more than 3900 spontaneous speech items recorded as extra material during the German SpeechDat-II project in a relaxed atmosphere. Speakers were asked to report what they had been doing during the last hour. This resulted in quite natural, colloquial speech, sometimes with marked regional accent. - PD1
The corpus contains 21587 recorded read speech utterances of 201 different speakers. Each speaker read a subcorpus of 450 different sentence equivalents (including alphanumericals and two shorter passages of prose text). The speakers were recorded at four different sites in Germany (Universities of Kiel, Bonn, Bochum, and Munich). The language is German. - PD2
The corpus contains 3200 recorded read speech utterances of 16 different speakers, 6 women and 10 men. Each speaker has read a corpus of 200 different sentences from a train query task. They were recorded at three different sites in Germany (Universities of Kiel, Bonn, and Munich). The language is German. - SHC
The SMARTWEB UMTS data collection was created within the publicly funded German SmartWeb project in the years 2004 - 2006. It comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The recordings include field recordings using a hand-held UMTS device (one person, SmartWeb Handheld Corpus SHC), field recordings with video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus SVC) as well as mobile recordings performed on a motorbike (one speaker, SmartWeb Motorboke Corpus SMC). - SIGNUM
The SIGNUM Database contains both isolated and continuous utterances of 25 German signers of different sexes and ages. For quick random access to individual frames, each video clip is stored as a sequence of images. The vocabulary comprises 450 basic signs in German Sign Language (DGS) representing different word types. The SIGNUM Database was created within the framework of a DFG funded research project at the Institute of Man-Machine-Interaction, located at the RWTH Aachen in Germany. - SK Home
This corpus contains multi modal recordings of 65 actors who use the SmartKom system. SmartKom Home should be an intelligent communication assistant for the private environment. Naive users were asked to test a 'prototype' for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimical expressions and should more or less communicate like a human. - SK Mobil
This corpus contains multi modal recordings of 73 actors who use the SmartKom system. SmartKom Mobil is a portable PDA equipped with a net link and additional intelligent communication devices. Naive users were asked to test a 'prototype' for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and should more or less communicate like a human. Experiments were not performed in the field but rather in a studio-like environment. Background noise was played back artificially and the users did not carry the PDA in their hand but rather used a much smaller version of the SIVIT projection plane (to simulate a PDA display) and a pen as a pointing device. Speakers were speaking to a headset microphone. - SK Public
This corpus contains multi modal recordings of 86 actors who use the SmartKom system. SmartKom Public is comparable to a traditional public phone booth but equipped with additional intelligent communication devices. Naive users were asked to test a 'prototype' for a market study not knowing that the system was in fact controlled by two human operators. They were asked to solve two tasks in a period of 4,5 min while they were left alone with the system. The instruction was kept to a minimum; in fact the user only knew that the system is able to understand speech, gestures and even mimical expressions and should more or less communicate like a human. - SVC
The SMARTWEB UMTS data collection was created within the publicly funded German SmartWeb project in the years 2004 - 2006. It comprises a collection of user queries to a naturally spoken Web interface with the main focus on the soccer world series in 2006. The recordings include field recordings using a hand-held UMTS device (one person, SmartWeb Handheld Corpus SHC), field recordings with video capture of the primary speaker and a secondary speaker (SmartWeb Video Corpus SVC) as well as mobile recordings performed on a BMW motorbike (one speaker, SmartWeb Motorbike Corpus SMC). - TAXI
This is the TAXI dialog database created in June 2001 in collaboration with the DFKI, Saarbruecken. TAXI contains 86 recorded dialogues between a cab dispatcher and a client recorded over public phone lines (network and GSM). The dispatcher always spoke German, while the clients always spoke English. - VM1
The Verbmobil (VM) dialog database is a collection of German, American and Japanese dialog recordings in the appointment scheduling task. The data were collected during the first phase (1993 - 1996) of the German VM project funded by the German Ministry of Science and Technology (BMBF). For a detailed information about this project please refer to the official VM Web Server (http://www.dfki.uni-sb.de/verbmobil/). 885 speakers participated in 1422 recordings. The total corpus amounts to 9GB of data containing 23750 conversational turns distributed on 15 CD-R. - VM2
Verbmobil 2 contains the speech of 401 speakers participating in 810 recordings. The emotional tagged recordings are not part of this edition but are collected inthe corpus 'BAS VMEmo'. The total VM2 corpus amounts to 17.6GB of data containing 58961 conversational turns distributed on 39 CD-R. VM2 contains dialogs in German, English, Japanese and mixed language pairs (partly with interpreter). The domain is appointment scheduling, travel planing, leisure time planing. - VMEmo
This database contains speech signals of dialogues in which a subject was recorded during a conversation via a spontaneous speech translation system. The response of the system was designed to invoke emotions (e.g. anger) in the Subjects. It is part of the larger Verbmobil 2 speech data collection. - ZIPTEL
The ZipTel telephone speech database contains recordings of people applying for a SpeechDat prompt sheet via telephone. The calls were recorded by a telephone server; callers were asked to provide name, address and telephone number. The ZipTel telephone speech database consists of 1957 recording sessions with a total of 7746 signal files. For privacy reasons, only a subset of the recorded signal files are contained in the databases: Streetnames (z2), ZIP-Codes (z3), Citynames (z4) and Telephone numbers (z5).
Choose another CLARIN-D Center
Berlin-Brandenburg Academy of Sciences and Humanities (BBAW), Berlin
Digital Dictionary of German Language (in German: DWDS)
The DWDS contains a range of contemporary German language corpora: a core corpus of 100 million tokens which covers the 20th century and which is balanced over the decades of this century and over several text genres, and an additional opportunistic complementary corpus (including e.g. newspaper corpora). The total size of these corpora is approximately 2.5 billion tokens. However, copyright restrictions and individual contracts limit access and usage of all these text collections. We must therefore follow a policy which specifies a variety of access rights and restrictions with respect to these collections. Usage conditions for individual text collections can be provided on request: dwds@bbaw.de.
Reference corpora
The core corpus of the 20th century consists of 100 million tokens and is balanced over the decades of this century and over several text genres: belles-lettres, newspaper, scientific and functional texts.
- Size: ca. 100 million tokens (120 million tokens incl. punctuation and numerical signs) in 79.830 documents.
- XML anntotation in accordance with the Guidelines of the TEI.
- Most of the text are subject to IPR restriction. However, concordance lines for most of the texts can be inspected through our website. Free registration on the website extends the number of searchable documents.
The core corpus of the 21st century is still being compiled, parts of it (belles-lettres and newspaper texts) can be inspected through our website. It will be constructed in the same way as the core corpus of the 20th century.
The "Juilland-D"-corpus is a German corpus which is balanced over the decades of this century and over several text genres and covers the period between 1920-1939. The structure of the corpus follows the principles which have been laid down and applied to other languages by Alphonse Juilland. Size: 500.000 tokens in 392 documents.
The DDR-Korpus (texts from the former German Democratic Republic) consists of 9 Millionen tokens in 1150 texts from the period betwee 1949 bis 1990. These texts have either been published in the GDR or they have been composed by authors living in the GDR and published in the "Bundesrepublik Deutschland" (Federal Rebublic of Germany).
The C4-Korpus is a joint effort of the BBAW, the Austrian Academy of Sciences, the University of Basel and the Free University of Bozen · Bolzano. The corpus currently consists of 20 million tokens of High German, 4.1 million tokens of German texts from Austria, 20 million tokens of German texts from Switzerland and 1.7 tokens of German texts from South Tyrol. It can be queried with the DDC search engine that has been developed at the BBAW e.g. through the DWDS-website and by a portal which is offered by the University of Basel.
Newspaper corpora
The Berliner Zeitung corpus consists the complete set of articles which have ben published online between January 1994 and december 2005. Size: 252 million tokens in 869.000 articles.
The Tagesspiegel corpus consists the complete set of articles which have been published online between 1996 and 2005. Size: 170 million tokens in 350.000 articles.
The Potsdamer Neueste Nachrichten corpus consists the complete set of articles which have been published between 2003 and June 2005. Size: 15 million tokens in 42.000 articles.
The corpus of Die ZEIT consists of all weekly issues of the period between 1946 bis 2009. Size: 460 million tokens.
The BILD corpus consists of all articles from the period between 02.05.1997 and 29.04.2006. Size: 121 million tokens in 550.000 articles.
The corpus of Die WELT consists of all articles from the period between 01.03.1999 and 29.04.2006. Size: 240 million tokens in 600.000 articles.
The Süddeutsche Zeitung corpus contains all articles from the period between 01.01.1994 - 31.12.2004. Size: 453 million tokens in 1.100.000 articles.
The use of the last three corpora is subject to stricter licence restrictions. Currently they can also be used for internal research and development. A small partition of the texts is available through our "word profile" and "good examples" applications.
Specialized corpora
The corpus of Jewish periodicals is a result of a cooperation with the Compactmemory project. The corpus is based on eight complete periodicals from the period between 1887 and 1938. Size: 26 million tokens on ca. 50.000 pages.
The "Berliner Wendekorpus" consists of 77 oral history interviews from the period of the German reunification involving citizens of East and West Berlin and referring to their personal experiences of the time. This project had been supported by the German Research Foundation (DFG) and the Freie Universität Berlin. It was directed by Norbert Dittmar. The corpus consists of 250.000 tokens.
The Korpus gesprochene Sprache contains transcripts of speeches, protocols of parliamentary sessions, and interviews spanning the whole 20th century. Size: 2.5 million tokens.
German Text Archive (German: DTA)
The corpora of the German Text Archive (Deutsches Textarchiv, DTA) contain printed works in German language dating from the 17th to the 19th century. The DTA core corpus is balanced with regard to text types, disciplines, and dates of origin. This way, the DTA may serve as a basis for a reference corpus of the historical New High German language. These text sources are published on the internet as digital facsimiles and as XML-annotated transcriptions along with comprehensive bibliographic metadata. The annotation consistently follows the well-documented DTA „base format‟ (DTABf), a strict TEI P5 subset developed for the representation of (historical) written corpora. The DTABf is recommended as best practice format for (historical) written corpora in the CLARIN-D User Guide. The electronic full-texts are enriched with linguistic information gained through automated tokenization, lemmatization and part-of-speech tagging; the historical spelling is mapped to its modernized equivalents.
The corpus is being enhanced successively. As of March 2014, it contains digitized and structurally as well as linguistically annotated full texts of 1304 volumes with 419 284 pages (ca. 100 million tokens).
Via the module DTA Extensions (DTAE) and in the context of a CLARIN-D curation project, the DTA is continually extended by specific textual resources and corpora, which originate from external projects. These external resources, which come in various mark-up formats, are converted to meet the DTA guidelines (DTA base format) and are linguistically analysed. As of March 2014, the DTAE corpora contain 587 texts (85 535 pages).
All DTA texts and their corresponding image sources are accessible via the quality assurance platform DTAQ, and, after quality control, via the DTA website. In DTAQ users are able to check different instances of all DTA corpus texts (XML, HTML, version with linguistic analyses), add remarks or suggest corrections via a ticketing system and can correct errors using a web-based inline editor.
All DTA texts can be downloaded or harvested via OAI/PMH in different formats and may be used under a Creative Commons license.
The DTA also offers services for the creation, maintenance, and quality management of German diachronic language corpora of the above-mentioned period.
dlexDB (dlexDB)
Research in psycholinguistics, experimental psychology and cognitive sciences has long moved from investigating mere word frequency effects to more complex variables and linguistic features. Such norms are often difficult to collect without proper text corpora and linguistic tools. dlexDB provides a wide range of information on words, their lemmas and categories, word n-grams and sublexical units (e.g. syllables and character n-grams). The data is available for use with a browser and through a webservice.
Choose another CLARIN-D Center
English Language and Translation science, Saarbrücken
List of corpora, treebanks and databases that will be available in CLARIN-D as planned at the time of writing (May 2012):
The version 2 consists of 355,096 tokens (20,602 sentences) of German newspaper text, taken from the Frankfurter Rundschau as contained in the CD "Multilingual Corpus 1" of the European Corpus Initiative. It is based on approx. 60,000 tokens that were tagged for part-of speech at the Institut für Maschinelle Sprachverarbeitung, Stuttgart. This corpus was extended, tagged with part-of-speech information and completely annotated with syntactic structures. The corpus was created in the projects NEGRA (DFG Sonderforschungsbereich 378, Projekt C3) and LINC (Universität des Saarlandes) in Saarbrücken.
The SALSA corpus is based on the TIGER corpus. The TIGER corpus (Version 2.1) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes. (cf. TIGER corpus website ) SALSA provides an additional annotation layer to the TIGER corpus: FrameNet semantic roles.
The CroCo corpus is a bidirectional corpus of German (GO) and English (EO) texts from 8 registers (popular-scientific texts, tourism leaflets, prepared speeches, political essays on economics, fictional texts, corporate communication, instruction manuals, websites) with the respective English (ETrans) and German (GTrans) translation. The corpus is annotated with lemma, POS, morphological information, phrasal chunks and grammatical functions. The parallel corpora (EO-GTrans and GO-ETrans) are aligned on different levels: word, chunk, clause and sentence.
The Darmstadt Corpus of Scientific Texts (DaSciTex) contains full English scientific journal articles compiled from 23 sources covering nine scientific domains. The corpus has a three-way partition: (1) a center discipline (computer science); (2) four 'pure' contact disciplines (linguistics, biology, mechanical engineering, electrical engineering) and;(3) four corresponding 'mixed' disciplines (computational linguistics, bio-informatics,computer-aided design, micro-electronics). The corpus comes in two versions: a small manually checked corpus (approx. one million words), a large corpus (17 million words).
The Saarbrücker Stimmdatenbank is a collection of voice recordings from more than 2000 persons. Recordings are classified according to healthy and pathological voice profiles on the basis of the acoustic and electroglottographic signals. The speech signal and the EGG signal have been stored in separate files. Any comments about the recordings are contained in an associated text file. The material can be queried through a web search interface, and the selected audio files can be exported.
The corpus GENIE - spoken Niedersorbisch/Wendisch gives access to the spoken variants of these languages via databases. It presents one of the two only autochthone Slavic minority languages in Germany, which is spoken in the Lausitz by less than 10,000 people in a native way, in acoustic form. It includes 350 audio files with over 62 hours of selected Niedersorbisch speech recordings from different sources and eras in mp3 and wav formats. It provides detailed information for each recording, which makes it possible to search for recordings with specific characteristics.
The GRUG Parallel Treebank is a set of four monolingual Treebanks (German, Georgian, Russian and Ukrainian), and four parallel Treebanks (German-Georgian, German-Russian, German-Ukrainian, Georgian-Ukrainian). The corpus is manually annotated with POS, morphologic and syntactic information following the TIGER guidelines, and the outcome is provided in TIGER-XML format. The monolingual treebanks can be explored using either TIGERSearch software or SALTO, whereas the parallel treebanks can be browsed with the Stockholm TreeAligner.
The SaCoCo "Saarbrücker Cookbook Corpus" is a diachronic corpus of cooking recipes containing a historical and a contemporary subcorpus. The historical subcorpus spans 200 years (1569-1729) and includes 430 recipes from 14 cook books written in German (approx. 45.000 tokens). The core of the recipe corpus was compiled as part of a PhD work in the field of Translation Studies (cf. (Wurm, 2007)). The recipes of the contemporary subcorpus were collected from Internet sources spanning a five year period (2007-2012), criteria for selection were comparability to the historical subcorpus with respect to register, and geographic information (recipes from Germany). The contemporary subcorpus contains 1.500 recipes and approx. 500.000 tokens.
Choose another CLARIN-D Center
The Hamburg Centre for Language Corpora
The Hamburg Centre for Language Corpora provides multilingual language acquisition, language attrition and sociolinguistic corpora, mainly from the Collaborative Research Centre 538 “Multilingualism”. The spoken language data amounts to approx. 2000 hours of audio or video recordings and 6 million transcribed words. The corpora that fulfill our quality requirements will be successively integrated into the CLARIN-D infrastructure. Until now, six spoken multilingual corpora are available (further will follow soon):
The Hamburg Map Task Corpus (HAMATAC) adds L2-German to the corpus “Deutsch heute” (IDS) through 24 adult German learners with varying L1s and L2-Proficiencies. Orthographic transcription (simplified HIAT, automatic and partly manually corrected POS- and lemma annotations (TreeTagger, STTS), and manual disfluency annotations. The maps used for the tasks are available.
Size: 24 communications, 3:17 hours, 21.400 words.
The "Dolmetschen im Krankenhaus (DiK)"-corpus is based on doctor-patient-communication between German doctors or medical staff and Turkish or Portuguese speaking patients interpreted by non-professional interpreters (medical staff or patient’s relatives). The corpus also include comparable data; monolingual doctor-patient-communication from Germany, Turkey and Portugal. HIAT-Transcription und German translation.
Size: 91 communications, 23:01 hours, 165.700 words.
The corpus Hamburg Adult Bilingual Language (HABLA) containst semi-spontaneous interviews with bilingual adult speakers, who acquired their languages (German and either French or Italian) either simultaneously (2L1) or successively (L2). The L2-speakers with German as either L1 or L2 were recorded using their L2, the 2L1-speakers were recorded twice, using both their languages. CHAT-transcription, detailed speaker metadata on language acquisition and use.
Size: 169 communications, 79:08 hours, 737.800 words.
The Hamburg Corpus of Argentinean Spanish (HaCASpa) comprises read, elicited, semi-spontaneous and spontaneous data from two varieties of Argentinian Spanish, focussing on intonation. Fifty speakers participate in five tasks, additionally there are recordings of map tasks and interviews with further speakers, partly with video. The speakers are divided into two generations and two regions (Buenos Aires (Porteño Spanish) or Nequén/Comahue). Orthographic transcription with references to the elicitation materials used in the task, which have also been included in the corpus.
Size: 259 communications, 18:24 hours, 141.300 words.
The Hamburg Corpus of Polish in Germany (HamCoPoliG) covers (semi-)spontanenous Polish data from Polish-German bilinguals - either Polish immigrants in Germany with German as L2 or so called heritage speakers born in germany with Polish as a family/community language. The corpus also includes control group recordings with speakers of Polish without German contact. Orthographic transcription, very detailed questionnaires on language acquisition and knowledge and Grammaticality Judgement Tests are also available.
Size: 359 communications, 37:50 hours, 294.700 words.
The EXMARaLDA Demokorpus corpus is used for demonstrations of the EXMARaLDA system and contains short audio and video recordings in eleven different languages. Orthographic transcription (simplified HIAT), German translation, example metadata.
Size: 20 communications, 15.400 words.
Choose another CLARIN-D Center
Institute of German Language, Mannheim
COSMAS II (Corpus Search, Management and Analysis System) is a full-text database for linguistic research on corpora of the Institut für Deutsche Sprache (IDS). It provides access to the steadily growing German Reference Corpus (DEREKO, over 4 billion words from newspapers, fictional, non-fictional and specialized works from Germany, Austria and Switzerland, from 1772 to present) and other written language corpora of the IDS.
Choose another CLARIN-D Center
Institute of Computer Science, Department: Natural language processing, Leipzig
The project Deutscher Wortschatz aims at documenting the usage of the German language. The content of the Wortschatz portal can be characterized as a collection. Since 1999 texts of newspaper portals, Wikipedia and other sources are automatically collected and separated into single sentences. Multiple language independent and mostly statistical methods are used to calculate data like word frequency, frequency class, sentence- and direct neighbour cooccurrences. In addition to the German portal concentrating on the German language an international portal provides access to monolingual lexicons that contain the typical Wortschatz data in over 90 different languages.
Words and phrases, the „words of the day“, mentioned in selected newspapers are extracted on a daily basis. The relevance of a word is calculated by comparing the frequency during a limited observation period to its long time moving average. The archive for German contains data from april 2002 till today. Norwegian data is available since march 2006.
Choose another CLARIN-D Center
Institute of Natural Language Processing, Stuttgart
The TIGER Treebank (Version 2.1) consists of app. 900,000 tokens (50,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau. The corpus was semi-automatically POS-tagged and annotated with syntactic structure. Moreover, it contains morphological and lemma information for terminal nodes.
The TIGER Treebank is delivered in two treebank formats: Negra export format (text format) and TIGER-XML format (XML-based format). Both versions of the corpus can be processed by the treebank query tool TIGERSearch, which has also been developed within the TIGER project (Saarbrücken, Stuttgart, Potsdam).
In addition to the TIGER Treebank proper, several resources derived from it are available. These are the TiGer Dependency Bank, which is a dependency-based gold standard for (hand-crafted) German parsers for the TIGER Corpus sentences 8,001 through 10,000, the TIGER 700 RMRS Bank, the TIGER data sets for the CoNLL-X shared task and dependency triple representations for (almost) the entire treebank, which, like the TiGer DB structures, are intended for evaluation purposes.
DIRNDL -- (D)iscourse (I)nformation (R)adio (N)ews (D)atabase for (L)inguistic Analysis -- is a corpus resource based on hourly broadcast German radio news. The textual version of the news is annotated with syntactic information. On top of this, the syntactic phrases are labeled with information status categories (given-new information). The speech version is prosodically annotated, i.e. with pitch accents and prosodic phrase boundaries. As the textual and the speech version slightly deviate from each other due to slips of the tongue, fillers and minor modifications, a (semi-automatic) linking of the two versions was carried out and the results were stored inside the database. With the help of these newly established links, all annotation layers can be accessed for exploring the relations between prosody, syntax and information status.
The Huge German Corpus (HGC) is a collection of German texts (newspaper, law texts) of about 204 million tokens including punctuation in 12.2 million sentences (about 180 million "real" words). The corpus was automatically segmented into sentences. Furthermore, it was lemmatized and part-of-speech tagged by the TreeTagger (Schmid 1994) using the STTS tagset (Schiller et al. 1999). The corpus is partly based on data taken from the European Corpus Intitiative Mutlilingual Corpus I (EMI/MCI).
SdeWaC is based on the deWaC web corpus of the WaCky-Initative. SdeWaC contains parsable sentences from deWaC documents of the .de domain. SdeWaC is limited to the sentence context. The sentences were sorted and sentence duplicates within the same domain-name were removed. In addition, some heuristics based on (Quasthoff et al. 2006, "Corpus Portal for Search in Monolingual Corpora") have been applied. To extract parsable sentences the FSPar dependency parser was applied. SdeWaC-v3 is made available by the WaCky-Initiative and comes in two formats:
- one sentence per line
- one token per line including part-of-speech and lemma annotation (Tokenizer and TreeTagger by H. Schmid)
In both formats, additional metadata encodes the domain-name and an "error-rate" of the parser.
Choose another CLARIN-D Center
Department of General and Computational Linguistics, Tübingen
TüBa-D/Z
The Tübingen Treebank of Written German (TüBa-D/Z) is a collection of manually annotated newspaper texts from the German daily newspaper "die tageszeitung." The Seminar für Sprachwissenschaft is actively maintaining this treebank and it has grown with each new release since 2003. The current version contains 75,408 sentences with 1,365,642 tokens.
All tokens are fully annotated for:
- inflectional morphology
- lemmas
- syntactic dependency and constituency
- grammatical functions
- named entities
- anaphora and coreference relations
The TüBa-D/Z syntactic annotation scheme builds on assumptions shared by most major syntactic theories and synthesizes different approaches to German grammar, including both constituency and dependency ideas. It marks four varieties of syntactic relations: token-level relations, phrasal constituency, topological fields (i.e. the German "V2" approach), and clausal relations.
TüPP-D/Z
The Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) is a collection of articles from the German daily newspaper "die tageszeitung" which have been automatically annotated for clause structure, topological fields, and shallowly parsed (i.e. chunked). It has also been automatically annotated for parts of speech and morphological classes, including marking possible classes in ambiguous cases.
The current release of the TüPP-D/Z is drawn from the 1999 release of "die tageszeitung" archives for scientific research, and includes materials from 1986 to 1999, amounting to more than 200 million text tokens.
Tübingen-Vermobil Treebanks of Speech
Tübingen manages and distributes three treebanks of transcribed speech originally commissioned for the Vermobil project (1993-2000).
The Tübingen Treebank of Spoken German (TüBa-D/S) is a selection of manually transcribed and annotated sentences from spontaneous spoken dialogues in German. It comprises approximately 38,000 sentences with roughly 360,000 words.
The Tübingen Treebank of Spoken English (TüBa-E/S) is a selection of manually transcribed and annotated sentences from spontaneous spoken dialogues in English. It comprises approximately 30,000 sentences with roughly 310,000 words.
The Tübingen Treebank of Spoken Japanese (TüBa-J/S) is a selection of manually transcribed and annotated sentences from spontaneous spoken dialogues in Japanese. It comprises approximately 18,000 sentences with roughly 160,000 words.