The DWDS corpus : A reference corpus for the German language of the 20th century

1. Core corpus

1.1. 100 million running word

1.2. a reference corpus for the German language of the 20th century

1.3. were obtained from publishing houses or donated by contributors

2. Opportunistic corpus

2.1. 900 million text words

2.2. consisting essentially of newspaper sources from the last 15 years

2.3. copyright clearance has been obtained from major publishing houses, enabling DWDS users to access the works of important literary and scientific authors

4. DWDS project

4.1. Motivations

4.1.1. no dictionary of the German language which represents the lexicon of the entire 20th century confront the texts of the past, they indicate the presence of a barrier to the full understanding of German

4.1.2. order its words by lexical categories, types of syntactic constructions, and lexical fields

4.1.3. a balanced corpus of German the challenge here is to filter out interesting words and word senses in a large mass of data

4.2. The current DWDS Kerncorpus (version 0.95) consists of 79,322 documents. The corpus comprises 100,600,993 tokens and 2,224,542 types

4.3. To serve as the empirical basis of a large monolingual dictionary of the 20th/21st century : to offer more subtle linguistic descriptions of the semantics and syntagmatics of lexical items

4.4. The DWDS Kerncorpus is publicly available from the project's website

5. Design of the DWDS

5.1. why was the period between 1900 and 2000 chosen?

5.1.1. the 20th century does not constitute a clear-cut historical time period

5.2. why is the corpus restricted to only five genres?

5.2.1. we were guided by the practical consideration that fewer genre distinctions make the daily corpus work easier

5.2.2. it is a balanced corpus of German texts of the 20th century with five genres: journalism (27%), literary texts (26%), scientific literature (22%) and other nonfiction (20%) and transcripts of spoken language (5%)

6. The compilation of the DWDS Kerncorpus

6.1. Text selection

6.2. Copyright issues

6.2.1. a committee was formed of prominent public personalities convince authors and copyright owners authority to the project's negotiations with publishing houses

6.2.2. the DWDS Kerncorpus at least should also be publicly available as a resource

6.3. Digitizacion

6.3.1. the availability of the texts in electronic format 60% of the selected texts were available in electronic format, these texts were either purchased as CD-ROMs 40 million tokens had to be digitized from the printed format

6.3.2. methods Optical Character Recognition manual transcription

6.4. Structural Annotation

6.4.1. the structural annotation of the texts follows the TEI guidelines information was encoded: page-breaks, footnotes, titles, chapters, paragraphs as well as prefaces and epilogues

6.4.2. normalize the orthography, the original orthography is stored as an attribut the original orthography is preserved for token search and for the display of the KWIC lines

6.4.3. information was encoded: first date of publication, copyright status, genre, and bibliographic reference

6.5. Linguistic anotation

6.5.1. all the texts have been annotated using the TAGH morphology is based on a stem lexicon with allomorphs and a concatenative mechanism for inflection, derivation and composition the recognition rate of TAGH is more than 99% for modern newspaper texts and about 98.3% for the DWDS Kerncorpus to unrecognized named entities, non-standard regional variants, typing errors, foreign words, non-standard abbreviations and words spelled according to the historical orthography of before 1902

7. The copyright status corresponds to the copyright acknowledgement agreed to between the publishing houses and the DWDS project