The DWDS corpus, A. GEYKEN (2006)

1. 6. Copyright issues & opportunities

1.1. Goal : convince authors and copyright owners to collaborate in the compilation

1.2. committee formed of prominent public personalities

1.2.1. advised the project

1.2.2. project's negotiations with publishing houses

1.3. Many procedures

1.3.1. only selected samples, not the entire text

1.3.2. a flexible mechanism to display query results with variable context windows

1.3.3. password protection for copyrighted texts

1.3.4. Anonymization of named entities

1.4. negotiations with publishers are slow and time-consuming

1.5. obtained the permission of 15 publishing houses

1.5.1. Ex : Suhrkamp, a famous publishing house

2. 1. Introduction

2.1. DWDS project

2.1.1. a balanced core corpus ("Kerncorpus")

2.1.2. an opportunistic supplementary corpus ("Ergänzungscorpus")

2.2. 3 main motivations

2.2.1. no dictionary of the German language offers a satisfactory representation of the lexicon of the entire 20th century. Grimms' or Duden Dictionary : language of the first half of the century

2.2.2. alphabetical order of traditional print dictionaries disadvantages for study of lexical field

2.2.3. current dictionaries don't form a balanced corpus of German manual exception of words and word uses from texts Ex : Grimm's dictionnary requires a large number of persons excerptors may inadvertently overlook important words or word senses mixture of manual excerption and automatically compiled electronic opportunistic corpora Ex : Duden dictionary

3. 2. The need for a new corpus

3.1. in 1999, no satisfactory corpora of 20th century German existed

3.1.1. LIMAS Corpus : a first-generation corpus created in 1973 followed the model of the Brown corpus 500 text samples of 2000 tokens each from the year 1964 20 different text genres considered a balanced corpus 1 million tokens and 100,000 types too small to constitute the text basis for a large monolingual dictionary.

3.1.2. IDS Corpus created by Institut für deutsche Sprache (IDS) in Mannheim 2 billion written tokens 4400 hours of recordings of spoken language focus mainly on recent newspaper texts many kind of text underrepresented in this corpus very few texts from the first half of the 20th century not chronologically balanced

3.1.3. resources provided by computational linguists & new dictionary mostly based on newspapers Negra TüBa-D/Z LexiView

3.1.4. general-language corpora of German created on the basis of Web resources Sharoff 2004 German Internet Corpus not obviate the need for a reference corpus

4. 3. Corpus design requirements

4.1. Corpora are generally multi-purpose

4.2. Purposes of DWDS corpus

4.2.1. serve as the empirical basis of a large monolingual dictionary of the 20th/21st century

4.2.2. offer more subtle linguistic descriptions of lexical items (concerning semantics and syntagmatics)

4.3. representativeness in a statistical sense cannot be obtained for corpora

4.3.1. causes : difficulties associated with defining the underlying population

4.3.2. use of the modest notion of "balance"

4.4. 3 desiderata for the DWDS Kerncorpus

4.4.1. DWDS has to be balanced with respect to text types

4.4.2. DWDS must be large enough for its purpose

4.4.3. DWDS contain a considerable amount of influential and important literature

4.4.4. (Sinclair 1994) : these properties characterize a reference corpus

5. 4. Design of the DWDS corpus

5.1. DWDS Kerncorpus

5.1.1. constructed at the Berlin-Brandenburg Academy of Sciences (BBAW)

5.1.2. between 2000 and 2003

5.1.3. a reference (balanced) corpus of the 20th century German language

5.1.4. 100 million tokens

5.1.5. equally distributed over time from 1900 to 2000 (sometimes before & after)

5.1.6. equally distributed over five genres

5.1.7. corpora of written and spoken language

5.1.8. sastisfy the 3 criteria (desiderata)

5.1.9. why was the period between 1900 and 2000 chosen? to have a clearly marked time period

5.1.10. why is the corpus restricted to only five genres? fewer genre distinctions make the daily corpus work easier

5.1.11. compilation proceeding in four steps Cf. 5. 6. 7. 8. 9. 10.

5.2. DWDS Ergänzungscorpus

5.2.1. a much larger corpus from electronic versions of daily and weekly newspapers of the 1990s Ex : Frankfurter Allgemeine Zeitung (1994–2000)

5.2.2. opportunistic corpus

5.2.3. 900 million tokens gathered in two million articles.

5.2.4. used in the Wolfgang-Paul-Preis project

6. 5. Text selection

6.1. Prose, verse and drama (26% of the DWDS Kerncorpus)

6.1.1. every year between 1900 and 1999 : three longer prose works selected 2 "classical" literary works and 1 of light fiction

6.1.2. The basis for the selection of plays : 'Reclams Schauspielführer' (Stuttgart 1996)

6.1.3. The basis for the selection of poetry : Karl-Otto Conrady 'Das große deutsche Gedichtbuch' (Frankfurt 1977)

6.1.4. Selection process provisional list based on official lists Members of the Academy of Science asked to comment on that list

6.2. Newspapers (27%)

6.2.1. selected from more than 50 different national and regional newspapers and magazines Ex : Berliner Tagesspiegel (1945-2000)

6.2.2. samples reporting on specific events

6.2.3. large samples of Keesing's "Archiv der Gegenwart"

6.3. Science (22%)

6.3.1. 100 members of the Academy of Sciences asked to list four works for each decade

6.3.2. selected for each year on average one important scientific monograph on average four articles from scientific journals

6.4. Other nonfiction (20%)

6.4.1. self-help literature car repair manuals, cookbooks, ...

6.4.2. texts rarely considered in lexicography user manuals, prescription drug information, theater and concert programs, ... considerable influence on the present-day language

6.4.3. large samples of legal texts taken from the two collections "Schönfelder" and "Sartorius"

6.5. Transcriptions of spoken language (5%)

6.5.1. 200 samples of radio interviews from the period before 1945 transcribed

6.5.2. Period after 1945 : transcriptions of German and Austrian parliamentary debates, TV debates & radio features Ex. : Deutschlandfunk (German Radio)

6.6. Texts from Austria and Switzerland

6.6.1. deliberately underrepresented

7. 7. Digitization

7.1. 60% of the selected texts already available in electronic format

7.2. Two different methods for full text digitization

7.2.1. Optical character recognition (OCR) the preferred alternative because of its cost-effectiveness recognition rate of 95% to 99% acceptable issues for lexicographic purposes the error rate needs to be very low

7.2.2. manual transcription more expensive the mark-up can already be done during transcription facilitate the XML conversion

7.3. importance of pre-editing

7.3.1. document selection

7.3.2. quality control of the input text

7.3.3. mark-up of difficult parts of the document

7.4. files in UMTF-8 format with XML mark-up

7.4.1. transformed into the final xml format (adhering to TEI)

8. 8. Structural Annotation

8.1. follows the TEI guidelines

8.1.1. (Text Encoding Initiative)

8.2. compromise between the depth of mark-up and the available budget

8.3. Many information was encoded

8.3.1. page-breaks, footnotes, titles, chapters, paragraphs, prefaces and epilogues

8.3.2. lines, characters' names, & stage directions, for plays

8.3.3. utterances and speakers' names encoded for interviews

8.3.4. first date of publication, copyright status, genre, and bibliographic reference.

8.3.5. column breaks and line breaks not annotated except in poems

8.3.6. The copyright status

9. 9. Linguistic Annotation

9.1. DWDS annotated using the TAGH morphology

9.1.1. system for automatic morphological analysis of German word forms

9.1.2. based on a stem lexicon with allomorphs and a concatenative mechanism

9.1.3. lemmatisation

9.1.4. Recognition rate : 98.3% for the DWDS Kerncorpus

9.2. analysis with fewer segmentations is preferred

9.3. texts annotated with lexical categories (according to the STTS tagset)

10. 10. Sampling the DWDS Kerncorpus

10.1. balanced text corpus of about 100,000,000 tokens extracted with a sampling procedure

10.1.1. extracted from documents with 254,293,835 tokens

10.2. Sampling process

10.2.1. respects the constraints for the DWDS Kerncorpus

10.2.2. calculates the expectation value according to the text type distribution

10.3. Version 0.95 of the DWDS Kerncorpus

10.3.1. 79,322 documents

10.3.2. 100,600,993 tokens

10.3.3. 2,224,542 types

11. 11. Querying the DWDS Kerncorpus

11.1. DWDS Kerncorpus is publicly available from the project's website

11.2. linguistic search engine DDC (Dialing DWDS Concordancer)

11.2.1. extracts metadata from the XML/TEI header and indexes the text

11.3. Input queries

11.3.1. word forms, lexical categories, lemmas, thesaurus elements

11.3.2. Boolean AND, OR, NOT searches supported

11.3.3. interval searches : NEAR, FOLLOWED_BY

11.3.4. regular expressions

11.3.5. collocations

11.4. Some issues

11.4.1. sometimes project restricted by copyright protection

11.4.2. 29% of the Kerncorpus only internal

12. 12. Conclusion and further work

12.1. DWDS

12.1.1. first reference corpus for the German language of the 20th century

12.1.2. balanced & equally distributed over all periods of the 20th century

12.1.3. lemmatized and part-of-speech tagged

12.1.4. enabling linguistic queries

12.1.5. web based query interface collocations computation tools integrated

12.1.6. source for various language-based work

12.1.7. resource for psychological and psycholinguistic research

12.1.8. freely available part + copyright part

12.2. Further work

12.2.1. enlarge the opportunistic corpus 100 million word corpus not sufficient for exploring certain phenomena

12.2.2. comparison DWDS Kerncorpus VS balanced web-based corpora

13. Giovani MARCUZZI L3 SDL