The DWDS corpus, A. GEYKEN (2006)

Get Started. It's Free
or sign up with your email address
The DWDS corpus, A. GEYKEN (2006) by Mind Map: The DWDS corpus, A. GEYKEN  (2006)

1. 6. Copyright issues & opportunities

1.1. Goal : convince authors and copyright owners to collaborate in the compilation

1.2. committee formed of prominent public personalities

1.2.1. advised the project

1.2.2. project's negotiations with publishing houses

1.3. Many procedures

1.3.1. only selected samples, not the entire text

1.3.2. a flexible mechanism to display query results with variable context windows

1.3.3. password protection for copyrighted texts

1.3.4. Anonymization of named entities

1.4. negotiations with publishers are slow and time-consuming

1.5. obtained the permission of 15 publishing houses

1.5.1. Ex : Suhrkamp, a famous publishing house

2. 5. Text selection

2.1. Prose, verse and drama (26% of the DWDS Kerncorpus)

2.1.1. every year between 1900 and 1999 : three longer prose works selected

2.1.1.1. 2 "classical" literary works and 1 of light fiction

2.1.2. The basis for the selection of plays : 'Reclams Schauspielführer' (Stuttgart 1996)

2.1.3. The basis for the selection of poetry : Karl-Otto Conrady 'Das große deutsche Gedichtbuch' (Frankfurt 1977)

2.1.4. Selection process

2.1.4.1. provisional list based on official lists

2.1.4.2. Members of the Academy of Science asked to comment on that list

2.2. Newspapers (27%)

2.2.1. selected from more than 50 different national and regional newspapers and magazines

2.2.1.1. Ex : Berliner Tagesspiegel (1945-2000)

2.2.2. samples reporting on specific events

2.2.3. large samples of Keesing's "Archiv der Gegenwart"

2.3. Science (22%)

2.3.1. 100 members of the Academy of Sciences asked to list four works for each decade

2.3.2. selected for each year

2.3.2.1. on average one important scientific monograph

2.3.2.2. on average four articles from scientific journals

2.4. Other nonfiction (20%)

2.4.1. self-help literature

2.4.1.1. car repair manuals, cookbooks, ...

2.4.2. texts rarely considered in lexicography

2.4.2.1. user manuals, prescription drug information, theater and concert programs, ...

2.4.2.2. considerable influence on the present-day language

2.4.3. large samples of legal texts taken from the two collections "Schönfelder" and "Sartorius"

2.5. Transcriptions of spoken language (5%)

2.5.1. 200 samples of radio interviews from the period before 1945 transcribed

2.5.2. Period after 1945 : transcriptions of German and Austrian parliamentary debates, TV debates & radio features

2.5.2.1. Ex. : Deutschlandfunk (German Radio)

2.6. Texts from Austria and Switzerland

2.6.1. deliberately underrepresented

3. 7. Digitization

3.1. 60% of the selected texts already available in electronic format

3.2. Two different methods for full text digitization

3.2.1. Optical character recognition (OCR)

3.2.1.1. the preferred alternative because of its cost-effectiveness

3.2.1.2. recognition rate of 95% to 99% acceptable

3.2.1.3. issues for lexicographic purposes

3.2.1.3.1. the error rate needs to be very low

3.2.2. manual transcription

3.2.2.1. more expensive

3.2.2.2. the mark-up can already be done during transcription

3.2.2.2.1. facilitate the XML conversion

3.3. importance of pre-editing

3.3.1. document selection

3.3.2. quality control of the input text

3.3.3. mark-up of difficult parts of the document

3.4. files in UMTF-8 format with XML mark-up

3.4.1. transformed into the final xml format (adhering to TEI)

4. 8. Structural Annotation

4.1. follows the TEI guidelines

4.1.1. (Text Encoding Initiative)

4.2. compromise between the depth of mark-up and the available budget

4.3. Many information was encoded

4.3.1. page-breaks, footnotes, titles, chapters, paragraphs, prefaces and epilogues

4.3.2. lines, characters' names, & stage directions, for plays

4.3.3. utterances and speakers' names encoded for interviews

4.3.4. first date of publication, copyright status, genre, and bibliographic reference.

4.3.5. column breaks and line breaks not annotated

4.3.5.1. except in poems

4.3.6. The copyright status

5. 9. Linguistic Annotation

5.1. DWDS annotated using the TAGH morphology

5.1.1. system for automatic morphological analysis of German word forms

5.1.2. based on a stem lexicon with allomorphs and a concatenative mechanism

5.1.3. lemmatisation

5.1.4. Recognition rate : 98.3% for the DWDS Kerncorpus

5.2. analysis with fewer segmentations is preferred

5.3. texts annotated with lexical categories (according to the STTS tagset)

6. 10. Sampling the DWDS Kerncorpus

6.1. balanced text corpus of about 100,000,000 tokens extracted with a sampling procedure

6.1.1. extracted from documents with 254,293,835 tokens

6.2. Sampling process

6.2.1. respects the constraints for the DWDS Kerncorpus

6.2.2. calculates the expectation value according to the text type distribution

6.3. Version 0.95 of the DWDS Kerncorpus

6.3.1. 79,322 documents

6.3.2. 100,600,993 tokens

6.3.3. 2,224,542 types

7. 11. Querying the DWDS Kerncorpus

7.1. DWDS Kerncorpus is publicly available from the project's website

7.2. linguistic search engine DDC (Dialing DWDS Concordancer)

7.2.1. extracts metadata from the XML/TEI header and indexes the text

7.3. Input queries

7.3.1. word forms, lexical categories, lemmas, thesaurus elements

7.3.2. Boolean AND, OR, NOT searches supported

7.3.3. interval searches : NEAR, FOLLOWED_BY

7.3.4. regular expressions

7.3.5. collocations

7.4. Some issues

7.4.1. sometimes project restricted by copyright protection

7.4.2. 29% of the Kerncorpus only internal

8. 1. Introduction

8.1. DWDS project

8.1.1. a balanced core corpus ("Kerncorpus")

8.1.2. an opportunistic supplementary corpus ("Ergänzungscorpus")

8.2. 3 main motivations

8.2.1. no dictionary of the German language offers a satisfactory representation of the lexicon of the entire 20th century.

8.2.1.1. Grimms' or Duden Dictionary : language of the first half of the century

8.2.2. alphabetical order of traditional print dictionaries

8.2.2.1. disadvantages for study of lexical field

8.2.3. current dictionaries don't form a balanced corpus of German

8.2.3.1. manual exception of words and word uses from texts

8.2.3.1.1. Ex : Grimm's dictionnary

8.2.3.1.2. requires a large number of persons

8.2.3.1.3. excerptors may inadvertently overlook important words or word senses

8.2.3.2. mixture of manual excerption and automatically compiled electronic opportunistic corpora

8.2.3.2.1. Ex : Duden dictionary

9. 2. The need for a new corpus

9.1. in 1999, no satisfactory corpora of 20th century German existed

9.1.1. LIMAS Corpus : a first-generation corpus

9.1.1.1. created in 1973

9.1.1.2. followed the model of the Brown corpus

9.1.1.3. 500 text samples of 2000 tokens each from the year 1964

9.1.1.4. 20 different text genres

9.1.1.5. considered a balanced corpus

9.1.1.6. 1 million tokens and 100,000 types

9.1.1.6.1. too small to constitute the text basis for a large monolingual dictionary.

9.1.2. IDS Corpus

9.1.2.1. created by Institut für deutsche Sprache (IDS) in Mannheim

9.1.2.2. 2 billion written tokens

9.1.2.3. 4400 hours of recordings of spoken language

9.1.2.4. focus mainly on recent newspaper texts

9.1.2.5. many kind of text underrepresented in this corpus

9.1.2.6. very few texts from the first half of the 20th century

9.1.2.6.1. not chronologically balanced

9.1.3. resources provided by computational linguists & new dictionary mostly based on newspapers

9.1.3.1. Negra

9.1.3.2. TüBa-D/Z

9.1.3.3. LexiView

9.1.4. general-language corpora of German created on the basis of Web resources

9.1.4.1. Sharoff 2004

9.1.4.2. German Internet Corpus

9.1.4.3. not obviate the need for a reference corpus

10. 3. Corpus design requirements

10.1. Corpora are generally multi-purpose

10.2. Purposes of DWDS corpus

10.2.1. serve as the empirical basis of a large monolingual dictionary of the 20th/21st century

10.2.2. offer more subtle linguistic descriptions of lexical items

10.2.2.1. (concerning semantics and syntagmatics)

10.3. representativeness in a statistical sense cannot be obtained for corpora

10.3.1. causes : difficulties associated with defining the underlying population

10.3.2. use of the modest notion of "balance"

10.4. 3 desiderata for the DWDS Kerncorpus

10.4.1. DWDS has to be balanced with respect to text types

10.4.2. DWDS must be large enough for its purpose

10.4.3. DWDS contain a considerable amount of influential and important literature

10.4.4. (Sinclair 1994) : these properties characterize a reference corpus

11. 4. Design of the DWDS corpus

11.1. DWDS Kerncorpus

11.1.1. constructed at the Berlin-Brandenburg Academy of Sciences (BBAW)

11.1.2. between 2000 and 2003

11.1.3. a reference (balanced) corpus of the 20th century German language

11.1.4. 100 million tokens

11.1.5. equally distributed over time

11.1.5.1. from 1900 to 2000 (sometimes before & after)

11.1.6. equally distributed over five genres

11.1.7. corpora of written and spoken language

11.1.8. sastisfy the 3 criteria (desiderata)

11.1.9. why was the period between 1900 and 2000 chosen?

11.1.9.1. to have a clearly marked time period

11.1.10. why is the corpus restricted to only five genres?

11.1.10.1. fewer genre distinctions make the daily corpus work easier

11.1.11. compilation proceeding in four steps

11.1.11.1. Cf. 5. 6. 7. 8. 9. 10.

11.2. DWDS Ergänzungscorpus

11.2.1. a much larger corpus from electronic versions of daily and weekly newspapers of the 1990s

11.2.1.1. Ex : Frankfurter Allgemeine Zeitung (1994–2000)

11.2.2. opportunistic corpus

11.2.3. 900 million tokens gathered in two million articles.

11.2.4. used in the Wolfgang-Paul-Preis project

12. 12. Conclusion and further work

12.1. DWDS

12.1.1. first reference corpus for the German language of the 20th century

12.1.2. balanced & equally distributed over all periods of the 20th century

12.1.3. lemmatized and part-of-speech tagged

12.1.4. enabling linguistic queries

12.1.5. web based query interface

12.1.5.1. collocations computation tools integrated

12.1.6. source for various language-based work

12.1.7. resource for psychological and psycholinguistic research

12.1.8. freely available part + copyright part

12.2. Further work

12.2.1. enlarge the opportunistic corpus

12.2.1.1. 100 million word corpus not sufficient for exploring certain phenomena

12.2.2. comparison DWDS Kerncorpus VS balanced web-based corpora

13. Giovani MARCUZZI L3 SDL