1. 6. Copyright issues & opportunities
1.1. Goal : convince authors and copyright owners to collaborate in the compilation
1.2. committee formed of prominent public personalities
1.2.1. advised the project
1.2.2. project's negotiations with publishing houses
1.3. Many procedures
1.3.1. only selected samples, not the entire text
1.3.2. a flexible mechanism to display query results with variable context windows
1.3.3. password protection for copyrighted texts
1.3.4. Anonymization of named entities
1.4. negotiations with publishers are slow and time-consuming
1.5. obtained the permission of 15 publishing houses
1.5.1. Ex : Suhrkamp, a famous publishing house
2. 5. Text selection
2.1. Prose, verse and drama (26% of the DWDS Kerncorpus)
2.1.1. every year between 1900 and 1999 : three longer prose works selected
2.1.1.1. 2 "classical" literary works and 1 of light fiction
2.1.2. The basis for the selection of plays : 'Reclams Schauspielführer' (Stuttgart 1996)
2.1.3. The basis for the selection of poetry : Karl-Otto Conrady 'Das große deutsche Gedichtbuch' (Frankfurt 1977)
2.1.4. Selection process
2.1.4.1. provisional list based on official lists
2.1.4.2. Members of the Academy of Science asked to comment on that list
2.2. Newspapers (27%)
2.2.1. selected from more than 50 different national and regional newspapers and magazines
2.2.1.1. Ex : Berliner Tagesspiegel (1945-2000)
2.2.2. samples reporting on specific events
2.2.3. large samples of Keesing's "Archiv der Gegenwart"
2.3. Science (22%)
2.3.1. 100 members of the Academy of Sciences asked to list four works for each decade
2.3.2. selected for each year
2.3.2.1. on average one important scientific monograph
2.3.2.2. on average four articles from scientific journals
2.4. Other nonfiction (20%)
2.4.1. self-help literature
2.4.1.1. car repair manuals, cookbooks, ...
2.4.2. texts rarely considered in lexicography
2.4.2.1. user manuals, prescription drug information, theater and concert programs, ...
2.4.2.2. considerable influence on the present-day language
2.4.3. large samples of legal texts taken from the two collections "Schönfelder" and "Sartorius"
2.5. Transcriptions of spoken language (5%)
2.5.1. 200 samples of radio interviews from the period before 1945 transcribed
2.5.2. Period after 1945 : transcriptions of German and Austrian parliamentary debates, TV debates & radio features
2.5.2.1. Ex. : Deutschlandfunk (German Radio)
2.6. Texts from Austria and Switzerland
2.6.1. deliberately underrepresented
3. 7. Digitization
3.1. 60% of the selected texts already available in electronic format
3.2. Two different methods for full text digitization
3.2.1. Optical character recognition (OCR)
3.2.1.1. the preferred alternative because of its cost-effectiveness
3.2.1.2. recognition rate of 95% to 99% acceptable
3.2.1.3. issues for lexicographic purposes
3.2.1.3.1. the error rate needs to be very low
3.2.2. manual transcription
3.2.2.1. more expensive
3.2.2.2. the mark-up can already be done during transcription
3.2.2.2.1. facilitate the XML conversion
3.3. importance of pre-editing
3.3.1. document selection
3.3.2. quality control of the input text
3.3.3. mark-up of difficult parts of the document
3.4. files in UMTF-8 format with XML mark-up
3.4.1. transformed into the final xml format (adhering to TEI)
4. 8. Structural Annotation
4.1. follows the TEI guidelines
4.1.1. (Text Encoding Initiative)
4.2. compromise between the depth of mark-up and the available budget
4.3. Many information was encoded
4.3.1. page-breaks, footnotes, titles, chapters, paragraphs, prefaces and epilogues
4.3.2. lines, characters' names, & stage directions, for plays
4.3.3. utterances and speakers' names encoded for interviews
4.3.4. first date of publication, copyright status, genre, and bibliographic reference.
4.3.5. column breaks and line breaks not annotated
4.3.5.1. except in poems
4.3.6. The copyright status
5. 9. Linguistic Annotation
5.1. DWDS annotated using the TAGH morphology
5.1.1. system for automatic morphological analysis of German word forms
5.1.2. based on a stem lexicon with allomorphs and a concatenative mechanism
5.1.3. lemmatisation
5.1.4. Recognition rate : 98.3% for the DWDS Kerncorpus
5.2. analysis with fewer segmentations is preferred
5.3. texts annotated with lexical categories (according to the STTS tagset)
6. 10. Sampling the DWDS Kerncorpus
6.1. balanced text corpus of about 100,000,000 tokens extracted with a sampling procedure
6.1.1. extracted from documents with 254,293,835 tokens
6.2. Sampling process
6.2.1. respects the constraints for the DWDS Kerncorpus
6.2.2. calculates the expectation value according to the text type distribution
6.3. Version 0.95 of the DWDS Kerncorpus
6.3.1. 79,322 documents
6.3.2. 100,600,993 tokens
6.3.3. 2,224,542 types
7. 11. Querying the DWDS Kerncorpus
7.1. DWDS Kerncorpus is publicly available from the project's website
7.2. linguistic search engine DDC (Dialing DWDS Concordancer)
7.2.1. extracts metadata from the XML/TEI header and indexes the text
7.3. Input queries
7.3.1. word forms, lexical categories, lemmas, thesaurus elements
7.3.2. Boolean AND, OR, NOT searches supported
7.3.3. interval searches : NEAR, FOLLOWED_BY
7.3.4. regular expressions
7.3.5. collocations
7.4. Some issues
7.4.1. sometimes project restricted by copyright protection
7.4.2. 29% of the Kerncorpus only internal
8. 1. Introduction
8.1. DWDS project
8.1.1. a balanced core corpus ("Kerncorpus")
8.1.2. an opportunistic supplementary corpus ("Ergänzungscorpus")
8.2. 3 main motivations
8.2.1. no dictionary of the German language offers a satisfactory representation of the lexicon of the entire 20th century.
8.2.1.1. Grimms' or Duden Dictionary : language of the first half of the century
8.2.2. alphabetical order of traditional print dictionaries
8.2.2.1. disadvantages for study of lexical field
8.2.3. current dictionaries don't form a balanced corpus of German
8.2.3.1. manual exception of words and word uses from texts
8.2.3.1.1. Ex : Grimm's dictionnary
8.2.3.1.2. requires a large number of persons
8.2.3.1.3. excerptors may inadvertently overlook important words or word senses
8.2.3.2. mixture of manual excerption and automatically compiled electronic opportunistic corpora
8.2.3.2.1. Ex : Duden dictionary
9. 2. The need for a new corpus
9.1. in 1999, no satisfactory corpora of 20th century German existed
9.1.1. LIMAS Corpus : a first-generation corpus
9.1.1.1. created in 1973
9.1.1.2. followed the model of the Brown corpus
9.1.1.3. 500 text samples of 2000 tokens each from the year 1964
9.1.1.4. 20 different text genres
9.1.1.5. considered a balanced corpus
9.1.1.6. 1 million tokens and 100,000 types
9.1.1.6.1. too small to constitute the text basis for a large monolingual dictionary.
9.1.2. IDS Corpus
9.1.2.1. created by Institut für deutsche Sprache (IDS) in Mannheim
9.1.2.2. 2 billion written tokens
9.1.2.3. 4400 hours of recordings of spoken language
9.1.2.4. focus mainly on recent newspaper texts
9.1.2.5. many kind of text underrepresented in this corpus
9.1.2.6. very few texts from the first half of the 20th century
9.1.2.6.1. not chronologically balanced
9.1.3. resources provided by computational linguists & new dictionary mostly based on newspapers
9.1.3.1. Negra
9.1.3.2. TüBa-D/Z
9.1.3.3. LexiView
9.1.4. general-language corpora of German created on the basis of Web resources
9.1.4.1. Sharoff 2004
9.1.4.2. German Internet Corpus
9.1.4.3. not obviate the need for a reference corpus
10. 3. Corpus design requirements
10.1. Corpora are generally multi-purpose
10.2. Purposes of DWDS corpus
10.2.1. serve as the empirical basis of a large monolingual dictionary of the 20th/21st century
10.2.2. offer more subtle linguistic descriptions of lexical items
10.2.2.1. (concerning semantics and syntagmatics)
10.3. representativeness in a statistical sense cannot be obtained for corpora
10.3.1. causes : difficulties associated with defining the underlying population
10.3.2. use of the modest notion of "balance"
10.4. 3 desiderata for the DWDS Kerncorpus
10.4.1. DWDS has to be balanced with respect to text types
10.4.2. DWDS must be large enough for its purpose
10.4.3. DWDS contain a considerable amount of influential and important literature
10.4.4. (Sinclair 1994) : these properties characterize a reference corpus
11. 4. Design of the DWDS corpus
11.1. DWDS Kerncorpus
11.1.1. constructed at the Berlin-Brandenburg Academy of Sciences (BBAW)
11.1.2. between 2000 and 2003
11.1.3. a reference (balanced) corpus of the 20th century German language
11.1.4. 100 million tokens
11.1.5. equally distributed over time
11.1.5.1. from 1900 to 2000 (sometimes before & after)
11.1.6. equally distributed over five genres
11.1.7. corpora of written and spoken language
11.1.8. sastisfy the 3 criteria (desiderata)
11.1.9. why was the period between 1900 and 2000 chosen?
11.1.9.1. to have a clearly marked time period
11.1.10. why is the corpus restricted to only five genres?
11.1.10.1. fewer genre distinctions make the daily corpus work easier
11.1.11. compilation proceeding in four steps
11.1.11.1. Cf. 5. 6. 7. 8. 9. 10.
11.2. DWDS Ergänzungscorpus
11.2.1. a much larger corpus from electronic versions of daily and weekly newspapers of the 1990s
11.2.1.1. Ex : Frankfurter Allgemeine Zeitung (1994–2000)
11.2.2. opportunistic corpus
11.2.3. 900 million tokens gathered in two million articles.
11.2.4. used in the Wolfgang-Paul-Preis project
12. 12. Conclusion and further work
12.1. DWDS
12.1.1. first reference corpus for the German language of the 20th century
12.1.2. balanced & equally distributed over all periods of the 20th century
12.1.3. lemmatized and part-of-speech tagged
12.1.4. enabling linguistic queries
12.1.5. web based query interface
12.1.5.1. collocations computation tools integrated
12.1.6. source for various language-based work
12.1.7. resource for psychological and psycholinguistic research
12.1.8. freely available part + copyright part
12.2. Further work
12.2.1. enlarge the opportunistic corpus
12.2.1.1. 100 million word corpus not sufficient for exploring certain phenomena
12.2.2. comparison DWDS Kerncorpus VS balanced web-based corpora