The DWDS corpus: A reference corpus for the German language of the 20th century Section 7: ...

COMBRE Alexia

Get Started. It's Free

or sign up with your email address

The DWDS corpus: A reference corpus for the German language of the 20th century Section 7: Numérisation by COMBRE Alexia Mind Map: The DWDS corpus: A reference corpus for the German language of the 20th century Section 7: Numérisation

Mind Map: The DWDS corpus: A reference corpus for the German language of the 20th century Section 7: Numérisation

1. Pré-révision

1.1. scanners d'image

1.2. logiciel

1.2.1. opérations fondamentales

1.2.1.1. copier

1.2.1.2. coller

1.2.1.3. insertion de texte

1.3. étapes

1.3.1. sélection documents

1.3.1.1. significatif

1.3.2. contrôle qualité texte de contribution

1.3.3. majoration parties difficiles

2. production dossiers

2.1. format UMTF-8 avec majoration XML

2.2. validation contre DTD

2.2.1. variation genres de texte

2.3. adhésion directives Text Encoding Initiative

3. Codage

3.1. fait par des humains

3.1.1. sans connaissances particulières

3.1.1.1. contenu textes

3.1.1.2. XML

3.2. par des natifs

4. COMBRE Alexia, L3 SDL

5. méthodes

5.1. reconnaissance optique de caractères : (OCR)

5.1.1. faible coût/efficacité

5.1.2. taux de reconnaissance de 95% à99%

5.1.2.1. acceptable

5.1.3. correction création index

5.1.3.1. noms

5.1.3.2. dates

5.1.3.3. évènements

5.1.4. but lexicographique

5.1.4.1. tous mots

5.1.4.1.1. mots clés

5.1.4.2. taux d'erreurs faible

5.1.5. conversion XML

5.1.5.1. effort manuel

5.1.6. DWSD Kerncorpus

5.1.6.1. petits échantillons

5.1.6.1.1. diversité texte

5.2. transcription manuelle

5.2.1. coût élevé

5.2.2. pas plus de 5 erreurs pour 10 000 caractères

5.2.3. surmonte problèmes liés OCR

5.2.4. codage simultané

6. sélection initiale de texte

6.1. pas considération

6.1.1. statut copyright

6.1.2. disponibilité textes format électronique

6.2. Kerncorpus

6.2.1. 60% textes format électronique

6.2.1.1. CD-ROMs

6.2.1.2. maisons d'édition

6.2.1.3. conversion des données

6.2.1.3.1. format structuré

6.2.2. 40 millions de "tokens"

6.2.2.1. numérisation papier

or Sign Up