Get Started. It's Free
or sign up with your email address
NLPA UNIX Tools by Mind Map: NLPA UNIX Tools

1. command line

1.1. documentation

1.1.1. man <command>

1.1.2. info <command>

1.1.3. V7 manual pages

1.2. basic

1.2.1. bash interactive shell and scripting programming constructs pipes for while read redirection

1.2.2. cat concatenate files cat file1 file2 ... > output

1.2.3. tac reverse the lines in a file

1.2.4. wc count characters, lines, words in a file

1.2.5. file determine file type

1.3. file level commands

1.3.1. rm remove files

1.3.2. ln create links

1.3.3. cp copy files

1.3.4. mv move / rename files

1.3.5. chmod change access mode

1.3.6. chown change ownership

1.3.7. mkdir create directories

1.4. searching

1.4.1. fgrep Aho-Corasick String matching

1.4.2. grep regular expressions translated into NDAs

1.4.3. egrep regular expressions translated into NDAs

1.4.4. agrep variants of Levensthein distance algorithm

1.5. file comparison

1.5.1. diff Hunt-McIlroy

1.6. database-like

1.6.1. sort merge sort, with special attention to external storage

1.6.2. cut cout out fields or lines

1.6.3. paste combine corresponding lines from files (line by line)

1.6.4. join join input files based on fields input files must be sorted on fields

1.6.5. uniq report or omit repeated lines

1.7. complex modifications

1.7.1. sed stream editor, performs edits on potentially very large files

1.7.2. awk simple scripting language, vaguely a simple ancestor to Perl; still useful for one-liners

1.7.3. gzip, bzip2, ... various compression utilities note that compression doesn't just save storage, it usually speeds things up as well

1.7.4. (Python scripting)

1.8. collections of files

1.8.1. xargs

1.8.2. find

1.8.3. cpio

1.8.4. tar

1.9. scripting

1.9.1. Python

1.9.2. Perl

1.9.3. Ruby

1.10. other tools

1.10.1. lexx / yacc parser generators, similar tools also exist for Java, python, etc

1.10.2. libdb fast key-value store

1.10.3. sqlite fast, server-less relational database management

1.10.4. mercurial distributed version control

2. data formats

2.1. text files

2.1.1. ASCII

2.1.2. Unicode

2.1.3. other text formats

2.1.4. multiple fields per line tab or comma separated used by relational operators etc.

2.2. multimedia and scientific

2.2.1. standard image, audio, video formats

2.2.2. array data in HDF5 format

2.2.3. Python object dumps

2.3. databases

2.3.1. Berkeley DB databases (key value stores)

2.3.2. sqlite databases

2.4. formats indicated by...

2.4.1. extension

2.4.2. magic number

3. background

3.1. for a lot of work in NLP, it's useful to know the standard UNIX commands

3.2. many of these have highly optimized implementations and can process data larger than memory

3.3. modern versions handle UNICODE correctly