NLPA UNIX Tools

Get Started. It's Free
or sign up with your email address
Rocket clouds
NLPA UNIX Tools by Mind Map: NLPA UNIX Tools

1. command line

1.1. documentation

1.1.1. man <command>

1.1.2. info <command>

1.1.3. V7 manual pages

1.2. basic

1.2.1. bash

1.2.1.1. interactive shell and scripting

1.2.1.2. programming constructs

1.2.1.2.1. pipes

1.2.1.2.2. for

1.2.1.2.3. while

1.2.1.2.4. read

1.2.1.2.5. redirection

1.2.2. cat

1.2.2.1. concatenate files

1.2.2.2. cat file1 file2 ... > output

1.2.3. tac

1.2.3.1. reverse the lines in a file

1.2.4. wc

1.2.4.1. count characters, lines, words in a file

1.2.5. file

1.2.5.1. determine file type

1.3. file level commands

1.3.1. rm

1.3.1.1. remove files

1.3.2. ln

1.3.2.1. create links

1.3.3. cp

1.3.3.1. copy files

1.3.4. mv

1.3.4.1. move / rename files

1.3.5. chmod

1.3.5.1. change access mode

1.3.6. chown

1.3.6.1. change ownership

1.3.7. mkdir

1.3.7.1. create directories

1.4. searching

1.4.1. fgrep

1.4.1.1. Aho-Corasick String matching

1.4.2. grep

1.4.2.1. regular expressions translated into NDAs

1.4.3. egrep

1.4.3.1. regular expressions translated into NDAs

1.4.4. agrep

1.4.4.1. variants of Levensthein distance algorithm

1.5. file comparison

1.5.1. diff

1.5.1.1. Hunt-McIlroy

1.6. database-like

1.6.1. sort

1.6.1.1. merge sort, with special attention to external storage

1.6.2. cut

1.6.2.1. cout out fields or lines

1.6.3. paste

1.6.3.1. combine corresponding lines from files (line by line)

1.6.4. join

1.6.4.1. join input files based on fields

1.6.4.2. input files must be sorted on fields

1.6.5. uniq

1.6.5.1. report or omit repeated lines

1.7. complex modifications

1.7.1. sed

1.7.1.1. stream editor, performs edits on potentially very large files

1.7.2. awk

1.7.2.1. simple scripting language, vaguely a simple ancestor to Perl; still useful for one-liners

1.7.3. gzip, bzip2, ...

1.7.3.1. various compression utilities

1.7.3.2. note that compression doesn't just save storage, it usually speeds things up as well

1.7.4. (Python scripting)

1.8. collections of files

1.8.1. xargs

1.8.2. find

1.8.3. cpio

1.8.4. tar

1.9. scripting

1.9.1. Python

1.9.2. Perl

1.9.3. Ruby

1.10. other tools

1.10.1. lexx / yacc

1.10.1.1. parser generators, similar tools also exist for Java, python, etc

1.10.2. libdb

1.10.2.1. fast key-value store

1.10.3. sqlite

1.10.3.1. fast, server-less relational database management

1.10.4. mercurial

1.10.4.1. distributed version control

2. data formats

2.1. text files

2.1.1. ASCII

2.1.2. Unicode

2.1.3. other text formats

2.1.4. multiple fields per line

2.1.4.1. tab or comma separated

2.1.4.2. used by relational operators etc.

2.2. multimedia and scientific

2.2.1. standard image, audio, video formats

2.2.2. array data in HDF5 format

2.2.3. Python object dumps

2.3. databases

2.3.1. Berkeley DB databases (key value stores)

2.3.2. sqlite databases

2.4. formats indicated by...

2.4.1. extension

2.4.2. magic number

3. background

3.1. for a lot of work in NLP, it's useful to know the standard UNIX commands

3.2. many of these have highly optimized implementations and can process data larger than memory

3.3. modern versions handle UNICODE correctly

3.4. http://cm.bell-labs.com/7thEdMan/bswv7.html