Linux bash console commands for data science and csv data analysis?

Linux, unix, and mac OS are marvelous tools for processing textual data very quickly. Quite commonly, you want to analyze data stored in one or several csv files. Let's review the most useful commands to do this.

Most useful terminal commands for data processing on linux / mac OS

Very often you want to have a look to a csv file, but the columns are not aligned. To solve this, use, column command:
column -t -s','

The before/after example below shows how easier it is to read well aligned data (using public data from the opencellid database):

Bash for loops are super useful whenever you have data spread over several files. A quick example to  get all .dat files from a range of folders stored on amazon S3:

for ((day=12; day <=28; day++))
  s3cmd get s3://xxx/bucket/incoming/201606$day/*.dat;

To know the number of lines in the file:
wc -l myfile

To preview the first 17 lines of the file
head -17 myfile

To extract column 2 to 4 and 7 from a file where the separator character is pipe "|"
cut -d"|" -f2-4,7 myfile

To count the number of occurrences of every value present in column 3.
cut -d"|" -f3 myfile |sort | uniq -c

  1. In previous command, do not forget the "sort" command! Uniq counts only adjacent duplicates. If you do not sort, your results will not be what you want.
  2. "sort" can use multiple processors in parallel, this can be handy if the file is super large. To use this feature, just use the appropriate option, e.g., "sort --parallel=8".

Most useful Hadoop commands

In our Hadoop driven world, there are chances Linux commands won't be enough and you will need to interact with HDFS / other Hadoop components too. Fortunately, this is pretty easy and the commands are very similar to standard terminal commands:

hdfs dfs -ls /myfolder
hdfs dfs -cat /myfolder/myfile.csv

