Thursday, May 19, 2016

Oracle Linux - remove duplicate lines with awk

Sometimes you want to clean data quickly and remove all duplicate lines that are present in the file. For example a raw output from a system that is "dumped" on your Linux file system needs to be cleaned before you use it as input into another system. You can write some fancy code to do so, you can also use a very simple and straight forward solution by using awk on your Oracle Linux bash shell.

In the below example we have a file (the data with the duplicate lines) called rawdata.txt and we want to make a clean file called cleandata.txt. The example awk command can be used to read rawdata.txt and write the clean data to the file cleandata.txt

awk '!seen[$0]++' rawdata.txt >> cleandata.txt

The command itself is a very quick and dirty solution, most likely you want to use this in a wider script that is cleaning your data in a more sophisticated manner. 

No comments: