How to: Compare two CSV files

Have you ever been faced with the task of comparing two datasets and seeing what is different? Or perhaps you need to compare in order to come up with a summary of what is the same between the two...Either way, you of course want to avoid at all costs having to manually complete this exercise - especially if you have thousands or tens of thousands of lines to go through!

Why might you even face such a task?

Well, you may have two files of what should be the same data, but perhaps one was changed/updated and you need to track or record those changes - whether that be insertion of new data, deletion of data or just changes to the data.

Perhaps you have disparate lists of contact data from various agencies within your company, and you need to pare down to only the data which overlaps (or is the same).

Maybe you have the daunting task of merging two files with overlapping data together, keeping a careful eye on what data is overwritten vs. what should be ignored.

Whatever the reason, your task requires a systematic and accurate method for comparing the two datasets. There are many options for doing this, and will vary greatly depending on your comfort level with coding, or working with formulas in Excel, or even "outsourcing" your task to an app!

To start, if you're more comfortable with working in Excel and you don't mind a bit of effort, you can always start with some basic formulas.

For example, if you have a specific column that you want to check across the two datasets (name, for example) to see where there is overlap, you can use the COUNTIF formula.

Simply create a new column in the first file, and enter the COUNTIF formula. It might go something like this, if your "name" was in column A:

=COUNTIF('[example2.csv]Sheet1'!A:A, A2)

In your second file then, you'd create the same formula, but reversed to reference the first file:

=COUNTIF('[example1.csv]'Sheet1!A:A, A2)

Now you have both files containing a column with a notation for any values that appear in the other file. How? If you see a number in the COUNTIF column, that means the value was found when it was searched in the other file. Conversely, if there is a "0" there, that means the value is unique, not found in the other file. So, you can use this to weed out either overlapping or unique values.

If you wish to merge, for example, the unique values from one file to the next, you can then just sort one file for all of the rows with a "0" result, and copy them over to your next file.
If you're looking for a more automated process (or just don't want to mess with Excel), you can utilize the Notepad plugin options. It's as simple as using the default text editor app on your computer to open the CSV files, and enable the plugin to run the compare process. Those plugins are Compare and Beyond Compare. This is a helpful overview of how the Compare plugin can be enabled and used.

Other (non-plugin) 3rd-party options are endless. Just a few examples include:

And if you're really savvy, then you can even create a python script to run the same comparison exercise!

Choose the method that is both the right fit for your comfort level, and for the size of the dataset you are working with. But no matter what, don't waste your efforts doing one-by-one comparison by hand!