Institute for Language Sciences Labs

How-tos

How to work with CSV files

Last updated on 10 March 2020 by Iris Mulders

If you have suggestions on how to improve this document, or find mistakes, please send them to ilslabs@nulluu.nl

CSV stands for ‘comma separated values’. A CSV file is usually a text file with the extension ‘.csv’ and is used because it is useful to pass a lot of structured data to a computer program, while also being relatively easy to read for humans.

How CSV files work

A CSV file always contains a table, which is put in plain text. Imagine you have the following table:

Number First Name Last Name Points
1 Eve Jackson 94
2 John Doe 80
3 Adam Johnson 67
4 Jill Smith 50

Here we have columns, rows, a header row (Number, First name, Last Name and Points) and values (or fields). For example, the values of the second row are 2, John, Doe and 80. Each row defines a group of information (which is called a record) and each column indicates what the denoted information in the value for that column means.
The header row is optional in a CSV. Often, the code that reads the CSV knows what each column means and the header row is usually just used to indicate the meaning to people. When you use a header row, the program might interpret the header row as data in the CSV, however. If you are editing a CSV file, make sure to always leave the number of rows before the actual data starts the same. It is best not to change any of that at all.

In a CSV, no lines are drawn to show where a field ends. Instead, new line characters are used to separate the rows from each other. This just means you have to press the enter key to start a new row. The fields within a row are split with a separation character (also called delimiter). Usually, this is a comma, but when the fields themselves contain commas, a semicolon (‘;’) is often used. Since many of our experiments deal with written texts, ZEP uses the semicolon as the separation character.

When we use a semicolon as the separation character, the table above would look like this in a CSV file:

Number; First name; Last name; Points
1; Eve; Jackson; 94
2; John; Doe; 80
3; Adam; Johnson; 67
4; Jill; Smith; 50

When a different separation character is used, all the semicolons in the above example can be replaced by that separation character.

In some cases, a lot of special characters are used by a field. In that case, you can use quotation marks around the value of that field. Beware, however, that not all computer programs can handle those correctly. So if you’re doubting and editing a pre-existing file, make sure to only use the quotation marks in columns where they are used already.

Editing a CSV file

 

It is always a good practice to make a backup of the original file you are about to edit. Computers are not very forgiving, and every small mistake you introduce in the formatting of the file will most like result in a parse error, meaning the computer can’t interpret the file

To edit a CSV, the most foolproof way is to use a text editor like Notepad++ (on Windows) or gedit (on Linux). Do not use Word. Editing the CSV in a text editor gives you more control than any other method, because you will see all the characters in your file.

It can be rather difficult, though, to see the structure of your CSV file properly in a text editor. If you need to see your data in columns to get a better grip on the file’s structure, the easiest is to open it in LibreOffice (on the Linux computers in the lab).

To open a file in LibreOffice Calc, follow these steps:

  1. Double click the CSV file, or click the CSV file with your right mouse button and select Open with LibreOffice Calc from the menu.
  2. This should open LibreOffice Calc and show you a dialog window with the title ‘Text import’
  3. Make sure you use the following settings:
    • Character set: Leave at the pre-selected value
    • Language: Default
    • From row: 1
    • Select Separated by in the separator options and then make sure only the separator character that is used in the CSV file is checked. If you are unsure what separation character is used, use the preview window below the settings to inspect the data. Can you see which items are supposed to be fields? What character is no part of the actual data, but is splitting the fields from each other? That is your separation character.
    • In Other options, uncheck each of the options
    • In Fields, you can see a preview of the file. Here you can check if separators work correctly. If this is not the case, the columns will not be split neatly. In that case, you probably selected the wrong separation character. Try another character until the data looks like a nice table.
  4. Once you’re sure the preview looks alright and you used the settings above, click the Ok button in the top right corner to open the file, after which you can edit the CSV file in LibreOffice Calc.

Once you opened the file, you should see a normal spreadsheet layout consisting of rows and columns. They should look like the table that the CSV encodes. From here, you can delete rows for records you want gone, or add rows below. But keep the following things in mind:

  • Do not change the rows that do not contain records, such as the header. The number of rows, and usually the values of those rows, should remain exactly the same
  • Some programs that have to read the CSV contain extra restrictions on what types of values can appear in a column. For example, some may only contain numbers, or single characters, or only certain words. If you do not know these restrictions, study the CSV to see what the restrictions might be.

Once you are satisfied with your changes, you can click ‘save as’, choose ‘csv’ as the file format, and in ‘filter settings’, indicate that you want to use the semicolon as the separator (for use with ZEP). To see if the changes are alright, execute the task for which you want to use the CSV file. For example, if you’re using the file with ZEP, run the ZEP script to see if any errors occur. Should that be the case, re-open the file and try to find the error you are pointed to.

On windows, you could technically edit the CSV with Microsoft Excel, but this will probably not make you happy. If the CSV uses any other separation character than a comma, the default settings for opening the file will mess your CSV up. If you feel comfortable enough with your understanding of CSV’s, you could open the CSV file in a non-rich text editor such as notepad or notepad++ (never use Microsoft Word or other rich text editors for CSV’s). Otherwise, easiest is just to switch to Linux.

Creating a CSV file (advanced)

To create a CSV file, you have to understand what the file is going to be used for. If you want to use it with ZEP, do you know the restrictions? If you want to use it with another program, do you know how that program will interpret the CSV file? The possible variations here are too large to enumerate, so this is something you will have to figure out yourself, or you have to ask the lab support in Room 0.09 for help.

Once you know the structure, the easiest way is to open LibreOffice Calc or Microsoft Excel, just fill in the data in a table, according to the restrictions you have found before. When you are done, you can click ‘save as’ in the menu and select CSV (or comma separated value). There might be a few different options, but the simplest is usually the best. After that, you will be confronted with a settings window, in which you have to state whether or not to use quotation marks around values and which separation character to use. If you are unsure which settings to use, go back to checking the restrictions on your file and the “How a CSV works” paragraph above. If no restrictions are placed on which setting to use, any setting is alright.