Assignment 2, CSC202, Spring 2024
1 Getting Started
2 Requirements
3 A note about automated testing
4 Background:   What’s a CSV file?
4.1 Reading CSV files
4.2 with
4.3 iterators
4.4 raise
4.5 format
5 The Input Data
6 Setting the Recursion limit
7 Data Definitions
8 Reading in the Data
9 Counting the data
10 Filtering the Data
11 Some Questions
12 Design Recipe
13 A note about automated testing
14 Import restrictions
15 Handin instructions
16 Automated Testing
16.1 Checking that you Submitted
8.12.900

Assignment 2, CSC202, Spring 2024🔗

In this assignment, we’re going to refine our skills with linked lists by reading in a many-line csv file, and then filtering it in a variety of different ways.

This file comes from Our World in Data, and it contains information about the CO2-equivalent emissions of all of the world’s countries, across a variety of sectors, from 1990 through 2020. Using this data, we can answer a whole bunch of interesting questions.

1 Getting Started🔗

To get started on the first project, use this GitHub Classroom invitation link. Clone it as you did the first assignment. Make a small change and push, just to make sure that you can. I will probably check early on that you have created this repo and pushed at least one small commit.

2 Requirements🔗

For each function required in this assignment, you should follow the design recipe, as described in this document. Specifically, each function should come with a purpose statement, types for both parameters and return type, and a full set of tests.

3 A note about automated testing🔗

We will be using some automated testing in order to check the correctness of your code. For this reason, it’s important that names be spelled as specified, and that fields and parameters occur in the order specified in the text.

You should use an @dataclass(frozen=True) decorator to define all classes created in this assignment.

4 Background: What’s a CSV file?🔗

The term CSV is used to refer to a loose standard of text file formatting where each line of text represents one row of a table, and the field values are separated by commas. (The abbreviation CSV stands for "Comma Separated Values".) The first line typically contains column names. So, for instance, a CSV file of store inventory at a magic store might look like this:

item,price,number in stock

cauldron,47000,16

broom,7899,10

wand,1426,150

... indicating for instance that there are 10 brooms in stock, and that each one goes for $78.99.

4.1 Reading CSV files🔗

Most languages have built-in library support for reading CSV files, and Python is no different.

Specifically, there’s a csv.reader class that is an iterator for the lines in the file, returning each one in turn as a list of strings.

Here’s a piece of example code. I will discuss the new things in this code in the sections below.

  import csv
   
  def total_item_count(filename: str) -> int: with open(filename, newline="") as csvfile: iter = csv.reader(csvfile) topline = next(iter) if not (topline == expected_labels): raise ValueError("unexpected first line: got: {}".format(topline)) item_count = 0 for line in iter: item_count = item_count + float(line[2]) return item_count

4.2 with🔗

There are a couple of new things here. First, the with form binds the name csvfile for the duration of the block. The special thing about with is that it uses a "context manager", which in this case will take care of closing the file after the block is done. If you want to know more, you can take a look at the docs.

4.3 iterators🔗

Next, the csv reader is an iterator. We will be talking more about iterators later, but for now the important thing is to know that iterators play nicely with the for form; using an iterator in a for like this causes the body block of the for to be evaluated once for each element that the iterator produces (that is, each line of the CSV file). Since the for form provides no natural way to accumulate the results, we need to use local mutation to construct a useful value. In this case, we’re mutating the item_count variable. In your case, you’ll instead be building a list. Note that the list is likely to "come out backward", in the sense that the first line of the file will correspond to the last element in the list. This is totally fine.

4.4 raise🔗

Another thing going on here is the use of raise to signal errors, and the use of format to format the error string. The pattern that you see here is a typical one; the raise call will halt the execution of the program, and show the given error to the user. The argument to raise must be an Exception derived from BaseException, and honestly you don’t have to worry about that, you can just use ValueError, as done in this example.

4.5 format🔗

There are a bunch of ways of formatting string outputs in Python, and this is one of them. Specifically, python strings have a format method, which can be used to embed printed representations of values into strings, which is often useful. Here’s another example:

  print("the result of adding {} and {} is approximately {}.".format(3,4,19))

... which helpfully informs you that adding 3 and 4 produces 19. Or something like that.

5 The Input Data🔗

I’ve extracted a slice of the full dataset for the purposes of this assignment, but there’s still plenty of data there. Specifically, the original dataset contains about 10 different sectors of greenhouse gas emissions, and I’ve trimmed this to just a few. Specifically, I picked up the "electricity and heat", "energy", and "total emissions excluding land use change and forestry" sectors. Each one has both total and per-capita numbers.

There are some gaps in the data, as you’ll see. For instance, you’ll see on line 126 that Andorra is missing "electricity and heat" data entirely. Missing data is indicated by empty strings (and when an empty string appears in a csv rows, it will therefore look like a sequence of un-separated commas, as e.g. ",,,").

The goal of this assignment is to create a linked list of Row records, and to allow database-like filtering of these rows, extracting (for instance) all of the records from Algeria, every record where the per-capita electricity and heat consumption is higher than 0.7, or every record from years before 2002.

6 Setting the Recursion limit🔗

In Python, there is an artificial recursion limit of 1,000, which is very low. You can adjust this, using sys.setrecursionlimit(). I suggest setting it to 10,000 for this assignment. In other words, right after your import declarations, you probably want to include

sys.setrecursionlimit(10000)

7 Data Definitions🔗

As we’ve discussed, the first step in writing programs is generally to figure out how to represent the data internally. In this case, you should create a linked list, as we’ve been doing in lab 2, and each element of the linked list should correspond to a single row of the CSV file.

Write a data definition for this data, put it at the top.

8 Reading in the Data🔗

The next step is to read the data from the file. The code above should be helpful to you.

Develop the read_csv_lines function, that accepts the name of a file in the local directory, and returns a list of row objects.

For the purposes of testing, it will probably be useful to extract the first four or five lines of the test file into a separate test CSV file.

I recommend creating a helper function that translates a list of strings into a row object; this helper function will be easier to test, and simplify the process of debugging the enclosing function.

9 Counting the data🔗

You should develop the listlen function, that accepts a linked list of row objects and returns its length.

10 Filtering the Data🔗

There are three ways we want to be able to filter the data:
  • Return all of the rows where a given field is less than a specified value,

  • return all of the rows where a given field is equal to a specified value, and

  • return all of the rows where a given field is greater than a specified value.

In order to make this possible, you should write a single filter method. It should accept a linked list of rows, a field name, a comparison type, and a comparison value.

The field name should be chosen from the strings appearing in the first line of the csv file. That is, the field name should be one of the following:

The comparison type should be one of the following:

Not all fields should be comparable using all the different comparison types.

Specifically, the "country" field should only be comparable using the "equal" comparison type, and the numerical measurement fields of CO2 emissions should only be comparable using the "less_than" and "greater_than" methods.

Also, note that there are a bunch of different opportunities for abstraction here. That is, certain ways of structuring the code can avoid lengthy repetition. If you can see good ways to design helper functions in order to reduce code duplication, that’s great! If not: that’s fine too!

11 Some Questions🔗

Using these functions, we should be able to answer some questions. For each of the following questions, write a function called answer_<n> that uses your functions and data to answer the corresponding question:

  1. How many countries are listed in this dataset? Your function should return the answer.

  2. What years are represented in this dataset for Mexico? Your function should return a list of all of the rows associated with Mexico.

  3. What countries have higher per-capita total energy consumption (excluding lucf) than the United States in 1990? Your function should return the row for each of these countries.

  4. What countries have higher per-capita total energy consumption (excluding lucf) than the United States in 2020? Your function should return the row for each of these countries.

  5. What is the population of Luxembourg in 2014? Your function should return the approximate population (in people, not in millions of people). You should infer this by using division, and comparing the per-capita and total figures.

  6. What is the increase in total energy-and-heat usage in China, from 1990 to 2020? Your function should return a multiplier. So, for instance, the number 1.4 would mean that the energy use had increased by 40%.

  7. If this rate of growth continues, what will China’s energy-and-heat usage be in 2070? Your function should return this number.

12 Design Recipe🔗

For each function required in this assignment, you should follow the design recipe, as described in this document. Specifically, each function should come with a purpose statement, types for both parameters and return type, and a full set of tests (unless the specification states that no tests are required for a particular function).

13 A note about automated testing🔗

You should use an @dataclass(frozen=True) decorator to define all classes created in this assignment.

14 Import restrictions🔗

In order to make it possible to test and analyze your code, it’s important that we be able to run it in a consistent environment.

Specifically, we will run your code with Python 3.12.3, in an environment that includes only the standard packages along with mypy. Please don’t import packages other than dataclass, typing, unittest, and math, and of course other parts of your own source code. If I’ve left something important off this list, let me know!

15 Handin instructions🔗

Submit this assignment by pushing to the repository created for you by GitHub classroom.

16 Automated Testing🔗

We will be using some automated testing in order to check the correctness of your code. For this reason, it’s important that names be spelled as specified, and that fields and parameters occur in the order specified in the text.

Please note that the repository also contains a file called "basic-tests.py", and these tests are run automatically when you push to GitHub. These tests don’t verify that your code is correct, they simply verify that you have defined the right functions, and that they have the right number of arguments, and that they return the right kind of thing. These tests exist because it’s very sad when you are supposed to develop a function named (say) "success" and you accidentally name it (say) "succcess" and then all of the tests fail.

You can run these tests yourself, using

python3 basic-tests.py

... or by running this file in your favorite IDE.

These tests are also run for you when you push to GitHub, and you should see either a green check next to your submission if the tests passed, or a red X if they didn’t pass.

In general, files that don’t pass the basic tests may receive a score of zero on the assignment.

Please note that editing the "basic-tests.py" file is a terrible idea; the whole point of "basic-tests.py" is to ensure that the post-submission tests run correctly; editing this file is like using chewing gum to stick the needle of your thermometer to 80 degrees — it won’t actually make our tests pass, it will just obscure the real signal.

16.1 Checking that you Submitted🔗

If you’re not familiar with the overly complex nature of git, it’s very very possible to do all the work and commit it locally but fail to push it to GitHub classroom, in which case it will appear to us that you’ve done no work at all, which is very sad.

In order to check that your work is checked in, use the GitHub web interface to ensure that the version of the code that appears on GitHub is the most updated version, and that it passes all of the basic tests.