TDM 10200: Project 3 — 2024

Motivation: Learning about Big Data. When working with large data sets, it is important to know how we can use control flow to find our information, a little bit at a time, without reading in all of the files at once. Control flow is the order that your code runs.

Scope: Python, Control Flow, if statements, for loops

Dataset(s)

/anvil/projects/tdm/data/noaa

Readings and Resources

  • Make sure to read about, and use the template found here, and the important information about projects submissions here.

  • Please review this Examples Book content about Control Flow

Questions

Question 1 (2 pts)

  1. Explore the files in the provided data set directory. Find out how many years are included in the data set. Briefly describe the contents of the files.

  2. Import the library pandas as pd, and import Path from pathlib.

  3. Create a list named myfiles, to hold Path objects from 1880.csv to 1883.csv in the data set folder using list comprehension. You can start with the following sample code (below), but you need to modify this for loop, to use list comprehension.

    Following is the sample code that will return a "Path" object for the file 1750.csv.

    Path("/anvil/projects/tdm/data/noaa/1750.csv")

    You can start with a for loop, to create a list of Path objects, as follows, BUT we want you to modify this example, to use list comprehension.

    myfiles=[]
    for year in range (1880, 1884):
        file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv')
        myfiles.append(file)
    print(myfiles)

Question 2 (2 pts)

  1. Calculate how many records are in the file 1880.csv. (Each line is one record.)

    The following is the sample code to calculate records in one sample file object named file:

    with open(file,"r") as f:
        mycount = 0
        for line in f:
            mycount += 1
    print(f'There are {mycount} records in the file called {file}')

    There are 370779 records in the file called /anvil/projects/tdm/data/noaa/1880.csv

  2. Calculate how many records there are (altogether) in the 4 files from 1880.csv to 1883.csv. Use the list myfiles that you created in Question 1. Your output should give the total number of records altogether, so it should say something like:

There are [put your number of records here] records in the 4 files altogether.

  • You may use a for loop to iterate over the myfiles object, like this:

    for file in myfiles:
        ...# body of the for loop

Question 3 (2 pts)

  1. Run the following statement, to read in the first file from the list myfiles into a DataFrame using myDF = pd.read_csv(myfiles[0]). Display the column names for myDF. Look at the head and tail of myDF. Do you see anything unexpected?

  2. Please modify your work from Question 3a, to correct the problem that you found. What are the column names now? Hint: Using The `header=None argument will be useful.

  3. Now let us add these 7 column names: id, date, element_code, value, mflag, qflag, sflag, and obstime to the data frame. You can do this using: pd.read_csv(myfiles[0],names = ["id","date","element_code","value","mflag","qflag","sflag","obstime"])

  4. Make a list called mydataframes (of length 4) that contains 4 data frames, one for each year, from 1880.csv to 1883.csv. Starting with the sample code (above) for reading in the first file, modify our example, so that you have a "for" loop that reads in all 4 files. Test your work with a for loop that displays the column names of each of the four data frames in mydataframes. You can show the column names of myDF using myDF.columns.

Question 4 (2 pts)

Let’s look at the column element_code. Use a loop to solve the following questions for all 4 DataFrames:

  1. Print out the (unique) elements of the column element_code (i.e., show each element just one time).

  2. Find the number of times that SNOW occurs in the element_code column.

  • The method unique() will be useful to calculate unique values.

  • You may use different methods to find the number of times that SNOW occurs, for instance, len(), value_counts(), sum(), etc.

Question 5 (2 pts)

Now let us practice using the chunksize feature for big data. You may refer to this document, to get more information about chunksize.

  1. Try to run the following 2 programs, to find the number of times that SNOW occurs in the element_code column, from the year 1880 to year 1883. Explain your understanding of chunksize.

Pre-work for the programs:

import pandas as pd
from pathlib import Path
myfiles=[]
for year in range (1880, 1884):
    file= Path(f'/anvil/projects/tdm/data/noaa/{year}.csv')
    myfiles.append(file)

Version 1 of the program:

count = 0
for file in myfiles:
    for myDF in pd.read_csv(file,names=["id","date","element_code","value","mflag","qflag","sflag","obstime"],chunksize =10000):
        count += len(myDF[myDF['element_code'] == 'SNOW'])

print(count)

Version 2 of the program:

count = 0
for file in myfiles:
    for myDF in pd.read_csv(file,names=["id","date","element_code","value","mflag","qflag","sflag","obstime"],chunksize =10000):
        for index, row in myDF.iterrows():
            if row['element_code'] == 'SNOW':
                count += 1

print(count)

Project 03 Assignment Checklist

  • Jupyter Lab notebook with your code, comments and output for the assignment

    • firstname-lastname-project03.ipynb.

  • Python file with code and comments for the assignment

    • firstname-lastname-project03.py

  • Submit files through Gradescope

Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted.

In addition, please review our submission guidelines before submitting your project.