Lab 12: Needles and Haystacks
Overview
Enron was an energy trading company. In October 2001, the Enron
Scandal became public
knowledge. Among other aspects, there were two crucial ingredients of
the scandal:
- Incriminating documentation was systematically destroyed.
- Fraudulent practices were employed to artificially drive up energy
prices.
Some key evidence employed by prosecutors was obtained through the
seizure of the archived email messages on the company’s servers. These
emails were released publicly in the
aftermath of the scandal. This email archive is frequently employed by
computer scientists for experimenting with ideas for large-scale text
processing. In this lab, we will write Python code to enable us to
explore this archive. In the process, you might be able to identify some
of the key incriminating emails that it contains.
File System Structure
The files on a computer are organized in a hierarchy. You are likely
accustomed to navigating this hierarchy using the metaphor of the
folder. Each folder contains files, some of which might be other
folders. By placing folders within folders, we can create a hierarchical
layout of all of our data. Because a folder can be placed within another
folder, we call this a recursive data structure. To navigate a
recursive data structure, the most effective technique is to write
recursive functions, that is, functions that call themselves.
Step 1: Exploring File System Functions
Create a new Python project for lab 12. Download the Enron email
repository,
unzip the compressed archive, and place it in your project directory.
(Note: This is not the full repository. We have edited it down
to make it easier for you to analyze.)
Then download the Python file
named filefuncs.py
and also place it in your project directory. At
the top of the file is the line import os
. This imports important
functions for accessing the computer’s operating system.
We are specifically interested in the following functions:
os.listdir()
: This function returns a list of all files in a
specified folder.
os.path.isdir()
: This function returns true if a given file is
another folder, and false otherwise.
Using os.listdir()
, we can examine each file in a given folder. If it
is a regular file, we can process it however we like. If it is another
folder, which we determine with os.path.isdir()
,
we can use recursion to process the files within it.
The following short function is included in the file and demonstrates how
these two functions are to be used:
def files_and_folders(path: str):
i: int = 0
files: List[str] = os.listdir(path)
for f in files:
child: str = path + os.sep + f
if os.path.isdir(child):
print(child + " is a folder")
else:
print(child + " is a regular file")
Step 1.1
Run your file in the console, and then type the following lines in the
console to see what it does:
files_and_folders('.')
files_and_folders('enron')
files_and_folders('enron/townsend-j')
files_and_folders('enron/townsend-j/to_do')
Write down your answers to the following questions in a comment above the definition of the
files_and_folders()
function.
- What is the output of each command?
- What is the meaning of this output?
Step 1.2
Next, add these two lines to the else
clause:
contents = open(child).read()
print(contents)
Run your file in the console, then run the following lines again
files_and_folders('.')
files_and_folders('enron/townsend-j/to_do')
Answer the following questions in comments above the function.
- What does it do differently?
- What is the effect of
open(child).read()
?
Step 2: Finding the depth
The Enron emails are organized using a system of folders. Each folder
corresponds to a specific person. Each person might have other folders
to better organize their emails.
In this step, we will write a function find_depth
to find out how deep a particular folder goes. This function is outlined
as follows:
def find_depth(path: str) -> int:
# set an integer variable representing the deepest path to zero
# for each file in the current folder
# If it is a folder
# Call find_depth recursively on that folder, storing the result in a variable.
# If the returned value, plus 1, is greater than our deepest path so far,
# set our deepest path to be that value. Otherwise, do nothing.
# return the deepest path
Here’s an example of using this function. (You should get the same
answer when you test it.)
Step 3: Counting all the files
In this step, we will write a function file_count
that will count the total number of files that are stored in
the folder. The outline of this function will be very similar to the
outline of the previous function. The difference is that instead of
finding the maximum depth, we want to know the total number of files.
So, inside our loop, if a given file is a folder, we add to our total
the result of the recursive call. If not, we add 1 (to count that as a
normal file).
def file_count(path: str) -> int:
# set an integer variable representing the total number of files to zero
# for each file in the current folder
# If it is a folder
# Call file_count recursively on that folder, adding the result to our total
# Otherwise
# Add 1 to our total
# return the total
Here’s an example of using this function. (You should get the same
answer when you test it.)
>>>file_count('enron')
56838
Step 4: Finding target strings
To detect clues of wrongdoing, we can write a function all_files_with
to visit every
file and see if it contains certain target words. In this step, our
function will return a list containing the name of every file that
contains a target word. We can then use this list to decide which files
to manually inspect.
def all_files_with(path: str, target_string: str) -> List[str]:
# set a variable representing all the filenames containing target_string to the empty list
# for each file in the current folder
# If it is a folder
# Call all_files_with on that folder, adding every element from the list it returns
# to our own list.
# Otherwise
# Open the file and read it into a string.
# If the string contains our target_string, add the filename to our list.
# return our list of filenames
Step 5: Finding evidence
At this point, we have available the tools to try to find a needle in
the haystack. Here are some text strings you might consider using to
find culpable emails:
- irregularities
- fear of litigation
- witholding capacity
- interruptible transmission
- found liable
- market interference
- special purpose entity
- possession of tapes
Warning: You are about to read the private emails
of Enron employees. If you search for other strings besides the
ones listed above, it is likely you will find questionable and objectionable content,
as the employees routinely used their work email for personal use. You are not
required to search beyond the strings listed above.
Step 6: Assessing evidence
Identify a collection of filenames of emails that you think might
contain evidence of wrongdoing.
- What aspects of those emails might provide evidence?
- Are there any ambiguities?
- In what ways is the evidence incomplete or inconclusive?
Provide answers in your Evaluation Document.
What to turn in
- filefuncs.py
- Evaluation Document for step 6