STAT218Lab11Open

932 days ago by pub

Sage Lab 11: Nonparametic Methods with New Zealand Earthquake Statistics

STAT 218 2009 S2: Computational Methods in Statistics

©2009 Jennifer Harlow, Dominic Lee and Raazesh Sainudiin.
Creative Commons Attribution-Noncommercial-Share Alike 3.0

Introduction

This is an interactive Sage worksheet from the Sage Notebook GUI.

This lab is split into two parts.   In the first part we will work through, as a class, the things that you need to know.   This should take half of the lab time.  In the second half of the lab you will continue at your own pace and we will be there to help with anything you want to ask about.   You can spend this time going back over the things we covered in the first half to make sure that you understand them properly.   If you are completely happy with the first part we have provided some more optional material for you to work on in the lab and on your own.

I. The essentials

The functions for obtaining and manipulating earthquakes data in this section are based on ideas in 5.2.5 in Miller & Ranum, Python: Programming in Context (2009). The source for New Zealand earthquake data  is http://magma.geonet.org.nz/resources/quakesearch/.

We first define a function that can get our data for us into a friendly array from a text file containing it.

import pylab def getData(filename, headerlines=0, sep=None): '''Open the file filename and return a pylab.array of the contents. Param filename is the filename to get the data from When used in the SAGE notebook interface, the data file must be uploaded to the worksheet Param headerlines is the number of headerlines (omitted from parsing, defaults to 0) Param sep is the separator for data within each line (defaults to None) getData only deals with 2-d data. If used with a space separated txt file, getData will reject any line with less than or more than the expected number of elements. This is because the pylab.array cannot be a ragged array.''' try: # use try ... except ... to check that we can open the file myFile = open(DATA+filename) # open the file - myFile is now a file object expElements = None # a variable to hold the expected number of elements in each line while(headerlines>0): # while is another way of looping, it loops until the condition headerlines>0 is false headers = myFile.readline() # read and discard each headerline headerlines = headerlines - 1 # decrement headerlines if headerlines == 0: # if we are on the last headerline expElements = len(headers.split(sep)) # assume last headerline column titles, count them tempList = [] # start with an empty list for line in myFile: # this for loop reads each line of the file one by one rlist = line.split(sep) # split is a method that allows you to split a string up into bits if expElements == None: # if the expElements is not set yet expElements = len(rlist) # set it based on elements in the first line if len(rlist) == expElements: # check the line has the expected number of elements tempList.append(rlist) # if so, add to the retlist else: print 'Omitted line\n', line, '\nwhich does not match column headers/first line' myFile.close() # don't have to here but good practice retArray = pylab.array(tempList) # make the list of lists into an array return retArray # the function returns array it has created except IOError, e: # error handling used if we could not open the file print 'Error opening file. Error is ', e print 'Check file attached.' return # end of getData(...) print 'We have now defined a function getData(...)\n\ which we can use to read data from a file into an array' 
       
We have now defined a function getData(...)
which we can use to read data from a file into an array

Let us assign the comma-separated-variable files named NZearthquakes1Jul08to6Aug08.csv and NZearthquakes1Jul09to6Aug09.csv that contain NZ earthquakes in the time intervals [1/July/2008 , 6/August/2008] and [1/July/2009 , 6/August/2009] to the string variables myFilename2008 and myFilename2009, respectively.

myFilename2008 = 'NZearthquakes1Jul08to6Aug08.csv' # 2008 data myFilename2009 = 'NZearthquakes1Jul09to6Aug09.csv' # 2009 data 
       

Now we assign the array returned by getData with appropriate arguments to myData2008 and myData2009.

myData2008 = getData(myFilename2008, headerlines=1, sep=',') myData2009 = getData(myFilename2009, headerlines=1, sep=',') 
       

What is exactly in the arrays myData2008 and myData2009 just returned by getData? Let's find out, shall we?

myData2008 
       
array([['2931322', '-37.88412', '177.87352', ..., '37.77383',
'2.421',
        '63.9307\r\n'],
       ['2931325', '-37.26902', '176.51445', ..., '38.14913',
'3.34',
        '220.2202\r\n'],
       ['2931332', '-39.47651', '175.69447', ..., '59.8815',
'1.839',
        '17.9686\r\n'],
       ..., 
       ['2949711', '-40.7494', '174.62494', ..., '40.23068',
'2.217',
        '12\r\n'],
       ['2949714', '-38.51364', '176.21767', ..., '36.03949',
'2.96',
        '171.0209\r\n'],
       ['2949715', '-41.38802', '172.34175', ..., '25.78879',
'2.932',
        '5\r\n']], 
      dtype='|S10')
myData2009 
       
array([['3117514', '-44.99823', '168.53069', ..., '12.58069',
'1.863',
        '5\r\n'],
       ['3117515', '-47.38385', '165.79785', ..., '19.09013',
'2.744',
        '33\r\n'],
       ['3117537', '-40.38247', '176.09354', ..., '37.45829',
'2.553',
        '33.809\r\n'],
       ..., 
       ['3134964', '-40.78112', '174.41324', ..., '11.8528',
'4.322',
        '50.269\r\n'],
       ['3134966', '-39.41886', '175.83319', ..., '48.71931',
'2.272',
        '53.3487\r\n'],
       ['3134981', '-39.78736', '176.8123', ..., '53.7128', '2.241',
        '45.7776\r\n']], 
      dtype='|S10')
type(myData2009) 
       
<type 'numpy.ndarray'>
myData2009[0] 
       
array(['3117514', '-44.99823', '168.53069', '2157734', '5569493',
'2009',
       '7', '1', '0', '59', '12.58069', '1.863', '5\r\n'], 
      dtype='|S10')
myData2009[0,0] 
       
'3117514'
type(myData2009[0,0]) 
       
<type 'numpy.string_'>
 
       
 
       

It is necessary to do error checks on raw data. If you expect data in the real world to be pre-checked for you then you are in the wrong profession. Without careful error checks you cannot analyse data. We need two functions safeFloat and safeInt. See the docstrings for these functions now.

def safeFloat(obj): '''make a float out of a given object if possible. the float(...) function only works with strings containing characters that can be part of a float, such as '123.45', so we have to check that the string we pass in can be made into a float, otherwise we will get an error when we try make a float out of a alpha-character string such as "AK". We use exception handling to do this ''' try: retval = float(obj) except (ValueError, TypeError), diag: retval = str(diag) return retval # end of safeFloat print 'We have now defined a function safeFloat(...)\n\ which we can use to try to turn strings into floats' 
       
We have now defined a function safeFloat(...)
which we can use to try to turn strings into floats
def safeInt(obj): '''make an Int out of a given object if possible. the int(...) function only works with strings containing characters that can be part of an int, such as '123', so we have to check that the string we pass in can be made into an int, otherwise we will get an error when we try make a float out of a alpha-character string such as "AK". We use exception handling to do this ''' try: retval = int(obj) except (ValueError, TypeError), diag: retval = str(diag) return retval # end of safeInt print 'We have now defined a function safeInt(...)\n\ which we can use to try to turn strings into floats' 
       
We have now defined a function safeInt(...)
which we can use to try to turn strings into floats

Let us next define a function that will return the magnitudes of the earthquakes from our data array. See the docstring of the function makeMagList now.

def makeMagList(myArray): '''Return a list of magnitudes from an array of earthquake data. Param myArray is a pylab.array of data formated as strings. Magnitudes are assumed to be in the 12 column of the file ''' try: # use try ... except ... to make sure we can do what we want to do magStrs = myArray[:,11] # take the magnitude column maglist = [] # initialise to an empty array lineno = 0 # to keep track of line numbers for mag in magStrs: lineno = lineno + 1 floatMag = safeFloat(mag) if isinstance(floatMag, float): # check we could turn the magnitude into a float maglist.append(floatMag) # append the magnitude at the end of the list else: # omit and report on any lines with magnitudes we can't turn into floats print 'Ignored row' , lineno, ' error diagnosis ', floatMag return maglist # the function returns the list it has created to the process which called the function except TypeError, e: # error handling for type incompatibilities print 'Error: Error is ', e return # end of makeMagsList(...) print 'We have now defined a function makeMagList(...)\n\ which we can use to process some earthquake data' 
       
We have now defined a function makeMagList(...)
which we can use to process some earthquake data

Let us assign the list of magnitudes returned by makeMagList(myData2008) to listMags2008 and by makeMagList(myData2009) to listMags2009.

listMags2008 = makeMagList(myData2008) # use our makeMagList function to make a list of magnitudes for 2008 data 
       
Ignored row 48  error diagnosis  empty string for float()
Ignored row 148  error diagnosis  empty string for float()
Ignored row 150  error diagnosis  empty string for float()
Ignored row 275  error diagnosis  empty string for float()
Ignored row 291  error diagnosis  empty string for float()
Ignored row 305  error diagnosis  empty string for float()
Ignored row 311  error diagnosis  empty string for float()
Ignored row 346  error diagnosis  empty string for float()
Ignored row 348  error diagnosis  empty string for float()
Ignored row 352  error diagnosis  empty string for float()
Ignored row 425  error diagnosis  empty string for float()
Ignored row 457  error diagnosis  empty string for float()
Ignored row 517  error diagnosis  empty string for float()
Ignored row 548  error diagnosis  empty string for float()
Ignored row 597  error diagnosis  empty string for float()
Ignored row 602  error diagnosis  empty string for float()
Ignored row 604  error diagnosis  empty string for float()
Ignored row 779  error diagnosis  empty string for float()
Ignored row 797  error diagnosis  empty string for float()
Ignored row 950  error diagnosis  empty string for float()
Ignored row 1048  error diagnosis  empty string for float()
Ignored row 1073  error diagnosis  empty string for float()
Ignored row 1094  error diagnosis  empty string for float()
Ignored row 1296  error diagnosis  empty string for float()
Ignored row 1366  error diagnosis  empty string for float()
Ignored row 1496  error diagnosis  empty string for float()
Ignored row 1506  error diagnosis  empty string for float()
Ignored row 1519  error diagnosis  empty string for float()
Ignored row 1532  error diagnosis  empty string for float()
Ignored row 1536  error diagnosis  empty string for float()
Ignored row 1627  error diagnosis  empty string for float()
Ignored row 1707  error diagnosis  empty string for float()
Ignored row 1712  error diagnosis  empty string for float()
Ignored row 1718  error diagnosis  empty string for float()
Ignored row 1769  error diagnosis  empty string for float()
Ignored row 1771  error diagnosis  empty string for float()
Ignored row 1777  error diagnosis  empty string for float()
Ignored row 1797  error diagnosis  empty string for float()
Ignored row 1818  error diagnosis  empty string for float()
Ignored row 1850  error diagnosis  empty string for float()
Ignored row 1871  error diagnosis  empty string for float()
Ignored row 1909  error diagnosis  empty string for float()
Ignored row 2066  error diagnosis  empty string for float()
listMags2009 = makeMagList(myData2009) # use our makeMagList function to make a list of magnitudes for 2009 data 
       
Ignored row 45  error diagnosis  empty string for float()
Ignored row 79  error diagnosis  empty string for float()
Ignored row 127  error diagnosis  empty string for float()
Ignored row 230  error diagnosis  empty string for float()
Ignored row 442  error diagnosis  empty string for float()
Ignored row 570  error diagnosis  empty string for float()
Ignored row 607  error diagnosis  empty string for float()
Ignored row 638  error diagnosis  empty string for float()
Ignored row 720  error diagnosis  empty string for float()
Ignored row 741  error diagnosis  empty string for float()
Ignored row 801  error diagnosis  empty string for float()
Ignored row 825  error diagnosis  empty string for float()
Ignored row 888  error diagnosis  empty string for float()
Ignored row 921  error diagnosis  empty string for float()
Ignored row 1281  error diagnosis  empty string for float()
Ignored row 1299  error diagnosis  empty string for float()
Ignored row 1708  error diagnosis  empty string for float()

Let us visualise the frequency of magnitudes in our data from 2008 (try 2009 too) over ten intervals using a histogram.

pylab.clf() # clear current figure n, bins, patches = pylab.hist(listMags2008, 10) # make the histogram (don't have to have n, bins, patches = ...) pylab.xlabel('magnitude') # use pyplot methods to set labels, titles etc similar to as in matlab pylab.ylabel('count') pylab.title('Count histogram for earthquake data') pylab.savefig('myHist') # seem to need to have this to be able to actually display the figure pylab.show() # and finally show it 
       

Note that this choice of ten interval bins over the range of magnitudes was an arbitrary choice. It is true that the relative frequency of the magnitudes over any interval is a consistent point estimate of the probability of an earthquake having a magnitude in that interval under an IID model of course. But, the shape of the histogram bar heights are not automatically a good estimate of the underlying density. This is because the histogram is isnsitive to the number of interval bins that are being used to construct it. This problem is called smoothing. So, without the right number of bins we may be under-smoothing or over-smoothing the histogram and therefore unable to get a good nonparametric point estimate of the underlying probability density function. Let us visualise the frequency of magnitudes in our data from 2009 (try 2008 too) over a range of bin numbers using to interactively appreciate the over/under-smoothing problem.

@interact def _(numOfBins=(1,(1..100))): pylab.clf() # clear current figure n, bins, patches = pylab.hist(listMags2009, numOfBins) # make the histogram (don't have to have n, bins, patches = ...) pylab.xlabel('magnitude') # use pyplot methods to set labels, titles etc similar to as in matlab pylab.ylabel('count') pylab.title('Count histogram for earthquake data') pylab.savefig('myHist') # seem to need to have this to be able to actually display the figure pylab.show() # and finally show it 
       
listMags2009Sorted = listMags2009[:] listMags2009Sorted.sort() 
       
len(listMags2009Sorted) 
       
1703
type(listMags2009[0]) 
       
<type 'float'>
len(Set(listMags2009)) 
       
1301
def makeEmfDict(dataList): '''Make a empirical mass function from a list of floating point data and return it as a dictionary''' n = len(dataList) # sample size emfDict = {} # start with an empty dictionary for the empirical mass function for x in dataList: if x in emfDict: # the sample x already exists as a key emfDict[x] = emfDict[x] + 1/n # add 1/n to the count else: # the sample x does not exist as a key value emfDict[x] = 1/n # add a new key-value pair for this new sample x, frequency 1/n return emfDict # return the dictionary created 
       
points(makeEmfDict(listMags2009).items()) 
       
points(makeEmfDict(listMags2008).items()) 
       
def edfPlot(dataList, colour): '''Function to plot the empirical distribution function for one dimensional floating-point data. dataList is a list of data to make into an edf plot. colour is colour to use for the plot returns a plot object.''' emfDict = makeEmfDict(dataList) dictItems = emfDict.items() dictItems.sort() # sort the items dataset = [di[0] for di in dictItems] relFreqs = [di[1] for di in dictItems] import pylab cumFreqs = pylab.cumsum(relFreqs) numCumFreqPairs = zip(dataset, cumFreqs) # make a list of tuples using zip plotEDF = points(numCumFreqPairs, rgbcolor = colour, faceted = false) return plotEDF 
       
edf2008and9 = edfPlot(listMags2008, "blue") edf2008and9 += edfPlot(listMags2009, "red") show(edf2008and9) 
       
 
       
 
       
myData2009 
       
# date and time are in columns index 5 to 10 
       
 
       
def makeQuakeTimes(myArray, minLat, maxLat, minLng, maxLng): '''Return a list earthquake times as a Unix time number for earthquakes occurring between specified latitudes and longitudes. Param myArray is a pylab.array of the data as strings Param minLat, maxLat are the minimum and maximum latitudes to include in the results Param minLng, maxLng are the minimum and maximum longitudes to include in the results Latitudes are expected to be in the second column of the array Longitudes are expected to be in the third column of the array Date and time elements are expected to be in the 6th to 11th columns of the array makeQuakeTimes returns a list occurrence times between for earthquakes a Unix time number Unix time starts at 1.1.1970; the Unix time number counts seconds since 1.1.1970.''' import datetime import time try: # use try ... except ... to make sure we can do what we want to do indexes = range(myArray.shape[0]) # using number of rows in array times = [] # empty list for i in indexes: myTime = [] # empty list for time elements for row j = 5 allOkay = true lat = safeFloat(myArray[i,1]) # get the latitude and longitude for checking lng = safeFloat(myArray[i,2]) if (isinstance(lat, float) & isinstance(lng, float)): # if we got floats for lat and longitude if (lat >= minLat) & (lat <= maxLat) & (lng >= minLng) & (lng <= maxLng): while (j < 10) and allOkay: # deal with year, month, day, hour, minute for this row intTm = safeInt(myArray[i,j]) if isinstance(intTm, int) and allOkay: # check we could turn time element into an int myTime.append(intTm) j+=1 # increment j else: allOkay = false # this means that we will stop the loop for this row # deal with seconds and microseconds # j should be 10 if we have got through year, month, day, hour, minute okay if j==10: seconds = safeFloat(myArray[i,j]) if isinstance(seconds, float): # check we could turn time element into a float myTime.append(int(seconds)) microseconds = int((seconds - int(seconds))*1000000) myTime.append(microseconds) else: allOkay = false if allOkay: yr, mth, dy, hr, mn, sec, msec = tuple(myTime) tm = datetime.datetime(yr, mth, dy, hr, mn, sec, msec) times.append(time.mktime(tm.timetuple())) else: # omit and report on lines outside lat or longitude range print 'Omitted row with latitude', lat, 'longititude', lng else: # omit and report on any lines with latitude or longitude we can't turn into floats print 'Ignored row' , (i+1), ' error diagnosis lat ', lat, 'long ', lng return times # the function returns the list of lists it has created except TypeError, e: # error handling for type incompatibilities print 'Error: Error is ', e return # end of makeQuakeTimes(...) print 'We have now defined a function makeQuakeTimes(...)\n\ which we can use to process some earthquake data' 
       
We have now defined a function makeQuakeTimes(...)
which we can use to process some earthquake data
 
       
def interQuakeTimes(myArray, minLat, maxLat, minLng, maxLng): '''Return a list inter-earthquake times in seconds for earthquakes occurring between specified latitudes and longitudes. Param myArray is a pylab.array of the data as strings Param minLat, maxLat are the minimum and maximum latitudes to include in the results Param minLng, maxLng are the minimum and maximum longitudes to include in the results Latitudes are expected to be in the second column of the array Longitudes are expected to be in the third column of the array Date and time elements are expected to be in the 6th to 11th columns of the array interQuakeTimes returns a list of inter-quake times in seonds.''' quakeTimes = makeQuakeTimes(myArray, minLat, maxLat, minLng, maxLng) retList = [] if len(quakeTimes) > 1: retList = [quakeTimes[i]-quakeTimes[i-1] for i in range(1,len(quakeTimes),1)] return retList 
       
interQuakes2009 = interQuakeTimes(myData2009, -50, -30, 150, 200) 
       
Omitted row with latitude -37.53203 longititude -179.80537
Omitted row with latitude -36.52448 longititude -179.99582
pylab.array(interQuakes2009).mean()/(60*60) # mean inter-earthquake times in hours 
       
0.49965718630686601
pylab.array(interQuakes2009).std()/(60*60) # std of inter-earthquake times in hours 
       
0.95531164430209203
pylab.clf() # clear current figure n, bins, patches = pylab.hist(interQuakes2009, 50) # make the histogram (don't have to have n, bins, patches = ...) pylab.xlabel('inter-quake time(seconds)') # use pyplot methods to set labels, titles etc similar to as in matlab pylab.ylabel('count') pylab.title('Count histogram for inter-quake times 1 July - 6 August 2009') pylab.savefig('myHist') # seem to need to have this to be able to actually display the figure pylab.show() # and finally show it 
       
 
       
# repeat for 2008 data myFilename2008 = 'NZearthquakes1Jul08to6Aug08.csv' # 2008 data myData2008 = getData(myFilename2008, headerlines=1, sep=',') interQuakes2008 = interQuakeTimes(myData2008, -50, -30, 150, 200) 
       
Omitted row with latitude -37.22483 longititude -179.88422
Omitted row with latitude -36.28675 longititude -179.25929
Omitted row with latitude -37.11357 longititude -179.92047
Omitted row with latitude -37.52361 longititude -179.95947
pylab.array(interQuakes2008).mean()/(60*60) # mean inter-earthquake times in hours 
       
0.42090492010757791
pylab.array(interQuakes2008).std()/(60*60) # std inter-earthquake times in hours 
       
0.49867108150408451
pylab.clf() # clear current figure n, bins, patches = pylab.hist(interQuakes2008, 50) # make the histogram (don't have to have n, bins, patches = ...) pylab.xlabel('inter-quake time(seconds)') # use pyplot methods to set labels, titles etc similar to as in matlab pylab.ylabel('count') pylab.title('Count histogram for inter-quake times 1 July - 6 August 2008') pylab.savefig('myHist') # seem to need to have this to be able to actually display the figure pylab.show() # and finally show it 
       

II. Non-essentials

Try to visualise the locations of earthquake epicenters between the two years. Recall the procedures from previous lab. The source for New Zealand earthquake data  is http://magma.geonet.org.nz/resources/quakesearch/.

def makeCoordList(myArray, minLat, maxLat, minLng, maxLng): '''Return a list of lists of earthquake latitude and longitude data and magnitudes. Param myArray is a pylab.array of the data as strings Param minLat, maxLat are the minimum and maximum latitudes to include in the results Param minLng, maxLng are the minimum and maximum longitudes to include in the results Latitudes are expected to be in the second column of the array Longitudes are expected to be in the third column of the array Magnitudes are expected to be in the 12th column of the array makeCoordList returns a list of three list: [0] is the first list in the list, the list of latitudes [1] is the second list in the list, the list of longitudes [2] is the third list in the list, the list of magnitudes associated with the latitudes and longitudes ([0][i],[1][i]) would form the tuple of coordinates for the (i+1) valid row in myArray All the component lists are of the same length: if a row in myArray is missing either latitude or longitude it is excluded entirely if a row in myArray has latitude and longitude but no magnitude, [2] will have None in that position ''' try: # use try ... except ... to make sure we can do what we want to do latStrs = myArray[:,1] # take the latitude column as strings lngStrs = myArray[:,2] # take the longitude column as strings magStrs = myArray[:,11] # take the mags column as strings latlist = [] # start with some empty lists lnglist = [] maglist = [] n = len(latStrs) # count how many rows we have indexes = range(n) # make a list of indexes into the array rows for i in indexes: # for loop over indexes lat = safeFloat(latStrs[i]) # index into each list of strings lng = safeFloat(lngStrs[i]) mag = safeFloat(magStrs[i]) if (isinstance(lat, float) & isinstance(lng, float)): # if we got floats for lat and longitude if (lat >= minLat) & (lat <= maxLat) & (lng >= minLng) & (lng <= maxLng): latlist.append(lat) # append a lat at the end of the list lnglist.append(lng) # append a lng at the end of the list if isinstance(mag, float): # append the mag if there is one, else None maglist.append(mag) else: maglist.append(None) else: # omit and report on lines outside lat or longitude range print 'Omitted row with latitude', lat, 'longititude', lng else: # omit and report on any lines with latitude or longitude we can't turn into floats print 'Ignored row' , (i+1), ' error diagnosis lat ', lat, 'long ', lng coordlist = [latlist,lnglist,maglist] # make a list of the three lists return coordlist # the function returns the list of lists it has created except TypeError, e: # error handling for type incompatibilities print 'Error: Error is ', e return # end of makeCoordList(...) print 'We have now defined a function makeCoordList(...)\n\ which we can use to process some earthquake data' 
       
listCoords = makeCoordList(myData2009, -50, -30, 150, 200) # use our makeCoordList function to make a list of coordinates (and magnitudes) 
       
clist = [] # make an empty list to hold colours depending on the magnitudes for m in listCoords[2]: # go through the third list in listCoords, ie the magnitudes if m == None: clist.append('k') # 'k' is black elif m > 6: clist.append('r') # 'r' is red elif m > 4: clist.append('m') # 'm' is magenta elif m > 2: clist.append('b') # 'b' is blue else: clist.append('y') # 'y' is yellow 
       
pylab.clf() # clear current figure pylab.scatter(listCoords[0],listCoords[1],s=10,c=clist) # make a scatter plot pylab.xlabel('latitude') # use pyplot methods to set labels, titles etc similar to as in matlab pylab.ylabel('longitude') pylab.title('Earthquake locations') pylab.savefig('myScatter') # seem to need to have this to be able to actually display the figure pylab.show() # and finally show it