''' A fairly thourough walkthrough of a Python Project Created on Aug 11, 2018 @author: Brett Paufler (c) Copyright Brett Paufler The first two lines of the signature above are added automatically by my IDE: PyDev in Eclipse. The last line is easier for me to add by hand as I never did figure out how to do it in my IDE. I added it, because I'm a bit paranoid and more than a bit protective of my 'Creations'. Those indents are likely hard to understand. Sorry. Sometimes they indicate sub-thoughts. Sometimes it's simply sentence continuation. Also of note, having poor eyesight: I work 12 inches from the monitor and use a 24 point font. So, my working line-width is 55 characters. Anyway, enough useless preambles. This script's (and it is a script) sole purpose is to generate graphs for my website. As such, it is a throw-away script, never to be used again. But since I don't like throwing things away, I'll add tons of extraneous commentary, so anyone who wants to can play along at home. #################################################### 53 so, I was wrong #################################################### Calorie Counter Throwaway Script Converts data from a raw text file into a few key stats, and a series of graphs. #################################################### ''' #NOTE: a '#' Denotes a comment #As does a triple quotation ('''), per above #Comments are ignored by computers #But not by our Glorious Robotic Overlords #NOTE: # NEVER INSULT OUR GLORIOUS ROBOTIC OVERLORDS! #Any-the-way, having been programming for a while # I have certain 'solutions' already in place. #Like, the use of directories for input & output # which the script does not create # as it assumes these directories already exist. dir_in = './/input//' dir_out = './/output' #But for the dir_in # I'll only ever need the one file of raw data # as has been posted to the website # where you found this script. file_in = './/input//calorie-counter-second.txt' #Most of the '#' comments # refer to the code that follows them directly # So, in the next, # I'm talking about the 'with' statement. #The first line creates a handle to a file called f. #The second line reads the file # using the handle and # storing the contents in the variable raw_text. with open(file_in, 'r') as f: raw_text = f.read() #Um, did the above work? #Now might be a good time to check... print raw_text #Typically, #Having checked, # I find the print is no longer required # and so, I comment the prints out. # But I am not going to bother, here. #The following converts the raw text into a list # based on passed delimiter: namely '\n\n'. # I always get '/' and '\' confused. # '\n\n' is a double return (double new line). data = raw_text.split('\n\n') #More checks print data print len(data) #33, two too many, I'd say #In an IDE, the code is easier to understand. #For instance, comments are RED. # And therefor, easier to ignore. # Odd how red is easier to ignore. #This is a list comprehension data = [item for item in data if not (item == '' or 'Calorie' in item or 'Brett' in item)] ''' Note the triple quotes (above). This is a multi-line comment. But if you want to get really technical, I am sure it is a string that has not been assigned to a variable; and so, is not accessable to the program. OK. Fair enough. This text is just tossed away by the parser. Anyway... Below is the same list comprehesion as above with each command on a seperate line The computer will compute each list comprehension. But as both are saved into the same variable: 'data' the second result will replace the first. i.e. It doesn't matter that THIS command set is written twice. ''' data = [item for item #for every item in data #in the list named data if #keep elements if not #the following is not true ( item == '' #removes empty items or 'Calorie' in item #removes matching items or 'Brett' in item #removes matching items ) #closes the logical condition ] #ends the list comprehension #Yeah, it's confusing... at first. #And I'll print it just to make sure. print data #['\n31\n9-380/11\n10-12... gobbledegook print len(data) #31, the desired result ''' Now, I simply have the raw data, split by date. The above print starts like this: ['\n31\n9-380/11\n10-1260/53\n13-750/38\n15- It looks confusing... and it is. Still, keep in mind '\n' simply means 'new line'. And '[' (in this case) means the start of a list. ''' #This will output human readable data # which matches the formatting of the text file for d in data: print d ''' One line of Print Output from above, which matches the input from file is as follows: 31 9-380/11 10-1260/53 13-750/38 15-100/4 17-550/24 19-1000/40 21-300/0 The first number on its own line(31) corresponds to the day of the month. For each line that follows: TT-CCC/PP TT: time CCC: calories PP: protien Some folks like Regular Expressions. But they are COMPLICATED. It's way easier to 'mung' the data just sort of waddle one's way through it. But before I do that... Well, typically, I would make a class. But doing so would remove any linearity of thought from the code. So far, this has been straight-line programming. But classes require going back and forth. So, instead, I'll use a Named Tuple: A Class substitute that is simplier to understand because it isn't as powerful as a Class. ''' #The following imports 'namedtuple' # which makes 'namedtuple' a useable command # or more accurately, # initializes 'namedtuple' # as a variable # which holds a function #Gads, but this stuff gets complicated fast from collections import namedtuple ''' That still is, probably, unclear, so let me try to do better. As a language, Python only has 30 odd commands. But as a language, it comes with 300+ built in libraries which include some 10,000 commands. from collections import namedtuple makes one (just one) of those commands usable These built in commands or functions, if you prefer are where Python's power comes from. Python: Batteries Included 300+ built in libraries 10,000+ built in (i.e. pre-written) commands so call them shortcuts thousands of third party libraries addressing almost every problem set known to man Python is ridiculously easy to understand! But don't kid yourself into thinking you know Python! It would be like saying you know History. Sure, you know part... But only ever part. ''' #Creation of a simple 'day' class Day = namedtuple('Day', 'date records') #Let's make sure it is working test_day = Day(1, []) print Day # print test_day #day(date=1, records=[]) #Creating another namedtuple Record = namedtuple('Record', 'time calories protien') test_record = Record(1, 222, 2) print Record # print test_record #Record(time=1, calories=222, protien=2) ''' Day: will store the records for each day Record: will hold each data entry So, the data structure will look something like a list of Days, containing a list of Records [Day(Record, Record), Day(Record, Record, Record), Day(Record, Record)...] And to access data, I will loop through the constructs. But first, I need to get the data into the constructs. And to do this, I will mung the data... or do something called data munging... so, I'll just sort of mung it, shall I? ''' print data #got to remember where we are #Create an empty list #It's the output of the upcoming loop #As in, the only thing important # on this side of the loop # on the other side of the loop. days =[] for d in data: #loop through the lot r = d.split('\n') #note the single newline print r #make sure it looks good #['', '29', '8-380/11', '10-300/25'... #being about what I expected #The first item in the list is '' #Which is garbage, so remove r = r[1:] #The second item, which is now the first # is the date, so save to a variable # and delete from the list date = r[0] r = r[1:] #Check! Check! And Double Check! print r #It's time to create the Day object. #If this were real code, I'd do it a step back #But there is nothing to be gained # in a throwaway script by doing so this_day = Day(int(date), []) print this_day #Day(date=1, records=[]) #The variable names in the loob below, # SUCK! #In a real program, I'd use longer ones #But whatever. #Looping over r, calling each piece of it x for x in r: #First, I want to make sure it looks right print x #20-150/8 x = x.split('-') print x #['20', '150/8'] #This is a sort of head tail split. # Common in Lisp & Haskell. time, x = int(x[0]), x[1] #Extraction of the remaining two values #Note, in all cases int(str) # converts from string to integer x = x.split('/') calories, protien = int(x[0]), int(x[1]) #Create the record from extracted values t_record = Record(time, calories, protien) #Append the temporary record # to the current Day object. this_day.records.append(t_record) #Being back an indent level # indicates an end to the inner loop #And the only thing left to do # in the outer loop # is adding the current day # onto the list of days days.append(this_day) #And Viola! #The raw data is now a list of Day objects # which in turn contain record objects #Check the output #I am not showing the errors #It is important to check, so there are not errors #Or at least, there are no errors carried forward print days #[Day(date=31, records=[Record(time=9 print len(days) #31, the expected length ################################################### # # OUTPUT: # GRAPHS & SUCH # # from here on out, # only the list of days matters # ################################################### ################################################### # # DAILY AVERAGE # # a simple stat # average calories per day # ################################################### ''' In a list comprehension, nested loops are written in the same order as if the comprehension were simply nested loops. So, the comprehension that follows is similiar to: calorie_data = [] for d in days: for r in d.records: calorie_data.append(r) But I find a list comprehension simpler to read. ''' ################## #CALORIES BY MONTH ################## all_calorie_data = [r.calories for d in days for r in d.records] #Always with the checking... print all_calorie_data #[380, 1260, 750, 100... #I hope the names are self-descriptive. cal_month_total = sum(all_calorie_data) cal_daily_ave = cal_month_total / 31.0 cal_daily_min = min(all_calorie_data) cal_daily_max = max(all_calorie_data) print 'cal_month_total ', cal_month_total print 'cal_daily_ave ', cal_daily_ave print 'cal_daily_min ', cal_daily_min print 'cal_daily_max ', cal_daily_max ''' Printed output: cal_month_total 102226 cal_daily_ave 3297.61290323 cal_daily_min 15 cal_daily_max 2000 This looks about right... Oh, wait! It's way off! See! Constant checking! These two are off: cal_daily_min cal_daily_max I never just ate 15 calories in a single day! ''' ################ #CALORIES BY DAY ################ ''' This is actually, where a Class becomes easier to work with; but becomes harder to explain, because the code as read, becomes non-linear ''' cal_daily_list = [] for d in days: cal = 0 for r in d.records: cal += r.calories cal_daily_list.append(cal) cal_daily_min = min(cal_daily_list) cal_daily_max = max(cal_daily_list) print 'cal_daily_list ', cal_daily_list print 'cal_daily_min ', cal_daily_min print 'cal_daily_max ', cal_daily_max ''' Printed output: cal_daily_list [4340, 3810, 3110... min(cal_daily_list) 1600 max(cal_daily_list) 4360 This looks more reasonable. But in truth, for a script like this that's the limit of my error checking ''' day_one_error = 915 + 1000 + 530 + 500 + 750 + 150 print 'Day 1: CAL: ', day_one_error ''' Print Output: Day 1: CAL: 3845 And this does not match the first item on my list. However, it does match the last item. Typically, I would reverse the list... at the top of the script. But I'll do that here. After all, what could possibly go wrong? ''' #Reversed lists: # The last shall be first. # And the first shall be last. days = list(reversed(days)) cal_daily_list = list(reversed(cal_daily_list)) #Now, it is time for some graphs ######################## # GRAPH: CALORIES BY DAY ######################## import matplotlib.pyplot as plt ''' matplotlib.pyplot is ALWAYS imported as plt. I mean, you don't have to. But everybody else IS going to. So, why are you wasting effort fighting the flow. ''' dates = range(1, 32, 1) #And this is where I discovered # I had to convert the reversed iterators (above) # back into lists print dates #[1, 2, 3, 4, 5, 6, print cal_daily_list #[3845, 3230, 4360... #I like consistency # so I set the image size to a constant. plt.figure(figsize=(10,5)) #Two lines, two plots plt.plot(dates, cal_daily_list, color='black', label='daily') plt.plot(dates, [cal_daily_ave] * 31, color='red', label='average') #Fine tuning plt.legend(loc='lower right') plt.title('Calories per Day') plt.ylabel('Calories') plt.xlabel('December, 2017') plt.ylim(0, 5000) plt.xlim(1, 31) #I like printing the save name to screen, # as it lets me know where the program is, # as it is running. # Some programs hang a long time # during image creation # So, it's nice to know # that is what they are doing. save_name = './output/calories_per_day.png' print save_name plt.savefig(save_name) #show() is good for debugging # annoying for production #plt.show() #Close 'er down plt.close() ''' Did you notice the dip in the middle of the graph? I noticed the dip in the middle. ''' ################################ # GRAPH: CALORIES BY TIME OF DAY ################################ #A zero initialized list with 24 items. # i.e. a list of zeroes, 24 items long. cal_by_hours =[0] * 24 print cal_by_hours #[0, 0, 0, 0, 0... print len(cal_by_hours) #24 #But using a Numpy array # will likely save me time. #Because that's how the cool kids do it import numpy as np cal_by_hours = np.zeros(24, dtype=np.uint32) print cal_by_hours #[0 0 0 0 0 0 0 0 0 0... #Now, add the relevent data to the array for d in days: for r in d.records: cal_by_hours[r.time] += r.calories print cal_by_hours #[ 0 0 0 0 0 0 0 # 1280 4855 4702 10815 5205 7745 10449 # 6360 4145 6400 5510 8050 7170 8665 # 5325 5550 0] print cal_by_hours.sum() #102226 #Note, this is the same value as cal_month_total # So, things are looking... consistent #A hallmark of functional programming # is changing the variable name # whenever a variable's value is changed # # Or in other words: # If x points to y, # x ALWAYS points to y! # So, for ALL PRACTICAL PURPOSES # x IS y! # #Um, this script doesn't come close # to being an example of Functional Programming # #Anyhow, this sort of renamings happens alot # in Functional Programming cal_by_hours_ave = cal_by_hours / 31.0 print cal_by_hours_ave print sum(cal_by_hours_ave * 31) ''' Print Output: [ 0. 0. 0. ... 0. 41.29032258 156.61290323... 167.90322581 249.83870968 337.06451613... 102226.0 The sum is consistent, so that is good. ''' #This is an edited copy / paste of previous code plt.figure(figsize=(10,5)) plt.plot(range(0, 24, 1), cal_by_hours_ave, color='blue') plt.title('Average Calories per Hour') plt.ylabel('Calories') plt.xlabel('24 Hour Clock') plt.ylim(0, 400) plt.xlim(0, 23) save_name = './output/calories_per_hour.png' print save_name plt.savefig(save_name) #plt.show() plt.close() #But I don't like the jagged lines ################################ # GRAPH: CALORIES BY TIME OF DAY # SMOOTH - SMOOTH - SMOOTH ################################ #Extrapolating data to increase the data points # making the graph smoother. # Or in other words, MAGIC! from scipy.interpolate import spline y = np.linspace(start=0, stop=23, num=1000) x = spline(range(0, 24, 1), cal_by_hours_ave, y) #Another copy paste #Some folks might say to make a function, # but it simply is not worth the trouble, # as there are too many specialized arguments. plt.figure(figsize=(10,5)) plt.plot(y, x, color='blue') plt.title('Average Calories per Hour') plt.ylabel('Calories') plt.xlabel('24 Hour Clock') plt.ylim(0, 400) plt.xlim(0, 23) save_name = './output/calories_per_hour_smooth.png' print save_name plt.savefig(save_name) #plt.show() plt.close() #And as I know my data wasn't that accurate # I'm going to take that into account, as well, # in the next graph. ################################ # GRAPH: CALORIES BY TIME OF DAY # CONVOLUTION - SMOOTH # # Basically, I'm going to blur # the data ################################ #For complex calls # I like to pass each arguement # on a separate line time_ave = np.convolve( cal_by_hours_ave, #data passed in [.33333] * 3, #the filter to use mode='full' #will require clean up ) #A print here, # as I really need to see the output. #Also, I had a bug, later # so I needed to find this print output # thus, giving it an easy to find label. print 'Time Average' #Label for debugging print time_ave print time_ave.shape #(26L,), an array 26 long ''' Print Output: [ 0. 0. 0. ... 0. 13.76330323 65.96708226... 222.81497613 255.53507903 251.59963452... ...59.67682258 0. ] (26L,) Both the first and last data points are zero, so I can cut them off without loosing any meaningful data. Yippie! Easy Peasy! ''' #Removes the first and last array entries. time_ave = time_ave[1:-1] print time_ave.shape #(24L,), as expected print time_ave.sum() #3297.5799271 print cal_by_hours_ave.sum() #3297.61290323 ''' So, there were two bits of error correction, here. First, in the above, I had originally used [.33, .33, .33] in the convolve, but changed it to [.33333, .33333, .33333] because I wasn't getting the accuracy I wanted. time_ave.sum() at .33 = 3264.63677419 time_ave.sum() at .33333 = 3297.5799271 My target number being: cal_by_hours_ave.sum() = 3297.61290323 so, that is spot on Secondly, before all of this matched up, I had diveded by an integer way back here... Oringally, cal_by_hours_ave = cal_by_hours / 31 Changed to, cal_by_hours_ave = cal_by_hours / 31.0 Now, I had intended to divide by an integer, knowing full well this would chomp the numbers. Um, chomp... truncate... lose the decimals. What I didn't figure was the loss of precision, and the fact that I would care a little bit later about that precision down the line while error checking. ''' print 'It Is All Good' #Another debug label print time_ave.sum() #3297.5799271 print 31 * 3297.5799271 #102224.97774 print 102226 - 102224.97774 #1.02226, close enough ''' And in this error checking if there is anything annoying it is the endless cascade. From the point of the error to here, I had to recopy every number, you know, for the write up. Anyhow, we now have time rounded numbers... The value of any caloric intact has been spread equally over the recorded hour the previous hour the next hour So, 0 3 0 Becomes, 1 1 1 The intent is to take into account experimentor error. I know the guy who recorded his caloric intake, he's just not that precise. ''' #Making a graph! #This is all repeated code. y = np.linspace(start=0, stop=23, num=1000) x = spline(range(0, 24, 1), time_ave, y) plt.figure(figsize=(10,5)) plt.plot(y, x, color='blue') plt.title('Time Averaged\n Calories per Hour') plt.ylabel('Calories') plt.xlabel('24 Hour Clock') plt.ylim(0, 400) plt.xlim(0, 23) save_name = './output/calories_per_hour_blur.png' print save_name plt.savefig(save_name) #plt.show() plt.close() #I could blur that graph further # but I am not going to #Two humps, # showing when/how I eat, # that be the relevant facts. ################### # AVERAGE MEAL SIZE ################### #I think this previously created variable # is what I want print all_calorie_data #[380, 1260, 750, 100, print len(all_calorie_data) #209 print sum(all_calorie_data) #102226 #Yep, that is it! average_meal_size = sum(all_calorie_data) / len(all_calorie_data) print average_meal_size #489, which includes snacks ################################################### # # GRAPH # BIN COUNT # AVERAGE MEAL SIZE # ################################################### #This should, actually, be ridiculously easy. #Instead of plot # I use hist # and plplot does all the heavy lifting for me. plt.figure(figsize=(10,5)) plt.hist( x=all_calorie_data, bins=8, color='blue') plt.ylim(0, 80) plt.xlim(0, 2000) plt.title('Average Meal Size') plt.ylabel('Number in December') plt.xlabel('Calories') save_name = './output/calories_meal_size_bar.png' print save_name plt.savefig(save_name) #plt.show() plt.close() ''' As graphed first bar indicates 0-250 calories second bar indicated 250-500 calories and so on. ''' ''' Gads! I'm trying to make a better graph, same basic concept as this last, only smoother. And I've sunk... two hours into it, getting nowhere. So, here I go again. I mean, no one else will see that work. I've already deleted it. Endless circles going nowhere. So, it's there. And it happens to me all the time, as I push the edges of what I can do. Anyhow, enough of trying to use the built ins. It's time to roll my own. And live with whatever results. Also, I should mention, I tried using: Seaborn but I never could load a graph Interpolate but that wasn't working either So, I basically blurred the results smoothing the data points using what I assume is a Guassian Blur The main point being, the preceeding has been a fairly accurate representation of stream-of-conciousness coding: code -> comment code -> comment while what follows is more like: code -> code -> code comment ''' ################################################### # # AVERAGE MEAL SIZE # SMOOTH LINE GRAPH # ################################################### #Creates an array of zeroes, 2000 entries long # from 0 calories to 1999 calories. bags = np.zeros( 2000, dtype=np.float64) print bags #[ 0. 0. 0. ..., 0. 0. 0.] #This looks good #Inserting calorie data into the array # using a-1, so the 2000 calorie meal fits # 1.0 / 209.0 normalizes the results # into a percent for a in all_calorie_data: bags[a-1] += 1.0 / 209.0 print len(bags), bags #2000 [ 0. 0.... #This, too, is as expected #But a better test would be print bags.sum() #1.0, which is perfect #Because of the looping function to follow # a function was imperative # for my mental health. def smooth_normalization(bag, f=5, n=50): '''Smooths the input data bag: 1 dimension array to be smoothed f = half of the window size for convolution n = number of convolutions if f is the grit of the sandpaper n is the number of times to sand ''' #ff is the true window size # the window size must resolve # to an odd in integer # well, must is a tricky word # even numbers may introduce subtle bugs # down the line ff = 1 + (f * 2) #Conversion of window size into an array # [1.0 / 3] = [0.33] # [0.33] * 3 = [0.33, 0.33, 0.33] v = [1.0 / ff] * ff #The main loop of this function #The '_' means I don't care about this value. for _ in range(n): #The smoothing function bag = np.convolve(bag, v, mode='full') #Since 0 == -0 and 0 is a valid input # this operation is eliminated for zero #NOTE: negative values would throw an error # well, hopefully, they would. # But, hey! # Script! # Darn the torpedoes! if f: bag = bag[f:-f] #The smoothed array is returned return bag #Dedenting a level marks the function's end. #After I coded this # I knew I wanted to see a variety of results # So, these values might work well? try_these = [ (0,1), (0,5), (0,50), (0,100), (1,1), (1,5), (1,50), (1,100), (5,1), (5,5), (5,50), (5,100), (10,1), (10,5), (10,50), (10,100), (25,1), (25,5), (25,50), (25,100) ] #For loop fires for each tuple pair for f,n in try_these: #The function is called # Having f=f is not required, # but it is conceptually easier. bag = smooth_normalization(bags, f=f, n=n) #This many prints in a loop # become unusable after awhile # filling the screen with garbage #print len(bag), bag #print bag[:f] #print bag[-f:] #print bag.sum() #Mostly a repeat of the previous pyplot calls. plt.figure(figsize=(10,5)) plt.plot(bag, color='red') #plt.ylim(0, 1.0) plt.xlim(0, 2000) plt.title('Relative Distribution by Meal Size') plt.ylabel('Relative Frequency') plt.xlabel('Calories Consumed') #Changed save_name to sn, # so string line would fit on screen. #I used to always call save_name 'sN', # but I tend to be wordier these days. sn = './output/calories_perc_%02d_%03d.png' % ( f, n) print sn plt.savefig(sn) #plt.show() plt.close() #And that's about it! print 'FINAL OUTPUT FOR WRITEUP' #Self Documenting print cal_month_total #102226 print cal_daily_ave #3297.61290323 print min(cal_daily_list) #1600 print max(cal_daily_list) #4360 print len(all_calorie_data) #209 print average_meal_size #489 ''' In the middle of a project, I often get carried away or forget to do certain things. In this one, I completely forgot about protein. I mean, I recorded my presumed protein for a month, so I am going to do something with that. But this script is getting a bit long in the tooth. So, I'll start another file. And believe it or not, to make things interesting, I think I will make a class for that one. So, like I said: if the write up for this was code -> comment code -> comment code -> comment code -> comment the write up for the next is most decidedly going to be: code -> code -> code code -> code -> code code -> code -> code comment -> comment -> comment Until then: 2018-08-12 (c) Brett Paufler '''