Week 7
Monday, July 8, 2024
Due to an emergency family matter, I was not able to accomplish much today.
Progress:
- Wrote code to read in all partial log files to a single data frame [!NOTE] Since there are ~96 million logs (~19GB), this strategy is infeasible for low/mid memory systems.
Tuesday, July 9, 2024
I have a lot of ideas for what I can do, but I also keep thinking up obstacles. One major limitation is the lack of memory on my desktop (~16 GiB). I could use my beefier laptop to actually run the code, but it makes testing tedious to rely on that. On the other hand, it is annoying to try to optimize for memory usage while still making sure I can capture all of the necessary data in my analysis. It would be ideal to just load all of the data into a single data frame, manipulate it, and call it a day.
Progress:
- Reviewed previous paper to see what already has been done in terms of statistical analysis and visualization
- 10 minutes to pandas tutorial completion
- Read
df.read_csv()
docs to figure out way to do “sliding window” strategy for reading across multiple logs (chunksize=x
+TextReader
) [!IMPORTANT] Don’t worry about this until you can meaningfully process a single file - Select data for a single user and load into data frame
Wednesday, July 10, 2024
I got quite sick today; my luck is pretty atrocious this week. I did still manage to get some meaningful preparatory work done.
Progress:
- Learn about python typing annotations
- Learn about python data classes
- Learn idiomatic pandas usage
Thursday, July 11, 2024
Big implementation day today. Got caught up on some silly pandas usage, but getting the hang of it – very handy, overall. This big data stuff is fun; it forces you to think about logical problems in a resource-conscious way.
Progress:
- Record aggregate statistics potentially useful in supplementary failure analysis
-
Create algorithm to detect and analyze user-specific failure sequences [!NOTE] Analysis not yet implemented, only detection.
- Write pseudocode to capture high-level logic
- Implement using pandas
groupby
API (split raw df into user groups, apply finding function, analyze when found)
Week 7 Summary
Mostly just tooling and config this week. I suspect that I may have a bit more of this to do next week since I am relatively new to python development in my new text editor. Overall, good progress – nothing crazy productive, but also not too slow (barring extraneous circumstances).