Skip to content

darpan-e6/log-processing-utils

Repository files navigation

LEO Logs Processing Pipeline

Scripts (run in order)

1. extract_log_from_parquet.py

Reads parquet files from each subdirectory in logs/ and merges them into a single restored.log per directory. Extracts the value column from each .snappy.parquet file in sorted order.

2. separate_logs_by_query_id.py

Parses each restored.log, extracts QueryId from every line, and writes each query's logs to a separate file in separated_logs/. Continuation lines (without QueryId) are assigned to the last encountered QueryId. Verifies losslessness at the end.

3. group_logs_by_prefix.py

Moves separated log files into subdirectories by QueryId prefix (part before the first -, e.g. Q381368). Verifies the exact same set of files exists before and after moving.

4. fix_csv_quotes.py

Fixes CSV quoting in logs/*.csv — removes wrapping " around the query field and restores "" back to ". Uses Python's csv module which handles the unescaping automatically on read.

5. extract_query_templates.py

Extracts query templates from the workload CSV for Parametric Query Optimisation using AST-based analysis (sqlglot). Replaces literals with ?, collapses IN lists, strips ORDER BY/LIMIT, and groups queries by structural template.

Utility Scripts (run anytime)

count_unique_query_ids.py

Counts unique QueryId values across all restored.log files and prints frequency stats.

structure_of_parquet.py

Inspects the schema/structure of parquet files.

Dependencies

pip install pandas pyarrow sqlglot

About

Bunch of scripts that does some operations on logs and SQL queries for PQO

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages