Reads parquet files from each subdirectory in logs/ and merges them into a single restored.log per directory.
Extracts the value column from each .snappy.parquet file in sorted order.
Parses each restored.log, extracts QueryId from every line, and writes each query's logs to a separate file in separated_logs/.
Continuation lines (without QueryId) are assigned to the last encountered QueryId. Verifies losslessness at the end.
Moves separated log files into subdirectories by QueryId prefix (part before the first -, e.g. Q381368).
Verifies the exact same set of files exists before and after moving.
Fixes CSV quoting in logs/*.csv — removes wrapping " around the query field and restores "" back to ".
Uses Python's csv module which handles the unescaping automatically on read.
Extracts query templates from the workload CSV for Parametric Query Optimisation using AST-based analysis (sqlglot).
Replaces literals with ?, collapses IN lists, strips ORDER BY/LIMIT, and groups queries by structural template.
Counts unique QueryId values across all restored.log files and prints frequency stats.
Inspects the schema/structure of parquet files.
pip install pandas pyarrow sqlglot