LEO Logs Processing Pipeline

Scripts (run in order)

1. `extract_log_from_parquet.py`

Reads parquet files from each subdirectory in logs/ and merges them into a single restored.log per directory. Extracts the value column from each .snappy.parquet file in sorted order.

2. `separate_logs_by_query_id.py`

Parses each restored.log, extracts QueryId from every line, and writes each query's logs to a separate file in separated_logs/. Continuation lines (without QueryId) are assigned to the last encountered QueryId. Verifies losslessness at the end.

3. `group_logs_by_prefix.py`

Moves separated log files into subdirectories by QueryId prefix (part before the first -, e.g. Q381368). Verifies the exact same set of files exists before and after moving.

4. `fix_csv_quotes.py`

Fixes CSV quoting in logs/*.csv — removes wrapping " around the query field and restores "" back to ". Uses Python's csv module which handles the unescaping automatically on read.

5. `extract_query_templates.py`

Extracts query templates from the workload CSV for Parametric Query Optimisation using AST-based analysis (sqlglot). Replaces literals with ?, collapses IN lists, strips ORDER BY/LIMIT, and groups queries by structural template.

Utility Scripts (run anytime)

`count_unique_query_ids.py`

Counts unique QueryId values across all restored.log files and prints frequency stats.

`structure_of_parquet.py`

Inspects the schema/structure of parquet files.

Dependencies

pip install pandas pyarrow sqlglot

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LEO Logs Processing Pipeline

Scripts (run in order)

1. `extract_log_from_parquet.py`

2. `separate_logs_by_query_id.py`

3. `group_logs_by_prefix.py`

4. `fix_csv_quotes.py`

5. `extract_query_templates.py`

Utility Scripts (run anytime)

`count_unique_query_ids.py`

`structure_of_parquet.py`

Dependencies

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
count_unique_query_ids.py		count_unique_query_ids.py
extract_log_from_parquet.py		extract_log_from_parquet.py
extract_query_templates.py		extract_query_templates.py
fix_csv_quotes.py		fix_csv_quotes.py
group_logs_by_prefix.py		group_logs_by_prefix.py
requirements.txt		requirements.txt
separate_logs_by_query_id.py		separate_logs_by_query_id.py
structure_of_parquet.py		structure_of_parquet.py

Folders and files

Latest commit

History

Repository files navigation

LEO Logs Processing Pipeline

Scripts (run in order)

1. extract_log_from_parquet.py

2. separate_logs_by_query_id.py

3. group_logs_by_prefix.py

4. fix_csv_quotes.py

5. extract_query_templates.py

Utility Scripts (run anytime)

count_unique_query_ids.py

structure_of_parquet.py

Dependencies

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `extract_log_from_parquet.py`

2. `separate_logs_by_query_id.py`

3. `group_logs_by_prefix.py`

4. `fix_csv_quotes.py`

5. `extract_query_templates.py`

`count_unique_query_ids.py`

`structure_of_parquet.py`

Packages