Wikipedia statistics and shortest path finder
- Clone the repository
git clone https://github.com/Gaareth/WikiStats.git --recurse-submodules - Copy
env.exampleto.envand edit the variables as needed - Run
docker compose --profile prod up -d --buildto build and start the services - (Optional) Create systemd service for task-scheduling, e.g., for the celery worker and for celery beat
- Run
./trigger-update.sh
- web-server: Astro webserver
- task-scheduling: Celery, Redis, RabbitMQ scheduler wikipedia dump processing
- wiki-stats-rs: Rust services for processing wikipedia dumps and a blazingly fast webserver calculating shortest paths
- igraph: Graph experiements
- Properly handle wikiprefixes: currently alot of stuff depends on wikiprefixes being two chars, and ending with "wiki". Also their url might be different
- More tests, especially benchmarking
- Download command
- Code refactor
The dumps may not always fully reflect the current links online, especially for websites that are frequently updated.
https://it.wikipedia.org/wiki/Nati_il_21_aprile Misses e.g. a link to https://it.wikipedia.org/wiki/Wesllem, as the it was removed shortly before the dump was made. And then later readded.
Demonstration:
grep -R --color=always -o -E '.{0,20}Wesllem.{0,8}' itwiki-20251001-linktarget.sql
_John'),(1020575,0,'Wesllem'),(1094
=> linktarget id of Wesllem is 1020575.
Does pagelinks.sql contain any links from Nati_il_21_aprile (pageid=1276495) to this linktarget_id?
gareth@manjaro-kde ~/dev/WikiStats/wiki-folder/20251001/downloads main*
❯ grep -o -E '\(1276495,[0-9]+,1020575\)' itwiki-20251001-pagelinks.sql
gareth@manjaro-kde ~/dev/WikiStats/wiki-folder/20251001/downloads main* 6s
❯
Does not seem like it?
Validating on ukwiki 20251001: Всесвітня премія фентезі за найкращу повість -> Премія Урсули К. Ле Ґуїн missing from db
This link is also not in the dump, but Премія Урсули К. Ле Ґуїн is on the page in a navbox at the bottom. Which might be the reason its not in the dump?
Also the latest update was the 26th october 2024, and its not frequently updated.
The dump only contains with Ґуїн in the name: Всесвітня премія фентезі за найкращу повість -> Урсула_Ле_Ґуїн