This Java application is designed to navigate the web, index pages, and extract specific content. Utilizing the Jsoup HTML parsing library and managed with Maven, the crawler operates to a depth of 2 links, retrieving target titles, links, and text, and subsequently saving them to a file.
- Web Crawling: Initiates from a seed URL and explores linked pages up to a depth of 2.
- Content Extraction: Parses and extracts titles, hyperlinks, and textual content from web pages.
- Data Storage: Saves the extracted information into a structured file for further analysis or processing.
- Robots.txt Compliance: Respects web scraping policies by adhering to the
robots.txtdirectives of each site.
- Initialization: Begins with a seed URL added to the frontier queue.
- Fetching: Retrieves the HTML content of the URL.
- Parsing: Uses Jsoup to parse the HTML and extract links and desired content.
- Duplication Check: Verifies if the URL or its content has been previously crawled to avoid redundancy.
- Compliance Verification: Checks the site's
robots.txtfile to ensure adherence to crawling policies. - Iteration: Adds new, uncrawled, and compliant URLs to the frontier queue, repeating the process up to the specified depth.
- Java Development Kit (JDK): Ensure JDK 8 or higher is installed.
- Maven: For project dependency management.
- Jsoup Library: Included as a Maven dependency.
-
Clone the Repository:
git clone https://github.com/KELVI23/Java-Web-Crawler.git
-
Navigate to the Project Directory:
cd Java-Web-Crawler -
Build the Project with Maven:
mvn clean install
-
Run the Application:
- Execute the
Mainclass to start the web crawling process. - Monitor the console output for progress and results.
- Execute the
- Seed URL: Modify the
seedUrlvariable in theMainclass to change the starting point of the crawl. - Crawling Depth: Adjust the
maxDepthvariable to set the desired depth of link traversal. - Output File: Specify the destination file for extracted data in the
outputFilePathvariable.
- Ethical Crawling: Always ensure compliance with each website's
robots.txtdirectives and terms of service. - Performance Considerations: Be mindful of the load imposed on servers; implement appropriate delays between requests if necessary.
- Data Accuracy: The quality of extracted data depends on the structure of the target web pages and may require adjustments to parsing logic.
This project is open-source. Feel free to modify and use it according to your needs.
For issues, contributions, or further information, please refer to the GitHub repository.