-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
Description
crawl4ai version
0.8.0
Expected Behavior
The results of deep search and simple crawl of the same web page are consistent
Current Behavior
Perform a deep search on the root webpage https://www.simon.com.cn/case. The sub-webpage https://www.simon.com.cn/case/index_2.html consistently fails to crawl, with the error: 'Error: Failed on navigating ACS-GOTO: Page.goto: Timeout 60000ms exceeded.' However, when performing a simple crawl on the sub-webpage, it succeeds within 1 second.
The code for a deep search of the root page is as follows:
import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.deep_crawling import BFSDeepCrawlStrategy
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling.filters import (
FilterChain,
URLPatternFilter,
DomainFilter,
ContentTypeFilter
)
async def main():
filter_chain = FilterChain([
DomainFilter(
allowed_domains=["www.simon.com.cn"],
),
URLPatternFilter(
patterns=["*/case/*"],
reverse=False
),
ContentTypeFilter(allowed_types=["text/html"])
])
config = CrawlerRunConfig(
deep_crawl_strategy=BFSDeepCrawlStrategy(
max_depth=3,
include_external=False,
filter_chain=filter_chain
),
scraping_strategy=LXMLWebScrapingStrategy(),
verbose=True
)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun("https://www.simon.com.cn/case", config=config)
print(f"Crawled {len(results)} pages in total")
if __name__ == "__main__":
asyncio.run(main())
The log is as follows:
[INIT].... → Crawl4AI 0.8.0
[FETCH]... ↓ https://www.simon.com.cn/case | ✓ |
⏱: 6.71s
[SCRAPE].. ◆ https://www.simon.com.cn/case | ✓ |
⏱: 0.02s
[COMPLETE] ● https://www.simon.com.cn/case | ✓ |
⏱: 6.74s
[ERROR]... × https://www.simon.com.cn/case/70/191.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/70/191.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/69/1103.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/69/1103.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/index_2.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/index_2.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/index_6.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/index_6.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/index_5.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/index_5.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/69/1033.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/69/1033.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/69/1034.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/69/1034.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/index_3.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/index_3.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/69/1036.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/69/1036.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/index_7.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/index_7.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/index_4.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/index_4.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/69/1032.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/69/1032.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/index_8.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/index_8.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/index.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/index.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
[ERROR]... × https://www.simon.com.cn/case/69/1035.html | Error: Unexpected error in _crawl_web at line 718 in
_crawl_web (../miniforge3/envs/web08/lib/python3.11/site-packages/crawl4ai/async_crawler_strategy.py):
Error: Failed on navigating ACS-GOTO:
Page.goto: Timeout 60000ms exceeded.
Call log:
- navigating to "https://www.simon.com.cn/case/69/1035.html", waiting until "domcontentloaded"
Code context:
713 tag="GOTO",
714 params={"url": url},
715 )
716 response = None
717 else:
718 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}")
719
720 # ──────────────────────────────────────────────────────────────
721 # Walk the redirect chain. Playwright returns only the last
722 # hop, so we trace the `request.redirected_from` links until the
723 # first response that differs from the final one and surface its
Crawled 16 pages in total
URL: https://www.simon.com.cn/case
If we perform the above deep search or simple scrape of the subpages, it will work:
import asyncio
from crawl4ai import AsyncWebCrawler
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig
async def main():
browser_config = BrowserConfig()
run_config = CrawlerRunConfig(
verbose=True,
)
async with AsyncWebCrawler(config=browser_config) as crawler:
result = await crawler.arun(
url="https://www.simon.com.cn/case/index_2.html",
config=run_config
)
print(result.markdown)
if __name__ == "__main__":
asyncio.run(main())
Log is:
[INIT].... → Crawl4AI 0.8.0
[FETCH]... ↓ https://www.simon.com.cn/case/index_2.html | ✓ |
⏱: 1.10s
[SCRAPE].. ◆ https://www.simon.com.cn/case/index_2.html | ✓ |
⏱: 0.03s
[COMPLETE] ● https://www.simon.com.cn/case/index_2.html | ✓ |
⏱: 1.14s
[](https://www.simon.com.cn/) [EN](https://www.simon.com.cn/en/) [](javascript:;)
[首页](https://www.simon.com.cn/index.html)
[关于](javascript:;)
[关于我们](https://www.simon.com.cn/about-us/index.html) [集团官网](https://www.simonelectric.com/) [亚太官网](https://www.simon-apac.com/)
[产品](javascript:;)
[开关插座](https://www.simon.com.cn/switch&socket/) [灯具照明](https://www.simon.com.cn/lighting/) [环境电器](https://www.simon.com.cn/environment/) [低压电器](https://www.simon.com.cn/breaker/) [智能控制](https://www.simon.com.cn/control/)
[Simon IoT](https://www.simon.com.cn/iot/index.html)
[解决方案](https://www.simon.com.cn/solutions/)
[项目案例](https://www.simon.com.cn/case/index.html)
[最新资讯](https://www.simon.com.cn/news/index.html)
[加入西蒙](https://www.simon.com.cn/join/index.html)
[](https://www.simon.com.cn/)
[关于我们](https://www.simon.com.cn/about-us/index.html) | [集团官网](https://www.simonelectric.com/) | [亚太官网](https://www.simon-apac.com/) | [最新资讯](https://www.simon.com.cn/news/index.html) | [加入西蒙](https://www.simon.com.cn/join/index.html)| [联系我们](https://www.simon.com.cn/contact/index.html)|
[ 新媒体矩阵  ](javascript::)
[EN](https://www.simon.com.cn/en/)
* [开关插座](https://www.simon.com.cn/switch&socket/)
* [灯具照明](https://www.simon.com.cn/lighting/)
* [环境电器](https://www.simon.com.cn/environment/)
* [低压电器](https://www.simon.com.cn/breaker/)
* [智能控制](https://www.simon.com.cn/control/)
* [解决方案](https://www.simon.com.cn/solutions/)
* [Simon IoT](https://www.simon.com.cn/iot/index.html)
* [项目案例](https://www.simon.com.cn/case/index.html)
[搜索](javascript:;)
[  ](https://www.simon.com.cn/case/70/191.html)
#### 项目应用介绍
* [  **成都抖音集团总部大楼** ](https://www.simon.com.cn/case/69/1031.html)
* [  **中山大学深圳校区** ](https://www.simon.com.cn/case/69/998.html)
* [  **香港中文大学深圳校区** ](https://www.simon.com.cn/case/69/997.html)
* [  **深圳沙井海岸城** ](https://www.simon.com.cn/case/69/996.html)
* [  **深圳梅里人家** ](https://www.simon.com.cn/case/69/995.html)
* [  **深圳第二儿童医院** ](https://www.simon.com.cn/case/69/994.html)
[](https://www.simon.com.cn/case/index.html) [1](https://www.simon.com.cn/case/index.html)[2](https://www.simon.com.cn/case/index_2.html)[3](https://www.simon.com.cn/case/index_3.html)[4](https://www.simon.com.cn/case/index_4.html)[5](https://www.simon.com.cn/case/index_5.html)[6](https://www.simon.com.cn/case/index_6.html)[7](https://www.simon.com.cn/case/index_7.html)[8](https://www.simon.com.cn/case/index_8.html) [](https://www.simon.com.cn/case/index_3.html)
#### 案例分享

#### 地产战略合作客户

#### 酒店管理公司合作客户

#### 更多成功案例...
[阳光城集团](javascript:;)[时代地产](javascript:;)[建业集团](javascript:;)[力高置业](javascript:;)[苏宁置业](javascript:;)[荣盛地产](javascript:;)[祥生地产](javascript:;)[融信地产](javascript:;)[万华地产](javascript:;)[中建东孚地产](javascript:;)[金融街](javascript:;)[金隅嘉业](javascript:;)[协信地产](javascript:;)[东原地产](javascript:;)[海尔地产](javascript:;)[泛海置业(武汉CBD)](javascript:;)[大华地产](javascript:;)[花样年地产](javascript:;)[龙光控股](javascript:;)[和昌地产](javascript:;)[当代置业](javascript:;)[SOHO中国](javascript:;)[华侨城](javascript:;)[景瑞地产](javascript:;)[苏宁置业](javascript:;)[雨润集团](javascript:;)[海伦堡地产](javascript:;)[中骏地产](javascript:;)[奥园地产](javascript:;)[新华联集团](javascript:;)[泰康之家](javascript:;)[恒盛地产](javascript:;)[银亿地产](javascript:;)[德信地产](javascript:;)[远大集团](javascript:;)[三盛地产](javascript:;)[蓝润地产](javascript:;)[珠江投资](javascript:;)[……](javascript:;)
[下载中心](javascript:;)
[手册下载](https://www.simon.com.cn/channels/110.html) [视频下载](https://www.simon.com.cn/channels/111.html)
[公司声明](javascript:;)
[商标公告](https://www.simon.com.cn/channels/91.html)
[企业文化](javascript:;)
[社会责任](https://www.simon.com.cn/about-us/73.html) [遇见西蒙](https://www.simon.com.cn/channels/100.html)
[服务与支持](javascript:;)
[防伪查询](http://q.simon.com.cn/query1.html) [门店查询](https://www.simon.com.cn/shops/index.html) [合作伙伴入口](https://c9.simon.com.cn/Desktop/login)
[联系西蒙](javascript:;)
[上海管理中心](https://www.simon.com.cn/contact/106.html) [海安生产基地](https://www.simon.com.cn/contact/107.html) [全国办事处](https://www.simon.com.cn/contact/index.html)
[苏ICP备10121996号](https://beian.miit.gov.cn/#/Integrated/index) 版权所有:西蒙电气(中国)有限公司
[  联系我们 ](https://www.simon.com.cn/contact/index.html) [  帮助与咨询 ](https://www.simon.com.cn/service/index.html)
描述您的问题…
The following parameters were tried in a deep search without success: semaphore_count=1,
delay_before_return_html=2,
capture_network_requests=False,
page_timeout=200000, # 50s
Is this reproducible?
Yes
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
Linux
Python version
3.11.11
Browser
Chrome
Browser version
No response
Error logs & Screenshots (if applicable)
No response