INFO:scrapy.utils.log:Scrapy 2.11.2 started (bot: scrapybot) INFO:scrapy.utils.log:Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.11.2 (main, Aug 26 2024, 07:20:54) [GCC 12.2.0], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Linux-5.10.104-linuxkit-x86_64-with-glibc2.36 INFO:scrapy.addons:Enabled addons: [] WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/scrapy/utils/request.py:254: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy. See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation. return cls(crawler) DEBUG:scrapy.utils.log:Using reactor: twisted.internet.epollreactor.EPollReactor INFO:scrapy.middleware:Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.memusage.MemoryUsage', 'scrapy.extensions.logstats.LogStats'] INFO:scrapy.crawler:Overridden settings: {'DUPEFILTER_CLASS': 'src.custom_dupefilter.CustomDupeFilter', 'LOG_ENABLED': '1', 'LOG_LEVEL': 'ERROR', 'TELNETCONSOLE_ENABLED': False, 'USER_AGENT': 'Typesense DocSearch Scraper (Bot; ' 'https://typesense.org/docs/guide/docsearch.html)'} INFO:scrapy.middleware:Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats', 'src.custom_downloader_middleware.CustomDownloaderMiddleware'] INFO:scrapy.middleware:Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] INFO:scrapy.middleware:Enabled item pipelines: [] INFO:scrapy.core.engine:Spider opened WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/scrapy/dupefilters.py:100: ScrapyDeprecationWarning: RFPDupeFilter subclasses must either modify their overridden '__init__' method and 'from_settings' class method to support a 'fingerprinter' parameter, or reimplement the 'from_crawler' class method. warn( WARNING:py.warnings:/home/seleuser/.local/share/virtualenvs/seleuser-AdYDHarm/lib/python3.11/site-packages/scrapy/dupefilters.py:59: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy. See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation. fingerprinter or RequestFingerprinter() INFO:scrapy.extensions.logstats:Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) Getting http://host.docker.internal/sitemap.xml from selenium DEBUG:selenium.webdriver.remote.remote_connection:POST http://localhost:34067/session/5b6e73f794d22b0b21535f65fdcc140e/url {'url': 'http://host.docker.internal/sitemap.xml'} DEBUG:urllib3.connectionpool:http://localhost:34067 "POST /session/5b6e73f794d22b0b21535f65fdcc140e/url HTTP/11" 200 0 DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'}) DEBUG:selenium.webdriver.remote.remote_connection:Finished Request DEBUG:selenium.webdriver.remote.remote_connection:GET http://localhost:34067/session/5b6e73f794d22b0b21535f65fdcc140e/source {} DEBUG:urllib3.connectionpool:http://localhost:34067 "GET /session/5b6e73f794d22b0b21535f65fdcc140e/source HTTP/11" 200 0 DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":"\u003Chtml lang=\"en\" data-theme=\"light\" dir=\"ltr\" data-rh=\"lang,dir,class\" class=\"plugin-native plugin-id-default\">\u003Chead>\n \u003Cmeta charset=\"utf-8\">\n \u003Cmeta name=\"generator\" content=\"Docusaurus\">\n \u003Ctitle>Page Not Found | DSK Documentation\u003C/title>\n \u003Clink rel=\"alternate\" type=\"application/rss+xml\" href=\"/blog/rss.xml\" title=\"DSK Documentation RSS Feed\">\n\u003Clink rel=\"alternate\" type=\"application/atom+xml\" href=\"/blog/atom.xml\" title=\"DSK Documentation Atom Feed\">\n\n\n\n\n\u003Clink rel=\"search\" type=\"application/opensearchdescription+xml\" title=\"DSK Documentation\" href=\"/opensearch.xml\">\n \u003Cscript defer=\"\" src=\"/runtime~main.js\">\u003C/script>\u003Cscript defer=\"\" src=\"/main.js\">\u003C/script>\u003Clink href=\"/styles.css\" rel=\"stylesheet\">\n \u003Clink rel=\"icon\" href=\"/img/favicon.ico\" data-rh=\"true\">\u003Cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" data-rh=\"true\">\u003Clink rel=\"canonical\" href=\"https://your-docusaurus-site.example.com/sitemap.xml\" data-rh=\"true\">\u003Clink rel=\"alternate\" href=\"https://your-docusaurus-site.example.com/sitemap.xml\" hreflang=\"en\" data-rh=\"true\">\u003Clink rel=\"alternate\" href=\"https://your-docusaurus-site.example.com/sitemap.xml\" hreflang=\"x-default\" data-rh=\"true\">\u003Cmeta name=\"twitter:card\" content=\"summary_large_image\" data-rh=\"true\">\u003Cmeta property=\"og:image\" content=\"https://your-docusaurus-site.example.com/img/docusaurus-social-card.jpg\" data-rh=\"true\">\u003Cmeta name=\"twitter:image\" content=\"https://your-docusaurus-site.example.com/img/docusaurus-social-card.jpg\" data-rh=\"true\">\u003Cmeta property=\"og:url\" content=\"https://your-docusaurus-site.example.com/sitemap.xml\" data-rh=\"true\">\u003Cmeta property=\"og:locale\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docusaurus_locale\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docusaurus_tag\" content=\"default\" data-rh=\"true\">\u003Cmeta name=\"docsearch:language\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docsearch:docusaurus_tag\" content=\"default\" data-rh=\"true\">\u003Cmeta property=\"og:title\" content=\"Page Not Found | DSK Documentation\" data-rh=\"true\">\u003C/head>\n \u003Cbody class=\"navigation-with-keyboard\" data-rh=\"class\">\n \u003Cscript>\n(function() {\n var defaultMode = 'light';\n var respectPrefersColorScheme = false;\n\n function setDataThemeAttribute(theme) {\n document.documentElement.setAttribute('data-theme', theme);\n }\n\n function getQueryStringTheme() {\n try {\n return new URLSearchParams(window.location.search).get('docusaurus-theme')\n } catch (e) {\n }\n }\n\n function getStoredTheme() {\n try {\n return window['localStorage'].getItem('theme');\n } catch (err) {\n }\n }\n\n var initialTheme = getQueryStringTheme() || getStoredTheme();\n if (initialTheme !== null) {\n setDataThemeAttribute(initialTheme);\n } else {\n if (\n respectPrefersColorScheme &&\n window.matchMedia('(prefers-color-scheme: dark)').matches\n ) {\n setDataThemeAttribute('dark');\n } else if (\n respectPrefersColorScheme &&\n window.matchMedia('(prefers-color-scheme: light)').matches\n ) {\n setDataThemeAttribute('light');\n } else {\n setDataThemeAttribute(defaultMode === 'dark' ? 'dark' : 'light');\n }\n }\n })();\n\n(function() {\n try {\n const entries = new URLSearchParams(window.location.search).entries();\n for (var [searchKey, value] of entries) {\n if (searchKey.startsWith('docusaurus-data-')) {\n var key = searchKey.replace('docusaurus-data-',\"data-\")\n document.documentElement.setAttribute(key, value);\n }\n }\n } catch(e) {}\n})();\n\n\n \u003C/script>\n \u003Cdiv id=\"__docusaurus\">\u003C/div>\n \n \n \n\n\u003C/body>\u003C/html>"} | headers=HTTPHeaderDict({'Content-Length': '4062', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'}) DEBUG:selenium.webdriver.remote.remote_connection:Finished Request DEBUG:selenium.webdriver.remote.remote_connection:GET http://localhost:34067/session/5b6e73f794d22b0b21535f65fdcc140e/url {} DEBUG:urllib3.connectionpool:http://localhost:34067 "GET /session/5b6e73f794d22b0b21535f65fdcc140e/url HTTP/11" 200 0 DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":"http://host.docker.internal/sitemap.xml"} | headers=HTTPHeaderDict({'Content-Length': '51', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'}) DEBUG:selenium.webdriver.remote.remote_connection:Finished Request DEBUG:scrapy.core.engine:Crawled (200) (referer: None) Getting http://host.docker.internal from selenium DEBUG:selenium.webdriver.remote.remote_connection:POST http://localhost:34067/session/5b6e73f794d22b0b21535f65fdcc140e/url {'url': 'http://host.docker.internal'} DEBUG:urllib3.connectionpool:http://localhost:34067 "POST /session/5b6e73f794d22b0b21535f65fdcc140e/url HTTP/11" 200 0 DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'}) DEBUG:selenium.webdriver.remote.remote_connection:Finished Request DEBUG:selenium.webdriver.remote.remote_connection:GET http://localhost:34067/session/5b6e73f794d22b0b21535f65fdcc140e/source {} DEBUG:urllib3.connectionpool:http://localhost:34067 "GET /session/5b6e73f794d22b0b21535f65fdcc140e/source HTTP/11" 200 0 DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":"\u003Chtml lang=\"en\" data-theme=\"light\" dir=\"ltr\" data-rh=\"lang,dir,class\" class=\"plugin-pages plugin-id-default\">\u003Chead>\n \u003Cmeta charset=\"utf-8\">\n \u003Cmeta name=\"generator\" content=\"Docusaurus\">\n \u003Ctitle>Hello from DSK Documentation | DSK Documentation\u003C/title>\n \u003Clink rel=\"alternate\" type=\"application/rss+xml\" href=\"/blog/rss.xml\" title=\"DSK Documentation RSS Feed\">\n\u003Clink rel=\"alternate\" type=\"application/atom+xml\" href=\"/blog/atom.xml\" title=\"DSK Documentation Atom Feed\">\n\n\n\n\n\u003Clink rel=\"search\" type=\"application/opensearchdescription+xml\" title=\"DSK Documentation\" href=\"/opensearch.xml\">\n \u003Cscript defer=\"\" src=\"/runtime~main.js\">\u003C/script>\u003Cscript defer=\"\" src=\"/main.js\">\u003C/script>\u003Clink href=\"/styles.css\" rel=\"stylesheet\">\n \u003Clink rel=\"icon\" href=\"/img/favicon.ico\" data-rh=\"true\">\u003Cmeta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\" data-rh=\"true\">\u003Clink rel=\"canonical\" href=\"https://your-docusaurus-site.example.com/\" data-rh=\"true\">\u003Clink rel=\"alternate\" href=\"https://your-docusaurus-site.example.com/\" hreflang=\"en\" data-rh=\"true\">\u003Clink rel=\"alternate\" href=\"https://your-docusaurus-site.example.com/\" hreflang=\"x-default\" data-rh=\"true\">\u003Cmeta name=\"twitter:card\" content=\"summary_large_image\" data-rh=\"true\">\u003Cmeta property=\"og:image\" content=\"https://your-docusaurus-site.example.com/img/docusaurus-social-card.jpg\" data-rh=\"true\">\u003Cmeta name=\"twitter:image\" content=\"https://your-docusaurus-site.example.com/img/docusaurus-social-card.jpg\" data-rh=\"true\">\u003Cmeta property=\"og:url\" content=\"https://your-docusaurus-site.example.com/\" data-rh=\"true\">\u003Cmeta property=\"og:locale\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docusaurus_locale\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docusaurus_tag\" content=\"default\" data-rh=\"true\">\u003Cmeta name=\"docsearch:language\" content=\"en\" data-rh=\"true\">\u003Cmeta name=\"docsearch:docusaurus_tag\" content=\"default\" data-rh=\"true\">\u003Cmeta property=\"og:title\" content=\"Hello from DSK Documentation | DSK Documentation\" data-rh=\"true\">\u003Cmeta name=\"description\" content=\"Description will go into a meta tag in \u003Chead />\" data-rh=\"true\">\u003Cmeta property=\"og:description\" content=\"Description will go into a meta tag in \u003Chead />\" data-rh=\"true\">\u003C/head>\n \u003Cbody class=\"navigation-with-keyboard\" data-rh=\"class\">\n \u003Cscript>\n(function() {\n var defaultMode = 'light';\n var respectPrefersColorScheme = false;\n\n function setDataThemeAttribute(theme) {\n document.documentElement.setAttribute('data-theme', theme);\n }\n\n function getQueryStringTheme() {\n try {\n return new URLSearchParams(window.location.search).get('docusaurus-theme')\n } catch (e) {\n }\n }\n\n function getStoredTheme() {\n try {\n return window['localStorage'].getItem('theme');\n } catch (err) {\n }\n }\n\n var initialTheme = getQueryStringTheme() || getStoredTheme();\n if (initialTheme !== null) {\n setDataThemeAttribute(initialTheme);\n } else {\n if (\n respectPrefersColorScheme &&\n window.matchMedia('(prefers-color-scheme: dark)').matches\n ) {\n setDataThemeAttribute('dark');\n } else if (\n respectPrefersColorScheme &&\n window.matchMedia('(prefers-color-scheme: light)').matches\n ) {\n setDataThemeAttribute('light');\n } else {\n setDataThemeAttribute(defaultMode === 'dark' ? 'dark' : 'light');\n }\n }\n })();\n\n(function() {\n try {\n const entries = new URLSearchParams(window.location.search).entries();\n for (var [searchKey, value] of entries) {\n if (searchKey.startsWith('docusaurus-data-')) {\n var key = searchKey.replace('docusaurus-data-',\"data-\")\n document.documentElement.setAttribute(key, value);\n }\n }\n } catch(e) {}\n})();\n\n\n \u003C/script>\n \u003Cdiv id=\"__docusaurus\">\u003C/div>\n \n \n \n\n\u003C/body>\u003C/html>"} | headers=HTTPHeaderDict({'Content-Length': '4280', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'}) DEBUG:selenium.webdriver.remote.remote_connection:Finished Request DEBUG:selenium.webdriver.remote.remote_connection:GET http://localhost:34067/session/5b6e73f794d22b0b21535f65fdcc140e/url {} DEBUG:urllib3.connectionpool:http://localhost:34067 "GET /session/5b6e73f794d22b0b21535f65fdcc140e/url HTTP/11" 200 0 DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":"http://host.docker.internal/"} | headers=HTTPHeaderDict({'Content-Length': '40', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'}) DEBUG:selenium.webdriver.remote.remote_connection:Finished Request DEBUG:scrapy.core.engine:Crawled (200) (referer: None) > DocSearch: http://host.docker.internal/ 0 records) INFO:scrapy.core.engine:Closing spider (finished) INFO:scrapy.statscollectors:Dumping Scrapy stats: {'downloader/request_bytes': 547, 'downloader/request_count': 2, 'downloader/request_method_count/GET': 2, 'downloader/response_bytes': 7494, 'downloader/response_count': 2, 'downloader/response_status_count/200': 2, 'elapsed_time_seconds': 0.85975, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2025, 4, 7, 8, 25, 23, 550254, tzinfo=datetime.timezone.utc), 'memusage/max': 75976704, 'memusage/startup': 75976704, 'response_received_count': 2, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'start_time': datetime.datetime(2025, 4, 7, 8, 25, 22, 690504, tzinfo=datetime.timezone.utc)} INFO:scrapy.core.engine:Spider closed (finished) DEBUG:selenium.webdriver.remote.remote_connection:DELETE http://localhost:34067/session/5b6e73f794d22b0b21535f65fdcc140e {} DEBUG:urllib3.connectionpool:http://localhost:34067 "DELETE /session/5b6e73f794d22b0b21535f65fdcc140e HTTP/11" 200 0 DEBUG:selenium.webdriver.remote.remote_connection:Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'}) DEBUG:selenium.webdriver.remote.remote_connection:Finished Request Crawling issue: nbHits 0 for docusaurus