Crawl Analyzer
==============

After using the _Crawl Wizard tool_, and running a _Wordcount (Discovery)_ or a _Content extraction (Scan)_, you can check its results under the _List_ section by either downloading the `.log` file, or using the _Crawl log visualization filter_.

When you click on the filter button, a new window will pop up, where you can narrow down the results of the log.

![Open crawl log filter](/img/dashboard2/open_crawl_log_filter.png)

### Crawl status

![Crawl log filter](/img/dashboard2/crawl_log_filter.png)

**Processed:** The page was successfully added to the pages list, the crawler was able to process it.

**Skipped:** In this case, the Proxy could not process the page, because of its `content-type`, or because of the configuration of the collectible resources in _Crawl wizard_. There are four main reasons behind this status:

- _No content type, page collection is disabled_: The crawler hadn't received a content-type before, and also the collection of new HTML pages is disabled.
- _HTML page, page collection is disabled_: The content-type is text/html, but the collection of new HTML pages is disabled.
- _Not HTML page, resource collection is disabled_: The URL points to a resource (e.g. the content-type is text/javascript or text/css), but the collection of resources is disabled.
- _Content with type < content-type > is not processed during this crawl_: The content-type header designates an unprocessable type, for example if the header is proprietary (such as "application/example-script"), or it is an unsupported content-type, for example text/plain.

**Failed:** The proxy was not able to process the content, you can see below the list of the reasons behind this status:

- _Path is externalized_: The page is externalized
- _Request sending failed_: The crawler wasn't able to send a GET request
- _Content is not an HTML page_: The content is not HTML, the crawler wasn't able process it
- _Not processable_: The crawler wasn't able process the given content-type
- _Too large, size: < size of content >, limit: 1048576_: Files above 1.5 MB are not handled
- _Parsing error_: Invalid script, or HTML format for example
- _Proxy request failed_, _Response not processable_, _Response processing was aborted_,  _Skipped because processing failed_: Error during processing. Sometimes it is not quite obvious at first glance why the crawler failed, but in this case we can check our logs for you


### Response type

You can also filter the log results by content-types, for example, you can select different resource types, like CSS, JS, images etc.


### Response code

With this filter you can narrow down the log list by [HTTP status](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) codes:

**1xx (Informational):** The request was received, continuing process

**2xx (Successful):** The request was successfully received, understood, and accepted

**3xx (Redirection):** Further action needs to be taken in order to complete the request

**4xx (Client Error):** The request contains bad syntax or cannot be fulfilled

**5xx (Server Error):** The server failed to fulfill an apparently valid request  


### Regexp

Inserting here links or other texts, will be treated as strings, but you can specify `java/util/Pattern` regular expressions as well.
You can test your regex [here](http://www.regexplanet.com/advanced/java/index.html).