Icarus is the web crawler for Alex Guichet’s web research projects.
A project using this crawling data will launch in early 2023.

Traffic coming from Icarus comes from various global sources, and can be identified by its user agent.

Mozilla/5.0 (compatible; Icarus/0.2; +https://icarus.computer)

Crawling Purposes

Icarus wants to discover culture, passionate people, interesting writing, primary document sources (sometimes with minimally transformative commentary over it), and anything in the weird internet at large.

  • If your content's goal is to be #1 on typical SERPs, Icarus probably won’t look at your content too kindly.
  • If you're making gifts for the internet, Icarus will be fast friends with your content.
Be a good internet citizen

Icarus may generally be doing the following:

  • finding and indexing English webpages
  • following relevant links on your pages
  • watching your RSS and JSON feeds
  • studying attributes of your webpage (does it have rich content, cookies, css, javascript, ads, tracking, etc.)
  • parsing and understanding your content with an advanced, evolving library of heuristics and AI analysis

This is all with the goal of seeking, sharing, and celebrating great under-appreciated content on the internet. Use your own voice, have your own style, tell your own story. If you’re writing good things and making beautiful sites, and your content doesn’t look like what everybody else is doing, you’re on the right track.

Prohibited Content

We refuse to index sites and will ban your entire domain without recourse if it prominently features content that:

  • is predominantly hateful, hurtful, toxic, or incendiary
  • spreads disinformation
  • reduces, or advocates for the reduction of, the sum total of human happiness
  • does not advance the forward progress of humanity

Icarus automatically flags content for review which could meet this criteria, but this flag is set by humans. If you find your domain has this flag set: are you being good to humans? Would empathetic well-regarded humans see you, your content, and your actions as being on the right side of history? If not, you have more things to worry about then if your content appears in an indie search engine.

Domain Penalties

Icarus does not look kindly on the following content, and subjectively penalizes your entire domain for:

  • Any bad SEO behaviors, search spamming, and link farming
  • Modifying or changing your website for different global users, like limiting access to certain countries, locking content behind paywalls, or similar. (We believe in free, open acess of information for all.)
  • Hosting predominently spammy and uninteresting content.
  • Using a CMS or tools that begets a monoculture of internet content.
  • Under-moderated or otherwise uncontrolled user-generated content.
  • Slow loading times (generally, if you limit large content or use some sort of CDN or cache layer, it’ll be ok)
  • Not listing a canonical link for a substantial amount of your pages
  • Heavy Javascript inclusion, or content that requires Javascript execution and significant DOM manipulation (like SPAs)

These do not prohibit your content from being indexed, but reduces the chance your content will breach the surface.

Unidexed Content

We are also choosing to not index content that doesn't meet our curation goals at this time:

  • Hyperlocal Content — like the website of a restaurant you enjoy down the street, and its menu and hours
  • Medical Diagnostic Information — you probably shouldn't be self-diagnosing in the internet
  • Q&As – Generally, people asking and answering questions on the internet typically fail to do so with nuance and expertise
  • Strong Adult Content — this content is conceptually ok for adults and the internet, but it's not currently in my internet research goals
  • Social Media — this is just an outright flood of content on the internet, and you're not incredibly likely to click somebody's social media posts.

Content which fits in the "unindexed" category does not penalize your domain, just pages that feature this content.

Index Requests

If your content is not in Icarus's index, you'll be able to request indexing at a later time. Check back for a submission form here. (Don't send an email.)

Crawling Behavior & Controls

During crawl waves, Icarus may crawl your domain at about 1-2 requests per second.

Customizing robots.txt rules

Icarus respects standard robots.txt directives that are targeted at Icarus or all robots, including Crawl-delay, Allow, and Disallow.

Customizing robots HTML meta tags & response headers

Icarus reads robots meta tags in HTML documents. To specify robots rules in meta tags, put the tags in the head section of the document, like this:

<html><head>
 <meta name="robots" content="noindex"/>
 ...
 </head>
 <body>...</body>
</html>

Additionally, icarus checks for any directive in an X-Robots-Tag header.

Icarus supports the following directives:

  • noindex: Icarus won't index this page
  • none: Icarus won't index, snippet, or follow links on the page, as described above.
  • all: Icarus provides the document for suggestions and snippets the contents so that a short description of the page can appear next to a representative image. Icarus may follow links on the page to provide more suggestions.

In the future, Icarus may support the following directives:

  • nosnippet: Icarus won't generate a description or web answer for the page. Any suggestions to visit this URL will only include the page's title.
  • nofollow: Icarus won't follow any links on the page.
Customizing link following

You can hint to Icarus if your links should be followed or used for ranking with nofollow and other directives:

<a href="#" rel="nofollow">Link</a>

Icarus supports the following directives:

  • nofollow: Icarus won't follow or index this link
  • ugc: Icarus will treat this link as if it has been contributed by a user on your site, and may choose to follow it if it's sufficiently related to the content on your page, but with decreased weighting. (You are still responsible for content moderation.)
  • sponsored: Icarus will treat this link as if it is advertised content, where you have (or may) recieve compensation for placing this link on your website,

Webmaster Tools

When the project goes live, webmaster tools also be available, featuring limited insights into your domain, if crawled.

Changelog

  • 0.1 (Early September 2022): Inital crawling begins
  • 0.2 (Late September 2022): Updated crawler behavior related to robots.txt, meta tags, and links.
Contact

If you have questions or concerns, please contact me at alex@alexguichet.com