Sunday, March 15, 2015

Robots.txt

As you may know, robots.txt has to do with telling a search engine -- mostly Google -- what it can and cannot index for searches. In other words, web site mostly use it to tell Google to know crawl those directories and make the results available.

For fun, let's see what UGA (where I work) thinks Google should not see and, thus, make available to searchers. So here's one way I did it. The library, for example, has this:

Disallow: /phonelist
Disallow: /events/
Disallow: /staff/facultysearches/
 
That means don't search the folders called phonelist, events, or staff/facultysearches. Because, ya know, libraries. Don't want people searching.

The UGA "research" site (research.uga.edu) doesn't want search engines, well, researching. They have this:

User-agent: *
Disallow: /*calendar
Disallow: /*events
Disallow: /*files
Disallow: /*wp-admin
Disallow: /*wp-content
Disallow: /*wp-includes
Disallow: /calendar/action~posterboard/
Disallow: /calendar/action~agenda/
Disallow: /calendar/action~oneday/
Disallow: /calendar/action~month/
Disallow: /calendar/action~week/
Disallow: /calendar/action~stream/
 

Clearly the research folks do not want Google seeing their calendar. I hope it's good.

I can do this all day, but you get the point.

No comments: