|
|
|
|
|
|
|
|
|
This page provides technical information for web authors and web site administrators of sites to be included in the MyCommunityInfo search results. How Does the MyCommunityInfo Search Engine work? The MyCommunityInfo search engine uses a Google search appliance operated by the City of London Technology Services Department. In order to provide accurate and up-to-date search results the search appliance crawls each of the MyCommunityInfo domains early each morning, everyday. If you monitor your web site logs, you will see log entries each morning from gsa-crawler at the IP address 204.225.163.10. The search appliance crawler will retrieve a new copy of each accessible page on your web site each morning. What Happens When I Submit My Site? The MyCommunityInfo project manager and/or a committee of MyCommunityInfo partner organizations will review the content of your web site to determine if it is appropriate for MyCommunityInfo. If it is approved, City of London Technology Services staff will then perform a test crawl of your web site. Usually this test crawl will be performed within a week of when you submit your site. If no problems are encountered during the test crawl your site will be added to the list of domains which are crawled by the search engine each morning. If your site contains a large number of pages, or if the search engine takes an excessive amount of time to crawl your site, we will include your site as an "archived site". An archived site will only be crawled once each month. How Do I Link to MyCommunityInfo? You can include a hyperlink on your web site to MyCommunityInfo.ca, either as plain text, for example: <A HREF="http://www.mycommunityinfo.ca">mycommunityinfo.ca</A> or with the MyCommunityInfo logo image: Why Can't the Search Engine Find My Pages? The search engine crawls each site by following conventional hyperlink <A HREF= tags which it finds in the HTML of each document. The crawler cannot follow hyperlinks which are dynamically constructed through JavaScript for rollover hierarchical menus, or links on Macromedia Flash pages, simply because the crawler cannot perform mouse-clicks like a human user would. Note that this problem is not specific to the MyCommunityInfo search engine -- all search engine crawlers have this same problem. To overcome this problem you should include a link on your main page to a "site map" page. This sitemap link must be a conventional hyperlink, and the sitemap page itself should contain conventional hyperlinks to the other pages on your site. If you use a Flash page as the initial page of your site we will point the search appliance to begin crawling at the first page after your Flash intro which contains actual links to other pages. How do I hide pages on my site from the search engine? You or your web site administrator should be maintaining a text file named "robots.txt" at the root folder of your web site. The robots.txt file instructs search engines to not crawl specific files and folders on your site. It should look something like this:
There will be a "Disallow" line for each folder or file on your website that you do not want to have crawled. If you are using the FrontPage Server extensions on your web server, you should have instructions in your robots.txt file to disallow the _vti folders which are associated with the server extensions. Note that the server extensions create cached copies of your pages in these folders which can confuse the search results returned. More information on how to construct a robots.txt file can be found at http://www.robotstxt.org/wc/exclusion.html If you have other questions about the MyCommunityInfo search engine, contact: [ Search | Life events | Help | Feedback | About MyCommunityInfo.ca ] |