WO2001075668A2 - Search systems - Google Patents

Search systems Download PDF

Info

Publication number
WO2001075668A2
WO2001075668A2 PCT/GB2001/001149 GB0101149W WO0175668A2 WO 2001075668 A2 WO2001075668 A2 WO 2001075668A2 GB 0101149 W GB0101149 W GB 0101149W WO 0175668 A2 WO0175668 A2 WO 0175668A2
Authority
WO
WIPO (PCT)
Prior art keywords
search
user
data
database
web
Prior art date
Application number
PCT/GB2001/001149
Other languages
French (fr)
Other versions
WO2001075668A3 (en
Inventor
Giles Chanot
Original Assignee
Dynamic Internet Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dynamic Internet Limited filed Critical Dynamic Internet Limited
Priority to AU40857/01A priority Critical patent/AU4085701A/en
Publication of WO2001075668A2 publication Critical patent/WO2001075668A2/en
Publication of WO2001075668A3 publication Critical patent/WO2001075668A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results

Definitions

  • This invention is generally concerned with software and systems for searching. More particularly it relates to systems for searching and cataloguing documents on networks such as the World Wide Web and to new interfaces to such systems.
  • a method for organising information is known from WO 99/06924 in which the search activity of a user is monitored and used to organise articles in a subsequent search by the same or another user who enters a similar search query.
  • US 5,748,954 refers to determining the popularity of a file according to how often a file is referenced by a computer other than the computer on which the file is stored.
  • US 5,974,455 uses a hash table and a sequential disk file to construct a search database.
  • US 5,983,218 describes a distributed (multimedia) database using a web server to select and co-ordinate information flow between database sites and user sites.
  • US 6,006,217 describes a method for providing enhanced search results in which a server retrieves a document from its home server and highlights matches to search criteria.
  • US 6,038,668 describes a networked catalogue search system in which a search engine forwards retrieved pages to an object oriented database distributed across a network of computers. A local portal retrieves pages through a web crawler.
  • US 6,078,924 uses collection agents to retrieve specific information without user intervention.
  • WO99/42935 describes a search system in which characteristic information for a search database is stored across a computer network.
  • An information collector comprises a plurality of collecting modules and user access to the system is via an interface server.
  • EP-A-0 982 672 describes an information retrieval system including a search assisting server having list data constructed using a list of identifiers for accessing information servers. In response to designation of a requested item the identifier corresponding to the requested item is searched for from the list data.
  • JP 11015856A describes a server for integrating databases including multimedia materials comprising a meta-server including a meta-database, a search agent for searching an objective database site by indexing, and an improving module for observing a response pattern from a database site corresponding to a user's enquiry and improving a calculation of a future site relation.
  • a distributed indexing/searching workshop held by the World Wide Web Consortium in May 1996, Massachusetts, USA provides background information on web spidering.
  • the web site www.webbuilde ⁇ nag.conf_ ⁇ upload/free/features/webbuilder/1999/udell/1999-07-20.asp purports to disclose an article in Web Builder Magazine of July 20, 1999 by Jon Udell which briefly refers to a distributed spidering process in which a number of software agents collect data for a search database.
  • the article invites comment on the idea of "pushing the work of spidering (but not indexing) out to ISPs and other hosts that serve large numbers of pages”.
  • the present invention addresses these needs.
  • a server system for searching a network comprising: a search data store storing: a plurality of addresses of locations of objects accessible using the network; and search data including data relating to information content of at least some of the objects; a program store storing processor implementable instructions; a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to:- receive a search request from a user terminal; retrieve search result data from the search data store comprising one or more search result address for objects having an information content relevant to the search request; transmit the search result data to the user terminal; receive from the user terminal information relating to an object located at an address provided to the user terminal by the server system; and update the stored search data using the object-related information received from the user te ⁇ riinal.
  • the address provided to the user terminal by the server system may comprise one of the search result addresses or a search tax address (described below) or an address for spidering as a background task.
  • a plurality of addresses for a plurality of objects is provided to the user terminal.
  • the information relating to the object or objects at the address or addresses provided to the user terminal may comprise object content characterizing data such as a last modified date and/or checksum for a web page, or it may comprise object information content data such as indexed content data.
  • raw object data may be received from the user terminal, such as raw (i.e. unprocessed) web page data.
  • the server system Updating the stored search data using information received from a user terminal relating to an object located at an address provided to the user terminal by the server system relieves the server system of much of the search and indexing work it would otherwise have to perform.
  • the reception of information from the user terminal is linked to use of the server system to process search requests from the user terminal which allows better use of network bandwidth and processing bandwidth as well as facilitating simplification of overall system design.
  • the search data store may reside on a single machine or may comprise a distributed data store.
  • the instructions further comprise instructions for retrieving at least one search tax address from the search data store, transmitting this to the user terminal, and receiving back information relating to an object at the search tax address.
  • the search tax address is an address provided to the user terminal for the user terminal to process, but in general does not comprise one of the search result addresses. Thus, in effect, this additional address is a tax on the user terminal (or user) for allowing the terminal access to the search data store.
  • the search tax address or addresses may comprise an address or addresses which are to be processed by the user terminal in an on-going background spidering process or the search tax address or addresses may be provided to the user terminal in response to receipt of a search request from the user terminal on a per-search basis. In a preferred embodiment both background and per- search tax addresses are sent to the user terminal for spidering.
  • the search tax addresses are preferably selected according to a logical proximity of an object at the tax address to the user terminal. Such a logical proximity may be based upon the user terminal's IP address, or upon a proximity measure such as ping time or a count of a number of hops between the user terminal and the object at the tax address. Search tax addresses may also be selected dependent upon the network access bandwidth of the user terminal.
  • the object information content data preferably comprises a list of words in the object and word rating data indicating the likely significance of the words to the object.
  • the server system may also be configured to receive user object preference data such as bookmark data indicating objects a user has bookmarked for access on later occasions.
  • two or more user terminals are sent the same object's address and the search data store is only updated once the result from a first user terminal has been checked against the data received from the second or further user terminals.
  • the system may also monitor user's IP addresses and/or user's traffic to detect fraud.
  • the invention also provides a search data store for the server system wherein an item of the object information content data, such as a keyword, is associated with a plurality of item location addresses for objects having an information content relevant to the item of object information content data; and wherein the item location addresses have an order corresponding to the relevance of the objects at the addresses to the item of object information content data.
  • an item of the object information content data such as a keyword
  • the invention provides a user terminal for searching a network, the user terminal comprising: a data store operable to store data to be processed; a program store storing processor implementable instructions; and a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to:- input a search request from a user; transmit the search request to a server system; receive search result data from the server system, the search result data comprising one or more search result address for objects having an information content relevant to the search request; retrieve from at least one address received from the server system object data for an object located at the received address; and transmit to the server system information relating to the object located at the received address derived from the retrieved object data.
  • the address received from the server system may be a search result address, a search tax address provided in response to a search request or a background search tax address, as described above with reference to the server system.
  • the search request itself may either be issued in a conventional manner using an internet or web browser, or the search request may originate from dedicated searching code running on the user terminal.
  • the object data retrieved by the terminal may comprise bibliographic data such as a last-modified date or more complete object data; the information transmitted to the server system may comprise the retrieved object data itself, for example where only bibliographic data is retrieved, or it may comprise the results of an object analysis procedure which has been executed on the user terminal.
  • the server system with which the user terminal communicates may comprise a single server or a set of interrelated servers.
  • the processor implementable instructions of one or both these systems may be provided on a data carrier or storage medium such as a hard or floppy disk, ROM or CD-ROM, or on an optical or electrical signal carrier.
  • the processor implementable instructions of the user terminal may be stored in the data store of a network server such as a web server, for example as part of a page of internet data such as a web page.
  • the invention also provides a corresponding method for searching a network using a client system, the method comprising: inputting a search request from a user; transmitting the search request to a server system; receiving search result data from the server system, the search result data comprising one or more search result address for objects having an information content relevant to the search request; retrieving from at least one address received from the server system object data for an object located at the received address; and transmitting to the server system information relating to the object located at the received address derived from the retrieved object data.
  • the invention provides a search system for a network comprising: a server coupled to the network; a plurality of user network-access means, couplable to the server via the network for providing a plurality of users with access to the network; a search database coupled to the server; an information collecting program accessible to each said user network-access means for running by said users; wherein said information collecting program is configured to, when rurining on a said user network access means, collect information relating to data stored at locations within the network and to pass at least a portion of the collected information to the search database; and wherein said locations are provided to the collecting program from the database in response to a search request sent by the collecting program to the server for search data from the database.
  • the search system may be part of a system providing a user's search service.
  • the network may be an Internet protocol network such as an Internet or an intranet and in what follows references to "web pages" are intended to include pages of information in internets and intranets other than the World Wide Web.
  • the user network - access means will be a personal computer, but network access can also be by means of a mobile telephone, Internet enabled TV and other similar net-compliant devices.
  • the information collecting program is integrated into a web browser, for example, comprising part of an executable file of the browser.
  • the search database comprises generally data and a software.interface thereto and may include associated data manipulation, processing and communication functionality.
  • URLs Uniform Resource Locators
  • Information from the collecting program for the database could comprise a downloaded web page and/or a compressed or encrypted version thereof, or the web page after partial or full analysis for, for example, keywords and/or phrases, by the information collecting program.
  • the Internet data collected may include (but is not limited to) HTML data, XML data, DHTML data, SGML data, web page information, and audio, video, multi-media, web TV, game, file, financial and other information types.
  • the information collecting program will require some sort of "signature" to show that it can be trusted to read and/or write to a local user's hard disk and to access information on other servers. This is not, however, an essential aspect of the invention but depends, in part, on how the network is set up and the context (for example the browser type) within which the information collecting program operates.
  • the invention provides a method of updating a search system for a network, the system comprising: a server; a plurality of user network-access means, couplable to the server via the network, each for providing a user with network access; and a search database couplable to the server; the method comprising: running an information collecting program by a plurality of said users; collecting information relating to data stored within the network using the program; passing at least a portion of the information collected by said plurality of users to the search database; and updating the database using the collected information.
  • the user's access to or vote of approval for information provided by the search results is logged or registered in the database. Notes can then, for example, be counted so that the results of future searches can be presented or ranked in order of relevance as determined by users of the system.
  • bookmarking in the context of an Internet search page, the marking of user-preferred pages in order that these can be returned to at a later stage. More generally bookmarking involves the storage of a location identifier, normally with some information concerning the site, page or data it locates, for example a title or description. Normally a user's bookmarks are specific to an individual user, but bookmarks can also be shared between users or within groups of users. In a preferred embodiment, when a site or web page or other network location is bookmarked this is registered as user approval for later ranking of search results, and where an axis or vote counting system is implemented, additional weight can be given to book marked sites.
  • the invention also provides a program to, when rurining, on a network: provide a user interface for searching the network; accept a user search request; pass a request to a search database, responsive to the user request; receive a search result having network data location information from the database; access, or request another program to access, the data location; and pass information from the data location back to the database.
  • the invention further provides a web browser application program to, when running, receive a URL from a server, at least partly download a web page at the URL, extract a portion of information from the web page, and send the information to a web searching database on the web.
  • the invention provides a web data collection system comprising a plurality of individual users each connected to the web and running a program to collect information on the contents of web pages and to report the information to a common database.
  • the invention provides a database for a network searching system comprising: a list of network resource locators; a list of search terms or term identifiers; and a list of ratings, each linked to at least one resource locator and one term or term identifier, a value of each rating being dependent upon access to or approval of a corresponding located resource by users of the searcWng system.
  • the invention provides a method of bookmarking resource locations in a network searching system, the system comprising a server coupled to a search database and means for remote access to the database by a plurality of users, the method comprising: providing to a user in response to a search request, search results from the database, the results being associated with corresponding resource locators; receiving from the user a request to bookmark a resource associated with a said result; storing, in the database, a corresponding resource locator coupled with user access control information for the user; whereby the resource is locatable by the user after bookmarking.
  • the invention provides a method of ranking results for a network search system, comprising: determining a first user's interest in a network resource by detecting whether the user stores the resource location for later access; and ranking a plurality of network resource locations provided as results for a search performed by another user, partly responsive to the first user's determined interest.
  • the invention provides a method of providing a web user with a preview of a web page, comprising: locally caching at least part of the web page information; rewriting at least one link in the cached page to point to locally cached data; and displaying at least a part of the cached page.
  • the invention provides a user interface for a network browser or search system, comprising means to automatically download a plurality of documents or web pages, or parts thereof, indicated by displayable results provided to a user, by starting a corresponding plurality of processing tasks to be executed in parallel.
  • the invention provides a network search system comprising: means to store a search request input to the system by a user on a first occasion; and means to repeat the user's stored request automatically and to display the results of the request when the user accesses the system on a second, subsequent, occasion.
  • the invention provides a network search system comprising: a server coupled to a search database; a remote network access means including input means for a user to input a search request; means to provide an instruction from the database to the remote network access means to access and analyse information relating to a resource on the network and to report to the search database; and means to provide search results to the network access means in response to the search request, conditional upon the database receiving the report.
  • the invention provides a method for quality control of a database of search data for a network, comprising: instructing a plurality of client programs to gathering information for the database from locations provided to the programs by the database; double checking a proportion of the gathered information by issuing identical or equivalent locations to two different client programs; determining whether the gathered information from the two client program agrees to within a tolerance margin; and adjusting said proportion based on the results of said step of determining.
  • the invention provides a stand-alone distributed web crawler to, when run contact a web page; download that web page; analyse the contents of that web page; and send the results of its analysis to a database system.
  • the invention provides a system and method in which a signed Java applet performs a web-crawling function analysing web pages and posting the results of its web crawling operations to a system partly comprising a database.
  • Such a database-system may be built from scratch expressly for the purpose of serving the signed Java applet or it may comprise an existing system, with the potential addition of new schemes, tables, relations or other data structures which facilitate serving the applet.
  • an Internet based search engine which is installed on a server but operates in a distributed way in that it makes use of users' local PCs to update the search engine database.
  • the user accesses the search engine database from a local PC by means of a Java applet which may be downloaded from the search engine server.
  • This applet is run when a search is carried out and returns a list of web page URLs in a conventional manner.
  • the Java applet fetches the web page identified by the selected URL and checks the time stamp on the web page against the date of an entry for that URL in the search engine database.
  • the Java applet takes further action. It either sends or forwards a copy, preferably in a compressed form, of the web page data to the search engine or it strips out key words from the web page and forwards these to the search engine. In this way the search engine database is updated as users use the search engine. Effectively, the web crawler software is distributed across a large number of local users' PCs.
  • the Java applet is "signed", in other words, provided with a digital signature or certificate.
  • An applet which is signed in this way is “trusted” and is permitted access to other servers. This is useful as it facilitates the Java applet forwarding web page data from these other services to the database search engine server.
  • the signature gives access to the local hard disc of the user's PC to, among other things, allow web pages downloaded from the other servers to be cached on the local hard disc for faster retrieval, previewing and viewing. Typically, when the system is first activated the user will be asked “do you trust the search engine provider?" before access to the local hard disc/other servers is confirmed.
  • Such digital signature/certification systems are provided by Verisign or other certificate authorities and use an RSA or other public key cryptography algorithm.
  • an additional signature capability is necessary for access to controlled parts of the web browser system and separate signatures may be required for NETSCAPE (Registered trade mark) and/or Microsoft Internet Explorer (Registered trade mark).
  • Other features include the provision of a scrolling list of search URL results (which is made possible by use of a Java applet) and a web page preview feature in which a reduced size version or reduced content version of a web page is displayed in a window when the user's cursor is momentarily held in position over a URL hyperlink.
  • the system effectively provides a distributed web crawler or web spider which uses a signed Java applet for network access.
  • the system can be run on some workstations and/or other hardware, and in one embodiment the applet occupies less than 100K bytes with approximately a further 1 Megabyte allocated to local disc caching of downloaded web pages.
  • the invention provides a web crawling system or applet to, when running, contact a web page; download that web page; analyse the contents of that web page; and send the results of its analysis to a database system.
  • Such a web crawling Java applet is to crawl or spider the world wide web. That is to say, the purpose of this applet is to contact a web page and then analyse its contents. Such a web page will not generally be hosted on the server from which the applet originates.
  • a Java applet is not permitted to access any server other than the server from which it originates. If the applet is signed however, that is to say, if it has been granted a digital certificate, it is permitted to access servers other than the server from which it originates. The applet contacts a web page, perhaps as a result of having been passed that web page's URL by a server, or perhaps as a result of having that web page's URL input by a user. The applet then downloads and proceeds to analyse the contents of that page.
  • the applet When the applet has performed its analysis it uploads its findings, for storage and later access, to a server hosting a database system.
  • the findings may be uploaded in an encrypted form, or a compressed form, or an encrypted and compressed form.
  • An advantage of compressing the data prior to uploading it to the database system is that the time required to upload the data in a compressed form will be generally less than that required to upload the same data in an uncompressed form. Accordingly the applet's connection will be less busy and therefore the applet will have more bandwidth available for spidering.
  • the system has a graphical user interface (GUI).
  • GUI graphical user interface
  • a search term is submitted to the database system via the applet and the database system accordingly returns its findings to the applet which the applet then displays.
  • the GUI permits user interaction with the central database of a search engine.
  • the Java Applet Graphical user interface accepts from the user a word or phrase which the user wishes to submit to the search engine (Search Term Acceptance).
  • search engine Search Term Acceptance
  • this will comprise a text box into which the user can type a search term or a voice recognition system into which the user can announce a search term.
  • the applet then submits that search term to a database system which has been specially constructed or adapted for this purpose.
  • the search term may be sent in an encrypted form, or a compressed form, or an encrypted and compressed form.
  • the database system After consulting its store of information relating to the search term, the database system returns its findings, or results, to the applet.
  • the findings of the database system may be sent in an encrypted form, or a compressed form, or an encrypted and compressed form.
  • the applet decrypts or decompresses or decrypts and decompresses the data as appropriate and then presents the results to the user.
  • the database system may also download to the applet one or more URLs of web pages which it would like updated with a request that the applet contact the page represented by the URL, analyse the page, and upload its findings as described earlier.
  • the code which comprises the "web crawler” is not necessarily written in Java and therefore does not necessarily comprise an Applet. Moreover it is not necessary for the software to "crawl" in the sense of copying itself from computer to computer.
  • the method and system for crawling the web is preferably directly integrated into the code for the web browser, that is to say, the code for the web browser and the code for the crawler are in the same executable file.
  • a stand-alone distributed web crawler may comprise an executable file which when run may have only the very simplest interface consisting of a 'stop' button or other means of halting the execution of the program.
  • the executable file which comprises a browser incorporates a system and method which calls the executable file which comprises the stand-alone distributed web crawler
  • the purpose of the stand-alone distributed web crawler is to crawl or spider the world wide web. That is to say, in the context of this description, the purpose of this crawler is to contact a web page and then analyse its contents then upload the analysis to a database-system.
  • the code which comprises the web crawler is preferably, but not necessarily, written in Java (Regd. T.M.) and does not necessarily comprise an Applet.
  • the stand-alone distributed web crawler contacts a web page, perhaps as a result of having been passed that web page's URL by a server, or perhaps as a result of having that web page's URL input by a user.
  • the stand-alone distributed web crawler then downloads and proceeds to analyse the contents of that page.
  • the stand-alone distributed web crawler has performed its analysis it uploads its findings, for storage and later access, to a server hosting a database system.
  • a system and method comprising software which accepts data from an applet as described previously and translates that data into a form, type, language or schema compatible with the form, structure or language of a database of an existing search engine, (for example, Northern Light, Snap, Alta Vista, HotBot, Microsoft, Infoseek, Google, Yahoo, Excite, Lycos, Euroseek (Registered Trade Marks).
  • a search engine for example, Northern Light, Snap, Alta Vista, HotBot, Microsoft, Infoseek, Google, Yahoo, Excite, Lycos, Euroseek (Registered Trade Marks).
  • the data being sent from the signed Java (Regd. T.M.) applet will comprise either queries or the web page-analysis findings of the signed Java applet for inclusion in the database.
  • a system and method comprising software which accepts data from a database of an existing search engine, (for example, Northern Light, Snap, Alta Vista, HotBot, Microsoft, Infoseek, Google, Yahoo, Excite, Lycos, Euroseek (Registered Trade Marks) and translates that data into a form, type, structure, style, language or schema compatible with an applet as described previously.
  • an existing search engine for example, Northern Light, Snap, Alta Vista, HotBot, Microsoft, Infoseek, Google, Yahoo, Excite, Lycos, Euroseek (Registered Trade Marks)
  • the data being sent or retrieved from the existing database to be processed by the system or method will comprise results pertaining to search queries returned in response to queries submitted to the existing database via a signed Java applet.
  • the data will typically, but not necessarily, be sent in a compressed and/or encrypted form.
  • the invention provides a database security system and method.
  • the uploaded data may be put in a holding data-structure or database in the database-system or may be placed in the database-system proper with a flag to indicate that the data has not yet been confirmed as valid.
  • This re-spidering can be repeated in the manner described and with each confirmation that the data is valid the degree of confidence that the data is represented by that URL increases, such that after a small number of re-spiderings the probability of the data being invalid is significantly reduced.
  • a search engine is able to determine which web pages are of the greatest interest to the user and is therefore able to return results ranked according to some criterion for relevance.
  • the invention provides a search system for returning results ranked according to relevance as determined by, for example, search term density and/or likelihood of user interest. There is thus also provided a method to determine the ratio of the number of appearances of a search term in a particular page to the size of that page.
  • a search term ratio or search term density of a web page can be defined as a ratio of the number of occurrences of the search term on a page to or divided by the size of that page.
  • the database system will comprise one or more tables or relations or other data structures in which each search term will be associated with the URL of each web page which contains that search term, and the search term density of that search term in that page.
  • One embodiment considers sentences beginning with 'What', 'Why', 'When', 'Where', 'How' or 'Who' and terminating with '?'. By compiling a directory of questions of this form associated with the URLs of the pages on which they appear, a directory of likely pages where the corresponding answers can be found is obtained.
  • Figures la and b show a block diagram of an Internet search system according to an embodiment of an aspect of the invention.
  • Figure 2 shows a block diagram of a user's computer in an embodiment of the invention
  • Figures 3 a to c show a flow diagram of a user registration and background spidering process
  • Figure 4 shows a flow diagram of a process for downloading a web page and applet from a web server
  • Figure 5 shows a flow diagram of a server process for the user registration and background spidering process of Figure 3;
  • Figure 6 shows a flow diagram of a search and spidering process on a user's computer
  • Figure 7 shows a flow diagram of a server process for the search and spidering process of Figure 6;
  • Figure 8 shows a flow diagram of a graphical user interface thread for a search process for a user's computer
  • Figure 9 shows dataflows in search and spidering processes according to an embodiment of an aspect of the present invention.
  • Figure 10 shows an exemplary graphical user interface for a search system according to an embodiment of the present invention.
  • Figure 11 shows an exemplary plurality of concurrently running program threads of the search and spidering process of Figure 6.
  • a user terminal 102 is connected to the Internet 114. Further user terminals 104, 106, and 108 are also connected to Internet 114, via LAN (local area network) 110, and Internet gateway 112. Connected to the internet 114 are a plurality of sources of information, represented in Figure la by web servers 116a to e. Data for user searching and for system spidering is stored on web servers 116a to e.
  • the world- wide web which represents objects in HTML (hypertext markup language) format and transfers data via the HTTP (hypertext transfer protocol) protocol.
  • HTTP hypertext transfer protocol
  • the Internet also provides access to data via other protocols such as, for example, FTP (file transfer protocol) and Gopher.
  • a search and spidering system web server 118 is coupled to the Internet 114, via a firewall 117 for security.
  • the system web server 118 provides a search system (home) web page including a search applet, that is, a Java (registered trade mark) program for execution within a supporting web browser.
  • the system web server 118 is coupled to web page and applet code storage 120 within which the applet is stored as a signed jar (Java archive).
  • a digital signature authenticates the Java applet as originating from the search system service provider.
  • the Java applet is downloaded to a user terminal a window is displayed together with the name of the service provider and a certification authority and the user is asked whether or not to trust content from the service provider.
  • the digital signature authenticates the origin of the Java applet as the service provider and the user is thus provided with sufficient information to enable the applet to be trusted.
  • the applet Once the applet has been marked as trusted it is given extended permissions by the web browser which allow it to perform the functions described below, such as reporting indexed content data to the service provider.
  • Web browsers such as Microsoft Internet Explorer (registered trade mark) and Netscape Navigator (registered trade mark) automatically recognize a signed Java applet and implement such security procedures.
  • Providing a web page including a signed Java applet is the preferred implementation of the system, but in other embodiments other security arrangements may be employed.
  • the search system home web page is a static web page 122 comprising graphics and an HTML tag 124 including a URL (uniform resource locator) pointing to the Java applet in code storage 120.
  • HTML tag 124 including a URL (uniform resource locator) pointing to the Java applet in code storage 120.
  • FIG. lb this also shows the web server 118 and code storage 120 of Figure la, together with a data collection server 122 and a query servicing server 124.
  • Each of servers 118, 122, and 124 has a separate URL.
  • the URL of web server 118 is accessed by a user's web browser to download the system home page; the URLs of servers 122 and 124 are accessed by the Java applet code running on the user's machine.
  • the data collection server 122 includes data collection code storage 122a and is coupled to a system data store 126.
  • the query servicing server 124 includes query serving code storage 124a and is coupled to a user data store 128, as well as to the system data store 126 for returning search results.
  • Some or all of the stored code and/or data may be stored on a removable storage medium, illustratively shown by disk 130.
  • the data collection server 122 manages data collection or spidering functions for the system and query servicing server 124 handles user queries.
  • search results are provided to a user together with a so-called "URL tax" of sites which the user's computer is to spider. For this reason query servicing server 124 is coupled to data collection server 122.
  • the system w ⁇ b server 118, data collection server 122, and query servicing server 124 may comprise computer programs implemented on dedicated machines or, as will be understood by the skilled person, two or more of these servers may be implemented on the same machine.
  • the system data store 126 preferably includes a list of all known URLs, although in practice at any one time the database will include URLs which are no longer in existence and will not include some new URLs.
  • the basis of such a list is obtainable from the authorities who are responsible for overseeing registration of domain names, such as Network Solutions Inc., although it may be necessary to combine lists of URLs obtained from two or more such authorities. Over time the list may be enhanced by server and user-based spidering as described later.
  • Embodiments of the system may include a subset of known URLs, for example to provide a language-based search facility, rather than attempt to include all known URLs.
  • each URL Associated with each URL is status data including a time stamp indicating when the status data was last updated, a "date last modified" date, normally provided on web pages to indicate when the page was last modified, a checksum based on the web page data, and a web page file size.
  • the database also includes indexed content data for the web pages (also referred to as URL spidering data) as described in more detail below, and page rating data to provide one or more ratings of, for example, popularity, utility, and the like.
  • the system data store 126 may comprise, in one embodiment, of the order of 10 10 URLs and associated data stored in of the order of 1 TB RAID (redundant array of inexpensive disks) storage.
  • the data store 126 may comprise a relational or object-orientated database, such as an Oracle or DB2 database, or it may comprise a proprietary database as described below. Data within the database is accessed by a user's search keyword although popular combinations of keywords may have their own entries. Taking into account the possibility of searching in a variety of languages, and searching for proper names and acronyms, provision for up to 10 7 keywords may be necessary.
  • each keyword has its own file comprising a list of URLs referencing that keyword.
  • This URL list is preferably ordered by default criteria so that retrieved search results are automatically provided in order of relevance.
  • the ordering of results where keywords are combined in a search term and the same URL appears under two (or more) keywords may, for example, be based upon the relative position of the URLs concerned in the two lists.
  • new indexed content is preferably inserted at an appropriate place within the relevant ordered list or lists.
  • the data collection server 122 provides URLs for spidering to the Java applet running on a computer operated by a user of the search system and receives indexed content data back from the applet for storage in system data store 126.
  • aspects of this process and of system data store 126 are optimized by the system, preferably automatically. This self-optimization may be performed by the data collection code by, for example, making a small modification to a parameter and measuring any resulting change in system performance to determine whether the performance is improved or detrimentally affected by the modification.
  • Global parameters which may be modified by such a procedure include the number of keyword combinations having their own separate entry in system data store 126, the number of keyword files cached, and the length of time an unaccessed file is retained in a cache.
  • User (“client”)- specific parameters include the number of URLs in each batch sent to the client for spidering, and the URLs selected for spidering, in particular their proximity to the user's URL - the user's "catchment area" - as described further below.
  • the client-specific parameters are preferably optimized separately during each session a user is logged-on, for example, to optimize use of available bandwidth to the user's (client's) computer.
  • User data store 128 stores data relating to specific users or clients of the search system.
  • user data store 128 comprises user identification data such as a user number, a user name and password for accessing the system, a user e-mail address for marketing purposes and user (search) term data as described in more detail later with reference to the USER TERM table.
  • user data store 128 may also store a user internet address (which may be temporary or the address of a gateway).
  • the user term data includes a history of search terms frequently used by a user which can be employed, for example, to generate a news or update service and to alert a user to new websites in which they may have an interest.
  • the user data store 128 may also include a user rating, for example, a "blacklist” flag which can be used to exclude unwanted users from the system.
  • a user rating for example, a "blacklist” flag which can be used to exclude unwanted users from the system.
  • the data store also holds each user's normal IP address (this could be the IP address of a company gateway such as gateway 112 of Figure la), for catchment area-related searching as described later.
  • user data store 128 preferably includes BOOKMARK and FOLDER tables (described later) to store and organize a user's bookmarks.
  • the database may also store user settings data for storing users' preferences.
  • the user settings data defines the number of results returned by a search, an age cut-off for search result web pages, whether or not the user wishes to take advantage of the user search term storage facility, and if the user does request this facility, the frequency of news updates and an option for e-mail notification of updates.
  • FIG. 2 shows an example of a user's computer which, as illustrated, comprises a conventional, general purpose personal computer 200 suitably programmed.
  • Personal computer 200 comprises a pointing device 206, such as a mouse, a keyboard 208, and a display 210, all for providing a user interface.
  • An Internet interface 204 is provided for connecting the computer to Internet 114; this may comprise any conventional communications interface such as a modem or a local area network interface (which provides an indirect interface to the Internet).
  • the computer includes a processor 212 which loads and implements program code stored in permanent program memory 218, such as a hard disk drive. Data for use by program code running on the processor is stored in permanent data memory 216 (which again may comprise a hard disk drive) and a working memory 214 is provided for use by processor 212 during its operation.
  • the program code and data in memories 214, 216, and 218 may be stored on a removable storage medium, as illustrated by floppy disk 220. All the components of computer 200 are linked by computer bus 202.
  • Processor 212 loads and implements a web browser 212a such as Internet Explorer (registered trade mark) or Netscape Navigator (registered trade mark) and, optionally, an e-mail application (not shown).
  • a web browser 212a such as Internet Explorer (registered trade mark) or Netscape Navigator (registered trade mark) and, optionally, an e-mail application (not shown).
  • a signed Java applet 212b also runs in computer 200.
  • applet code is also stored in working memory 214, together with a list of URLs spidering, HTML files for web pages retrieved by the user's computer (either for indexing or, equivalently, as search results), indexed content data, and a list of search result URLs.
  • the list of search result URLs may also be stored in permanent data memory 216 together with, optionally, a list of the user's "favourite" bookmarked URL references.
  • the user's bookmarks are also stored in user data store 128 and the list of bookmarks and search results list are only updated if the user chooses to save this data locally.
  • Web browser 212a includes cryptography code to recognize the Java applet's digital signature and to display a certificate, together with a company name, offering the user a choice of whether or not to trust the service provider. If the "trust" option is accepted web browser 212a gives signed Java applet 212b extended permissions, for spidering web pages and reporting indexed content data to the service system provider. Permanent data memory 216 may store data indicating that applet code from the search system service provider is always to be trusted.
  • FIG. 3 a to 3 c show a flow diagram of a user registration and background spidering process.
  • the flow chart illustrates steps performed by search/spidering applet code running on a user's personal computer 200.
  • the flow chart shows a background spidering process which runs continuously on computer 200, according to the available processing and communications bandwidth, when the user is not performing a search.
  • the process continues to run in the background during a search, although bandwidth limitations may cause the process to run slowly.
  • the process is a multi-threaded process; the flow chart shows steps in both a master (or control) thread and a spidering thread.
  • step S300 the search system home page 122 and signed Java applet are downloaded from system web server 118 to a user terminal such as personal computer 200.
  • web page 122 includes a URL to the Java applet code, which is downloaded separately from the web page text and graphics. If the user has previously accessed the search system home page the applet code and, in some instances the web page text and graphics, may be locally cached on the user's machine. The search system may force an update of such locally stored applet code by, for example, changing the applet's file name.
  • the user's web browser 212a runs the downloaded applet 212b which, at step S304, establishes a socket connection with data collection server 122.
  • the socket comprises a bi-directional virtual connection between the applet and the data collection server:
  • the applet sends initialization data to the data collection server 122 comprising, for example, an applet version number.
  • the applet receives a list of URLs for spidering from the data collection server. Associated with each URL is a date retrieved from system data store 126 indicating the last date (and/or time) when the data in data store 126 associated with that URL was verified and/or updated.
  • checksum also retrieved from data store 126, calculated from the web page data pointed to by the URL.
  • the checksum is, in one embodiment, calculated using the entirety of the web page data including HTML tags, although in other embodiments data within HTML tags may be ignored.
  • the applet may process each URL sequentially, downloading content from a first URL, indexing this and reporting back to the data collection server, and then processing the next URL. However, it is more efficient if the applet processes a plurality of URLs in parallel, for example, using a separate thread for each. Web pages from some URLs will download more quickly than web pages from others and a multi-threaded process facilitates making use of this. Thus, at step S310, the applet selects a first batch of URLs to be processed from the list of URLs received, for example the first ten URLs in the list, and starts a new thread for spidering each one.
  • step S310 is the master or control thread; step S312 is the first step of one of the new URL spidering threads created at step S310.
  • the control thread halts and waits.
  • step S312 the URL spidering thread of the applet sends a URL header request to the URL it is processing, requesting header data from that URL.
  • the header data includes a "date last modified" - i.e. the date at which the web page was last updated, and web page summary data.
  • the applet receives URL header data from the URL to be processed and, at step S314, checks whether or not the header data includes a date-last-modified for the web page. If there is no date-last-modified the applet proceeds to step S318 in Figure 3b, otherwise the applet checks, at step S318, whether the date-last-modified is later than the URL date received from data collection server 122.
  • step S338 the main control thread checks whether or not all the URLs received at step S308 have been processed. If they have not the existing thread, which has just finished processing its last URL - that is the spidering thread of step S312 et seq, is reassigned to a new URL to be processed (step S340) and the process then loops back to step S312. Otherwise, if all the URLs received from the data collection server have been processed, the applet requests a new list of URLs for processing from the data collection server at step S342. The main control thread then again reassigns the completed thread to a new URL and, again, the process then loops back to step S312.
  • the Java code handles signalling between the master/control thread and the URL spidering threads, enabling the control thread to detect when a spidering thread completes.
  • step S318 of Figure 3b if the date the web page was last modified is later than the corresponding date in system data store 126, at step S318 the applet URL spidering thread requests the full web page data from the URL, excluding any data such as graphics and included pages indicated by links within the page. Then, at step S320, the applet caches the downloaded web page in case the user should wish to preview the web page contents, as described later. This caching function is provided by the applet 212b rather than the web browser 212a.
  • step S322 the applet calculates a checksum for the downloaded web page and, at step S324, checks whether the calculated checksum is equal to the checksum associated with the URL received at step S308 from the data collection server. If the checksums are the same the process continues at step S336 where the applet sends the URL (or a URL identifier) and the results of the date and checksum checks back to data collection server 122. The date is returned because the web page date-last-modified may have been updated without any change in the web page content. The process then continues at step S338, as described above.
  • step S324 the applet then proceeds to analyse the web page contents and report back to the data collection server, which stores the results of the analysis in system data store 126. More particularly, the process continues at step S326 at which the applet stores links to other pages and sub-pages (frames) in the downloaded web page in working memory 214 for return to the data collection with compressed indexed content data, as described later.
  • the applet compiles a list of all words on the web page except for HTML tags.
  • words are not restricted to dictionary words but include acronyms and, more generally, alphanumeric character strings. This is useful when searching for product numbers, specifications, invented names and the like.
  • the applet determines a word rating.
  • the word rating may be determined from one or more of word frequency, the relative font size of the word as compared with other text on the page, and the word's location, for example, whether it appears in a heading, a URL, a hypertext link, an HTML tag, or in some other location. Other conventional word rating methods may also be employed.
  • step S334 compiles compressed URL spidering data comprising URL identifying data, a current date (either from the user's personal computer 200 or, preferably, as supplied by the search system), a page checksum, the word list and word rating data for each word, a list of links from the page as stored by the applet at step S326, and a page file size.
  • the indexed content data in system data store 126 is drawn from this URL spidering data.
  • the applet compresses this URL spidering data and sends it to data collection server 122 for updating system data store 126.
  • the URL spidering thread then halts while, at step S338, the control thread checks whether or not all URLs have been processed and, if they have not, the control thread reassigns the spidering thread to a new URL and the process begins again at step S312.
  • the function of the applet is downloading and indexing ("spidering") web page data has been described but the applet is not restricted to downloading HTML data.
  • the applet also spiders data in Adobe (Registered Trade Mark) postscript (opdf) format, as well as data in other formats.
  • the applet may also index content contained within multimedia documents, data files or other objects.
  • FIG. 4 shows a flow diagram of a process for downloading web page 122 and its associated applet from system web server 118.
  • web server 118 receives a request for the search system home page from web browser 212a of user's computer 200.
  • the web server then, at step S402, sends the text and graphics for the home web page to the user's browser.
  • the web browser determines, at step S404, whether or not the applet is cached in the computer's permanent memory 218 and, if the applet is cached, the process ends at step S410. If the applet is not cached (or, equivalently, if the applet's file name has been changed) at step S406 the web server 118 receives a request for the applet from the user's web browser, the applet having its own specific URL.
  • the web browser then, at step S408, retrieves the applet from code storage 120 and sends the signed Java applet, as a signed JAR, to the web browser where the user is asked by the web browser whether or not to trust content (i.e. the applet) from the service provider.
  • the process then ends, again at step S410.
  • FIG. 5 shows a flow diagram for the background spidering process described with reference to Figure 3a, as implemented on data collection server 122.
  • step S500 (which corresponds to step S304 of Figure 3a) data collection server 122 is contacted by applet 212b and a socket connection is established between a data collection server communication process thread and a background spidering thread of an applet running on user's computer system 200.
  • step S500 (which corresponds to step S304 of Figure 3a) data collection server 122 is contacted by applet 212b and a socket connection is established between a data collection server communication process thread and a background spidering thread of an applet running on user's computer system 200.
  • Each of the many user computer systems connected to the data collection server at any one time is allocated a separate socket connection and a separate process thread on the server.
  • the data collection server receives initialization data from the applet including, for example, a version number of the applet which the data collection server can use to select a data communications protocol and/or data format for communicating with the applet.
  • the data collection server receives a request for a URL list from the applet for background spidering and, at step S508, the data collection server determines the next URLs which are to be updated. This determination may be made based upon recency, popularity, proximity, or on some other basis.
  • a determination based upon recency may, for example, select for updated spidering those URLs for which the greatest time has elapsed since they were last updated or, additionally or alternatively, may include new URLs which have not been spidered.
  • a determination based upon popularity may be arranged to ensure that those URLs most frequently appearing in search results are most frequency checked and if necessary updated. Selection of URLs by proximity is described in more detail below. In some embodiments a combination of two or more of these criteria may be employed in order to determine which URLs are next to be sent to a user's computer for spidering to update the URL's records.
  • step S504 the process is entered at step S504 and, again, at step S506 the data collection server receives a request for a list of URLs to spider from the applet.
  • a list of the selected URLs is sent to the applet and, at step S512, the selected URLs are marked in system data store 126 as "pending", for example by means of a flag.
  • the "pending" flag indicates that a URL has been selected for updating but has not yet been updated and the selection (at step S508) preferably ensures that once a URL has been marked as pending it is not again selected for updating by a different user.
  • the "pending" flag has a timed expiry so that if no spidering results relating to that URL are received from a user's computer after a predetermined interval the URL is again made available for selection for spidering by the same or another user.
  • the data collection server waits for spidering data to be received from an applet (corresponding to the data sent at steps S334 and S336 of Figure 3).
  • the data collection server receives URL spidering data from one of the many applets running on the plurality of users' computers which may be connected at any one time to the' search system, at step S516.
  • a separate data'reception process is started for each return from a system user so that in practice, at any one time, there will be a plurality of concurrent reception processes operating on data collection server 122. Such processes may be implemented in a conventional manner on data collection server 122 using Java.
  • the data collection server checks whether the received URL spidering data comprises indexed content data (corresponding to the data sent by the applet at step S334 of Figure 3) or merely bibliographic data (such as that sent at step S336 of Figure 3). If the received URL spidering data does not contain indexed content data the data collection server, at step S520, updates the bibliographic data for the relevant URL in system data store 126 with information indicating when the URL was last checked. This information is received from the applet and comprises a time stamp and, if available, a data-last-modified for the web page. The system then loops back to step S514 to wait for further spidering data from the same or another applet. Alternatively, where step S516 is implemented as a plurality of concurrent processes, the process receiving data from the applet halts or waits, although the socket connection to the applet remains open (since one process is allocated to serve each user's computer).
  • step S524 (Bibliographic) checksum data for the updated indexed content is also written into system data store 126.
  • the data collection server writes updated indexed content data into system data store based upon the word list and rating data received from the applet as described above with reference to step S334 of Figure 3. The process then again loops to step S514, waiting for further data from the user's applet.
  • a user or applet identifier such as a username, is also stored to indicate the origin of the new or updated indexed content data, to help detect and reduce the risk of fraud by, for example, unauthorised passing of data into the system data store.
  • URLs for updating by a user's computer's applet may be selected partially or completely on the basis of whether or not they are within a URL "catchment area" defining URLs of a selected or predetermined proximity to a user's effective IP address.
  • the system data store 126 stores a list of URLs for substantially every web page on the Internet to be covered by the search system. Some of these web pages are new and have never been spidered, and some may need checking for updates, for example, because they were last checked more than 24 hours previously.
  • the data collection server prioritises the URLs to be spidered according to how recently they were last checked and, starting with the least recently checked pages, URLs are sent to instances of applet 212b residing on the computers of users who are currently connected to the search system.
  • the URLs to check may be selected substantially at random, for example, for security reasons, to reduce the risk of biased or erroneous data being submitted to the database.
  • the spidering process can be made more efficient by selecting URLs a user's computer receives for spidering based upon the physical or logical connection of a user's computer to the Internet in relation to the physical or logical locations of the URLs to be spidered. More specifically, there are likely to be fewer bandwidth bottlenecks to locations on the Internet (or other network) which are close to the user's computer as compared with those which are more distant.
  • An Internet address comprises four 8-bit octets normally written in decimal notation, for example, 193.243.236.208.
  • a first portion an Internet (IP) address defines a computer network and a second portion of the address defines a computer coupled to the network.
  • Computer networks are identified by network numbers and IP routers generally store a table of such network numbers together with corresponding IP addresses for gateways into the networks.
  • 193.243 may define the network number of a computer network.
  • an Internet address usually reflects the underlying physical structure of a computer network, at least to a degree.
  • Domain name servers translate between domains and Internet addresses typically by working down a tree from a root/top-level domain name server. The allocation of domain names is overseen by ICANN who appoint country-based domain name registrars.
  • IP addresses which have a good chance of being close to the IP address of a user's computer
  • an Internet Service Provider allocates an IP address to a user's computer when the user logs onto that ISP, the address being selected, often at random, from a range of IP addresses assigned to that Internet Service Provider.
  • the system merely has to identify a subset of URLs for which a selected number n of the candidate URL's IP address most significant bits match the corresponding most significant bits of the user's IP address. For example, if the IP address of a user's computer is 193.243.236.208 a candidate URL with an IP address of 193.243.233.128 may be considered within the user's catchment area because the first portions of these two addresses match.
  • n selected determines the catchment area of a user's computer. In a simple embodiment this could be fixed at, for example, 16 bits. In a more sophisticated example the number of bits may be selected according to the class of IP address.
  • Class A Internet addresses are reserved for large networks and use only the first octet for the network number (addresses 1 to 126); class B addresses are for standard size networks and use the first two octets for the network number; class C addresses are for small networks and use the first three octets for network numbers.
  • a subset of IP addresses based upon "proximity" may be selected using a so-called subset mask, that is, a 32 bit number with selected (most significant) bits set to one. Where subsets are contiguous within a network each subset can access the other subsets without passing traffic through other networks. Additional/alternative proximity determinations may be based upon Classless Inter-Domain Routing (CIDR).
  • CIDR Classless Inter-Domain Routing
  • a standard utility may be employed to determine the route datagrams take between two hosts and the "proximity" can be determined accordingly, for example by counting the number of hops in the route.
  • the hop count is a ftinctionally significant measure of the distance between two computers connected to the Internet since a datagram may pass through a large number of different networks before reaching its destination, even when that destination is geographically close at hand.
  • the server gives each applet a list of URLs to ping and the applets report the ping times back to the server.
  • the server maintains a list of URLs it wants spidered together with an average ping time for each URL determined from the average of all previous applet pings to that site.
  • the server selects URLs for a particular applet depending on how important it is to spider that particular URL and the applet's ping time to that URL. So when an applet has a particularly short ping for a URL compared to the average ping time it receives an instruction from the server to spider it.
  • FIG. 6 shows a flow diagram of a search process implemented using an applet on a user's computer. This process operates in parallel with the URL download and spidering processes described with reference to Figure 3.
  • the user enters a search term into the applet running on the user's computer or, alternatively, a search term is selected from a historical list of previously conducted searches.
  • the search term may comprise a single keyword or a combination of keywords linked by logical operators such as "KEYWORD1 and KEYWORD2".
  • the applet sends the search request to query servicing server 124 and, at step S604, receives a list of search results and "tax" URLs back from the query servicing server.
  • Each search result in the list comprises a URL, preferably together with additional information such as a title and/or an indication of the content of the web page pointed to by the URL. This additional information may be retrieved during the distributed spidering process and stored in system data store 126 in association with its corresponding URL.
  • Both the search result URLs and the tax URLs each have an associated date and checksum, and optionally file size and language data. The search result URLs are flagged to differentiate them from the tax URLs.
  • the URLs which are stored in system data store 126 are organized in association with search term keywords and are ranked by their relevance.
  • the list of search results received by the applet is ordered by relevance and thus when the applet displays the list of search results, at step S606, these are simply displayed in the same order in which they have been provided to the applet.
  • the user may, optionally, re-order the displayed search results according to other criteria such as, for example, date.
  • step S608 the applet identifies a first batch of URLs to begin spidering.
  • the spidering process is preferably carried out by a plurality of concurrently running URL spidering threads in a broadly similar manner to that described with reference to Figure 3.
  • the steps of Figure 6 from step S600 to step S601 are preferably steps performed by a master or control thread of the applet wliich, in a preferred embodiment, is a GUI thread which also manages the interface provided for a user by computer system 200.
  • some of the URL spidering threads are allocated to spidering search results and others of the threads are allocated to spidering URLs comprising the URL tax. For example where the applet creates ten URL spidering thread instances, five of these may be assigned to spidering search result URLs and five to spidering tax URLs.
  • the applet starts a new thread for each URL to be spidered.
  • the search result URLs to be spidered although selected initially by the applet are, indirectly, amendable by the user.
  • the applet detects which search results the user is viewing, for example by detecting result list scroll events, list re-ordering, and list item deletion, and controls the URL spidering threads accordingly to spider, for example, URLs being viewed, URLs in the order that they are being viewed, and to cancel spidering of deleted items.
  • the master or GUI thread effectively halts at step S610, waiting for events from the user such as the scroll events described above and, at step S612, spidering of a plurality of URLs commences (the process steps for only one of these spidering threads is illustrated in Figure 6).
  • an applet URL spidering thread requests a complete web page from the URL assigned to it for processing.
  • the spidering thread retrieves both header and text data on the web page but does not retrieve objects embedded within the page accessed via fiirther URLs, such as sub-frames.
  • the applet URL spidering thread receives a data stream from the URL until reception is complete, when the process continues to step S618.
  • the spidering thread sends reception status data to the master GUI thread to indicate events such as, "waiting for a response from the URL", "page downloading” and "time out and halt”. This information may be used, for example by the applet, to optimize the balance between the number of threads assigned to search result URLs and a number assigned to tax URLs.
  • the URL spidering thread caches the downloaded web page for the GUI thread to display on request, and sends "download complete” status data to the GUI thread (step S618).
  • the GUI thread preferably displays an indication of the status of the web page download from each URL, for example as a traffic light to indicate "waiting", "downloading” and "ready".
  • no status data is provided to the GUI thread since, in general, the tax URLs are not displayed.
  • spidering search result URLs are given priority over threads spidering tax URLs so that, in effect, the taxation operates as a background process and has only a small impact upon the user's available bandwidth.
  • This prioritisation may be implemented straightforwardly using the Java Virtual Machine.
  • Steps S620 to S628 correspond to steps S326 to S334 of Figure 3.
  • the applet stores links on the retrieved web page pointing to other objects such as program code, graphics, other web pages, sub-frames and the like.
  • the thread compiles a list of all "words" on the page except for HTML tags and, at step S624, discards unwanted words from the list.
  • the thread determines a rating for each listed word and, at step S628, sends compressed spidering data to data collection server 122 in a corresponding way to step S334 of Figure 3.
  • step S630 the spidering thread, which by then has processed the URL assigned to it, is reassigned to a new URL which may either be a search result URL or a tax URL.
  • a new URL which may either be a search result URL or a tax URL.
  • the process then continues again at step S612. If the user has modified the search result list, for example as described above, event data is received from the GUI thread and one or more existing URL spidering threads are reassigned to spider new (search result) URLs, whether or not they have completed processing of the URLs initially assigned to them.
  • this shows a flow diagram of a computer program running on query servicing server 124 for providing search results to an applet running on a user's computer.
  • the query servicing server 124 receives a search request, including a search term or keyword, from an applet on a user's computer.
  • the query servicing server retrieves search result URLs from system data store 126, already ranked in the order in which they will be presented to the user. This is because, as has been described above, when indexed content data from URL spidering processes is written to system data store 126, it is written in order of relevance to an associated keyword.
  • search term comprises two or more keywords search result URLs are retrieved in the manner which has already been described in connection with Figure 1.
  • the query servicing server 124 requests a list of tax URLs from data collection server 122. These tax URLs are preferably determined according to the same criteria as the background spidering URLs, as described with reference to Figure 5. In other embodiments tax URLs may be selected according to additional or different criteria from those used to select URLs for background spidering, for example to preferentially update the system data store with information relating to websites of businesses having a relationship with the search system service provider.
  • the query servicing server then sends both the search results and the tax URLs back to the applet, for display to the user, and for spidering.
  • search results and tax URLs are locally cached on the query servicing server 124 and transmitted to the user's applet in batches, to facilitate the applet's data handling and to make it easier for the search system to keep track of which URLs should be being spidered.
  • GUI graphical user interface
  • the GUI thread displays a list of search results in the order they are received from query servicing server 124.
  • the GUI thread awaits an event, for example initiated by a user.
  • events may include, for example, a modify result display event (such as a scroll event, re-order list event, or delete item event as mentioned above), a page preview event, a select item event, a bookmark item event and (not initiated by the user) a web page download status update event.
  • step S804 On receipt of a status update event from a URL spidering thread (step S804) the GUI thread, at step S806, displays updated status information for the relevant URL.
  • step S808 On receipt of a modify result display event from a user (for example, by operation of a scroll bar) at step S808 the GUI thread displays a modified list of search results and then, at step S810, sends data relating to the modify result display event to one or more URL spidering threads as necessary, for example to reassign spidering threads to process new URLs.
  • the GUI thread On receipt of a preview event (for example, by the user clicking on a preview region such as a URL title) at step S812 the GUI thread displays a simplified rendering of the downloaded web page, for example a text-only display in a supplementary window. If a hypertext link is selected (for example, by a user clicking on the link) the GUI thread, at step S814, opens a new browser window for the selected URL for displaying data from the selected URL. If the data at that URL has previously be cached by the applet, so that a cached version of the data is available, this cached version is displayed. After each event the GUI thread returns to step S802, to await the next event. As the skilled person will appreciate, preferably the GUI thread is able to process more than one event in parallel.
  • Figure 9 shows exemplary dataflows 900 for a user search process and for a background spidering process.
  • the terminal's web browser makes a URL request 902 to search system web server 118 and applet data 904 is downloaded to user terminal 102.
  • the user enters a search term into the graphical user interface provided by the applet and a query 906 comprising this search term is sent to query servicing server 124.
  • the query servicing server then returns a URL list 908 comprising search results and a URL tax to user terminal 102.
  • URL requests 910 are then issued to web servers 116a-e comprising web servers storing web pages indicated by the search results and web servers to be spidered in accordance with the URL tax.
  • the web page data 912 is then returned from these web servers to user terminal 102, where it is processed by the applet.
  • the compressed URL spidering data 914 resulting from the web page processing (comprising indexed content data) is then sent to the data collection server 122 for storing in the system data store 126.
  • Generally compressed spidering data from a plurality of web pages is reported to the data collection server, data from each page being reported by a separate thread running within the applet.
  • the data collection server 122 also provides a URL spidering list 916 to the user terminal 102.
  • This process first sends URL page header requests 918 to web servers 116a-e (which are merely exemplary of all the web servers connected to the Internet) and web page headers 920 are consequently returned to the use terminal. Then, where necessary, the background spidering process issues URL requests 922 for full web page data and these web pages 924 are then returned for processing. Compressed URL spidering data 926 is then reported to data collection server 122 in the same way as with the search and tax URL spidering process.
  • Figure 10 shows an exemplary graphical user interface 1000 for presentation to a user on personal computer 200.
  • the user interface comprises a conventional browser window 1002 within which a secondary window 1004 is provided by the applet's grammatical user interface (GUI) thread.
  • This secondary window comprises a field 1006 for entering a search term and an adjacent (search) button 1008.
  • a window 1010 displays a list of search results including a field 1012 displaying a title and URL for each result.
  • a second field 1014 indicates whether or not a web page for the search result is locally cached on the user's computer and, if it is cached, on what date it was cached.
  • the field 1014 also includes an indication 1016 of the download status of a web page to indicate, for example, that the associated web page is not active, that no response has yet been received from the web server, that the page has been accessed but is not yet fully downloaded, and that the page has been fully downloaded.
  • a bookmark and relevance field indicates the likely relevance of a web page to the requested search term, and indicates whether or not the page has been bookmarked.
  • a scroll bar 1020 is provided to allow a user to scroll up and down the list of search results in a conventional manner.
  • a preview window 1022 displays a scrollable preview of the text within the web page.
  • the user must then authorise the applet to have full access to his machine's resources by clicking a Grant button on a window that appears.
  • the user is then invited to enter a query in the form of one or more keywords.
  • the applet contacts the system server which returns a list of URLs (Uniform Resource Locators i.e. web page addresses) related to the query, which the applet displays in a table.
  • the table shows the Date the page was last modified (according to the system database), the document Title, the URL (or, in other embodiments, just the domain), and a rating for the site.
  • the applet When the applet receives a URL, it attempts to contact the site and download the page or document there, updating its status and dates columns as it does so.
  • document is used generally to include video, audio text and multimedia files, games and other similar types of information.
  • a page When a page has been downloaded (several are downloaded simultaneously by utilising Java's inbuilt thread (multi-tasking) support) it can be previewed in a preview pane of rich text (i.e. colours, fonts, size, bold, italic but in one version, with no images or audio) or alternatively an HTML frame, by hovering the mouse pointer over its entry in the list.
  • the user is able to see which ones have been contacted and cached, which ones are still pending and which ones have moved or been deleted. Once cached, the user can see the size of the page, and by moving the mouse over it is able to preview it and get an idea of what the page consists of. At this stage the user may bookmark the site for later reference, or he may choose to view the actual page.
  • Viewing is performed by clicking on the entry in the list (or on the URL, if shown). This brings up another browser window, with its address bar disabled to prevent confusing the user because the page is displayed from a local cache (i.e. a file on the hard disc) rather than a web location. Whilst viewing the full page, the user is free to follow links as in a normal browsing session, although any links followed take the user to actual web pages rather than local cache files. Having viewed the page, the user may then decide to bookmark it. Bookmarking is the "marking" of a location so that a user can return to it, for example by storing a reference to the location in a folder.
  • Bookmarking is preferably performed by clicking on the checkbox next to the entry in the list. Note that if the user has previously bookmarked this exact URL (either for this query or a different one, whether in this session or a previous session) then the entry in the list will already have its checkbox checked.
  • bookmarking When the user bookmarks a particular site the act of bookmarking advantageously casts a vote or recommendation for that page.
  • bookmarking There may be two forms of bookmarking, the stronger marking a persistent interest in a site, and the weaker marking of an article to be read later.
  • this distinction can be automatically deduced by the applet observing when the user subsequently views a page via the bookmark from a Bookmark Viewer.
  • the system uses these votes to assign a two-fold recommendation-rating or score for the URL.
  • the first is with regard to the search terms used in the query and the second is a general vote in the page as being of high quality.
  • the results he receives are ranked by other users' bookmark-recommendations. This process happens in real-time, so a popular new site can be highly ranked very quickly.
  • the order of the pages returned for particular query is by votes with regard to this or similar queries, but the "general quality" vote is also displayed alongside each page.
  • the applet may have a number of other features, including, a Bookmark Viewer allowing users to reorganise their bookmarks, check for out of date ones, and view the bookmarked pages, which preferably registers a vote for the page with the system, and also indicates to the applet that this bookmark is to be more highly rated amongst the user's list.
  • the applet may also offer the user 'today's favourite query' and 'today's favourite site'.
  • this comprises a central repository of data which indexes the web. Its function is to collate information collected by its clients and to service requests from clients to access this data in an ordered fashion. Preferably, it does not perform the collection or crawling itself.
  • This server consists of three main subsystems: a database, a data collection subsystem and a query request servicing subsystem.
  • the database stores data relating to what pages exist on the web, what keywords (or more generally, terms i.e. words or phrases) are associated with these pages, and how each page is ranked according to each term. It also contains user data including their bookmarks.
  • This database may be implemented as a standard relational database, or as a custom data structure.
  • the data collection subsystem or process is the recipient of processed and compressed data prepared by the applets relating to new and updated pages. This data is incorporated into the database, replacing anything which is out of date. An important part of this process is that it is done in real time, i.e. the database is constantly kept fully up to date.
  • This subsystem is also responsible for accepting user votes for sites in terms of bookmarking. The bookmarks are inserted into the users database entry and a vote is registered linking the site with a search term.
  • the query request servicing subsystem answers queries from users, essentially of the form "most relevant pages for ⁇ keyword(s)>", which are submitted via the applet.
  • the response consists of a list of URLs, ordered with the most relevant first, which is fed back into the applet for display to the user. Associated with this response is also one or many URL tax items (see below), which the applet must check and report back to the system.
  • This process preferably has a very high performance as volumes of requests are typically measured in millions per day. Achieving this performance is helped by the fact that entire web pages do not need to be served for each request; instead a compact list of URLs which the applet can display in a user-friendly fashion will suffice.
  • the applet is signed, which means that a digital certificate or signature has been applied to the binary file which comprises the applet such that an end user may be confident of where the applet originates from. This instructs the browser to provide the user with the option of marking the applet as trusted.
  • Java (Regd. T.M.) applies to applets.
  • Java Virtual Machines Java Virtual Machines
  • an applet is only permitted (by, e.g., a browser) to make a network connection to the server from which it was downloaded. This could prevent the functionality in the system applet of going to different websites and downloading their pages.
  • signing the applet, and prompting the user to mark the applet as trusted these restrictions are lifted.
  • RMI Remote Method Invocation
  • socket-based and HTTP-based versions of the system protocol are implemented so that users who are able to make use of the more efficient sockets version can do so.
  • Java (Regd. T.M.) has inbuilt and efficient support for threads, that is, the ability to set multiple processes running concurrently within the same program (i.e. multi-tasking). This provides a number of advantages. To download say 10 web pages simultaneously, using threads, there is no need to manage switching between pages as packets arrive randomly across the Internet. Since the JVM automatically allocates CPU resources to each of the threads, if one is held up waiting for the network to provide data, then the CPU is freed to work on another task. In the case of the system applet, whilst it is waiting for a response from the website it is attempting to download a page from, another thread can be processing a previously downloaded page, thus optimising usage of available resources.
  • GUI graphical user interface
  • Each page being downloaded preferably also gets a thread to optimise the trade off between network and CPU. This includes URL tax pages (described below).
  • the thread that downloads the page will then, having updated the GUI, begin processing the page if necessary.
  • This thread will then pass information stating that the page has not been modified since last checked, or a fresh analysis of the page as required.
  • information is sent to update the system on a page by page basis as and when it is available.
  • This is advantageous as the applet might be terminated at any stage by the user closing his browser or moving to another page.
  • the applet When the applet receives a list of URLs, it displays them on the GUI in the order received which is ranked in descending order of relevance as determined by the system using its relevance and voting data. Additionally or alternatively, the results can be ranked using data from the local user only. The applet then begins to contact each of the sites in the list and to download from them. It starts at the top of the list, downloading the pages most likely to be what the user is looking for.
  • FIG. 11 this illustrates a plurality of concurrently running threads of the search, spidering and user interface aspects of the applet, showing different stages of page downloading and processing.
  • a single thread is responsible for downloading one web page and, if necessary, processing the web page data and transmitting the result back to the system.
  • a GUI thread 1100 is also shown.
  • a spidering thread first waits 1110 for a response from a web server then downloads 1120 the web page and finally processes and transmits 1130 indexed content data back to the search system server.
  • an exemplary slow thread 1102 and fast thread 1104 are shown together with a thread 1106 which does not receive a response from the web server and times out.
  • the exact number of threads which results in optimal performance can be determined by empirical means but it is typically of the order of 10.
  • the applet may measure its own performance to optimise the number of threads it creates. This is useful as many factors affect the optimal number, e.g. network bandwidth availability, CPU availability, JVM use and physical memory availability.
  • a further preferred feature that improves the performance from the user's point of view is that the applet monitors the user scrolling through the list of URLs to ensure that it concentrates on items cunently visible in the scrolling window. For example, say ten items are visible in the list without scrolling, then if the user quickly scans the first ten items and determines, perhaps by looking at the titles, that he's not interested without actually waiting for the preview to appear and scrolls onto the next ten items, then the applet operates to focus on getting those items downloaded. The applet preferably continues to download the previous items in case the user decides to return to them, but places a higher priority on the cunently visible items. In this way the applet is seen to keep in touch with the user and is able to present previews and cached copies of the pages with a minimal delay.
  • the applet download pages, it stores them in a special directory on the local hard disc set aside for this purpose. This is facilitated because the applet is signed and trusted and therefore has access to system resources (such as the file system) that untrusted applets do not have (see above section on signing).
  • system resources such as the file system
  • the cached HTML files on disc can also be used for the preview functionality (see below) to enable a full rendering of the page (utilising the browser HTML renderer) on a small scrollable pane rather than the simplified rendering performed within the applet itself.
  • the applet may provide this as a user option depending on which the user finds most helpful; this will typically be dependent on the network and CPU resources available to the given user.
  • the user In order to preview a page, the user simply hovers his mouse pointer over the list entry of interest. If utilising the browser's rendering engine, then the applet issues a command specifying the file to display, and whether to display it in a separate window or a particular pane. Preferably there is a pane as part of the system's home page that is set apart for this purpose; if a separate window is used then it preferably has its address bar and toolbar suppressed to save space (and to prevent confusing the user with a the presence in the address bar of a local filename instead of the actual URL).
  • a Java (Regd. T.M.) pane is created within the screen area that belongs to the applet, and this is populated with a much-simplified representation of the HTML, which is purely text but which has some of the characteristics of the full HTML such as colours and font size. Again this pane may be placed in a separate window - the advantage of doing this is that the user has complete freedom in how he organises the layout.
  • the interface may provide a detach button which allows the fixed pane to become a separate window, with the option of re- attaching it.
  • the user clicks on the page title in the list, wliich highlights as the mouse rolls over it indicating that it can be clicked.
  • This preferably uses the same mechanism described above to display the full page, again from local cache if available; this is advantageously shown in a new browser window.
  • the user may then choose to follow links within the site, in which case un-rewritten URLs are followed taking the user to areas within the actual site and browsing may continue as usual.
  • the user is presented with the previous browser window containing the system applet, ready to continue looking for his page of interest.
  • the applet When the applet has downloaded the text for a particular URL it begins processing it, but only if the system marked this URL as wanted for update, and if the page has a date modified later than the date the system has on record. If the page is wanted for update but has the same date modified as the system has on record, then this information is , returned to the system so it knows that it doesn't have to check this page again for another period of time specified for updating.
  • Those pages which have changed since last checked and which the system therefore requires a new analysis of are processed to determine the keywords present and to obtain a relevance ranking for the keywords based on relative frequency and significance within the page (e.g. large headings carry more weight). This processing is performed in a thread for each page, preferably the same thread that originally downloaded it.
  • the applet begins by building the word list for the page, that is it lists all the words that appear on the page, and assigns each word a rating based on its importance. This is determined by that word's frequency and whether it appears in the title, headings, links etc.
  • the applet has built into it (preferably in such a way that it may be dynamically updated without downloading the whole applet again) a list of words that are too common to be useful for searching except when searching by exact phrase. Preferably, there is also a list of words that are deemed inappropriate for allowing searching on. These words will also be discarded.
  • This simple message instructs the system that this page has been checked as required. The system notes this and then checks again in the period of time specified for updating.
  • the URL_ID is supplied by a server when it sends the list of URLs, and the system uses its own local clock to determine the DATE_CHECKED field for its database entry.
  • the user is identified by means of an HTTP cookie, a small piece of data which the web server sends to the browser and which the browser keeps a copy of between sessions.
  • the browser then automatically sends this back to the server whenever it visits that site again, thus enabling the server to identify the user and retrieve personal information for them.
  • regular users to register with the site. With this they choose a username and password and can then logon to the system from any computer connected to the Internet and retrieve their personal bookmarks.
  • bookmarks For both types of user there is also the option of saving their bookmarks to local disc and of exporting them in a format (such as HTML) such that they can be imported into popular browsers such as Internet Explorer (Regd. T.M.) and Netscape (Regd. T.M.). Similarly, users can import bookmarks from their browsers into their system accounts.
  • a format such as HTML
  • popular browsers such as Internet Explorer (Regd. T.M.) and Netscape (Regd. T.M.).
  • users can import bookmarks from their browsers into their system accounts.
  • bookmarks stored on the system server occupy disc space
  • the user is sent a warning email a week before this is going to happen, and if the user accesses the system within that week then the account becomes active again.
  • the email contains the user's username and password in case they have forgotten it and a link to the system that automatically logs them on, making it as simple as possible for the user to begin using the system again.
  • the email also contains an attached file, in portable HTML format, containing all the user's bookmarks so that even if the user's account is deleted, they still have a record of their bookmarks which can be used directly from the email, saved to local disc, imported into a browser or, if the user creates a new system account at some point in the future, imported back into the system.
  • Bookmark Viewer In order to organise a user's often large number of bookmarks, there is preferably provided a special Bookmark Viewer or Bookmark Manager activatable by the user. This may be a pane on the side of the normal system applet view which may display contracted and expanded folder contents view formats. Bookmarks are preferably organised into Folders, which are hierarchical and can be expanded or contracted to show the next level down in the hierarchy, in the standard Windows (Regd T.M.), Tree Diagram paradigm.
  • a user bookmarks a site it is placed in a folder based on the query that the user performed. This may be automatically made hierarchical based on multiple query terms.
  • Extra features in the form of a personal results functionality are available on subsequent visits for users who have already visited the system and performed one or more queries (see below).
  • the applet automatically selects a number of queries from the user's history, up to, for example, ten (this number is preferably user-configurable).
  • a search is then performed for each of the queries, returning only pages which are highly rated by other users and which are relatively new, e.g. less than 1 year, 1 month, 1 week or 1 day old. If there are no such sites for a particular query then nothing is shown.
  • This cut-off criteria is preferably user-configurable, for example in terms of votes cast or recency of modification).
  • a priority value is maintained for each query; this may be incremented whenever that query is run by the user.
  • a small header indicating which of the user's queries each set of pages are in response to may also be provided. Again, by hovering his mouse over the entry, a preview is shown to the user and by clicking on it the full page comes up in a separate window as normal. This effectively provides a "web magazine”.
  • the queries chosen for such a Personal Results or web magazine page are preferably selected based on any or all of the importance of the query to the user which is determined by seeing how recently the user performed this query, how often the user performs the query, and how many bookmarks he has which relate to this query. It is also possible to allow the user to promote a query to a higher significance thereby guaranteeing its inclusion in the Personal Results search.
  • the new search results replace the Personal Results. However, these may be returned to at any stage (including on the user's first visit to the system) by clicking on a Personal Results button.
  • the queries that generate the Personal Results page preferably implement the usual spidering functionality of any other query, including the URL tax (see below). This is advantageous as people may use the system as their home page or often view their Personal Results without actually doing fresh queries, in which cases the system benefits from their spidering input as soon as they logon.
  • the system relies on the processing capability and bandwidth of its users, in particular, the applets running on their browsers.
  • the applets running on their browsers.
  • a list of matching URLs is returned. If the system needs an update for any of these pages, it informs the applet, wliich returns to the system a fresh analysis of the page containing the word list and the set of URLs that the page links to.
  • each URL is assigned a unique compact ID of preferably five or more bytes. 5 bytes allows for a trillion unique URLs, but depending on the exact implementation 8 bytes may be simpler to store.
  • a field TODAYS VOTES counts the number of times this page has been bookmarked today, so it is set to zero every 24 hours.
  • a RATING field is a score based on the number of bookmarks and other votes (i.e. people viewing the page from their bookmark viewer) which accumulates with time, although when pages are freshly analysed this number is reduced, for example by halving its value.
  • a term is used to mean a "word” (preferably with capitals and punctuation suppressed) or a set of words which are grouped together as a phrase, although preferably the order is not significant, hi this context, "word” may include words in more than one language, proper nouns, and combinations of letters, numbers and other characters such as the AMD "3DNow! trademark.
  • TERM__ID is a 4 byte integer allowing for 4 billion unique search terms.
  • TEXT is a word or phrase as a character string e.g. "clinton”.
  • RATING counts the accumulated number of times this term is used for querying and TODAY is the RATING count for the last 24 hours.
  • a TERM table contains all unique words to be indexed against, and in addition it contains the most popular phrases that searches are performed against. The number of phrases stored selected dependent on the system's resources.
  • a RATING table provides cross-references between particular pages and search terms that they are relevant to.
  • Data in a RATING field is preferably a combination of a static rating of a page with respect to a particular term, i.e. how often a particular word appears on a particular page and a dynamic rating i.e. based on bookmarked pages that users associate with a search term.
  • the static and dynamic ratings may be stored in separate fields.
  • a USER table holds important user-related information. An entry for each unique user is created the first time they visit the system site. As session tracking is performed using HTTP cookies without the need for the user to register and sign in, initially, USERNAME, PASSWORD and EMAIL fields are blank, and only a USER D is stored. This is what the cookie stored on the user's browser contains in order to identify the user on a return visit.
  • a USER TERM table stores queries that a particular user performs on a regular basis. Each user will have a number of entries in this table, preferably with an upper limit to prevent the table growing too large (therefore the primary key for this table is USER_ID + TERM). If the phrase or word appears in the master term table as on page 25 then a TERM_ID is referenced thus saving space in the database otherwise the term appears in full as a text string.
  • the priority field indicates the importance of this query; preferably it is automatically incremented whenever the user performs that query.
  • the value may be manually edited by the user to indicate when they are particularly interested in a query. This value is used when generating the automatic Personal Results page.
  • a BOOKMARK table stores the users' bookmarks.
  • the primary key for this table is USERJD + URLJD this ensures that each user may only bookmark a given page once.
  • TERM indicates for which query a given page was bookmarked and this information is used for ranking pages with respect to search terms.
  • a FOLDER JD is used by the Bookmark Viewer/Manager to organise the bookmarks hierarchically.
  • URLJD signifies the web page address efficiently by providing a cross-reference to the URL table.
  • a FOLDER table enables bookmarks to be organised hierarchically into folders. By default when the user bookmarks a page a folder is created with the name formed by the search queries used, if not already existing for this user. The user can rename folders and create subfolders. If a folder is a subfolder of another folder than its PARENT JD points to that folder, otherwise PARENT JD is null, indicating that it is a top-level folder.
  • the database may be implemented on an RDBMS (Relational Database Management System) or on a proprietary or other data structure.
  • RDBMS Relational Database Management System
  • the dataset is also veiy large and simply structured.
  • a custom design data structure is one efficient and cost-effective solution.
  • each table consists of a file on disc. There is a portion of each table held in physical memory (i.e. cached) at all times. Only specific operations are allowed on the database; these are the ones that allow the data to be updated as information is received from the applets, and that allow queries to be performed against the database by the applets. Optimised C++ routines perform these operations on the cached portions of the tables and also keep the full disc versions up to date.
  • the interface to the system database comprises a Java (Regd T.M.) servlet, optimised for network operations, thus enabling a large number of applets to be simultaneously connected to the system.
  • Integration of the front and back-ends of the database is by implementing the C++ methods as Java Native Interface (JNI) methods, that is so that they comply with a standard interface allowing the servlet JVM to make direct method calls on them.
  • JNI Java Native Interface
  • DOS Denial Of Service
  • Denial Of Service attacks are a common problem for all Internet based services, systems and networks, hackers using the normal process of sending queries, but in such quantities that potentially the server cannot cope.
  • One way to address this potential vulnerability is to provide means to monitor the network traffic and the system server(s) itself to check that no such attacks are in progress and that the system is running smoothly and is coping with the number of interactions it is receiving. It is possible to provide means to track the trends in traffic on daily and weekly cycles, and to keep ahead of demand for the service as the system's popularity grows. Thus, if a DOS attack is launched, a sudden increase in activity beyond the normal cycles can be detected and defensive measures taken. These may involve blocking traffic from certain IP domains or addresses and if necessary closing down the system until the source of the attack has been identified and shut down. Similarly, means to monitor activity on the system site and data flowing into and out of the database can identify discrepancies that signify hacker attack, at which point appropriate measures can be taken.
  • Another potential risk is that of inaccurate information being systematically fed into the system database. For example, a person might consider the artificial boosting of page rankings to artificially boost the traffic to their website by sending in large numbers of fictitious votes, or possibly to reduce the traffic going to a competitor's site by sending wrong analyses of pages.
  • a so-called dynamic page rating is obtained by counting votes that are cast whenever a user bookmarks a site enforced by the database schema.
  • a solution to this problem is to provide means to monitor the IP address from which new users are originating. Many users, for example more than 10, (especially over a short space of time, for example, 1 week, 1 day or 1 hour) originating from the same IP address can be detected indicate a potential problem (especially if these users are indeed casting many votes for the same web page). Measures can then be taken to block this activity, for example by issuing all users from the same or a particular suspect IP address with the same USERJD.
  • This technique can be termed multiple-spidering. This technique introduces a potential overhead into the system that could reduce the spidering power of the system. However, there are normally so many clients performing spidering on behalf of the system that this is unlikely to be a significant issue, and indeed provides a further guarantee of the quality of the data. It is nevertheless possible to restrict the impact of this overhead, by close monitoring of aspects of the system.
  • a small sample (say 1%) of pages may be multiple-spidered by default. If a significant number of these pages are rejected then this can be taken to indicate a problem of this kind. At this point (potentially automatically), the number of pages being multiple-spidered is gradually increased, up to the maximum of 100% so that the quality of the system's data is once more assured. At this point the malicious third party is defeated, no more contentious page analyses will be detected and the system can once again reduce the percentage of multiple-spidering back to the original low background level.
  • the product of the recommendation-rating for each word in the query is used to rank the URLs.
  • two copies of the term lists are maintained - one ordered by rank and the other ordered by URLJD.
  • a word list associated with each URL is maintained. The best strategy for a given application can be readily determined by experiment.
  • the lists, and in particular the ordering of them, are advantageously updated whenever a query invokes it.
  • a flag on each list will indicate whether it is up to date with ordering.
  • each user-query also results in one or more 'processing-tax' URLs being sent to the user, unrelated to the user's query, but which the system wants updated.
  • a list of URLs may be sent to have their modified dates checked, and those that have changed are processed as described above.
  • the processing tax may have to be pitched as high as 33% or even 50% in order to achieve the level of spidering required but lower levels such as 15%, 10%, 5%, 2%, 1% or less can also be used. This can be tuned according to the needs of the system at a particular stage in its development as a web authority.
  • the tax rate can also be dependent upon a user's bandwidth, for example to levy a higher tax on "rich" users, i.e. those people with high speed connections.
  • Any pages which have been non-contactable for a certain number of days are preferably deleted from the list. Less highly-rated pages will thus have a shorter time-to-live than more important pages.
  • the time-to-live for a particular page is preferably determined by a function of that page's importance and the continuous amount of time it has been non-contactable.
  • the data collection and processing (web page spidering) procedures described above as being performed by the applet are also implemented in conesponding code on the data collection server (although there may be no need to establish a socket connection).
  • the data collection server may then begin to populate the database using its own spidering processes. Initially, there will be relatively few users, so the system will have time to do its own searches. Gradually however it will be able to afford to spend less time searching and will have to spend more time interacting with clients. Concomitantly as time goes on the central data collection server will need to do less and less spidering itself, instead relying on the resources of the growing numbers of users. There is thus an elegant trade-off between available CPU and bandwidth and the number of users. Furthermore as the size of the web expands exponentially in terms of number of documents so the number of potential system users spidering the web also grows exponentially.
  • Initial spidering is thus preferably done by the server itself, utilising the same optimised distributed mechanism that is used to manage the many client applets once the system site is up and running with a large set of regular users.
  • a special version of the applet is therefore provided which has no or a very limited GUI, which has the same interface to the server as the normal applet and which spends substantially all its time requesting long lists of pages that need checking and returning the results in batches.
  • This process is run on the server machine and any other available machines with permanent Internet connections.
  • One way to build up the initial list of URLs is to query DNS servers for the complete set of registered domain names, (for example, for .co.uk domains); alternatively this information may be purchased (for example from Network Solutions for top-level .com domains).
  • the present invention is also applicable to other networks such as intranets, extranets, local and wide area networks, WAIS (Wide Area Information Servers) -based networks and wireless networks.
  • WAIS Wide Area Information Servers
  • the invention is also applicable to mobile phone-accessed networks such as networks accessed by means of i-mode or WAP (Wireless Application Protocol).

Abstract

The invention is generally related to search systems, such as systems for searching and cataloguing data on the Internet. The invention provides a server system for searching a network, the system comprising: a search data store storing: a plurality of addresses of locations of objects accessible using the network; and search data including data relating to information content of at least some of the objects; a program store storing processor implementable instructions; a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to: receive a search request from a user terminal; retrieve search result data from the search data store comprising one or more search result address for objects having an information content relevant to the search request; transmit the search result data to the user terminal; receive from the user terminal information relating to an object located at an address provided to the user terminal by the server system; and update the stored search data using the object-related information received from the user terminal.

Description

SEARCH SYSTEMS
This invention is generally concerned with software and systems for searching. More particularly it relates to systems for searching and cataloguing documents on networks such as the World Wide Web and to new interfaces to such systems.
The World Wide Web is expanding more quickly than the capacity of search engines to catalogue it and search engines are increasingly falling behind ('Accessibility of information on the web' by Steve Lawrence and C. Lee Giles, Nature, 400, 107, July 1999).
As of January 2000 the World Wide Web comprised more than one billion unique documents (http://www.inktomi.com/new/press/billion.html), and indexing new or modified web pages could take several months or longer.
A method for organising information is known from WO 99/06924 in which the search activity of a user is monitored and used to organise articles in a subsequent search by the same or another user who enters a similar search query.
US 5,748,954 refers to determining the popularity of a file according to how often a file is referenced by a computer other than the computer on which the file is stored. US 5,974,455 uses a hash table and a sequential disk file to construct a search database. US 5,983,218 describes a distributed (multimedia) database using a web server to select and co-ordinate information flow between database sites and user sites. US 6,006,217 describes a method for providing enhanced search results in which a server retrieves a document from its home server and highlights matches to search criteria. US 6,038,668 describes a networked catalogue search system in which a search engine forwards retrieved pages to an object oriented database distributed across a network of computers. A local portal retrieves pages through a web crawler. US 6,078,924 uses collection agents to retrieve specific information without user intervention. WO99/42935 describes a search system in which characteristic information for a search database is stored across a computer network. An information collector comprises a plurality of collecting modules and user access to the system is via an interface server.
EP-A-0 982 672 describes an information retrieval system including a search assisting server having list data constructed using a list of identifiers for accessing information servers. In response to designation of a requested item the identifier corresponding to the requested item is searched for from the list data. JP 11015856A describes a server for integrating databases including multimedia materials comprising a meta-server including a meta-database, a search agent for searching an objective database site by indexing, and an improving module for observing a response pattern from a database site corresponding to a user's enquiry and improving a calculation of a future site relation.
A distributed indexing/searching workshop held by the World Wide Web Consortium in May 1996, Massachusetts, USA (www.w3.org/search 9605-indexing-workshop) provides background information on web spidering. The web site www.webbuildeπnag.conf_^upload/free/features/webbuilder/1999/udell/1999-07-20.asp purports to disclose an article in Web Builder Magazine of July 20, 1999 by Jon Udell which briefly refers to a distributed spidering process in which a number of software agents collect data for a search database. The article invites comment on the idea of "pushing the work of spidering (but not indexing) out to ISPs and other hosts that serve large numbers of pages".
Since the web is expanding more rapidly than the capacity of current search engines to catalogue it, a system and method is required in which inter alia the cataloguing of the web is performed more quickly than has hitherto been the case. There is also a demand for a search engine with a more comprehensive database than those of current search engines which will enable more complete results to be returned in response to a user's search query.
The present invention addresses these needs.
According to the present invention there is therefore provided a server system for searching a network, the system comprising: a search data store storing: a plurality of addresses of locations of objects accessible using the network; and search data including data relating to information content of at least some of the objects; a program store storing processor implementable instructions; a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to:- receive a search request from a user terminal; retrieve search result data from the search data store comprising one or more search result address for objects having an information content relevant to the search request; transmit the search result data to the user terminal; receive from the user terminal information relating to an object located at an address provided to the user terminal by the server system; and update the stored search data using the object-related information received from the user teπriinal.
The address provided to the user terminal by the server system may comprise one of the search result addresses or a search tax address (described below) or an address for spidering as a background task. Preferably, however, a plurality of addresses for a plurality of objects is provided to the user terminal. The information relating to the object or objects at the address or addresses provided to the user terminal may comprise object content characterizing data such as a last modified date and/or checksum for a web page, or it may comprise object information content data such as indexed content data. Alternatively, but less preferably, raw object data may be received from the user terminal, such as raw (i.e. unprocessed) web page data. Updating the stored search data using information received from a user terminal relating to an object located at an address provided to the user terminal by the server system relieves the server system of much of the search and indexing work it would otherwise have to perform. The reception of information from the user terminal is linked to use of the server system to process search requests from the user terminal which allows better use of network bandwidth and processing bandwidth as well as facilitating simplification of overall system design. As the skilled person will appreciate, the search data store may reside on a single machine or may comprise a distributed data store.
In a preferred embodiment the instructions further comprise instructions for retrieving at least one search tax address from the search data store, transmitting this to the user terminal, and receiving back information relating to an object at the search tax address. The search tax address is an address provided to the user terminal for the user terminal to process, but in general does not comprise one of the search result addresses. Thus, in effect, this additional address is a tax on the user terminal (or user) for allowing the terminal access to the search data store. The search tax address or addresses may comprise an address or addresses which are to be processed by the user terminal in an on-going background spidering process or the search tax address or addresses may be provided to the user terminal in response to receipt of a search request from the user terminal on a per-search basis. In a preferred embodiment both background and per- search tax addresses are sent to the user terminal for spidering.
The search tax addresses are preferably selected according to a logical proximity of an object at the tax address to the user terminal. Such a logical proximity may be based upon the user terminal's IP address, or upon a proximity measure such as ping time or a count of a number of hops between the user terminal and the object at the tax address. Search tax addresses may also be selected dependent upon the network access bandwidth of the user terminal.
The object information content data preferably comprises a list of words in the object and word rating data indicating the likely significance of the words to the object. The server system may also be configured to receive user object preference data such as bookmark data indicating objects a user has bookmarked for access on later occasions.
To restrict the likelihood of fraud, in preferred embodiments two or more user terminals are sent the same object's address and the search data store is only updated once the result from a first user terminal has been checked against the data received from the second or further user terminals. The system may also monitor user's IP addresses and/or user's traffic to detect fraud.
The invention also provides a search data store for the server system wherein an item of the object information content data, such as a keyword, is associated with a plurality of item location addresses for objects having an information content relevant to the item of object information content data; and wherein the item location addresses have an order corresponding to the relevance of the objects at the addresses to the item of object information content data.
In a complimentary aspect the invention provides a user terminal for searching a network, the user terminal comprising: a data store operable to store data to be processed; a program store storing processor implementable instructions; and a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to:- input a search request from a user; transmit the search request to a server system; receive search result data from the server system, the search result data comprising one or more search result address for objects having an information content relevant to the search request; retrieve from at least one address received from the server system object data for an object located at the received address; and transmit to the server system information relating to the object located at the received address derived from the retrieved object data.
The address received from the server system may be a search result address, a search tax address provided in response to a search request or a background search tax address, as described above with reference to the server system. The search request itself may either be issued in a conventional manner using an internet or web browser, or the search request may originate from dedicated searching code running on the user terminal. The object data retrieved by the terminal may comprise bibliographic data such as a last-modified date or more complete object data; the information transmitted to the server system may comprise the retrieved object data itself, for example where only bibliographic data is retrieved, or it may comprise the results of an object analysis procedure which has been executed on the user terminal. The server system with which the user terminal communicates may comprise a single server or a set of interrelated servers.
The processor implementable instructions of one or both these systems may be provided on a data carrier or storage medium such as a hard or floppy disk, ROM or CD-ROM, or on an optical or electrical signal carrier. The processor implementable instructions of the user terminal may be stored in the data store of a network server such as a web server, for example as part of a page of internet data such as a web page.
The invention also provides a corresponding method for searching a network using a client system, the method comprising: inputting a search request from a user; transmitting the search request to a server system; receiving search result data from the server system, the search result data comprising one or more search result address for objects having an information content relevant to the search request; retrieving from at least one address received from the server system object data for an object located at the received address; and transmitting to the server system information relating to the object located at the received address derived from the retrieved object data.
In another aspect the invention provides a search system for a network comprising: a server coupled to the network; a plurality of user network-access means, couplable to the server via the network for providing a plurality of users with access to the network; a search database coupled to the server; an information collecting program accessible to each said user network-access means for running by said users; wherein said information collecting program is configured to, when rurining on a said user network access means, collect information relating to data stored at locations within the network and to pass at least a portion of the collected information to the search database; and wherein said locations are provided to the collecting program from the database in response to a search request sent by the collecting program to the server for search data from the database.
The search system may be part of a system providing a user's search service. The network may be an Internet protocol network such as an Internet or an intranet and in what follows references to "web pages" are intended to include pages of information in internets and intranets other than the World Wide Web. Typically the user network - access means will be a personal computer, but network access can also be by means of a mobile telephone, Internet enabled TV and other similar net-compliant devices. In one embodiment the information collecting program is integrated into a web browser, for example, comprising part of an executable file of the browser. The search database comprises generally data and a software.interface thereto and may include associated data manipulation, processing and communication functionality.
In an Internet data locations are usually identified by URLs (Uniform Resource Locators), and in a preferred embodiment these are provided to the information collecting program from the database. However, URLs for collecting information could be obtained from another source. Information from the collecting program for the database could comprise a downloaded web page and/or a compressed or encrypted version thereof, or the web page after partial or full analysis for, for example, keywords and/or phrases, by the information collecting program. In an Internet, the Internet data collected may include (but is not limited to) HTML data, XML data, DHTML data, SGML data, web page information, and audio, video, multi-media, web TV, game, file, financial and other information types.
Often the information collecting program will require some sort of "signature" to show that it can be trusted to read and/or write to a local user's hard disk and to access information on other servers. This is not, however, an essential aspect of the invention but depends, in part, on how the network is set up and the context (for example the browser type) within which the information collecting program operates.
In a further aspect the invention provides a method of updating a search system for a network, the system comprising: a server; a plurality of user network-access means, couplable to the server via the network, each for providing a user with network access; and a search database couplable to the server; the method comprising: running an information collecting program by a plurality of said users; collecting information relating to data stored within the network using the program; passing at least a portion of the information collected by said plurality of users to the search database; and updating the database using the collected information.
In one embodiment the user's access to or vote of approval for information provided by the search results is logged or registered in the database. Notes can then, for example, be counted so that the results of future searches can be presented or ranked in order of relevance as determined by users of the system. There is preferably also a provision of bookmarking, in the context of an Internet search page, the marking of user-preferred pages in order that these can be returned to at a later stage. More generally bookmarking involves the storage of a location identifier, normally with some information concerning the site, page or data it locates, for example a title or description. Normally a user's bookmarks are specific to an individual user, but bookmarks can also be shared between users or within groups of users. In a preferred embodiment, when a site or web page or other network location is bookmarked this is registered as user approval for later ranking of search results, and where an axis or vote counting system is implemented, additional weight can be given to book marked sites.
The invention also provides a program to, when rurining, on a network: provide a user interface for searching the network; accept a user search request; pass a request to a search database, responsive to the user request; receive a search result having network data location information from the database; access, or request another program to access, the data location; and pass information from the data location back to the database.
The invention further provides a web browser application program to, when running, receive a URL from a server, at least partly download a web page at the URL, extract a portion of information from the web page, and send the information to a web searching database on the web.
In another aspect the invention provides a web data collection system comprising a plurality of individual users each connected to the web and running a program to collect information on the contents of web pages and to report the information to a common database.
In another aspect the invention provides a database for a network searching system comprising: a list of network resource locators; a list of search terms or term identifiers; and a list of ratings, each linked to at least one resource locator and one term or term identifier, a value of each rating being dependent upon access to or approval of a corresponding located resource by users of the searcWng system.
In another aspect the invention provides a method of bookmarking resource locations in a network searching system, the system comprising a server coupled to a search database and means for remote access to the database by a plurality of users, the method comprising: providing to a user in response to a search request, search results from the database, the results being associated with corresponding resource locators; receiving from the user a request to bookmark a resource associated with a said result; storing, in the database, a corresponding resource locator coupled with user access control information for the user; whereby the resource is locatable by the user after bookmarking.
In another aspect the invention provides a method of ranking results for a network search system, comprising: determining a first user's interest in a network resource by detecting whether the user stores the resource location for later access; and ranking a plurality of network resource locations provided as results for a search performed by another user, partly responsive to the first user's determined interest.
In another aspect the invention provides a method of providing a web user with a preview of a web page, comprising: locally caching at least part of the web page information; rewriting at least one link in the cached page to point to locally cached data; and displaying at least a part of the cached page.
In another aspect the invention provides a user interface for a network browser or search system, comprising means to automatically download a plurality of documents or web pages, or parts thereof, indicated by displayable results provided to a user, by starting a corresponding plurality of processing tasks to be executed in parallel.
In another aspect the invention provides a network search system comprising: means to store a search request input to the system by a user on a first occasion; and means to repeat the user's stored request automatically and to display the results of the request when the user accesses the system on a second, subsequent, occasion.
In another aspect the invention provides a network search system comprising: a server coupled to a search database; a remote network access means including input means for a user to input a search request; means to provide an instruction from the database to the remote network access means to access and analyse information relating to a resource on the network and to report to the search database; and means to provide search results to the network access means in response to the search request, conditional upon the database receiving the report.
In another aspect the invention provides a method for quality control of a database of search data for a network, comprising: instructing a plurality of client programs to gathering information for the database from locations provided to the programs by the database; double checking a proportion of the gathered information by issuing identical or equivalent locations to two different client programs; determining whether the gathered information from the two client program agrees to within a tolerance margin; and adjusting said proportion based on the results of said step of determining.
In another aspect the invention provides a stand-alone distributed web crawler to, when run contact a web page; download that web page; analyse the contents of that web page; and send the results of its analysis to a database system.
In yet further aspects the invention provides a system and method in which a signed Java applet performs a web-crawling function analysing web pages and posting the results of its web crawling operations to a system partly comprising a database.
Such a database-system may be built from scratch expressly for the purpose of serving the signed Java applet or it may comprise an existing system, with the potential addition of new schemes, tables, relations or other data structures which facilitate serving the applet.
In the case of a pre-existing database, it may be necessary to incorporate a method of translating data being sent from the applet to the database-system into a form comprehensible to the database-system and a method of translating data being sent from the database-system to the applet into a form comprehensible to the applet. In either of these latter cases translation software may be incorporated into the applet, or incorporated into the database-system, or both.
The utilisation of a signed Java applet for web crawling also confers other advantages upon the search process.
Generally speaking, described herein is an Internet based search engine which is installed on a server but operates in a distributed way in that it makes use of users' local PCs to update the search engine database. The user accesses the search engine database from a local PC by means of a Java applet which may be downloaded from the search engine server. This applet is run when a search is carried out and returns a list of web page URLs in a conventional manner. However, when a user accesses one of the URLs identified by the search, the Java applet fetches the web page identified by the selected URL and checks the time stamp on the web page against the date of an entry for that URL in the search engine database. If the check shows that the web page fetched by the user is newer than the search engine database entry the Java applet takes further action. It either sends or forwards a copy, preferably in a compressed form, of the web page data to the search engine or it strips out key words from the web page and forwards these to the search engine. In this way the search engine database is updated as users use the search engine. Effectively, the web crawler software is distributed across a large number of local users' PCs.
Preferably the Java applet is "signed", in other words, provided with a digital signature or certificate. An applet which is signed in this way is "trusted" and is permitted access to other servers. This is useful as it facilitates the Java applet forwarding web page data from these other services to the database search engine server. It is also preferable that the signature gives access to the local hard disc of the user's PC to, among other things, allow web pages downloaded from the other servers to be cached on the local hard disc for faster retrieval, previewing and viewing. Typically, when the system is first activated the user will be asked "do you trust the search engine provider?" before access to the local hard disc/other servers is confirmed. Such digital signature/certification systems are provided by Verisign or other certificate authorities and use an RSA or other public key cryptography algorithm. In some cases an additional signature capability is necessary for access to controlled parts of the web browser system and separate signatures may be required for NETSCAPE (Registered trade mark) and/or Microsoft Internet Explorer (Registered trade mark).
We will also describe a means for registering statistics on users' approval or use of a given site presented by the search engine in response to a search request. When a user looks at a URL voting statistics are generated, which relate to the search engine query. In a refinement when a user bookmarks a particular site extra weight is given to the user's vote for that site. Search results can thus be presented ranking by their relevance to the system's users.
Other features include the provision of a scrolling list of search URL results (which is made possible by use of a Java applet) and a web page preview feature in which a reduced size version or reduced content version of a web page is displayed in a window when the user's cursor is momentarily held in position over a URL hyperlink.
Thus the system effectively provides a distributed web crawler or web spider which uses a signed Java applet for network access. Advantageously the system can be run on some workstations and/or other hardware, and in one embodiment the applet occupies less than 100K bytes with approximately a further 1 Megabyte allocated to local disc caching of downloaded web pages.
In a still further aspect the invention provides a web crawling system or applet to, when running, contact a web page; download that web page; analyse the contents of that web page; and send the results of its analysis to a database system.
The purpose of such a web crawling Java applet is to crawl or spider the world wide web. That is to say, the purpose of this applet is to contact a web page and then analyse its contents. Such a web page will not generally be hosted on the server from which the applet originates.
Ordinarily, a Java applet is not permitted to access any server other than the server from which it originates. If the applet is signed however, that is to say, if it has been granted a digital certificate, it is permitted to access servers other than the server from which it originates. The applet contacts a web page, perhaps as a result of having been passed that web page's URL by a server, or perhaps as a result of having that web page's URL input by a user. The applet then downloads and proceeds to analyse the contents of that page.
When the applet has performed its analysis it uploads its findings, for storage and later access, to a server hosting a database system.
The findings may be uploaded in an encrypted form, or a compressed form, or an encrypted and compressed form.
An advantage of compressing the data prior to uploading it to the database system is that the time required to upload the data in a compressed form will be generally less than that required to upload the same data in an uncompressed form. Accordingly the applet's connection will be less busy and therefore the applet will have more bandwidth available for spidering.
Preferably the system has a graphical user interface (GUI). Typically in this system a search term is submitted to the database system via the applet and the database system accordingly returns its findings to the applet which the applet then displays. The GUI permits user interaction with the central database of a search engine.
Preferably the Java Applet Graphical user interface accepts from the user a word or phrase which the user wishes to submit to the search engine (Search Term Acceptance). Typically, but not necessarily, this will comprise a text box into which the user can type a search term or a voice recognition system into which the user can announce a search term.
The applet then submits that search term to a database system which has been specially constructed or adapted for this purpose. The search term may be sent in an encrypted form, or a compressed form, or an encrypted and compressed form. After consulting its store of information relating to the search term, the database system returns its findings, or results, to the applet. The findings of the database system may be sent in an encrypted form, or a compressed form, or an encrypted and compressed form. The applet decrypts or decompresses or decrypts and decompresses the data as appropriate and then presents the results to the user.
Typically, but not necessarily, the database system may also download to the applet one or more URLs of web pages which it would like updated with a request that the applet contact the page represented by the URL, analyse the page, and upload its findings as described earlier.
The code which comprises the "web crawler" is not necessarily written in Java and therefore does not necessarily comprise an Applet. Moreover it is not necessary for the software to "crawl" in the sense of copying itself from computer to computer. The method and system for crawling the web is preferably directly integrated into the code for the web browser, that is to say, the code for the web browser and the code for the crawler are in the same executable file.
A stand-alone distributed web crawler may comprise an executable file which when run may have only the very simplest interface consisting of a 'stop' button or other means of halting the execution of the program.
More typically, but not necessarily, the executable file which comprises a browser incorporates a system and method which calls the executable file which comprises the stand-alone distributed web crawler
The purpose of the stand-alone distributed web crawler is to crawl or spider the world wide web. That is to say, in the context of this description, the purpose of this crawler is to contact a web page and then analyse its contents then upload the analysis to a database-system. The code which comprises the web crawler is preferably, but not necessarily, written in Java (Regd. T.M.) and does not necessarily comprise an Applet. The stand-alone distributed web crawler contacts a web page, perhaps as a result of having been passed that web page's URL by a server, or perhaps as a result of having that web page's URL input by a user. The stand-alone distributed web crawler then downloads and proceeds to analyse the contents of that page. When the stand-alone distributed web crawler has performed its analysis it uploads its findings, for storage and later access, to a server hosting a database system.
There is also envisaged a system and method comprising software which accepts data from an applet as described previously and translates that data into a form, type, language or schema compatible with the form, structure or language of a database of an existing search engine, (for example, Northern Light, Snap, Alta Vista, HotBot, Microsoft, Infoseek, Google, Yahoo, Excite, Lycos, Euroseek (Registered Trade Marks).
Typically, the data being sent from the signed Java (Regd. T.M.) applet will comprise either queries or the web page-analysis findings of the signed Java applet for inclusion in the database.
There is further envisaged a system and method comprising software which accepts data from a database of an existing search engine, (for example, Northern Light, Snap, Alta Vista, HotBot, Microsoft, Infoseek, Google, Yahoo, Excite, Lycos, Euroseek (Registered Trade Marks) and translates that data into a form, type, structure, style, language or schema compatible with an applet as described previously.
Typically, the data being sent or retrieved from the existing database to be processed by the system or method will comprise results pertaining to search queries returned in response to queries submitted to the existing database via a signed Java applet.
In both cases the data will typically, but not necessarily, be sent in a compressed and/or encrypted form.
In a further aspect the invention provides a database security system and method. In the above described systems it is desirable to determine whether the data which is uploaded onto the database-system is sent by a bona fide applet of the type described earlier, and that the data the applet uploads is therefore genuine data and not data uploaded maliciously by an algorithm masquerading as a bona fide applet.
To assist in ensuring that the data which is uploaded onto the database-system is genuine, the uploaded data may be put in a holding data-structure or database in the database-system or may be placed in the database-system proper with a flag to indicate that the data has not yet been confirmed as valid.
For data not confirmed as valid, that is to say, for data which purports to represent the findings of an applet's "spidering" of a particular web page, confirmation can be obtained by re-spidering that web page one or more times with applets known to be at a location different from the location from which the initial spidering findings were received (spidering of a page, here, means simply accessing information on the page).
If the re-spidering of that same web page is undertaken by an applet at a location other than the location of the initial spidering and the findings of the re-spidering are identical or similar to the findings of the initial spidering then this will provide a degree of confirmation that the data represents the content of that web page.
Conversely, if the re-spidering of that same web page is undertaken by an applet at a location other than the location of the initial spidering and the findings of the re- spidering differ, or significantly differ, from the findings of the initial spidering then this will provide a degree of confirmation that the data does not represent the content of that web page.
This re-spidering can be repeated in the manner described and with each confirmation that the data is valid the degree of confidence that the data is represented by that URL increases, such that after a small number of re-spiderings the probability of the data being invalid is significantly reduced.
It is desirable that a search engine is able to determine which web pages are of the greatest interest to the user and is therefore able to return results ranked according to some criterion for relevance.
Thus in a still further aspect the invention provides a search system for returning results ranked according to relevance as determined by, for example, search term density and/or likelihood of user interest. There is thus also provided a method to determine the ratio of the number of appearances of a search term in a particular page to the size of that page.
Since a page with only a small number of references to a search term is likely to be of less interest to the user than a page of the same size with a larger number of references to the same search term, a search term ratio or search term density of a web page can be defined as a ratio of the number of occurrences of the search term on a page to or divided by the size of that page. Other things being equal, it is preferable for a search engine to return results having a high density of references rather than a low density of references. That is, other things being equal, the greater the value for the search term density, the greater the likelihood of that page being of interest to the user with respect to that reference.
Typically, the database system will comprise one or more tables or relations or other data structures in which each search term will be associated with the URL of each web page which contains that search term, and the search term density of that search term in that page.
When a user consults the world wide web in order to discover an answer to a particular question that user will often have a particular question in mind. Questions tend to fall into categories, those that require a simple 'yes' or 'no' answer and those that require a fuller answer. This latter category of questions often commence with 'What', 'Why', 'When', 'Where', 'How' or 'Who'.
It is often the case that when a particular question appears on a web page that web page then discusses possible answers to that question. Accordingly a means is also provided of determining on which page(s) a particular question appears will assist the user in obtaining an answer to that question.
One embodiment considers sentences beginning with 'What', 'Why', 'When', 'Where', 'How' or 'Who' and terminating with '?'. By compiling a directory of questions of this form associated with the URLs of the pages on which they appear, a directory of likely pages where the corresponding answers can be found is obtained.
These and other aspects of the present invention will now be further described, by way of example only, with reference to the accompanying figures in which:-
Figures la and b show a block diagram of an Internet search system according to an embodiment of an aspect of the invention;
Figure 2 shows a block diagram of a user's computer in an embodiment of the invention;
Figures 3 a to c show a flow diagram of a user registration and background spidering process;
Figure 4 shows a flow diagram of a process for downloading a web page and applet from a web server;
Figure 5 shows a flow diagram of a server process for the user registration and background spidering process of Figure 3;
Figure 6 shows a flow diagram of a search and spidering process on a user's computer; Figure 7 shows a flow diagram of a server process for the search and spidering process of Figure 6;
Figure 8 shows a flow diagram of a graphical user interface thread for a search process for a user's computer;
Figure 9 shows dataflows in search and spidering processes according to an embodiment of an aspect of the present invention.
Figure 10 shows an exemplary graphical user interface for a search system according to an embodiment of the present invention; and
Figure 11 shows an exemplary plurality of concurrently running program threads of the search and spidering process of Figure 6.
Referring first to Figures la and b, these together shows a block diagram of a search system 100 according to an embodiment of the present invention.
In Figure la a user terminal 102 is connected to the Internet 114. Further user terminals 104, 106, and 108 are also connected to Internet 114, via LAN (local area network) 110, and Internet gateway 112. Connected to the internet 114 are a plurality of sources of information, represented in Figure la by web servers 116a to e. Data for user searching and for system spidering is stored on web servers 116a to e. The world- wide web, which represents objects in HTML (hypertext markup language) format and transfers data via the HTTP (hypertext transfer protocol) protocol. However, the skilled person will be aware that the Internet also provides access to data via other protocols such as, for example, FTP (file transfer protocol) and Gopher. In the description of the embodiment of the invention which follows for simplicity reference is made to searching data on web servers, although in practice the invention is not restricted to data available via this format. A search and spidering system web server 118 is coupled to the Internet 114, via a firewall 117 for security. The system web server 118 provides a search system (home) web page including a search applet, that is, a Java (registered trade mark) program for execution within a supporting web browser. The system web server 118 is coupled to web page and applet code storage 120 within which the applet is stored as a signed jar (Java archive).
A digital signature authenticates the Java applet as originating from the search system service provider. When the Java applet is downloaded to a user terminal a window is displayed together with the name of the service provider and a certification authority and the user is asked whether or not to trust content from the service provider. The digital signature authenticates the origin of the Java applet as the service provider and the user is thus provided with sufficient information to enable the applet to be trusted. Once the applet has been marked as trusted it is given extended permissions by the web browser which allow it to perform the functions described below, such as reporting indexed content data to the service provider.
Web browsers such as Microsoft Internet Explorer (registered trade mark) and Netscape Navigator (registered trade mark) automatically recognize a signed Java applet and implement such security procedures. Providing a web page including a signed Java applet is the preferred implementation of the system, but in other embodiments other security arrangements may be employed.
The search system home web page is a static web page 122 comprising graphics and an HTML tag 124 including a URL (uniform resource locator) pointing to the Java applet in code storage 120.
Referring now to Figure lb, this also shows the web server 118 and code storage 120 of Figure la, together with a data collection server 122 and a query servicing server 124. Each of servers 118, 122, and 124 has a separate URL. The URL of web server 118 is accessed by a user's web browser to download the system home page; the URLs of servers 122 and 124 are accessed by the Java applet code running on the user's machine. The data collection server 122 includes data collection code storage 122a and is coupled to a system data store 126. The query servicing server 124 includes query serving code storage 124a and is coupled to a user data store 128, as well as to the system data store 126 for returning search results. Some or all of the stored code and/or data may be stored on a removable storage medium, illustratively shown by disk 130.
Broadly speaking, the data collection server 122 manages data collection or spidering functions for the system and query servicing server 124 handles user queries. In a preferred embodiment of the system, search results are provided to a user together with a so-called "URL tax" of sites which the user's computer is to spider. For this reason query servicing server 124 is coupled to data collection server 122.
The system wέb server 118, data collection server 122, and query servicing server 124 may comprise computer programs implemented on dedicated machines or, as will be understood by the skilled person, two or more of these servers may be implemented on the same machine.
The system data store 126 preferably includes a list of all known URLs, although in practice at any one time the database will include URLs which are no longer in existence and will not include some new URLs. The basis of such a list is obtainable from the authorities who are responsible for overseeing registration of domain names, such as Network Solutions Inc., although it may be necessary to combine lists of URLs obtained from two or more such authorities. Over time the list may be enhanced by server and user-based spidering as described later. Embodiments of the system may include a subset of known URLs, for example to provide a language-based search facility, rather than attempt to include all known URLs. Associated with each URL is status data including a time stamp indicating when the status data was last updated, a "date last modified" date, normally provided on web pages to indicate when the page was last modified, a checksum based on the web page data, and a web page file size. The database also includes indexed content data for the web pages (also referred to as URL spidering data) as described in more detail below, and page rating data to provide one or more ratings of, for example, popularity, utility, and the like. The system data store 126 may comprise, in one embodiment, of the order of 1010 URLs and associated data stored in of the order of 1 TB RAID (redundant array of inexpensive disks) storage.
The data store 126 may comprise a relational or object-orientated database, such as an Oracle or DB2 database, or it may comprise a proprietary database as described below. Data within the database is accessed by a user's search keyword although popular combinations of keywords may have their own entries. Taking into account the possibility of searching in a variety of languages, and searching for proper names and acronyms, provision for up to 107 keywords may be necessary.
In an exemplary proprietary format each keyword has its own file comprising a list of URLs referencing that keyword. This URL list is preferably ordered by default criteria so that retrieved search results are automatically provided in order of relevance. The ordering of results where keywords are combined in a search term and the same URL appears under two (or more) keywords may, for example, be based upon the relative position of the URLs concerned in the two lists. Thus when the database is updated new indexed content is preferably inserted at an appropriate place within the relevant ordered list or lists. With this proprietary format images of the files of popular keywords may be held in RAM for speed.
The data collection server 122 provides URLs for spidering to the Java applet running on a computer operated by a user of the search system and receives indexed content data back from the applet for storage in system data store 126. In one embodiment of the system aspects of this process and of system data store 126 are optimized by the system, preferably automatically. This self-optimization may be performed by the data collection code by, for example, making a small modification to a parameter and measuring any resulting change in system performance to determine whether the performance is improved or detrimentally affected by the modification.
Global parameters which may be modified by such a procedure include the number of keyword combinations having their own separate entry in system data store 126, the number of keyword files cached, and the length of time an unaccessed file is retained in a cache. User ("client")- specific parameters include the number of URLs in each batch sent to the client for spidering, and the URLs selected for spidering, in particular their proximity to the user's URL - the user's "catchment area" - as described further below. The client-specific parameters are preferably optimized separately during each session a user is logged-on, for example, to optimize use of available bandwidth to the user's (client's) computer.
User data store 128 stores data relating to specific users or clients of the search system. Thus in a preferred embodiment user data store 128 comprises user identification data such as a user number, a user name and password for accessing the system, a user e-mail address for marketing purposes and user (search) term data as described in more detail later with reference to the USER TERM table. Optionally the user data store 128 may also store a user internet address (which may be temporary or the address of a gateway). The user term data includes a history of search terms frequently used by a user which can be employed, for example, to generate a news or update service and to alert a user to new websites in which they may have an interest.
The user data store 128 may also include a user rating, for example, a "blacklist" flag which can be used to exclude unwanted users from the system. Preferably the data store also holds each user's normal IP address (this could be the IP address of a company gateway such as gateway 112 of Figure la), for catchment area-related searching as described later.
Other data stored in user data store 128 preferably includes BOOKMARK and FOLDER tables (described later) to store and organize a user's bookmarks. The database may also store user settings data for storing users' preferences. In one embodiment the user settings data defines the number of results returned by a search, an age cut-off for search result web pages, whether or not the user wishes to take advantage of the user search term storage facility, and if the user does request this facility, the frequency of news updates and an option for e-mail notification of updates.
Referring now to Figure 2, this shows an example of a user's computer which, as illustrated, comprises a conventional, general purpose personal computer 200 suitably programmed.
Personal computer 200 comprises a pointing device 206, such as a mouse, a keyboard 208, and a display 210, all for providing a user interface. An Internet interface 204 is provided for connecting the computer to Internet 114; this may comprise any conventional communications interface such as a modem or a local area network interface (which provides an indirect interface to the Internet). The computer includes a processor 212 which loads and implements program code stored in permanent program memory 218, such as a hard disk drive. Data for use by program code running on the processor is stored in permanent data memory 216 (which again may comprise a hard disk drive) and a working memory 214 is provided for use by processor 212 during its operation. The program code and data in memories 214, 216, and 218 may be stored on a removable storage medium, as illustrated by floppy disk 220. All the components of computer 200 are linked by computer bus 202.
Processor 212 loads and implements a web browser 212a such as Internet Explorer (registered trade mark) or Netscape Navigator (registered trade mark) and, optionally, an e-mail application (not shown). When computer 200 accesses the search system's home web page a signed Java applet 212b also runs in computer 200. This is either downloaded from system web server 118 or loaded from permanent program memory 218 (when the applet code has been cached by web browser 212a following an earlier access to the search system web page) In use the applet code is also stored in working memory 214, together with a list of URLs spidering, HTML files for web pages retrieved by the user's computer (either for indexing or, equivalently, as search results), indexed content data, and a list of search result URLs. The list of search result URLs may also be stored in permanent data memory 216 together with, optionally, a list of the user's "favourite" bookmarked URL references. The user's bookmarks are also stored in user data store 128 and the list of bookmarks and search results list are only updated if the user chooses to save this data locally.
Web browser 212a includes cryptography code to recognize the Java applet's digital signature and to display a certificate, together with a company name, offering the user a choice of whether or not to trust the service provider. If the "trust" option is accepted web browser 212a gives signed Java applet 212b extended permissions, for spidering web pages and reporting indexed content data to the service system provider. Permanent data memory 216 may store data indicating that applet code from the search system service provider is always to be trusted.
Referring now to Figures 3 a to 3 c, these together show a flow diagram of a user registration and background spidering process. The flow chart illustrates steps performed by search/spidering applet code running on a user's personal computer 200. In particular, the flow chart shows a background spidering process which runs continuously on computer 200, according to the available processing and communications bandwidth, when the user is not performing a search. Preferably the process continues to run in the background during a search, although bandwidth limitations may cause the process to run slowly. As described in more detail below, in a preferred embodiment the process is a multi-threaded process; the flow chart shows steps in both a master (or control) thread and a spidering thread.
At step S300 the search system home page 122 and signed Java applet are downloaded from system web server 118 to a user terminal such as personal computer 200. As explained with reference to Figure la, web page 122 includes a URL to the Java applet code, which is downloaded separately from the web page text and graphics. If the user has previously accessed the search system home page the applet code and, in some instances the web page text and graphics, may be locally cached on the user's machine. The search system may force an update of such locally stored applet code by, for example, changing the applet's file name.
At step S302 the user's web browser 212a runs the downloaded applet 212b which, at step S304, establishes a socket connection with data collection server 122. The socket comprises a bi-directional virtual connection between the applet and the data collection server: Once the socket is established, at step S306 the applet sends initialization data to the data collection server 122 comprising, for example, an applet version number. The applet then, at step S308, receives a list of URLs for spidering from the data collection server. Associated with each URL is a date retrieved from system data store 126 indicating the last date (and/or time) when the data in data store 126 associated with that URL was verified and/or updated. Also associated with each URL is a checksum, again retrieved from data store 126, calculated from the web page data pointed to by the URL. The checksum is, in one embodiment, calculated using the entirety of the web page data including HTML tags, although in other embodiments data within HTML tags may be ignored.
The applet may process each URL sequentially, downloading content from a first URL, indexing this and reporting back to the data collection server, and then processing the next URL. However, it is more efficient if the applet processes a plurality of URLs in parallel, for example, using a separate thread for each. Web pages from some URLs will download more quickly than web pages from others and a multi-threaded process facilitates making use of this. Thus, at step S310, the applet selects a first batch of URLs to be processed from the list of URLs received, for example the first ten URLs in the list, and starts a new thread for spidering each one. The process illustrated up to step S310 is the master or control thread; step S312 is the first step of one of the new URL spidering threads created at step S310. At step S310 the control thread halts and waits. At step S312 the URL spidering thread of the applet sends a URL header request to the URL it is processing, requesting header data from that URL. The header data includes a "date last modified" - i.e. the date at which the web page was last updated, and web page summary data.
The applet receives URL header data from the URL to be processed and, at step S314, checks whether or not the header data includes a date-last-modified for the web page. If there is no date-last-modified the applet proceeds to step S318 in Figure 3b, otherwise the applet checks, at step S318, whether the date-last-modified is later than the URL date received from data collection server 122.
If the date the web page was last modified is later than the URL date the thread again proceeds to step S318; otherwise the thread proceeds to step S338 of Figure 3c. At step S338 the main control thread checks whether or not all the URLs received at step S308 have been processed. If they have not the existing thread, which has just finished processing its last URL - that is the spidering thread of step S312 et seq, is reassigned to a new URL to be processed (step S340) and the process then loops back to step S312. Otherwise, if all the URLs received from the data collection server have been processed, the applet requests a new list of URLs for processing from the data collection server at step S342. The main control thread then again reassigns the completed thread to a new URL and, again, the process then loops back to step S312.
The Java code handles signalling between the master/control thread and the URL spidering threads, enabling the control thread to detect when a spidering thread completes.
Referring to step S318 of Figure 3b, if the date the web page was last modified is later than the corresponding date in system data store 126, at step S318 the applet URL spidering thread requests the full web page data from the URL, excluding any data such as graphics and included pages indicated by links within the page. Then, at step S320, the applet caches the downloaded web page in case the user should wish to preview the web page contents, as described later. This caching function is provided by the applet 212b rather than the web browser 212a.
At step S322 the applet calculates a checksum for the downloaded web page and, at step S324, checks whether the calculated checksum is equal to the checksum associated with the URL received at step S308 from the data collection server. If the checksums are the same the process continues at step S336 where the applet sends the URL (or a URL identifier) and the results of the date and checksum checks back to data collection server 122. The date is returned because the web page date-last-modified may have been updated without any change in the web page content. The process then continues at step S338, as described above.
If, at step S324, the system checksum and the checksum calculated from the web page differ the applet then proceeds to analyse the web page contents and report back to the data collection server, which stores the results of the analysis in system data store 126. More particularly, the process continues at step S326 at which the applet stores links to other pages and sub-pages (frames) in the downloaded web page in working memory 214 for return to the data collection with compressed indexed content data, as described later.
Following this, at step S328, the applet compiles a list of all words on the web page except for HTML tags. Preferably such "words" are not restricted to dictionary words but include acronyms and, more generally, alphanumeric character strings. This is useful when searching for product numbers, specifications, invented names and the like.
Once the list of words has been compiled the applet, at step S330, discards unwanted words from the list. A list of these unwanted words is stored within the applet itself and comprises common English (and other language) words such as "the", "and", "&", and certain obscene and/or offensive words. For each word remaining in the list, at step S332 the applet determines a word rating. The word rating may be determined from one or more of word frequency, the relative font size of the word as compared with other text on the page, and the word's location, for example, whether it appears in a heading, a URL, a hypertext link, an HTML tag, or in some other location. Other conventional word rating methods may also be employed.
Once the rating for each word has been determined the applet, at step S334, compiles compressed URL spidering data comprising URL identifying data, a current date (either from the user's personal computer 200 or, preferably, as supplied by the search system), a page checksum, the word list and word rating data for each word, a list of links from the page as stored by the applet at step S326, and a page file size. The indexed content data in system data store 126 is drawn from this URL spidering data. At step S334 the applet compresses this URL spidering data and sends it to data collection server 122 for updating system data store 126. The URL spidering thread then halts while, at step S338, the control thread checks whether or not all URLs have been processed and, if they have not, the control thread reassigns the spidering thread to a new URL and the process begins again at step S312.
The function of the applet is downloading and indexing ("spidering") web page data has been described but the applet is not restricted to downloading HTML data. For example, in a preferred embodiment the applet also spiders data in Adobe (Registered Trade Mark) postscript (opdf) format, as well as data in other formats. The applet may also index content contained within multimedia documents, data files or other objects.
Referring now to Figure 4, this shows a flow diagram of a process for downloading web page 122 and its associated applet from system web server 118.
At step S400 web server 118 receives a request for the search system home page from web browser 212a of user's computer 200. The web server then, at step S402, sends the text and graphics for the home web page to the user's browser. The web browser then determines, at step S404, whether or not the applet is cached in the computer's permanent memory 218 and, if the applet is cached, the process ends at step S410. If the applet is not cached (or, equivalently, if the applet's file name has been changed) at step S406 the web server 118 receives a request for the applet from the user's web browser, the applet having its own specific URL. The web browser then, at step S408, retrieves the applet from code storage 120 and sends the signed Java applet, as a signed JAR, to the web browser where the user is asked by the web browser whether or not to trust content (i.e. the applet) from the service provider. The process then ends, again at step S410.
Figure 5 shows a flow diagram for the background spidering process described with reference to Figure 3a, as implemented on data collection server 122. Thus, at step S500 (which corresponds to step S304 of Figure 3a) data collection server 122 is contacted by applet 212b and a socket connection is established between a data collection server communication process thread and a background spidering thread of an applet running on user's computer system 200. Each of the many user computer systems connected to the data collection server at any one time is allocated a separate socket connection and a separate process thread on the server.
At step S502 the data collection server receives initialization data from the applet including, for example, a version number of the applet which the data collection server can use to select a data communications protocol and/or data format for communicating with the applet. At step S506 the data collection server receives a request for a URL list from the applet for background spidering and, at step S508, the data collection server determines the next URLs which are to be updated. This determination may be made based upon recency, popularity, proximity, or on some other basis.
A determination based upon recency may, for example, select for updated spidering those URLs for which the greatest time has elapsed since they were last updated or, additionally or alternatively, may include new URLs which have not been spidered. A determination based upon popularity may be arranged to ensure that those URLs most frequently appearing in search results are most frequency checked and if necessary updated. Selection of URLs by proximity is described in more detail below. In some embodiments a combination of two or more of these criteria may be employed in order to determine which URLs are next to be sent to a user's computer for spidering to update the URL's records.
Where an applet has already established a socket connection with the data collection server and requests further URLs for spidering (as in step S342 of Figure 3c) the process is entered at step S504 and, again, at step S506 the data collection server receives a request for a list of URLs to spider from the applet.
At step S510 a list of the selected URLs is sent to the applet and, at step S512, the selected URLs are marked in system data store 126 as "pending", for example by means of a flag. The "pending" flag indicates that a URL has been selected for updating but has not yet been updated and the selection (at step S508) preferably ensures that once a URL has been marked as pending it is not again selected for updating by a different user. Preferably the "pending" flag has a timed expiry so that if no spidering results relating to that URL are received from a user's computer after a predetermined interval the URL is again made available for selection for spidering by the same or another user. This ensures that those URLs wliich are dispatched for spidering but which are not in fact spidered, for example because computer 200 is switched off before they are processed, may be re-selected. The "pending" flag is also cancelled once updated spidering data relating to that URL is received from a user's computer.
At step S514 the data collection server waits for spidering data to be received from an applet (corresponding to the data sent at steps S334 and S336 of Figure 3). The data collection server receives URL spidering data from one of the many applets running on the plurality of users' computers which may be connected at any one time to the' search system, at step S516. A separate data'reception process is started for each return from a system user so that in practice, at any one time, there will be a plurality of concurrent reception processes operating on data collection server 122. Such processes may be implemented in a conventional manner on data collection server 122 using Java. At step S518 the data collection server checks whether the received URL spidering data comprises indexed content data (corresponding to the data sent by the applet at step S334 of Figure 3) or merely bibliographic data (such as that sent at step S336 of Figure 3). If the received URL spidering data does not contain indexed content data the data collection server, at step S520, updates the bibliographic data for the relevant URL in system data store 126 with information indicating when the URL was last checked. This information is received from the applet and comprises a time stamp and, if available, a data-last-modified for the web page. The system then loops back to step S514 to wait for further spidering data from the same or another applet. Alternatively, where step S516 is implemented as a plurality of concurrent processes, the process receiving data from the applet halts or waits, although the socket connection to the applet remains open (since one process is allocated to serve each user's computer).
If the received data was determined, at step S518, to include indexed content data the bibliographic data in system data store 126 is updated, at step S522, in a corresponding way to that described with reference to step S520. In addition, however, at step S524 (bibliographic) checksum data for the updated indexed content is also written into system data store 126. Also at step S524 the data collection server writes updated indexed content data into system data store based upon the word list and rating data received from the applet as described above with reference to step S334 of Figure 3. The process then again loops to step S514, waiting for further data from the user's applet. Preferably a user or applet identifier, such as a username, is also stored to indicate the origin of the new or updated indexed content data, to help detect and reduce the risk of fraud by, for example, unauthorised passing of data into the system data store.
Referring back to step S508 above, URLs for updating by a user's computer's applet may be selected partially or completely on the basis of whether or not they are within a URL "catchment area" defining URLs of a selected or predetermined proximity to a user's effective IP address. In a preferred embodiment the system data store 126 stores a list of URLs for substantially every web page on the Internet to be covered by the search system. Some of these web pages are new and have never been spidered, and some may need checking for updates, for example, because they were last checked more than 24 hours previously. The data collection server prioritises the URLs to be spidered according to how recently they were last checked and, starting with the least recently checked pages, URLs are sent to instances of applet 212b residing on the computers of users who are currently connected to the search system.
The URLs to check may be selected substantially at random, for example, for security reasons, to reduce the risk of biased or erroneous data being submitted to the database. However in other embodiments of the system the spidering process can be made more efficient by selecting URLs a user's computer receives for spidering based upon the physical or logical connection of a user's computer to the Internet in relation to the physical or logical locations of the URLs to be spidered. More specifically, there are likely to be fewer bandwidth bottlenecks to locations on the Internet (or other network) which are close to the user's computer as compared with those which are more distant. For example, if a user connects to the Internet via Internet Service Provider A, that user's computer's applet is likely to be able to spider websites hosted by that Internet Service Provider more easily than websites hosted by another Internet Service Provider who is physically and logically more distant. This strategy is effectively "cyber green" since it tends to reduce the level of long-distance IP traffic.
An Internet address comprises four 8-bit octets normally written in decimal notation, for example, 193.243.236.208. A first portion an Internet (IP) address defines a computer network and a second portion of the address defines a computer coupled to the network. Computer networks are identified by network numbers and IP routers generally store a table of such network numbers together with corresponding IP addresses for gateways into the networks. Thus, in the foregoing example, 193.243 may define the network number of a computer network. In many cases it is convenient for network operators to assign sub-networks to different sets of host addresses within the network so that, for example, 193.243.1 defines a first sub-network and 193.243.2 defines a second sub-network. It can therefore be seen that an Internet address usually reflects the underlying physical structure of a computer network, at least to a degree. Domain name servers translate between domains and Internet addresses typically by working down a tree from a root/top-level domain name server. The allocation of domain names is overseen by ICANN who appoint country-based domain name registrars.
From the foregoing discussion it can be seen that one strategy to identify IP addresses which have a good chance of being close to the IP address of a user's computer is simply to truncate the IP address to identify a network number or sub-network address. Typically an Internet Service Provider allocates an IP address to a user's computer when the user logs onto that ISP, the address being selected, often at random, from a range of IP addresses assigned to that Internet Service Provider. Thus to identify or filter candidate URLs for spidering according to "proximity" to the IP address of a user's computer the system merely has to identify a subset of URLs for which a selected number n of the candidate URL's IP address most significant bits match the corresponding most significant bits of the user's IP address. For example, if the IP address of a user's computer is 193.243.236.208 a candidate URL with an IP address of 193.243.233.128 may be considered within the user's catchment area because the first portions of these two addresses match.
The value of n selected determines the catchment area of a user's computer. In a simple embodiment this could be fixed at, for example, 16 bits. In a more sophisticated example the number of bits may be selected according to the class of IP address. Class A Internet addresses are reserved for large networks and use only the first octet for the network number (addresses 1 to 126); class B addresses are for standard size networks and use the first two octets for the network number; class C addresses are for small networks and use the first three octets for network numbers. Thus n may be small, for example n = 8; for class B addresses n may be larger, for example n = 16; and for class C addresses n may be larger still, for example n = 24. More generally, a subset of IP addresses based upon "proximity" may be selected using a so-called subset mask, that is, a 32 bit number with selected (most significant) bits set to one. Where subsets are contiguous within a network each subset can access the other subsets without passing traffic through other networks. Additional/alternative proximity determinations may be based upon Classless Inter-Domain Routing (CIDR).
In a still more sophisticated system traceroute, a standard utility, may be employed to determine the route datagrams take between two hosts and the "proximity" can be determined accordingly, for example by counting the number of hops in the route. The hop count is a ftinctionally significant measure of the distance between two computers connected to the Internet since a datagram may pass through a large number of different networks before reaching its destination, even when that destination is geographically close at hand.
In another embodiment the server gives each applet a list of URLs to ping and the applets report the ping times back to the server. The server maintains a list of URLs it wants spidered together with an average ping time for each URL determined from the average of all previous applet pings to that site. The server then selects URLs for a particular applet depending on how important it is to spider that particular URL and the applet's ping time to that URL. So when an applet has a particularly short ping for a URL compared to the average ping time it receives an instruction from the server to spider it.
Referring now to Figure 6, this shows a flow diagram of a search process implemented using an applet on a user's computer. This process operates in parallel with the URL download and spidering processes described with reference to Figure 3.
At step S600 the user enters a search term into the applet running on the user's computer or, alternatively, a search term is selected from a historical list of previously conducted searches. The search term may comprise a single keyword or a combination of keywords linked by logical operators such as "KEYWORD1 and KEYWORD2".
At step S602 the applet sends the search request to query servicing server 124 and, at step S604, receives a list of search results and "tax" URLs back from the query servicing server. Each search result in the list comprises a URL, preferably together with additional information such as a title and/or an indication of the content of the web page pointed to by the URL. This additional information may be retrieved during the distributed spidering process and stored in system data store 126 in association with its corresponding URL. Both the search result URLs and the tax URLs each have an associated date and checksum, and optionally file size and language data. The search result URLs are flagged to differentiate them from the tax URLs.
As described above, the URLs which are stored in system data store 126 are organized in association with search term keywords and are ranked by their relevance. The list of search results received by the applet is ordered by relevance and thus when the applet displays the list of search results, at step S606, these are simply displayed in the same order in which they have been provided to the applet. The user may, optionally, re-order the displayed search results according to other criteria such as, for example, date.
At step S608 the applet identifies a first batch of URLs to begin spidering. The spidering process is preferably carried out by a plurality of concurrently running URL spidering threads in a broadly similar manner to that described with reference to Figure 3. Thus the steps of Figure 6 from step S600 to step S601 are preferably steps performed by a master or control thread of the applet wliich, in a preferred embodiment, is a GUI thread which also manages the interface provided for a user by computer system 200.
In a preferred embodiment some of the URL spidering threads are allocated to spidering search results and others of the threads are allocated to spidering URLs comprising the URL tax. For example where the applet creates ten URL spidering thread instances, five of these may be assigned to spidering search result URLs and five to spidering tax URLs. Thus, at step S610 the applet starts a new thread for each URL to be spidered. Preferably the search result URLs to be spidered, although selected initially by the applet are, indirectly, amendable by the user. In such an embodiment the applet detects which search results the user is viewing, for example by detecting result list scroll events, list re-ordering, and list item deletion, and controls the URL spidering threads accordingly to spider, for example, URLs being viewed, URLs in the order that they are being viewed, and to cancel spidering of deleted items.
As illustrated in Figure 6 the master or GUI thread effectively halts at step S610, waiting for events from the user such as the scroll events described above and, at step S612, spidering of a plurality of URLs commences (the process steps for only one of these spidering threads is illustrated in Figure 6).
At step S612 an applet URL spidering thread requests a complete web page from the URL assigned to it for processing. The spidering thread retrieves both header and text data on the web page but does not retrieve objects embedded within the page accessed via fiirther URLs, such as sub-frames. At steps S614 and S616 the applet URL spidering thread receives a data stream from the URL until reception is complete, when the process continues to step S618. During data reception the spidering thread sends reception status data to the master GUI thread to indicate events such as, "waiting for a response from the URL", "page downloading" and "time out and halt". This information may be used, for example by the applet, to optimize the balance between the number of threads assigned to search result URLs and a number assigned to tax URLs.
Once reception of the web page is complete the URL spidering thread caches the downloaded web page for the GUI thread to display on request, and sends "download complete" status data to the GUI thread (step S618). The GUI thread preferably displays an indication of the status of the web page download from each URL, for example as a traffic light to indicate "waiting", "downloading" and "ready". In the case of a URL spidering thread which is spidering a tax URL preferably no status data is provided to the GUI thread since, in general, the tax URLs are not displayed.
In a preferred embodiment of the applet threads spidering search result URLs are given priority over threads spidering tax URLs so that, in effect, the taxation operates as a background process and has only a small impact upon the user's available bandwidth. This prioritisation may be implemented straightforwardly using the Java Virtual Machine.
Steps S620 to S628 correspond to steps S326 to S334 of Figure 3. At step S620 the applet stores links on the retrieved web page pointing to other objects such as program code, graphics, other web pages, sub-frames and the like. Then, at step S622, the thread compiles a list of all "words" on the page except for HTML tags and, at step S624, discards unwanted words from the list. Then, at step S626, the thread determines a rating for each listed word and, at step S628, sends compressed spidering data to data collection server 122 in a corresponding way to step S334 of Figure 3. At step S630 the spidering thread, which by then has processed the URL assigned to it, is reassigned to a new URL which may either be a search result URL or a tax URL. The process then continues again at step S612. If the user has modified the search result list, for example as described above, event data is received from the GUI thread and one or more existing URL spidering threads are reassigned to spider new (search result) URLs, whether or not they have completed processing of the URLs initially assigned to them.
Referring now to Figure 7, this shows a flow diagram of a computer program running on query servicing server 124 for providing search results to an applet running on a user's computer. At step S700 the query servicing server 124 receives a search request, including a search term or keyword, from an applet on a user's computer. At step S702 the query servicing server retrieves search result URLs from system data store 126, already ranked in the order in which they will be presented to the user. This is because, as has been described above, when indexed content data from URL spidering processes is written to system data store 126, it is written in order of relevance to an associated keyword. Where a search term comprises two or more keywords search result URLs are retrieved in the manner which has already been described in connection with Figure 1.
At step S704 the query servicing server 124 then requests a list of tax URLs from data collection server 122. These tax URLs are preferably determined according to the same criteria as the background spidering URLs, as described with reference to Figure 5. In other embodiments tax URLs may be selected according to additional or different criteria from those used to select URLs for background spidering, for example to preferentially update the system data store with information relating to websites of businesses having a relationship with the search system service provider. At step S706 the query servicing server then sends both the search results and the tax URLs back to the applet, for display to the user, and for spidering. In some embodiments the search results and tax URLs are locally cached on the query servicing server 124 and transmitted to the user's applet in batches, to facilitate the applet's data handling and to make it easier for the search system to keep track of which URLs should be being spidered.
Once the search results and tax URLs have been sent to the applet, the process ends at step S708.
Referring now to Figure 8, this shows elements of a graphical user interface (GUI) thread suitable for use with the process of Figure 6. At step S800 the GUI thread displays a list of search results in the order they are received from query servicing server 124. Then, at step S802, the GUI thread awaits an event, for example initiated by a user. Such events may include, for example, a modify result display event (such as a scroll event, re-order list event, or delete item event as mentioned above), a page preview event, a select item event, a bookmark item event and (not initiated by the user) a web page download status update event.
On receipt of a status update event from a URL spidering thread (step S804) the GUI thread, at step S806, displays updated status information for the relevant URL. On receipt of a modify result display event from a user (for example, by operation of a scroll bar) at step S808 the GUI thread displays a modified list of search results and then, at step S810, sends data relating to the modify result display event to one or more URL spidering threads as necessary, for example to reassign spidering threads to process new URLs.
On receipt of a preview event (for example, by the user clicking on a preview region such as a URL title) at step S812 the GUI thread displays a simplified rendering of the downloaded web page, for example a text-only display in a supplementary window. If a hypertext link is selected (for example, by a user clicking on the link) the GUI thread, at step S814, opens a new browser window for the selected URL for displaying data from the selected URL. If the data at that URL has previously be cached by the applet, so that a cached version of the data is available, this cached version is displayed. After each event the GUI thread returns to step S802, to await the next event. As the skilled person will appreciate, preferably the GUI thread is able to process more than one event in parallel.
Figure 9 shows exemplary dataflows 900 for a user search process and for a background spidering process. Once the search system home web page is downloaded to user terminal 102 the terminal's web browser makes a URL request 902 to search system web server 118 and applet data 904 is downloaded to user terminal 102. The user then enters a search term into the graphical user interface provided by the applet and a query 906 comprising this search term is sent to query servicing server 124. The query servicing server then returns a URL list 908 comprising search results and a URL tax to user terminal 102. URL requests 910 are then issued to web servers 116a-e comprising web servers storing web pages indicated by the search results and web servers to be spidered in accordance with the URL tax. The web page data 912 is then returned from these web servers to user terminal 102, where it is processed by the applet. The compressed URL spidering data 914 resulting from the web page processing (comprising indexed content data) is then sent to the data collection server 122 for storing in the system data store 126. Generally compressed spidering data from a plurality of web pages is reported to the data collection server, data from each page being reported by a separate thread running within the applet.
As a background process the data collection server 122 also provides a URL spidering list 916 to the user terminal 102. This process first sends URL page header requests 918 to web servers 116a-e (which are merely exemplary of all the web servers connected to the Internet) and web page headers 920 are consequently returned to the use terminal. Then, where necessary, the background spidering process issues URL requests 922 for full web page data and these web pages 924 are then returned for processing. Compressed URL spidering data 926 is then reported to data collection server 122 in the same way as with the search and tax URL spidering process.
Figure 10 shows an exemplary graphical user interface 1000 for presentation to a user on personal computer 200. The user interface comprises a conventional browser window 1002 within which a secondary window 1004 is provided by the applet's grammatical user interface (GUI) thread. This secondary window comprises a field 1006 for entering a search term and an adjacent (search) button 1008. A window 1010 displays a list of search results including a field 1012 displaying a title and URL for each result. A second field 1014 indicates whether or not a web page for the search result is locally cached on the user's computer and, if it is cached, on what date it was cached. The field 1014 also includes an indication 1016 of the download status of a web page to indicate, for example, that the associated web page is not active, that no response has yet been received from the web server, that the page has been accessed but is not yet fully downloaded, and that the page has been fully downloaded. A bookmark and relevance field indicates the likely relevance of a web page to the requested search term, and indicates whether or not the page has been bookmarked. A scroll bar 1020 is provided to allow a user to scroll up and down the list of search results in a conventional manner. A preview window 1022 displays a scrollable preview of the text within the web page. Aspects of a second embodiment of the system, broadly similar to that described above, will now be described, starting with the applet.
When the user visits the system's home page a signed Java applet of about 100k is downloaded onto the user's machine. This figure is currently based on what is deemed an acceptable download for the majority of web users on modem dialups. Once the applet is downloaded the first time, it should remain cached as long as the user does not clear the browser's cache. Updates to the applet can be released so as to force a re- download. The bulk of the applet code can be saved to a special area on the disc which remains even if the user clears his browser's cache, if desired.
Once the applet has been downloaded, the user must then authorise the applet to have full access to his machine's resources by clicking a Grant button on a window that appears. The user is then invited to enter a query in the form of one or more keywords. The applet contacts the system server which returns a list of URLs (Uniform Resource Locators i.e. web page addresses) related to the query, which the applet displays in a table. The table shows the Date the page was last modified (according to the system database), the document Title, the URL (or, in other embodiments, just the domain), and a rating for the site.
When the applet receives a URL, it attempts to contact the site and download the page or document there, updating its status and dates columns as it does so. Here document is used generally to include video, audio text and multimedia files, games and other similar types of information. When a page has been downloaded (several are downloaded simultaneously by utilising Java's inbuilt thread (multi-tasking) support) it can be previewed in a preview pane of rich text (i.e. colours, fonts, size, bold, italic but in one version, with no images or audio) or alternatively an HTML frame, by hovering the mouse pointer over its entry in the list.
Hence as pages are being downloaded, the user is able to see which ones have been contacted and cached, which ones are still pending and which ones have moved or been deleted. Once cached, the user can see the size of the page, and by moving the mouse over it is able to preview it and get an idea of what the page consists of. At this stage the user may bookmark the site for later reference, or he may choose to view the actual page.
Viewing is performed by clicking on the entry in the list (or on the URL, if shown). This brings up another browser window, with its address bar disabled to prevent confusing the user because the page is displayed from a local cache (i.e. a file on the hard disc) rather than a web location. Whilst viewing the full page, the user is free to follow links as in a normal browsing session, although any links followed take the user to actual web pages rather than local cache files. Having viewed the page, the user may then decide to bookmark it. Bookmarking is the "marking" of a location so that a user can return to it, for example by storing a reference to the location in a folder.
Bookmarking is preferably performed by clicking on the checkbox next to the entry in the list. Note that if the user has previously bookmarked this exact URL (either for this query or a different one, whether in this session or a previous session) then the entry in the list will already have its checkbox checked.
When the user bookmarks a particular site the act of bookmarking advantageously casts a vote or recommendation for that page. There may be two forms of bookmarking, the stronger marking a persistent interest in a site, and the weaker marking of an article to be read later. Alternatively, this distinction can be automatically deduced by the applet observing when the user subsequently views a page via the bookmark from a Bookmark Viewer.
The system uses these votes to assign a two-fold recommendation-rating or score for the URL. The first is with regard to the search terms used in the query and the second is a general vote in the page as being of high quality. Thus when a user performs a query, the results he receives are ranked by other users' bookmark-recommendations. This process happens in real-time, so a popular new site can be highly ranked very quickly. The order of the pages returned for particular query is by votes with regard to this or similar queries, but the "general quality" vote is also displayed alongside each page. A fuller discussion of how ranking works is provided below in the context of Query Servicing.
Apart from the standard searching features outlined above, the applet may have a number of other features, including, a Bookmark Viewer allowing users to reorganise their bookmarks, check for out of date ones, and view the bookmarked pages, which preferably registers a vote for the page with the system, and also indicates to the applet that this bookmark is to be more highly rated amongst the user's list. The applet may also offer the user 'today's favourite query' and 'today's favourite site'.
Turning now to the system server of this second embodiment, this comprises a central repository of data which indexes the web. Its function is to collate information collected by its clients and to service requests from clients to access this data in an ordered fashion. Preferably, it does not perform the collection or crawling itself.
This server consists of three main subsystems: a database, a data collection subsystem and a query request servicing subsystem.
The database, stores data relating to what pages exist on the web, what keywords (or more generally, terms i.e. words or phrases) are associated with these pages, and how each page is ranked according to each term. It also contains user data including their bookmarks. This database may be implemented as a standard relational database, or as a custom data structure.
The data collection subsystem or process, is the recipient of processed and compressed data prepared by the applets relating to new and updated pages. This data is incorporated into the database, replacing anything which is out of date. An important part of this process is that it is done in real time, i.e. the database is constantly kept fully up to date. This subsystem is also responsible for accepting user votes for sites in terms of bookmarking. The bookmarks are inserted into the users database entry and a vote is registered linking the site with a search term.
The query request servicing subsystem answers queries from users, essentially of the form "most relevant pages for <keyword(s)>", which are submitted via the applet. The response consists of a list of URLs, ordered with the most relevant first, which is fed back into the applet for display to the user. Associated with this response is also one or many URL tax items (see below), which the applet must check and report back to the system. This process preferably has a very high performance as volumes of requests are typically measured in millions per day. Achieving this performance is helped by the fact that entire web pages do not need to be served for each request; instead a compact list of URLs which the applet can display in a user-friendly fashion will suffice.
Referring now to the features of the applet in more detail, the applet is signed, which means that a digital certificate or signature has been applied to the binary file which comprises the applet such that an end user may be confident of where the applet originates from. This instructs the browser to provide the user with the option of marking the applet as trusted.
To understand the implications of this procedure, it is helpful to outline the security model which Java (Regd. T.M.) applies to applets. To prevent the execution of malicious code when a user browses a website containing Java (Regd. T.M.) applets, most browsers' Java Virtual Machines (JVM) have certain restrictions which determine what an applet can and cannot do. These restrict access to the local file system and to the network at large. Normally, an applet is only permitted (by, e.g., a browser) to make a network connection to the server from which it was downloaded. This could prevent the functionality in the system applet of going to different websites and downloading their pages. However, by signing the applet, and prompting the user to mark the applet as trusted, these restrictions are lifted. The two most popular web browsers, Internet Explorer (Regd. T.M.) and Netscape (Regd. T.M.) currently employ different signing mechanisms. Therefore, to provide a signed applet which both browsers can recognise as signed it is preferable to sign it twice, once using Internet Explorer's (Regd. T.M.) scheme and once using Netscape's (Regd. T.M.).
One way for the applet to communicate with the server is to use Remote Method Invocation (RMI). If sockets are implemented, this provides an adequate alternative (RMI is implemented on top of sockets), although where users are behind corporate or ISP firewalls this generally prevents the use of sockets. An alternative uses HTTP requests, although this imposes a performance overhead as HTTP is a relatively heavyweight protocol, and is stateless.
Preferably both socket-based and HTTP-based versions of the system protocol are implemented so that users who are able to make use of the more efficient sockets version can do so.
Java (Regd. T.M.) has inbuilt and efficient support for threads, that is, the ability to set multiple processes running concurrently within the same program (i.e. multi-tasking). This provides a number of advantages. To download say 10 web pages simultaneously, using threads, there is no need to manage switching between pages as packets arrive randomly across the Internet. Since the JVM automatically allocates CPU resources to each of the threads, if one is held up waiting for the network to provide data, then the CPU is freed to work on another task. In the case of the system applet, whilst it is waiting for a response from the website it is attempting to download a page from, another thread can be processing a previously downloaded page, thus optimising usage of available resources. Moreover, from the user's point of view threads make the graphical user interface (GUI) responsive, in that whilst heavy processing is going on in the background, or a page is taking time to download, the user interface still responds interactively. Preferably, the GUI gets one thread to itself, which controls all the others, preventing the controls hanging (becoming non-responsive) when something is happening in the background.
Each page being downloaded preferably also gets a thread to optimise the trade off between network and CPU. This includes URL tax pages (described below).
Advantageously, the thread that downloads the page will then, having updated the GUI, begin processing the page if necessary. This thread will then pass information stating that the page has not been modified since last checked, or a fresh analysis of the page as required. In this way information is sent to update the system on a page by page basis as and when it is available. This is advantageous as the applet might be terminated at any stage by the user closing his browser or moving to another page.
When the applet receives a list of URLs, it displays them on the GUI in the order received which is ranked in descending order of relevance as determined by the system using its relevance and voting data. Additionally or alternatively, the results can be ranked using data from the local user only. The applet then begins to contact each of the sites in the list and to download from them. It starts at the top of the list, downloading the pages most likely to be what the user is looking for.
Referring now to Figure 11 , this illustrates a plurality of concurrently running threads of the search, spidering and user interface aspects of the applet, showing different stages of page downloading and processing. A single thread is responsible for downloading one web page and, if necessary, processing the web page data and transmitting the result back to the system. A GUI thread 1100 is also shown. In Figure 11 a spidering thread first waits 1110 for a response from a web server then downloads 1120 the web page and finally processes and transmits 1130 indexed content data back to the search system server. In the illustration an exemplary slow thread 1102 and fast thread 1104 are shown together with a thread 1106 which does not receive a response from the web server and times out. The exact number of threads which results in optimal performance can be determined by empirical means but it is typically of the order of 10. The applet may measure its own performance to optimise the number of threads it creates. This is useful as many factors affect the optimal number, e.g. network bandwidth availability, CPU availability, JVM use and physical memory availability.
A further preferred feature that improves the performance from the user's point of view is that the applet monitors the user scrolling through the list of URLs to ensure that it concentrates on items cunently visible in the scrolling window. For example, say ten items are visible in the list without scrolling, then if the user quickly scans the first ten items and determines, perhaps by looking at the titles, that he's not interested without actually waiting for the preview to appear and scrolls onto the next ten items, then the applet operates to focus on getting those items downloaded. The applet preferably continues to download the previous items in case the user decides to return to them, but places a higher priority on the cunently visible items. In this way the applet is seen to keep in touch with the user and is able to present previews and cached copies of the pages with a minimal delay.
As the applet download pages, it stores them in a special directory on the local hard disc set aside for this purpose. This is facilitated because the applet is signed and trusted and therefore has access to system resources (such as the file system) that untrusted applets do not have (see above section on signing). Once the HTML files are on disc, the applet can cause the browser to view these files as if they were real web pages, thus allowing the pages to be viewed almost instantaneously once selected.
One potential difficulty with this approach is that often web pages contain a lot of images which can be large files compared with the actual HTML file itself and therefore can cause a delay to the display of the page. Therefore it is desirable for the applet to automatically download any such associated files into the same cache directory to allow very rapid display of pages even with many large images (provided there was sufficient time for all the files to download). This also requires the applet to rewrite the master HTML page such that URLs for all associated files point towards the local cached versions. In order to avoid problems with broken images if not all images have been cached before the page is viewed, only those URLs for which a local cache file has been obtained are rewritten, the rest continue to point to the original source. This means the browser uses its normal iterative display algorithm which displays all loaded elements and then fills in other elements such as images and frame contents as and when they complete downloading.
Note that the cached HTML files on disc can also be used for the preview functionality (see below) to enable a full rendering of the page (utilising the browser HTML renderer) on a small scrollable pane rather than the simplified rendering performed within the applet itself. The applet may provide this as a user option depending on which the user finds most helpful; this will typically be dependent on the network and CPU resources available to the given user.
In order to preview a page, the user simply hovers his mouse pointer over the list entry of interest. If utilising the browser's rendering engine, then the applet issues a command specifying the file to display, and whether to display it in a separate window or a particular pane. Preferably there is a pane as part of the system's home page that is set apart for this purpose; if a separate window is used then it preferably has its address bar and toolbar suppressed to save space (and to prevent confusing the user with a the presence in the address bar of a local filename instead of the actual URL).
If rendering the pages within the applet then a Java (Regd. T.M.) pane is created within the screen area that belongs to the applet, and this is populated with a much-simplified representation of the HTML, which is purely text but which has some of the characteristics of the full HTML such as colours and font size. Again this pane may be placed in a separate window - the advantage of doing this is that the user has complete freedom in how he organises the layout. The interface may provide a detach button which allows the fixed pane to become a separate window, with the option of re- attaching it.
When ready to view a full page, the user clicks on the page title in the list, wliich highlights as the mouse rolls over it indicating that it can be clicked. This preferably uses the same mechanism described above to display the full page, again from local cache if available; this is advantageously shown in a new browser window. The user may then choose to follow links within the site, in which case un-rewritten URLs are followed taking the user to areas within the actual site and browsing may continue as usual. On closing the window, the user is presented with the previous browser window containing the system applet, ready to continue looking for his page of interest.
When the applet has downloaded the text for a particular URL it begins processing it, but only if the system marked this URL as wanted for update, and if the page has a date modified later than the date the system has on record. If the page is wanted for update but has the same date modified as the system has on record, then this information is , returned to the system so it knows that it doesn't have to check this page again for another period of time specified for updating.
Those pages which have changed since last checked and which the system therefore requires a new analysis of, are processed to determine the keywords present and to obtain a relevance ranking for the keywords based on relative frequency and significance within the page (e.g. large headings carry more weight). This processing is performed in a thread for each page, preferably the same thread that originally downloaded it.
The applet begins by building the word list for the page, that is it lists all the words that appear on the page, and assigns each word a rating based on its importance. This is determined by that word's frequency and whether it appears in the title, headings, links etc. The applet has built into it (preferably in such a way that it may be dynamically updated without downloading the whole applet again) a list of words that are too common to be useful for searching except when searching by exact phrase. Preferably, there is also a list of words that are deemed inappropriate for allowing searching on. These words will also be discarded. Thus a list is obtained containing for each non- trivial, non-offensive unique word a ranking, for example on a scale of 1-255 (1 byte) or 1-65535 (2 bytes). This provides a very compact analysis of the page as there is no duplication and all HTML tags and common words have been removed. In addition to the word list, a list of contained URLs is also returned to the system, in this way new pages are discovered and analysed.
Once a page has been analysed, the resulting data is transmitted to a server for central organisation and storage. This is done as the next stage in the sequence handled by the thread that downloaded the page (see above section on threads). One embodiment of a format for this data is shown below:
URL D
CHANGED = 1
DATE_MODIFIED
SIZE
WORD
RANKING
WORD
RANKING
NULL TERMINATOR
URL
URL
URL NULL TERMINATOR
This is for a fresh analysis. For a page that has been checked and found to have not been modified, the following might be sent:
URL D CHANGED = 0
This simple message instructs the system that this page has been checked as required. The system notes this and then checks again in the period of time specified for updating.
The URL_ID is supplied by a server when it sends the list of URLs, and the system uses its own local clock to determine the DATE_CHECKED field for its database entry.
When the user clicks the checkbox next to a search result, that page is bookmarked within the applet. This means the user can return to it at later time for previewing or viewing. It also sends a vote to the system to rate this site with regard to the search terms the user was using. The user's bookmarks are sent to the user database on the system server so that they can be retrieved by the user at a later date.
The user is identified by means of an HTTP cookie, a small piece of data which the web server sends to the browser and which the browser keeps a copy of between sessions. The browser then automatically sends this back to the server whenever it visits that site again, thus enabling the server to identify the user and retrieve personal information for them. Thus, when the user returns to the system site he is automatically and transparently logged in without having to take any action. There is also an option for regular users to register with the site. With this they choose a username and password and can then logon to the system from any computer connected to the Internet and retrieve their personal bookmarks. For both types of user there is also the option of saving their bookmarks to local disc and of exporting them in a format (such as HTML) such that they can be imported into popular browsers such as Internet Explorer (Regd. T.M.) and Netscape (Regd. T.M.). Similarly, users can import bookmarks from their browsers into their system accounts.
As bookmarks stored on the system server occupy disc space, it is preferable for accounts which have not been accessed for a given period of time (say three months) to be deleted. The user is sent a warning email a week before this is going to happen, and if the user accesses the system within that week then the account becomes active again. The email contains the user's username and password in case they have forgotten it and a link to the system that automatically logs them on, making it as simple as possible for the user to begin using the system again. The email also contains an attached file, in portable HTML format, containing all the user's bookmarks so that even if the user's account is deleted, they still have a record of their bookmarks which can be used directly from the email, saved to local disc, imported into a browser or, if the user creates a new system account at some point in the future, imported back into the system.
In order to organise a user's often large number of bookmarks, there is preferably provided a special Bookmark Viewer or Bookmark Manager activatable by the user. This may be a pane on the side of the normal system applet view which may display contracted and expanded folder contents view formats. Bookmarks are preferably organised into Folders, which are hierarchical and can be expanded or contracted to show the next level down in the hierarchy, in the standard Windows (Regd T.M.), Tree Diagram paradigm. By default, when a user bookmarks a site, it is placed in a folder based on the query that the user performed. This may be automatically made hierarchical based on multiple query terms. Users may leave the bookmarks in the folders the applet assigns for them, or they are free to move them around at will, change folder names and create new folders. If changing a folder name, the system prompts the user to ask if he wants bookmarks for the original query terms to still go into that folder or not (if not, another folder will be created based on the query terms again). At any stage, the user can hover his mouse over his bookmarks, causing them to be previewed in the normal preview window. Clicking on them brings up a new browser window containing the bookmarked page. It is quite likely that often the user will click on a bookmark before the applet has had a chance to cache it (the only warning the applet having received being the time the user had his mouse pointer hovering over the link), in which case the browser window will open with the actual URL, not the name of a locally cached file.
Extra features in the form of a personal results functionality are available on subsequent visits for users who have already visited the system and performed one or more queries (see below). With this optional feature selected, on returning to the system site, before entering new search terms, the applet automatically selects a number of queries from the user's history, up to, for example, ten (this number is preferably user-configurable). A search is then performed for each of the queries, returning only pages which are highly rated by other users and which are relatively new, e.g. less than 1 year, 1 month, 1 week or 1 day old. If there are no such sites for a particular query then nothing is shown. This cut-off criteria is preferably user-configurable, for example in terms of votes cast or recency of modification). These results are then displayed in the same manner as normal search results or in a different format and in the same part of the window with the document title/URL, short extract and ranking information.
Preferably, a priority value is maintained for each query; this may be incremented whenever that query is run by the user. A small header indicating which of the user's queries each set of pages are in response to may also be provided. Again, by hovering his mouse over the entry, a preview is shown to the user and by clicking on it the full page comes up in a separate window as normal. This effectively provides a "web magazine".
The queries chosen for such a Personal Results or web magazine page, if there are more than the specified number to choose from, are preferably selected based on any or all of the importance of the query to the user which is determined by seeing how recently the user performed this query, how often the user performs the query, and how many bookmarks he has which relate to this query. It is also possible to allow the user to promote a query to a higher significance thereby guaranteeing its inclusion in the Personal Results search.
If the user proceeds to perform a new query, then the new search results replace the Personal Results. However, these may be returned to at any stage (including on the user's first visit to the system) by clicking on a Personal Results button.
The queries that generate the Personal Results page preferably implement the usual spidering functionality of any other query, including the URL tax (see below). This is advantageous as people may use the system as their home page or often view their Personal Results without actually doing fresh queries, in which cases the system benefits from their spidering input as soon as they logon.
To provide a comprehensive and up to date index of the web, the system relies on the processing capability and bandwidth of its users, in particular, the applets running on their browsers. When users perform queries against the central system database a list of matching URLs is returned. If the system needs an update for any of these pages, it informs the applet, wliich returns to the system a fresh analysis of the page containing the word list and the set of URLs that the page links to.
However, there are potentially many pages that should preferably be checked every day but that will not necessarily appear high enough in a search results list to be checked frequently enough. Therefore, for every set of URLs that users are interested in, preferably a number of URLs are returned purely for the purpose of spidering, that is, in order to get the applet to check, and if necessary, analyse them. This may be refened to as a "URL Tax". The ratio of spidering URLs to search result URLs will be termed a URL tax percentage. If the applet completes downloading all pages the user is interested in (or at least the next ten in the list say) it then begins checking more URLs above and beyond the normal tax percentage. This allows efficient use of high-bandwidth users who leave the system open in their browser after finishing using it.
An exemplary database scheme for the system will now be described, giving the main features and data items contained. The actual schema used does not need to follow this exact structure as this will largely depend upon the data structure used to implement the database i.e. a custom architecture or a standard relational database management system.
URL
URL ID URL LAST UPDATE LAST CHECKED RATING TODAYS VOTES
In the URL list, each URL is assigned a unique compact ID of preferably five or more bytes. 5 bytes allows for a trillion unique URLs, but depending on the exact implementation 8 bytes may be simpler to store.
A field TODAYS VOTES counts the number of times this page has been bookmarked today, so it is set to zero every 24 hours. A RATING field is a score based on the number of bookmarks and other votes (i.e. people viewing the page from their bookmark viewer) which accumulates with time, although when pages are freshly analysed this number is reduced, for example by halving its value.
TERM
Figure imgf000059_0001
Here, a term is used to mean a "word" (preferably with capitals and punctuation suppressed) or a set of words which are grouped together as a phrase, although preferably the order is not significant, hi this context, "word" may include words in more than one language, proper nouns, and combinations of letters, numbers and other characters such as the AMD "3DNow!" trademark. TERM__ID is a 4 byte integer allowing for 4 billion unique search terms. TEXT is a word or phrase as a character string e.g. "clinton". RATING counts the accumulated number of times this term is used for querying and TODAY is the RATING count for the last 24 hours.
A TERM table contains all unique words to be indexed against, and in addition it contains the most popular phrases that searches are performed against. The number of phrases stored selected dependent on the system's resources.
RATING
Figure imgf000060_0001
A RATING table provides cross-references between particular pages and search terms that they are relevant to. Data in a RATING field is preferably a combination of a static rating of a page with respect to a particular term, i.e. how often a particular word appears on a particular page and a dynamic rating i.e. based on bookmarked pages that users associate with a search term. Depending on the exact implementation (in particular the request servicing algorithm) the static and dynamic ratings may be stored in separate fields.
There are some tables which are not related to the core dataset, rather they hold information related to users i.e. bookmarks and frequently asked queries, as described below. In the embodiment described with reference to Figure 1 these tables are held in user data store 128.
USER
Figure imgf000060_0002
A USER table holds important user-related information. An entry for each unique user is created the first time they visit the system site. As session tracking is performed using HTTP cookies without the need for the user to register and sign in, initially, USERNAME, PASSWORD and EMAIL fields are blank, and only a USER D is stored. This is what the cookie stored on the user's browser contains in order to identify the user on a return visit.
If and when a user decides to register with the system, he must enter a unique username as well as a password and email address.
USER TERM
Figure imgf000061_0001
A USER TERM table stores queries that a particular user performs on a regular basis. Each user will have a number of entries in this table, preferably with an upper limit to prevent the table growing too large (therefore the primary key for this table is USER_ID + TERM). If the phrase or word appears in the master term table as on page 25 then a TERM_ID is referenced thus saving space in the database otherwise the term appears in full as a text string. The priority field indicates the importance of this query; preferably it is automatically incremented whenever the user performs that query. In addition the value may be manually edited by the user to indicate when they are particularly interested in a query. This value is used when generating the automatic Personal Results page.
BOOKMARK
Figure imgf000061_0002
A BOOKMARK table stores the users' bookmarks. The primary key for this table is USERJD + URLJD this ensures that each user may only bookmark a given page once. TERM indicates for which query a given page was bookmarked and this information is used for ranking pages with respect to search terms. A FOLDER JD is used by the Bookmark Viewer/Manager to organise the bookmarks hierarchically. URLJD signifies the web page address efficiently by providing a cross-reference to the URL table.
FOLDER
USER ID FOLDER ID FOLDER NAME PARENT ID
A FOLDER table enables bookmarks to be organised hierarchically into folders. By default when the user bookmarks a page a folder is created with the name formed by the search queries used, if not already existing for this user. The user can rename folders and create subfolders. If a folder is a subfolder of another folder than its PARENT JD points to that folder, otherwise PARENT JD is null, indicating that it is a top-level folder.
The database may be implemented on an RDBMS (Relational Database Management System) or on a proprietary or other data structure. The dataset is also veiy large and simply structured. A custom design data structure is one efficient and cost-effective solution.
In one implementation, each table consists of a file on disc. There is a portion of each table held in physical memory (i.e. cached) at all times. Only specific operations are allowed on the database; these are the ones that allow the data to be updated as information is received from the applets, and that allow queries to be performed against the database by the applets. Optimised C++ routines perform these operations on the cached portions of the tables and also keep the full disc versions up to date.
The interface to the system database comprises a Java (Regd T.M.) servlet, optimised for network operations, thus enabling a large number of applets to be simultaneously connected to the system. Integration of the front and back-ends of the database is by implementing the C++ methods as Java Native Interface (JNI) methods, that is so that they comply with a standard interface allowing the servlet JVM to make direct method calls on them. There are also servlets (conveniently, mostly written in C++) that continually sort data and ensure that the database is self-consistent.
In a prefened embodiment, virtually all the data in the system database is contributed by the clientside Java (Regd. T.M.) applets, apart from initial spidering information. It is thus desirable to ensure that the process of data collection from clients is efficient, with minimal overhead for both the client and server, and that the information incorporated into the system is accurate.
One important consideration is security, in particular the potential for hackers to create, by reverse engineering, malicious variants of the system applet. These could, for example, attempt to feed inaccurate information back to the database, or attempt to disrupt the normal operation of the system by means of a Denial Of Service (DOS) attack where large quantities of queries or data is directed at the server in an attempt to overload it.
Denial Of Service attacks are a common problem for all Internet based services, systems and networks, hackers using the normal process of sending queries, but in such quantities that potentially the server cannot cope. One way to address this potential vulnerability is to provide means to monitor the network traffic and the system server(s) itself to check that no such attacks are in progress and that the system is running smoothly and is coping with the number of interactions it is receiving. It is possible to provide means to track the trends in traffic on daily and weekly cycles, and to keep ahead of demand for the service as the system's popularity grows. Thus, if a DOS attack is launched, a sudden increase in activity beyond the normal cycles can be detected and defensive measures taken. These may involve blocking traffic from certain IP domains or addresses and if necessary closing down the system until the source of the attack has been identified and shut down. Similarly, means to monitor activity on the system site and data flowing into and out of the database can identify discrepancies that signify hacker attack, at which point appropriate measures can be taken.
Another potential risk is that of inaccurate information being systematically fed into the system database. For example, a person might consider the artificial boosting of page rankings to artificially boost the traffic to their website by sending in large numbers of fictitious votes, or possibly to reduce the traffic going to a competitor's site by sending wrong analyses of pages.
As described above, a so-called dynamic page rating is obtained by counting votes that are cast whenever a user bookmarks a site enforced by the database schema. However, it is only possible for each unique user to bookmark a page once. It would be theoretically possible to obtain many virtual USER JDS by visiting the system site many times, each time the user clearing the cookie(s) from their browser, so that the system would consider the user new every time. A solution to this problem is to provide means to monitor the IP address from which new users are originating. Many users, for example more than 10, (especially over a short space of time, for example, 1 week, 1 day or 1 hour) originating from the same IP address can be detected indicate a potential problem (especially if these users are indeed casting many votes for the same web page). Measures can then be taken to block this activity, for example by issuing all users from the same or a particular suspect IP address with the same USERJD.
To detect malicious activity where the applet is modified to transmit inaccurate analyses of pages, it is prefened that the applet cannot request which pages it analyses - rather it is sent commands from the central server. Therefore, as long as the server includes means to keep track of which applets have been sent which spider-request URLs, this restricts the possibility of a malicious third party sending inconect information for any web page at will. A hacker might also or instead simply want to reduce the quality of information in the system as a whole. This risk can be alleviated by providing means to ensure that each web page is analysed by at least two different clients before being incorporated into the database. If these two clients do not agree on the analysis, then a third (or further) client is enlisted, and the two (or more) clients that agree have their data incorporated. This technique can be termed multiple-spidering. This technique introduces a potential overhead into the system that could reduce the spidering power of the system. However, there are normally so many clients performing spidering on behalf of the system that this is unlikely to be a significant issue, and indeed provides a further guarantee of the quality of the data. It is nevertheless possible to restrict the impact of this overhead, by close monitoring of aspects of the system.
Rather than multiple-spidering every single page that needs analysing, a small sample (say 1%) of pages may be multiple-spidered by default. If a significant number of these pages are rejected then this can be taken to indicate a problem of this kind. At this point (potentially automatically), the number of pages being multiple-spidered is gradually increased, up to the maximum of 100% so that the quality of the system's data is once more assured. At this point the malicious third party is defeated, no more contentious page analyses will be detected and the system can once again reduce the percentage of multiple-spidering back to the original low background level.
For single word or recognised-phrase queries, a list of URLs associated with that word or phrase is returned, ordered by their recommendation scores. Multiple word queries which appear in the phrase database will be treated in a similar fashion.
For multiple word queries which don't appear in the single word or phrase table, the product of the recommendation-rating for each word in the query is used to rank the URLs.
In one embodiment, two copies of the term lists are maintained - one ordered by rank and the other ordered by URLJD. In an alternative embodiment, a word list associated with each URL is maintained. The best strategy for a given application can be readily determined by experiment.
The lists, and in particular the ordering of them, are advantageously updated whenever a query invokes it. A flag on each list will indicate whether it is up to date with ordering.
To ensure that all pages are kept up to date, each user-query also results in one or more 'processing-tax' URLs being sent to the user, unrelated to the user's query, but which the system wants updated. A list of URLs may be sent to have their modified dates checked, and those that have changed are processed as described above. The processing tax may have to be pitched as high as 33% or even 50% in order to achieve the level of spidering required but lower levels such as 15%, 10%, 5%, 2%, 1% or less can also be used. This can be tuned according to the needs of the system at a particular stage in its development as a web authority. The tax rate can also be dependent upon a user's bandwidth, for example to levy a higher tax on "rich" users, i.e. those people with high speed connections.
Any pages which have been non-contactable for a certain number of days are preferably deleted from the list. Less highly-rated pages will thus have a shorter time-to-live than more important pages. In other words, the time-to-live for a particular page is preferably determined by a function of that page's importance and the continuous amount of time it has been non-contactable.
Preferably the data collection and processing (web page spidering) procedures described above as being performed by the applet are also implemented in conesponding code on the data collection server (although there may be no need to establish a socket connection). Following start-up the data collection server may then begin to populate the database using its own spidering processes. Initially, there will be relatively few users, so the system will have time to do its own searches. Gradually however it will be able to afford to spend less time searching and will have to spend more time interacting with clients. Concomitantly as time goes on the central data collection server will need to do less and less spidering itself, instead relying on the resources of the growing numbers of users. There is thus an elegant trade-off between available CPU and bandwidth and the number of users. Furthermore as the size of the web expands exponentially in terms of number of documents so the number of potential system users spidering the web also grows exponentially.
Initial spidering is thus preferably done by the server itself, utilising the same optimised distributed mechanism that is used to manage the many client applets once the system site is up and running with a large set of regular users. In a prefened embodiment a special version of the applet is therefore provided which has no or a very limited GUI, which has the same interface to the server as the normal applet and which spends substantially all its time requesting long lists of pages that need checking and returning the results in batches. This process is run on the server machine and any other available machines with permanent Internet connections. One way to build up the initial list of URLs is to query DNS servers for the complete set of registered domain names, (for example, for .co.uk domains); alternatively this information may be purchased (for example from Network Solutions for top-level .com domains).
The foregoing embodiments of the system have been described with reference to the Internet, but the present invention is also applicable to other networks such as intranets, extranets, local and wide area networks, WAIS (Wide Area Information Servers) -based networks and wireless networks. Moreover although it is preferable to employ Internet and web-based technology this is not essential and the invention may be adapted for use with other systems in which applications are shared between machines which communicate with each other, for example over a network. Thus the invention is also applicable to mobile phone-accessed networks such as networks accessed by means of i-mode or WAP (Wireless Application Protocol).
No doubt many other effective alternative anangements will occur to the skilled person and it should be understood that the invention is not limited to the described embodiments and encompasses modifications apparent to those skilled in the art lying within the spirit and scope of the claims appended hereto.

Claims

CLAIMS:
1. A server system for searching a network, the system comprising: a search data store storing: a plurality of addresses of locations of objects accessible using the network; and search data including data relating to information content of at least some of the objects; a program store storing processor implementable instructions; a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to:- receive a search request from a user terminal; retrieve search result data from the search data store comprising one or more search result address for objects having an information content relevant to the search request; transmit the search result data to the user terminal; receive from the user terminal information relating to an object located at an address provided to the user terminal by the server system; and update the stored search data using the object-related information received from the user terminal.
2. A server system as claimed in claim 1 , wherein the instructions further comprise instructions for controlling the processor to:- retrieve from the search data store at least one search tax address; transmit the search tax address to the user terminal; and receive from the user terminal information relating to an object located at the search tax address.
3. A server system as claimed in claim 2, wherein the search tax address is retrieved in response to receipt of the search request; and wherein the instructions further comprise instructions for controlling the processor to: receive from the user terminal information relating to an object located at a search result address and information relating to an object located at a search tax address.
4. A server system as claimed in claim 2 or 3, wherein the instructions further comprise instructions for controlling the processor to: select a said search tax address for the user terminal according to a logical proximity of an object at the tax address to the user terminal.
5. A server system as claimed in any preceding claim, wherein the stored search data includes object content characterizing data for determining whether the stored search data requires updating; and wherein the instructions further comprise instructions for controlling the processor to: retrieve the object content characterizing data from the search data store; and transmit the object content characterizing data to the user teπninal with the search result address and/or search tax address.
6. A server system as claimed in any preceding claim, wherein the information relating to an object received from the user terminal comprises object information content data for determining the relevance of the object's information content to a search request.
7. A server system as claimed in claim 6, wherein the object information content data includes a list of words in the object and word rating data indicating the likely significance of the words.
8. A server system as claimed in claim 7, wherein the instructions further comprise instructions for controlling the processor to: receive user object preference data for an object from the user terminal; and determine object rating data for an object using the user object preference data and word rating data for the object.
9. A server system as claimed in any preceding claim, wherein the instructions further comprise instructions for controlling the processor to: provide an address for the same object to two user terminals; receive information relating to the object from both user terminals; and update the stored search data conditionally upon information relating to the object having been received from both terminals.
10. A server system as claimed in any preceding claim, wherein the instructions further comprise instructions for controlling the processor to: receive and manage search requests concurrently from a plurality of user terminals.
11. A user terminal for searching a network, the user terminal comprising: a data store operable to store data to be processed; a program store storing processor implementable instructions; and a processor coupled to the data store and to the program store for implementing the stored instructions; the instructions stored in the program store comprising instructions for controlling the processor to:- input a search request from a user; transmit the search request to a server system; receive search result data from the server system, the search result data comprising one or more search result address for objects having an information content relevant to the search request; retrieve from at least one address received from the server system object data for an object located at the received address; and transmit to the server system information relating to the object located at the received address derived from the retrieved object data.
12. A user terminal as claimed in claim 11 , wherein the at least one address comprises a search tax address; and wherein the instructions further comprise instructions for controlling the processor to:- receive from the server system at least one said search tax address; retrieve from the search tax address object data for an object located at the search tax address; and transmit to the server system information relating to an object located at the search tax address.
13. A user terminal as claimed in claim 12, wherein the at least one address comprises both a search tax address and a search result address; wherein the search tax address is received from the server system in association with the search result data; and wherein the instructions further comprise instructions for controlling the processor to:- retrieve and transmit to the server system information relating to an object located at the search tax address and information relating to an object located at the search result address.
14. A user terminal as claimed in claim 11 or 12, wherein the at least one address comprises a said search result address.
15. A user terminal as claimed in any one of claims 11 to 14, wherein the information relating to an object transmitted to the server system comprises object information content data; and wherein the instructions further comprise instructions for controlling the processor to:- determine said object information content data using the retrieved object data.
16. A user terminal as claimed in claim 15, wherein the object information content data includes a list of words in the object and word rating data indicating the likely significance of the words.
17. A user terminal as claimed in claim 15 or 16, wherein the instructions further comprise instructions for controlling the processor to:- receive from the server system object content characterizing data; and determine, using the object content characterizing data, whether object information content data is to be transmitted to the server system.
18. A user terminal as claimed in any one of claims 11 to 17, wherein the instructions further comprise instructions for controlling the processor to:- input user object preference data indicating a user preference for an object; and transmit the user object preference data to the server system.
19. A network server having a data store storing the processor implementable instructions of any one of claims 11 to 18.
20. A network server as claimed in claim 19, wherein the data store further stores a page of internet data including or having a pointer to the processor implementable instructions.
21. A computer program comprising the processor implementable instructions of any preceding claim.
22. A storage medium storing the computer program of claim 21.
23. A method for searching a network using a client system, the method comprising: inputting a search request from a user; transmitting the search request to a server system; receiving search result data from the server system, the search result data comprising one or more search result address for objects having an information content relevant to the search request; retrieving from at least one address received from the server system object data for an object located at the received address; and transmitting to the server system information relating to the object located at the received address derived from the retrieved object data.
24. A search system for a network comprising: a server coupled to the network; a plurality of user network-access means, couplable to the server via the network for providing a plurality of users with access to the network; a search database coupled to the server; an information collecting program accessible to each said user network-access means for running by said users; wherein said information collecting program is configured to, when running on a said user network access means, collect information relating to data stored at locations within the network and to pass at least a portion of the collected information to the search database; and wherein said locations are provided to the collecting program from the database in response to a search request sent by the collecting program to the server for search data from the database.
25. A search system as claimed in claim 24, wherein the network comprises an internet, and wherein the information collection program comprises code for running within an Internet browser and includes a user interface for searching the network.
26. A search system as claimed in claim 25 wherein the user interface is configured to accept input of a search term from a user and to pass a representation of the search term to the database, and wherein a said location is provided to the information collecting program in response.
27. A search system as claimed in claim 26 wherein the user interface is configured to provide a preview of a search result when a user hovers or clicks a cursor or pointer over a search result identifier.
28. A search system as claimed in any one of claims 24 to 27 wherein said information collecting program is downloadable from the server via the network for running by a said user network access means.
29. A search system as claimed in any one of claims 24 to 28 wherein the collected information comprises date and/or status information characterizing data stored at locations within the network
30. A search system as claimed in any one of claims 24 to 29 wherein an item of collected information relates to a document or page of data and includes information selected from any or all of: a list of distinct words and/or keywords and/or headings in that document or on that page with a count of their frequency; a list of distinct phrases in that document or on that page with a count of their frequency; a number of words in that document or on that web page; a size of that document or web page; a ratio of the number of occunences of distinct words or phrases to the size of that document or web page; a number of images in that document or on that web page; a size of an image or images in that document or on that web page; a number of audio clips in that document or on that web page; a size of an audio clip or audio clips in that document on that web page; a number of movie clips in that document or on that web page; a size of a movie clip or movie clips in that document or on that web page; a number of hyperlinks in that document or on that web page.
31. A search system as claimed in any one of claims 24 to 30 wherein a said information collecting program comprises access enabling means to enable access to data stored within the network.
32. A search system as claimed in any one of claims 24 to 31 wherein the network is an Internet protocol network and the information relates to Internet data.
33. A search system as claimed in claim 32 wherein a said information collecting program is integrated into a web browser.
34. A method of updating a search system for a network, the system comprising: a server; a plurality of user network-access means, couplable to the server via the network, each for providing a user with network access; and a search database couplable to the server; the method comprising: running an information collecting program by a plurality of said users; collecting information relating to data stored within the network using the program; passing at least a portion of the information collected by said plurality of users to the search database; and updating the database using the collected information.
35. A method as claimed in claim 34, wherein said information collecting program comprises a user interface for the search system, the method further comprising: running said program when a said user performs a search; and collecting information from a location provided from the database in response to the search.
36. A method as claimed in claim 35, further comprising providing search results to said user, the results including a reference to said location, and wherein said information is collected from said location when the said user accesses the location.
37. A method as claimed in claim 36 further comprising providing a preview of a search result when a user hovers or clicks a cursor or pointer over a search result identifier.
38. A method as claimed in claim 36 or 37 further comprising: checking the information at the location against a conesponding entry in the database; and updating the database if a portion of the information is newer than the database entry.
39. A method as claimed in claim 38 wherein said updating comprises forwarding a representation of at least part of the information to the database.
40. A method as claimed in claim 39 wherein said forwarding provides a compressed copy of at least part of the information to the database.
41. A method as claimed in claim 39 wherein said forwarding comprises identifying key words or phrases in the information and passing a representation of these to the database.
42. A method as claimed in any of claims 24 to 41 wherein said collecting comprises checking whether said program is trusted to access data within the network.
43. A method as claimed in any one of claims 35 to 42 further comprising registering in the database an indication of a user's access or approval of information at the location provided from the database in response to the search.
44. A method as claimed in claim 43 comprising registering a user's bookmarking of the location.
45. A data processing system comprising means for carrying out the method of any one of claims 34 to 44.
46. A method as claimed in any one of claims 34 to 44, of updating a search system as claimed in any one of claims 24 to 33.
47. A search system according to claim 24, or a method according to claim 34, wherein the search system further comprises means for a user to input a search request and means to provide the user with a plurality of search results.
48. A search system or method as claimed in claim 47 wherein the search results are provided to the user together with an indication of their relative ranking, and wherein the ranking of a result is determined at least partly by other users' access and/or bookmarking of data indicated by the said result.
49. A program to, when running, on a network: provide a user interface for searching the network; accept a user search request; pass a request to a search database, responsive to the user request; receive a search result having network data location information from the database; access, or request another program to access, the data location; and pass information from the data location back to the database.
50. A web browser application program to, when running, receive a URL from a server, at least partly download a web page at the URL, extract a portion of information from the web page, and send the information to a web searching database on the web.
51. A web data collection system comprising a plurality of individual users each connected to the web and running a program to collect information on the contents of web pages and to report the information to a common database.
52. A search data store for the server system of claims 6, 7 or 8 wherein an item of the object information content data is associated with a plurality of item location addresses for objects having an information content relevant to the item of object information content data; and wherein the item location addresses have an order conesponding to the relevance of the objects at the addresses to the item of object information content data.
53. A database for a network searching system comprising: a list of network resource locators; a list of search terms or term identifiers; and a list of ratings, each linked to at least one resource locator and one term or term identifier, a value of each rating being dependent upon access to or approval of a conesponding located resource by users of the searching system.
54. A database for a network searching system as claimed in claim 53, further comprising means for a user to bookmark a resource locator by storing it in association with user identity information and wherein a said rating is dependent upon a combination of a rating of the conesponding resource with respect to at least one said search term and a rating dependent upon a frequency or count of bookmarks of the resource.
55. A method of bookmarking resource locations in a network searching system, the system comprising a server coupled to a search database and means for remote access to the database by a plurality of users, the method comprising: providing to a user in response to a search request, search results from the database, the results being associated with conesponding resource locators; receiving from the user a request to bookmark a resource associated with a said result; storing, in the database, a conesponding resource locator coupled with user access control information for the user; whereby the resource is locatable by the user after bookmarking.
56. A method of ranking results for a network search system, comprising: determining a first user's interest in a network resource by detecting whether the user stores the resource location for later access; and ranking a plurality of network resource locations provided as results for a search performed by another user, partly responsive to the first user's determined interest.
57. A method of providing a web user with a preview of a web page, comprising: locally caching at least part of the web page information; rewriting at least one link in the cached page to point to locally cached data; and displaying at least a part of the cached page.
58. A method as claimed in claim 57 wherein the displaying displays a simplified rendering of the cached page.
59. A method as claimed in claim 57 or 58 wherein the web page preview is displayed when a user hovers or clicks a cursor or pointer over a search result identifier.
60. A user interface for a network browser or search system, comprising means to automatically download a plurality of documents or web pages, or parts thereof, indicated by displayable results provided to a user, by starting a conesponding plurality of processing tasks to be executed in parallel.
61. A user interface as claimed in claim 60 further comprising means to monitor access by the user of the displayable results and to adjust in real time a relative priority of the processing tasks in response to real time visibility or invisibility of the results by the user.
62. A user interface as claimed in claim 60 or 61 further comprising means to monitor a performance measure for the user interface and to adjust the number of processing tasks to benefit performance.
63. A network search system comprising: means to store a search request input to the system by a user on a first occasion; and means to repeat the user's stored request automatically and to display the results of the request when the user accesses the system on a second, subsequent, occasion.
64. A network search system comprising: a server coupled to a search database; a remote network access means including input means for a user to input a search request; means to provide an instruction from the database to the remote network access means to access and analyse information relating to a resource on the network and to report to the search database; and means to provide search results to the network access means in response to the search request, conditional upon the database receiving the report.
65. A method for quality control of a database of search data for a network, comprising: instructing a plurality of client programs to gathering information for the database from locations provided to the programs by the database; double checking a proportion of the gathered information by issuing identical or equivalent locations to two different client programs; determining whether the gathered information from the two client program agrees to within a tolerance margin; and adjusting said proportion based on the results of said step of determining.
66. A data processing system comprising means for carrying out the method of any one of claims 55 to 59 or 65.
67. A system, method or program as claimed in any one of claims 24 to 51 in combination with the applicable features of any one of claims 53 to 65.
68. A web crawling system or applet to, when running contact a web page; download that web page; analyse the contents of that web page; and send the results of its analysis to a database system.
69. A web crawling system or applet as claimed in claim 68, having a graphical user interface.
70. An applet as claimed in claim 68, to additionally accept a search term from the user; submit the search term to a database system; receive results from the database system; and present the results to user.
71. A web crawling system or applet as claimed in claim 68, 69 or 70, integrated into a web browser.
72. A stand-alone distributed web crawler to, when run contact a web page; download that web page; analyse the contents of that web page; and send the results of its analysis to a database system.
73. A web crawler as claimed in claim 72, callable from a browser.
74. A system or applet or web-crawler as claimed in any one of claims 68 to 73, which performs one or more of the following functions: analyses a web page by constructing a list of distinct words and/or keywords and/or headings on that web page with a count of their frequency, for uploading to the database-system; analyses a web page by constructing a list of distinct phrases on that page with a count of their frequency, for uploading to the database system; analyses a web page by determining the number of words on that web page, for uploading to the database system; analyses a web page by determining the size of that web page, for uploading to the database system; analyses a web page by determining the ratio of the number of occunences of distinct words or phrases to the size of that web page, for uploading to the database system; analyses a web page by determining the number of images on that web page, for uploading to the database system; analyses a web page by determining the size of an image or images on that web page, for uploading to the database system; analyses a web page by determining the number of audio clips on that web page, for uploading to the database system; analyses a web page by determining the size of the audio clips on that web page, for uploading to the database system; analyses a web page by determining the number of movie clips on that web page, for uploading to the database system; analyses a web page by determining the size of the movie clips on that web page, for uploading to the database system; analyses a web page by determining the number of hyperlinks on that web page, for uploading to the database system; allocates one thread for running the graphical user interface; allocates one thread for each web page it is downloading; measures its own performance and determines the optimal number of threads to be created for downloading web pages.
75. A system or applet or web-crawler as claimed in any one of claims 68 to 73, which allocates one thread for each web page it is downloading.
76. A system or applet or web-crawler as claimed in claim 75, measures its own performance and determines the optimal number of threads to be created for downloading web pages.
77. A system or applet or web-crawler as claimed in any one of claims 68 to 73, which incorporates or in which the database incorporates a text filter, in particular a filter for in part, of a list of prohibited words such as words having pornographic or racist connotations.
78. A system or applet or web-crawler as claimed in any one of claims 68 to 73, in which a web-crawling signed Java applet with a graphical user interface incorporates a method by which one or more URLs that have been returned in response to the search term(s) can be displayed to the user.
79. A system or applet or web-crawler as claimed in any one of claims 68 to 73, in which a web-crawling signed Java applet with a graphical user interface incorporates a status indication method to indicate to the user whether indicates URL(s) are up to date or have not yet been contacted.
80. A system or applet or web-crawler as claimed in any one of claims 68 to 73, in which a web-crawling signed Java applet with a graphical user interface incorporates a data indication method to indicate to the user the last time web page(s) indicated by the displayed URL(s) were updated.
81. A system or applet or web-crawler as claimed in any one of claims 68 to 73 , in which a web-crawling signed Java applet with a graphical user interface incorporates a bookmarking method to permit the user to bookmark any previously described indicated URL(s).
82. A system or applet or web-crawler as claimed in any one of claims 68 to 73, in which a web-crawling signed Java applet with a graphical user interface incorporates a bookmarking indicator to indicate to the user whether a previously described indicated URL(s) has been bookmarked.
83. A system or applet or web-crawler as claimed in any one of claims 68 to 73, in which a web-crawling signed Java applet with a graphical user interface incorporates a bookmarked page has been deleted indicator to indicate to the user that a page that the user has bookmarked is not cunently available on the world wide web.
84. A system or applet or web-crawler as claimed in any one of claims 68 to 73, in which a web-crawling signed Java applet with a graphical user interface incorporates a page preview method which displays one or more parts of web page(s) indicated by the indicated URL(s).
85. A system or method which translates data from a web crawling system or applet as claimed in any one of claims 68 to 84, into a form intelligible to a predetermined database or database type.
PCT/GB2001/001149 2000-03-22 2001-03-15 Search systems WO2001075668A2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU40857/01A AU4085701A (en) 2000-03-22 2001-03-15 Search systems

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GBGB0006991.4A GB0006991D0 (en) 2000-03-22 2000-03-22 Search systems
GB0006991.4 2000-03-22

Publications (2)

Publication Number Publication Date
WO2001075668A2 true WO2001075668A2 (en) 2001-10-11
WO2001075668A3 WO2001075668A3 (en) 2003-10-09

Family

ID=9888225

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2001/001149 WO2001075668A2 (en) 2000-03-22 2001-03-15 Search systems

Country Status (3)

Country Link
AU (1) AU4085701A (en)
GB (1) GB0006991D0 (en)
WO (1) WO2001075668A2 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004092980A1 (en) * 2003-04-17 2004-10-28 Nokia Corporation File upload using a browser
EP1805987A1 (en) * 2004-10-26 2007-07-11 Samsung Electronics Co., Ltd. Displaying apparatus and method for processign text information thereof
EP2084628A2 (en) * 2006-11-20 2009-08-05 Yapta, Inc. Data retrieval and price tracking for goods and services in electronic commerce
US8095534B1 (en) 2011-03-14 2012-01-10 Vizibility Inc. Selection and sharing of verified search results
US20120221546A1 (en) * 2011-02-24 2012-08-30 Rafsky Lawrence C Method and system for facilitating web content aggregation initiated by a client or server
US8429185B2 (en) 2007-02-12 2013-04-23 Microsoft Corporation Using structured data for online research
JP2013120416A (en) * 2011-12-06 2013-06-17 Canon Inc Information processing apparatus, information processing method, and computer program
US8595259B2 (en) 2007-02-12 2013-11-26 Microsoft Corporation Web data usage platform

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968824B (en) * 2018-09-30 2023-08-25 北京国双科技有限公司 Page data processing method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694593A (en) * 1994-10-05 1997-12-02 Northeastern University Distributed computer database system and method
WO1999048028A2 (en) * 1998-03-16 1999-09-23 Globalbrain.Net Inc. Improved search engine

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5694593A (en) * 1994-10-05 1997-12-02 Northeastern University Distributed computer database system and method
WO1999048028A2 (en) * 1998-03-16 1999-09-23 Globalbrain.Net Inc. Improved search engine

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CABRI G ET AL: "MOBILE-AGENT COORDINATION MODELS FOR INTERNET APPLICATIONS" COMPUTER, IEEE COMPUTER SOCIETY, LONG BEACH., CA, US, US, vol. 33, no. 2, February 2000 (2000-02), pages 82-89, XP000893912 ISSN: 0018-9162 *
MILLER R C ET AL: "SPHINX: a framework for creating personal, site-specific Web crawlers" COMPUTER NETWORKS AND ISDN SYSTEMS, NORTH HOLLAND PUBLISHING. AMSTERDAM, NL, vol. 30, no. 1-7, 1 April 1998 (1998-04-01), pages 119-130, XP004121434 ISSN: 0169-7552 *
YAMANA H ET AL: "Experiments of collecting WWW information using distributed WWW robots" PROCEEDINGS OF THE ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, XX, XX, 1998, pages 379-380, XP002142760 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004092980A1 (en) * 2003-04-17 2004-10-28 Nokia Corporation File upload using a browser
EP1805987A1 (en) * 2004-10-26 2007-07-11 Samsung Electronics Co., Ltd. Displaying apparatus and method for processign text information thereof
EP1805987A4 (en) * 2004-10-26 2010-10-27 Samsung Electronics Co Ltd Displaying apparatus and method for processign text information thereof
EP2084628A2 (en) * 2006-11-20 2009-08-05 Yapta, Inc. Data retrieval and price tracking for goods and services in electronic commerce
EP2084628A4 (en) * 2006-11-20 2011-11-02 Yapta Inc Data retrieval and price tracking for goods and services in electronic commerce
US8775563B2 (en) 2006-11-20 2014-07-08 Yapta, Inc. Dynamic overlaying of content on web pages for tracking data
US8429185B2 (en) 2007-02-12 2013-04-23 Microsoft Corporation Using structured data for online research
US8595259B2 (en) 2007-02-12 2013-11-26 Microsoft Corporation Web data usage platform
US20120221546A1 (en) * 2011-02-24 2012-08-30 Rafsky Lawrence C Method and system for facilitating web content aggregation initiated by a client or server
US8095534B1 (en) 2011-03-14 2012-01-10 Vizibility Inc. Selection and sharing of verified search results
JP2013120416A (en) * 2011-12-06 2013-06-17 Canon Inc Information processing apparatus, information processing method, and computer program

Also Published As

Publication number Publication date
AU4085701A (en) 2001-10-15
WO2001075668A3 (en) 2003-10-09
GB0006991D0 (en) 2000-05-10

Similar Documents

Publication Publication Date Title
JP4846922B2 (en) Method and system for accessing information on network
US8832085B2 (en) Method and system for updating a search engine
US6408316B1 (en) Bookmark set creation according to user selection of selected pages satisfying a search condition
US6460060B1 (en) Method and system for searching web browser history
US9703885B2 (en) Systems and methods for managing content variations in content delivery cache
US6480853B1 (en) Systems, methods and computer program products for performing internet searches utilizing bookmarks
JP3295667B2 (en) Method and system for accessing information on a network
CA2300239C (en) A content-indexing search system and method providing search results consistent with content filtering and blocking policies implemented in a blocking engine
US6714934B1 (en) Method and system for creating vertical search engines
US7552109B2 (en) System, method, and service for collaborative focused crawling of documents on a network
US7933917B2 (en) Personalized search method and system for enabling the method
US7487145B1 (en) Method and system for autocompletion using ranked results
US20020123988A1 (en) Methods and apparatus for employing usage statistics in document retrieval
US6961751B1 (en) Method, apparatus, and article of manufacture for providing enhanced bookmarking features for a heterogeneous environment
US20010037325A1 (en) Method and system for locating internet users having similar navigation patterns
JPH1091638A (en) Retrieval system
GB2406399A (en) Seaching within a computer network by entering a search term and optional URI into a web browser
CN101551813A (en) Network connection apparatus, search equipment and method for collecting search engine data source
US20050125412A1 (en) Web crawling
WO2001075668A2 (en) Search systems
US8914347B2 (en) Extensible search engine
JP2004110080A (en) Computer network connection method on internet by real name, and computer network system
JPH11184862A (en) Group adaptation-type information retrieval device
KR20010076035A (en) Direct Access Internet Service System and Method
WO2005022401A1 (en) Method, device and software for querying and presenting search results

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ PL PT RO RU SD SE SG SI SK SL TJ TM TR TT TZ UA UG US UZ VN YU ZA ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
122 Ep: pct application non-entry in european phase
NENP Non-entry into the national phase

Ref country code: JP