奇想录 www.qixianglu.cn: 2007/01/20

频

道

新技术
 新趋势
 奇思妙想
 科学探索
 科幻奇幻

2007-01-20

腾讯的浏览器竟然比firefox速度快很多

真没想到，因为我已经决定做firefox的忠实用户了，但是却发现firefox的访问速度并不快，而且占很大的内存，虽然google的同步非常好用，但是还是考虑增加其他浏览器的使用量。qq的拍拍网做得真不错，刚才随便去里面搜索了基本冷门的书，发现居然都能找到，原来很多书店在腾讯的拍拍网开了网店，所以里面的物品非常丰富，真不错，看样子，我也可以考虑在里面开个网店。至少也可以卖卖自己的旧书。

How to build a WebFountain: An architecture for very large-scale text analytics

by D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien

WebFountain is a platform for very large-scale text analytics applications. The platform allows uniform access to a wide variety of sources, scalable system-managed deployment of a variety of document-level "augmenters" and corpus-level "miners," and finally creation of an extensible set of hosted Web services containing information that drives end-user applications. Analytical components can be authored remotely by partners using a collection of Web service APIs (application programming interfaces). The system is operational and supports live customers. This paper surveys the high-level decisions made in creating such a system.

This paper describes WebFountain as a platform for very large-scale text analytics applications. WebFountain processes and analyzes billions of documents and hundreds of terabytes of information by using an efficient and scalable software and hardware architecture. It has been created as an infrastructure in the text analytics marketplace.

Analysts expect this market to grow to five billion dollars by 2005. The leaders in the text analytics market provide easily installed packages that focus on document discovery within the enterprise (i.e., search and alerts) and often bring some level of analytical function. The remainder of the market is populated with smaller entrants offering niche solutions that either address a targeted business need or bring to bear some piece of the growing body of corporate and academic research on more advanced text analytic techniques.

Lower-function commercial solutions typically operate in the domain of a million documents or so, whereas higher-function offerings exist at a significantly lower scale. Such offerings focus primarily on the enterprise and secondarily on the World Wide Web through the mechanism of small-scale focused "crawls."

When large-scale exploitation of the World Wide Web is required, individuals and corporations alike turn to undifferentiated lower-function solutions such as hosted keyword search engines.¹^, ² Typically, such solutions receive a small number of keywords (often one) and are unaware that the query comes from a competitive intelligence analyst, or an economics professor, or a professional baseball player.

Users with a business need to exploit the Web or large-scale enterprise collections are justifiably unsatisfied with the current state of affairs. Web-scale offerings leave professional users with the sense that there is fantastic content "out there" if only they could find it. Provocative new offerings showcase sophisticated new functions, but no vendor combines all these exciting new approaches—truly effective solutions require components drawn from diverse fields, including linguistic and statistical variants of natural language processing, machine learning, pattern recognition, graph theory, linear algebra, information extraction, and so on. The result is that corporate information technology departments must struggle to cobble together combinations of different tools, each of which is a monolithic chain of data ingestion, processing, and user interface.

This situation spurred the creation of WebFountain as an environment where the right function and data can be brought together in a scalable, modular, extensible manner to create applications with value for both business and research. The platform has been designed to encompass different approaches and paradigms and make the results of each available to the others.

A complete presentation and performance analysis of the WebFountain platform is unfortunately beyond the scope of this paper; instead, we adopt the approach taken by the book How to Build a Beowulf,³ which laid out in high-level terms a set of architectural decisions that had been used successfully to produce "Beowulf" clusters of commodity machines. We now describe the high-level design of the WebFountain system.

Requirements

The requirements for a very large-scale text analytics system that can process Web material are as follows:

It must support billions of documents of many different types.
It must support documents in any language.
Reprocessing all documents in the system must take less than 24 hours.
New documents will be added to the system at a rate of hundreds of millions per week.
Some required operations will be computationally intensive.
New approaches and techniques for text analytics will need to be tried on the system at any time.
Since this is a service offering, many different users must be supported on the system at the same time.
For economic reasons, the system must be constructed primarily with general-purpose hardware.

Related literature

The explosive growth of the Web and the difficulty of performing complex data analysis tasks on unstructured data has led to several different lines of research and development. Of these, the most prominent are the Web search engines (see, for instance, Google¹ and AltaVista ²), which have been primarily designed to address the problem of "information overload." A number of interesting techniques have been suggested in this area; however, because this is not the direct focus of this paper, we omit these here. The interested reader is referred to the survey by Broder and Henzinger.⁴

Several authors^5–9 describe relational approaches to Web analysis. In this model, data on the Web are seen as a collection of relations (for instance, the "points to" relation), each of which are realized by a function and accessed through a relational engine. This process allows a user to describe his or her query in declarative form (Structured Query Language, or SQL, typically) and leverages the machinery of SQL to execute the query. In all of these approaches, the data are fetched dynamically from the network on a lazy basis, and therefore, run-time performance is heavily penalized.

The Internet Archive,¹⁰ the Stanford WebBase project,¹¹ and the Compaq Computer Corporation (now Hewlett-Packard Company) SRC Web-in-a-box project¹² have a different objective. The data are crawled and hosted, as is the case in Web search engines. In addition, a streaming data interface is provided that allows applications to access the data for analysis. However, the focus is not on support for general and extensible analysis.

The Grid initiative¹³ provides highly distributed computation in a world of multiple "virtual organizations"; the focus is, therefore, on the many issues that arise from resource sharing in such an environment. This initiative is highly relevant to the WebFountain business model, in which multiple partners interact cooperatively with the system. However, architecturally WebFountain is a distributed architecture based on local area networks, rather than wide-area networks, and thus the particular models differ.

The Semantic Web¹⁴ initiative proposes approaches to make documents more accessible to automated reasoning. WebFountain annotations on documents may be seen as an internal representation of standardized markup as provided by frameworks such as the Resource Description Framework (RDF),¹⁵ upon which ontologies of markups can be built using mechanisms such as OWL ¹⁶ or DAML.¹⁷

Other research from different areas with significant overlap includes IBM's autonomic computing initiative,¹⁸^, ¹⁹ which addresses issues of "self-healing" for complex environments, such as WebFountain.

System design

The main WebFountain is designed as a loosely coupled, share-nothing parallel cluster of Intel-based Linux** servers. It processes and augments articles using a variant of the blackboard system approach to machine understanding. ²⁰^,²¹ These augmented articles can then be queried. Additionally, aggregate statistics or other cross-document meta-data can be computed across articles, and the results can be made available to applications.

The loosely coupled nature of the cluster makes it a natural for a Web-service style communication approach, for which we use a lightweight, high-speed Simple Object Access Protocol (SOAP) derivative called Vinci.²²

We scale up to billions of documents by making sure that full parallelism can be achieved, and by adding a fair amount of hardware to solve the problem (currently, 256 nodes in the main cluster alone). This level of scaling is made possible because the same hardware and, in many cases, the same results of analysis are used to support multiple customers.

To support the multilingual requirement, all documents are converted to Universal Character Set transformation format 8 (UTF-8)²³ upon ingestion, allowing the system to support transport, storage, indexing, and augmentation in any language. We currently have developed text analytics for Western European languages, Chinese, and Arabic, with others being developed and imported.

Ingestion is supported by a 48-node "crawler" cluster that obtains documents from the Web as well as other sources and sends them into the main processing cluster (see Figure 1).

Figure 1

Additional racks of SMP (symmetric multiprocessor) machines and blade servers supply the additional processing needed for more complex tasks. A well-defined application programming interface (API) allows new augmenters and miners (both described later) to be added as needed, and an overall cluster management system (also described later), backed by a number of human operators, schedules tasks to allow maximum utilization of the system.

Ingestion

The process of loading data into WebFountain is referred to as ingestion. Because ingestion of Web sources is so important to a system for large-scale unstructured data analysis, the ingestion subsystem is broken into two components. The first focuses on large-scale acquisition of Web content, for which the primary issues are the raw scale of the data and the heterogeneity of the content itself. The second focuses on acquisition of other sources, for which the primary concerns are extraction of the data itself and management of the delivery channel. We discuss these two components separately.

Acquisition of Web data. The approach taken and the hardware and software used to acquire data are indicated in this discussion.

Approach. Crawling large portions of the Web requires a system that has performance points high enough to saturate the inbound bandwidth, but that also can perform a fair amount of analysis on the pages fetched to determine where to go next. Crawling is a cycle of queuing up pages to acquire, fetching, and then examining the results and deciding where to go next. These decisions can be quite complex to make because there is a desire to maximize the value of the pages fetched. (See Figure 2.) Such feedback is required to avoid traditional problems of crawling: automatically generated or trivially changed pages, sites with session identifiers (IDs) in the uniform resource locator (URL), content type of no interest to the user population, and so forth.

Figure 2

We achieve the performance through share-nothing parallelism in the fetcher (as well as share-nothing processing in the cluster). The "single point" is the queue, which fortunately is quite simple: URLs to crawl are communicated to it from the various evaluators along with a priority. These pages are then pulled from it by the various fetcher instances that are allocated work on a simple hash of the host name. Each fetcher node then maintains its own queue and selects the next URLs to crawl, based on freshness, priority, and politeness (avoiding heavily loading a Web server with multiple successive accesses to the same server).

Hardware. The processing portion of the crawler is part of the main mining and storage cluster and is thus discussed later. The queue resides on a single queue management machine (an IBM xSeries* Model x335 server, which uses a 2.4 gigahertz (GHz) Intel Xeon ** processor with 4 gigabytes (GB) of read only memory, or RAM). Requested Web pages are dispatched via hash on the site name to 48 fetcher nodes, which are responsible for throttling load on the sites being examined, obeying politeness rules, and so forth. These machines connect to the Internet at large (at the moment) through 75 Mbps of an OC3 (optical Carrier level 3) line.

Software. The fetch cluster is coordinated through a Web service interface, with a DB2* system to hold information to support politeness policies. The fetcher itself is written in C++, as is the queue. We run multiple DNS (Domain Name System) servers and caches to reduce load on our upstream DNS providers. The queue supports priorities and has a transaction type semantic on work (preventing a failed fetch machine from resulting in data loss).

Acquisition of other data sources. WebFountain employs a number of other data sources as well: traditional news feeds, preprocessed bulletin boards, discussion groups, analyst reports, and a variety of both structured and unstructured customer data.

Approach. All of these sources come with their own unique delivery method, formatting, and so on. The ingestion task for all these methods consists of rationalizing the content into Extensible Markup Language (XML), which can then be loaded into storage, or the store, through an "XML fetcher." For most sources, this operation includes an attempt to either provide or begin a chain of mining that will result in the running text being available to further mining chains (so called DetaggedContent). In this way, many of the high-level linguistic analyses can be run on all data, regardless of source.

Hardware. As might be imagined, "other" data sources require a variety of data access approaches. Some of the data comes on CDs (compact disks), some on magnetic tape, some on removable hard drives, some via Web site, much via FTP (File Transfer Protocol) (both pull and push), some via Lotus Notes* database replication, some via e-mail, and so forth.

Each of these delivery mechanisms may imply a single machine per source, or a machine shared across a small number of sources, to accommodate the particular needs of that source. For instance, particular operating system versions are a typical requirement. However, the data volume on these sources tends to be relatively small, so in most cases a single IBM xSeries Model x335 server is sufficient.

Software. Typical data sources require a specialized "adapter." Each of these adapters is responsible for reducing the input data to XML files, usually one per "document," as described in the preceding subsection "Approach."

Data storage

The task of the WebFountain Store component is to manage entities (where an entity is a referenceable unit of information) represented as frames²⁴ in XML files. This management entails storage, modification, and retrieval. In WebFountain, entities are typically documents (Web pages, newsgroup postings), but might also be concepts such as persons, places, or companies. Entities have two properties: a type and a set of keys, each of which is associated with a multiset of values. Interpretation of the semantics of a particular key depends on the entity type. Common to all entity types is a key called the Universal Entity Identifier (UEID), which holds the globally unique 16-byte identifier for that entity. See Figures 3 and 4.

Figure 3 Figure 4

Challenge. The WebFountain store must receive entities, modify them, and return parts of them as needed. The challenge is one of scale in the face of several very different access patterns that need to be supported. These access patterns can be classified as creating new entities, modifying existing ones, or reading parts of existing ones. Access for a particular client can be either a sequential pass through most or all of the data, or a sequence of random accesses.

The key problem is avoiding the overhead of a disk seek for each access of each client. Latency for a small seek followed by a read or write is dominated by the seek time, and thus limited to the low hundreds per second. The use of RAID5 (redundant array of independent disks, level 5) to help with maintenance and uptime just exacerbates the problem as each write to the array becomes three to four writes to the devices.²⁵

The traditional approach for dealing with these types of patterns has been to cache heavily. This does not help as much as we might like as the data scale is too large and the typical access pattern too random for cache hits to result in substantial savings.

Approach. Regardless of other tricks used to address this problem, it is always desirable to have access to numerous disk heads working in parallel. If the problem can be spread over many disk heads (or arrays), and the order in which data are returned is unimportant, this spreading can result in linear increases in speed. We hash a unique part of the entity (in most cases the UEID) to determine storage locations, providing uniform distribution across all the devices, and take full advantage of share-nothing parallelism on the later mining and indexing steps.

On the disk itself, we take a compromise position of storing data together in bundles of a few hundred entities. The number of entities in the bundle is a tuning parameter that can be adjusted to match the overall workload. This approach allows sequential access when the whole data set needs to be examined, at the expense of a small penalty for each random access. Random access is achieved by hosting a UEID to bundle lookup using a fully dynamic B+Tree data structure on disk where all the nonleaf nodes are in memory. For storage devices that handle fewer than five million UEIDs, the entire lookup tree is kept in memory for faster access.

The physical storage is fronted by a Web service interface that performs some access optimization and pooling of data access. It provides separate sequential and random access APIs and uses a different family of optimization techniques in each case.

Hardware. We are using 2560 72-gigabyte drives, arranged into 512 five-disk RAID5 arrays. These drives are hosted by IBM TotalStorage* IP Storage 200i iSCSI (Internet Small Computer System Interface) devices, and the actual storage interface is via 256 network-attached IBM xSeries Model x335 Linux servers, grouped into eight-server nodes, connected via a jumbo frame gigabit network. This works out to approximately 0.5 terabyte per "node" of formatted space, which is a suitable mix of processor and storage for the augmentation to be done later (see the next section). The storage is formatted with a ReiserFS V2 file system. We are running various Linux kernels in approximately the Version 2.4.15 to Version 2.4.20 range.

Software. The serialization format used to transfer data over the network is the same format used to store frames on disk. This helps to reduce processing overhead because little or no parsing of the data is needed. This low CPU utilization approach is important because the iSCSI storage approach does require a certain amount of computation itself, and we need to do augmentation and indexing on the nodes as well.

The storage server is multithreaded both in the client requests and disk access (via asynchronous I/O), so as to achieve deep disk queues and the corresponding increases in speed.²⁶

For sequential access, storage reads an entire bundle (typically 8 MB or larger) to increase read size, and also performs prefetching of the next bundle when appropriate. The use of bundles, however, requires periodic reorganization to remove deleted records. Thus, an on-line reorganization thread needs to run in the background (see, for example, Sockut and Iyer²⁷).

The last feature is a very simple skip selection function used with sequential access. Certain commonly accessed collections have a tag embedded in their directory information that allows storage to provide (for example) all the nonduplicate, nonpornographic English language pages (a common starting point for marketing research).

Performance. Table 1 shows current performance numbers in documents per second, where a document is a Web page. As the table shows, read access is nearly the same in both access patterns, but random writes are considerably more expensive.

Table 1 Performance of storage per node (documents/second)

Access	Read	Create	Modify

Sequential	440	200	350
Random	420	200	150

Future directions. Future explorations will include the trade-offs of direct-attached versus network-attached storage (NAS), testing of new NAS approaches such as RDMA (Remote Direct Memory Access), the use of hardware iSCSI controllers, the possibility of either changing the file system to ReiserFS V4 or XFS from Silicon Graphics, Inc. (SGI) or dispensing with it entirely and managing the raw device. Lastly, kernel-based asynchronous I/O is developing and may provide another considerable decrease in processor I/O overhead.

Augmentation

Augmenters are special-purpose programs that extract information from entities in storage and add new key or value pairs to these entities. Each augmenter can be thought of as a domain-specific expert. The blackboard approach is traditionally implemented as a set of experts considering a problem while sitting in front of a virtual blackboard, each adding a comment to the blackboard when that expert understands part of the problem.²⁰^,²¹ However, a straightforward implementation of this approach leads to inefficiencies caused by contention and excessive disk I/O, as each expert examines and augments each entity independently. WebFountain takes a slightly different approach of moving the blackboard past each expert and giving each a chance to add comments as the data stream by.²⁸ This approach turns what was a contention problem into a pipeline, at the price of somewhat decreased interaction among the experts.

We refer to these experts by the somewhat less pretentious term "augmenters." For example, an augmenter might look at an article and extract the names of the people mentioned therein. The next augmenter might match the extracted names against a list of chief executive officers (CEOs), and a third augmenter might further annotate certain CEOs with the year in which they acquired the position.

A similar chain of augmenters might recognize syntactic structures, such as important domain-specific noun phrases or geographical entities, or might augment a page with a version of the content translated from Korean to English.

Where appropriate, these augmentations are then merged and indexed. The index query language will accept simultaneous requests for augmentations produced by different augmenters, allowing queries that might, for example, determine all pages with a reference to a European location, a new CEO, and a synonym for "layoff."

Challenge. Each augmenter is an independent process that must run against some subset of the entities in the system. The challenge is to perform these augmentations as quickly and efficiently as possible, taking advantage of ordering, batching, and so forth, given that the augmenters themselves are at times not "hardened" (production-level) code and thus must run isolated ("sandboxed"). Augmenters may have different access control restrictions, may require different subsets of the data, may exhibit dependencies on one another, and may run on different physical machines, perhaps because of operating system requirements. In this complex space, the system must determine the optimal manner in which to run ("gang together") augmenters.

Approach. The chief tool available to the optimization engine is a Foreman process that spawns and monitors a sequence of augmenters and passes data through the sequence, incurring only a single read/write operation for the entire chain (Figure 5).

Figure 5

We put a number of augmenters together in this pipeline until their memory and CPU requirements match the disk I/O and call the resulting set a "phase" of augmentation. The data are processed through multiple phases, depending on the kind of data. The Foreman process allows conditional branching, so for example, Arabic tokenization is only run on the entity if it is in Arabic.

The Foreman process provides "sandboxing" and monitoring by keeping limits on process memory usage and processing times and restarting processes that appear to be hung or misbehaving. Errors are logged, together with recent input data, to facilitate reproducing the error in a debugging environment. This process gives the system a high degree of robustness in the presence of possibly unreliable augmenters.

Grouping augmenters into phases, given the requirements above, is an ongoing research problem; our current groupings are determined manually.

Hardware. As noted above, augmentation occurs whenever possible on the same hardware as the storage. Sometimes "off-node" hardware will be used for particularly computationally intensive augmentation, or for augmentation that cannot be run locally for other reasons (permissions, code ownership, legacy operating system requirements, and so forth).

Software. There is no single approach to writing an augmenter, nor any single language in which the augmenter must be written. A number of libraries seek to simplify the task by allowing the augmenter author to write a function that takes an entity and returns an entity with the augmentations added. Augmenters written in primary supported languages (currently C++, Java**, and Perl) need only provide a processOnePage() method, and can then be automatically run across the distributed cluster, monitored, and restarted. Pointers to entities that cause crashes can be passed to the author.

The Foreman process itself represents a part of the augmenter software stack. It allows augmenters to connect to a stream of new entities coming into the system, run over the whole storage, or run over the result of a query to storage without needing any changes to the augmenters themselves.

Performance. Augmenters running over the whole storage can see around 300 entities per second per node. Augmenters that are processing the results of a query can see around 100 entities per second per node. As noted earlier, there are 256 nodes on the system for a total rate of 76800 entities per second or 25600 for query processing.

Future directions. Most of the future work on augmentation will be to enhance the ease of authoring augmenters and miners, as well as to enhance the sandboxing and debugging tools to identify problems and help with their resolution.

Index

The WebFountain indexer is used to index not only text tokens from processed entities (including, but not limited to Web pages), but also conceptual tokens generated by augmenters. The WebFountain indexer supports a number of indices, each supporting one or more different query types. Boolean, range, regular expression, and spherical are typical. More complex queries include graph distance (e.g., as in the game of how many people between a person and the actor Kevin Bacon), spatial (e.g., pages within San Jose, California), and relationships (e.g., people who work directly for John Smith, CEO of XYZ Company).

Challenge. Indexing in this environment presents five challenges. Indices must:

Build quickly
Build incrementally
Respond to queries promptly
Use space efficiently
Deal with result sets that may be many times larger than machine memory

All of these requirements must be addressed in an environment with trillions of total indexed tokens and billions of new or changed tokens every week.

Approach. We will address only the main indexer for the rest of this section. Because the main index is a fully positional index, the location of every token offset within every entity (document) is recorded, along with possible additional attributes that can be attached by augmenters to each token occurrence. The indexing approach is scalable and not limited by main memory because we adopt a sort-merge approach in which sorted runs are written to disk in one phase and final files are generated in a merge phase. To allow larger-than-memory result sets, most analytical queries return results in UEID order.²⁹ This allows most joins in the system to be performed as merges, without buffering of one of the result sets—a great convenience when result sets may represent hundreds of billions of entities and take days to consume.

The indexer supports the WebFountain Query Language (WFQL), the language that allows processes within the system to specify declarative combinations of result sets from different parts of the system (see Figure 6). Fragments of WFQL can be pushed down into the indices themselves; for example, Boolean subtrees can be handed in their entirety to the Boolean index, limiting the amount that needs to be sent over the network. However, when results must be aggregated over multiple indices, the WebFountain Joiner is responsible for handling the resulting joins. See the section "Querying" later for more details.

Figure 6

Hardware. Again, because hardware is the main cluster of nodes, indices are built where the data are stored and augmented for both performance and convenience reasons. This siting is relaxed for some specialized indices, which can run on stand-alone IBM xSeries model x350 machines (using 700 MHz four-way Intel Xeon processors with 4 GB of RAM) when appropriate. Currently, we migrate indexing away from the data for indices in development or when we index a relatively small set of the entities.

Software. The main indexer is implemented as a distributed, multithreaded C++ token stream indexer³⁰ run on each of the cluster nodes. It employs an integer-compressed-posting-list format to minimize not only storage requirements, but also I/O latency on complex joins that are passed down to it.

Future directions. The more WFQL that can be pushed down to the index the better. Doing this intelligently will require better costing estimates and models for setup and transport times on various queries. This improvement in turn leads to identification of frequent queries (for caching), as well as more involved query optimization.

Miners

We begin with a description of the distinction between entity-level operations (augmentation) and corpus-level (cross-entity) operations (mining). Augmenters have a specific data model: they process each entity in isolation without requiring information from neighboring entities. As described previously, they are easily parallelizable and can provide significant programming support. Tokenization, geographic context discovery, name extraction, and full machine translation are all examples of tasks that execute one page at a time. Miners, in contrast, perform tasks such as aggregate statistics, trending, relationship extraction, and clustering. They must maintain state across multiple entities in order to do their job. Typically, such miners would begin by running an augmenter over the data, possibly just to dump certain extracted information. However, the miner would then begin aggregate processing of all the dumped information. Finally, the miner can either upload data back to storage (to the processed entity, or to another entity such as a Web site, or a corporation), or the miner can present the results of its mining as a Web service without writing back to storage.

Challenge. The primary challenge in mining is scalability. Additionally, even sorting takes a significant amount of time (measured in hours or days), and quadratic time algorithms are essentially infeasible until the data have been dramatically distilled. Furthermore, in a multibillion entity corpus, even the problem of separating the relevant information from the noise may become a large data question.

Approach. With many cross-entity techniques, a multitier approach must be used to reduce data-scale to more manageable levels. A simple example is to query all entities that match some trivial selection (such as, "Must mention at least one Middle-Eastern country") and then look at this subset (which may be less than one percent of the whole Web) for further processing.

If several "sieving" approaches can be used, a data set several orders of magnitude smaller can be considered, and many more complicated techniques are then available.

Hardware. These cross-entity approaches generally run on a separate rack of IBM xSeries Model x350 machines that allow more computation than the storage nodes themselves. Some miners use data warehousing to move data of interest to a DB2 database for further (often OLAP, or on-line analytical processing) style investigation. Additionally, some whole-Web-graph problems are run on an Itanium** system, which has a very large amount of memory installed. In short, these mining problems can be run on whatever platform is appropriate to the task at hand.

Software. Likewise, these approaches often require specialized hardware or software. Because of the diverse nature of mining operations, little generic support can be provided by the system. In general, the author of the miner must be aware of issues of persistence and distributed processing that can be hidden in abstract superclasses for augmenters. Additionally, the task of scheduling miners requires a more involved workflow specification because the miner usually is triggered after a particular suite of augmenters complete and dump useful information. Once the miner operation is completed, there may be a final upload of the resulting data back to storage.

For example, consider a link-based classifier such as Hyperclass.³¹ Such a classifier would dump feature information for each entity (using an augmenter), would interact with a distributed Web service providing connectivity information (see, for instance, the Connectivity Server³²) to generate neighborhood information, and would then perform an iterative cycle with a data access pattern much like power iteration to compute final classes. Once the classes for each entity have been computed, yet another augmenter would run to upload the classes for each entity back to storage.

Performance. The cross-page mining rate is somewhere between 25 thousand and 70 thousand entities per second.

Future directions. The most challenging future problems for WebFountain lie in the mining space. The techniques of data mining, graph theory, pattern recognition, natural language processing, and so forth are all amenable for reapplication to the new domain of very large-scale text analytics. In most cases, the traditional algorithms require modification to maintain efficiency; this domain therefore represents a fruitful opportunity to apply existing techniques in new ways on a timely and valuable data set.

Querying

As introduced earlier, WebFountain supports a query language that generates augmented collections of entities by combining results from various services within the platform. A query consists of selection criteria and a specification of the particular keys that should decorate the resulting entities. For notational purposes we use an SQL derivative to capture most common queries that the system can process. A typical query might be as follows:

SELECT URL, UEID, Companies 
   FROM Web 
   WHERE 
      Person='John Smith' 
   AND Location WITHIN 
      '10 miles of San Jose, CA'

The results are then returned as an enumeration of XML fragments containing the requested data.

Challenge. Recall that queries must run against terabytes of data stored on hundreds or even thousands of nodes. The example above is easy because it can be sent in parallel to all the nodes. A more complex query, requiring a more complex data flow, would be "Give me all the pages from sites where at least one page on the site is in Arabic."

A second challenge arises from the possible size of the result sets. These sets may need to be shared between multiple clients and may need to deal with clients crashing and restarting where they left off. This robustness is easier, thanks to the loose coupling, but still requires a fair amount of complexity, especially in deciding when to drop queries or fragments of result sets as "abandoned."

Finally, as in structured data queries to relational databases, efficient computation of result sets relies heavily on the optimizer. In a loosely coupled system, the cost of moving entries that will later be trimmed from one machine to another can easily dominate query execution time, resulting in situations that are not standard territory for database optimizers. Further, the system allows the dynamic introduction of engines to perform intermediate stages of query processing, and these lightweight distributed engines must be capable of specifying enough information about their performance to allow the optimizer to generate an efficient distributed query plan.

Approach. Although expressive, there are times when SQL is not expressive enough; therefore, the common query format is WFQL, an XML query plan. It allows more complex interactions between services to be scripted. SQL-type queries are converted to this format before processing by a front end.

After the WFQL proposal is received, the tree is optimized. Optimization includes tree rewriting, rebalancing where appropriate, and determination of subtrees that can be "pushed down" in large amounts to the leaf services (such as indices). This new WFQL plan is then executed, and the results served up to the client as an enumeration of XML fragments.

Hardware. The query engine (called a joiner) runs on its own dedicated IBM xSeries Model x350 machine. Because these queries are independent, additional instances of the joiner can easily be added to the system until the cluster is saturated.

Software. The query engine performs a three-step process: taking in the query in a variety of formats and translating it to WFQL (the front end), optimizing it (the middle end), and executing the resulting plan (the back end). Currently the middle end is fairly trivial, but see below for some discussion of how this may change.

Performance. The current joiner has a latency of around 20 milliseconds for relatively simple queries. Queries that require a resorting of the results can take much, much longer, as n log n (where n is a billion) can run to a few days.

Future directions. As noted earlier, accurate query cost estimation, possibly requiring statistics gathered during query execution, is a key requirement for optimization. Two key future directions in this area are the following: First, integration of the DB2 DataJoiner product for query rewriting, which would allow us to support more ad hoc queries without worrying about bringing down the cluster; second, introduction of multiquery optimization—for example, should a number of queries need to be run every night, could they be combined in some way to limit the number of data accesses.

Web service gateway

The scale of hardware required for WebFountain makes it infeasible to build a separate instance for each customer. Instead, WebFountain is a service offering that performs data processing in a single centralized location (or a few such locations) and then delivers results to clients from these locations. Given the existing Web service approach used internally, it is natural to leverage that decision by providing result data to clients through a Web service model as well.

Challenge. There are three primary challenges in the design of the gateway. First, and most important, access must be as simple as possible to encourage developers to write programs that make use of the platform. Second, the gateway must provide access controls, monitoring, quality of service guarantees, and other user management tasks. Third, the gateway must provide performance sufficiently high to meet the needs of users.

Approach. We do this data processing through a SOAP³³ Web service gateway. For each service to be exposed externally, a WSDL ³⁴ (Web Services Description Language) is published. Connection to the gateway is by SSL2 MAC (Secure Socket Layer Version 2 mutually authenticated certificate). Clients register with the gateway and negotiate the commands they are allowed to execute, the load they are allowed to place on the system, and the families of parameters they are authorized to specify. The gateway monitors quality of service and bandwidth for each client, based on a logging subsystem that captures queries, response times, and meta-data that arrive with each completed internal query.

Commands exposed through the WSDL need not map directly to commands inside the cluster. For example, an external query for the current crawling bandwidth might result in an internal query to all 48 crawlers.

Hardware. The gateways are set up as a set of IBM xSeries Model x330 machines (using 1.13 GHz dual Intel Pentium** III Processors with 2 GB of RAM), dual network zoned, and behind firewalls. The number of gateway machines can be trivially scaled up to meet demand, dynamically if necessary.

Software. The gateways themselves are written in the C++ and Java languages. They perform the task of authenticating the queries, logging them, translating them to the requisite xtalk queries, dispatching them to the cluster, aggregating the results, rephrasing them as SOAP, and returning them.

Performance. The current performance point is tens of queries per gateway per second. This number obviously varies considerably depending on the size of the result set being returned.

Future directions. Future work for the gateway includes supporting higher degrees of granularity on querying (resulting in a more complex set of supported queries), better performance by running similar queries together, faster deployment of new functionality through dynamic plug-ins to the gateway, and integrated load balancing, sharing, and grouping of requests.

Cluster management

Although having a large number of machines available allows us to overcome a problem of several orders of magnitude of scale, it introduces several orders of magnitude of complexity in keeping the system running. Maintaining an overview of 500 machines can be taxing, particularly coupled with the requirement that the system be resilient to some number of failed nodes.³⁵

Nonetheless, we need to identify problems, fix them automatically when possible, call in human support otherwise, and allow institution of workarounds while the problem is being dealt with. Additionally, the cluster management subsystem is responsible for the more mundane automation tasks that surround workflow management, automatically distributing processes across nodes of the cluster, monitoring for progress, restarting as necessary, and so forth.

Challenge. The complications of running such a system include the heterogeneous nature of both hardware environment and software deployment (a variety of code versions running all at the same time). Because of requirements on turnaround time and the vicissitudes of Web data, software failures in the augmenters and miners are inevitable. The system must not rely on an error-free execution. If an augmenter fails once, the system will log the error, restart the process, and go on. If it fails several times on the same entity, the author will be notified, and the entity will be skipped. Miners, in contrast, may or may not provide mechanisms for restart. Without such a mechanism, the system will merely retry the miner and request operator assistance based on the priority of the miner task.

In a complex system the root cause of a problem is often elusive. Cascade failures are quite common, and looking at crash logs can be a time-consuming task. Although rigorous testing and debugging is done before introducing new code, by definition no test is complete until it has run on the whole Web. We always find things that do not scale as we wish in production, or that are tripped up by truly odd pages. Additionally, resource contention problems are hard to model in test and often only appear in production.

Approach. The cluster management subsystem runs a special service on each machine known as a nanny. This process forks and runs all the services needed on a machine and monitors their performance, CPU and memory usage, ends them on request or when thresholds are exceeded, and reports on their status when queried. Any major changes that the nanny undertakes (e.g., install new code or start a new service) are authenticated and logged.

A central coordinator checks with all these nannies every few seconds and creates an aggregate view of the production cluster. In addition to services, each nanny also monitors overall system status, including disk status, CPU load, memory usage, swapping, and network traffic, as appropriate. This information is logged centrally to a "cluster flight recorder," which can be replayed to find unusual performance bottlenecks.

Simply managing hundreds of machines is a conceptual challenge as well. For the operators and technicians, nodes are grouped into sets of eight, which represent a "rack" for administrative purposes. Visual displays use this grouping to allow rapid drill down to machines and problems.

Hardware. A single IBM xSeries Model x335 machine serves as the central coordinator; one instance of the nanny runs on every main cluster node and every ingester.

Software. Surprisingly little work has been done on managing large, loosely coupled clusters (although grid research¹³ is beginning to look promising). As a result, the nanny or coordinator is a custom implementation in a mix of C++ and Java languages. A number of commercial Web service monitors can be used to monitor the SOAP gateway, but these monitors still need to be examined by the central coordinator to provide a uniform view.

Performance. The current cluster supports a half dozen "clients" at a time and requires 7.5 people to run. It is unclear what the scaling relationship is, but the goal is to be highly sublinear.

Future directions. Our primary goals for the future in cluster management are improved problem determination and enhanced speed of resolution, including autonomic swapping of hot spares to facilitate automatic failover. The goal is to provide better utilization of the hardware with a smaller operations staff.

Conclusion

The WebFountain system currently runs and supports both research and a set of customers who are involved in "live" use of applications hosted in the production environment. As such, the architecture has completed the first phase of its development: going live with a limited set of customers.³⁶^,³⁷ In this paper, we have disclosed at a high level the architectural decisions we have made to complete this first phase of execution, with an eye to the rapid expected growth in both data and load (measured as number of applications, number of partners, and number of users).

We adopted the Web service model because we needed the traditional benefits of such an architecture: modularity, extensibility, loose coupling, and heterogeneity. So far, we have delivered multiple real-time services backed by 100 terabytes of data with debugging cycles measured in days and extensibility that exceeded our expectations. Although our requirements will only become more severe, we do not anticipate needing to revisit this basic architecture in order to meet them.

^*Trademark or registered trademark of International Business Machines Corporation.
^**Trademark or registered trademark of Linus Torvalds, Intel Corporation, or Sun Microsystems, Inc.

Cited references and notes

Accepted for publication August 12, 2003; Internet publication January 14, 2004.

有关自然语言搜索研究的一个项目：webfountain

约翰巴特利（ John Battelle）的《搜》（《the search》）第267页提到IBM实验室一个叫做webfountain的项目，此项目被美国电器电子工程师协会（IEEE）称为分析型搜索引擎，是IBM企业搜索研究计划的一部分。项目负责人：总设计师丹尼尔格鲁尔和首席科学家安德鲁汤姆金斯。可能自然语言搜索引擎的研究会需要社会科学的帮助，我想这也许会是我将来可以尝试进入的研究领域，社会学中有很多有关话语分析的研究，这能够把我对社会学的兴趣和对网络技术的热情结合起来，这样我可以缩小我的目标范围了。

原来新浪读书频道的连载是不完整的

今天用手机看新浪读书频道连载的慕容雪村的《伊甸樱桃》，
很快就看完了，让我觉得很奇怪，因为前天在图书馆找到了此书，
发现还比较厚，以我的阅读速度，不可能这么快读完，
于是去搜索，发现新浪读书频道的连载只是节选，根本不是全文，
看来靠新浪读书频道看书是不够的，还得自己买书。
刚才把昨天看完的《搜》与新浪读书频道连载的《搜》对比了一下，
发现好几个非常精彩的章节都省略了，呵呵。
以后看书一定要做笔记，迫使自己思考，否则读完一本书，收获有限。

当今人类需要反思的若干问题

人类社会发展至20世纪末期, 出现的文明进步与生存危机并存的"二律背反"现象越来越明显。生态破坏、环境污染、资源短缺、全球变暖等等问题，也随文明进步和人口的急增接踵而来，又严重威胁着人类的安全生存和可持续发展。当人类发展正处在这样一个十字路口之时，人类需自省反思！中华民族需自省反思！

▲ 人类需重新认识自己吗？
当今地球"母亲"，正在大声斥责"人类儿女"是地球上最危险的动物，是地球上最大的贪婪者，是地球上最大的破坏者，是地球上最大的污染者，是吗？如果"母亲"的斥责没有冤枉我们，那么，我们应如何来洗心革面，争取做一个心灵洁净，言行一致的绿色"儿女"，以洗涮这些不光彩的名声，努力赡养和保护好地球"母亲"。
在美国最大的动物园—纽约布隆库斯动物园的人猿展厅里,有一道奇怪的木栏,木栏面向观众一侧有一面大镜子, 木栏上面写有一行大字:"世界最危险的动物"。观众直到跟前才明白,镜子映出的正是自己的身影! 这个有趣的展柜，无非是用一面镜子以巧妙的方式告诉人们：自己就是地球上最危险的动物。

▲ 人类的致命弱点的是什么？
人类的致命弱点是不是"不止贪婪"？"人为财死,鸟为食亡"和"人不为已,天诛地灭"这两句话, 好像把人类自私、贪婪、短视的负面，刻画得淋漓尽致。但人类历来对自己的致命弱点缺乏足够的认识。人类如再不检点自己，让不止贪婪恶性膨胀，那末，它将是埋葬人类自己的最后一座坟墓。结局果真会是这样吗？

▲马克思关于遏贪的告诫，当今是否还适用？
马克思曾经告诫说，百分之百的利润可使你壮起胆子来；百分之两百的利润可以使你不顾人间的法律；百分之三百的利润可使你不顾绞首的危险！当今还适用吗？

▲地球究竟能养活多少人？现在是否到了人多为患的地步？
学者对此认识大相径庭，有的说可养500亿，有的说可养活100亿，有的说可养活80亿，有的说最好把全球人口回降到50亿的水平，现在已到了人多为患，人满成灾的地步了。是吗？

▲ 我们需认真反思批判、迫害"三马"的历史教训吗？
人类无限制地盲目繁衍, 最终会反过来导致毁灭性灾难降临在自己的头上。关于控制人口这一首要大问题, 在20世纪以前, 只有少数远见卓识的有识之士才认识到并提出了这个问题。但在当时的历史条件下, 不仅没有为大多数人所接受, 反而被某些人斥之为"异端邪说"加以抵制、批判, 甚至横加迫害。如对提出在世界上影响比较大的人口理论的马尔萨斯、世界上第一个提倡节制生育的美国女科学家玛格丽和中国著名人口学家马寅初的批判和迫害。错批和迫害"三马"，导致了20世纪全球人口的恶性膨胀, 致使人口不论在增长速度还是增长数量上, 在进入20世纪中都达到登峰造极的地步。20世纪初, 全球人口只有16.73亿,到2004年全世界人口就增至64亿, 在这个世纪中, 全球人口就净增了40多亿, 其繁衍速度超过以前100多万年的两倍多。
回过头去看看人类近万年的生存发展史, 就可悟出推迟一个世纪在全球实行节制生育, 导致人口超高速的增长, 不能不是人类发展史上的一个大失误, 这是否是值得人类认真吸取的一个大教训？

▲ 人类无限繁衍的最终结局究竟怎样？
联合国称，如果生育率维持现在的水平，即一个妇女生育超过两个孩子，那么到2100年全球人口将达到440亿，2150年将达到2440亿，到了2300年，全球将有1．34万亿人口。结局可能像日本庆应熟大学教授米泽富美子所讲的那样："欲望将淹没地球"。
人类的繁衍, 如不加以有效控制, 地球迟早会变成人挨人组成的"海洋", 加上众多人类的不止贪婪，"人群汪洋"和"欲望汪洋"将会把地球淹没!
这种告诫对吗？

▲ 在加蓬原始森林中发生的猴群大血战，对我们有何启示？
1997年5月, 在加蓬的原始森林中,发生了一场猴群大血战。猴王大声喊叫，雄猴在树上跳来跳去，接着它们就向同类发起进攻，互相撕咬，结果血染森林，猴子尸体遍地，这场大血战下来，加蓬这片原始森林中的5万多只猴子，只剩下3万来只。猴群为何会发生这场大血战？原来是人类破坏了自然生态平衡，砍伐者不断毁坏原始森林，致使猴群的生存空间越来越小,为了争夺有限的生存空间,才酿成了猴群之间的这场大血战。假若，人类不节制生育，任其自身继续盲目繁衍下去，地球上人满为患的程度达到每公顷空间达到1000人以上时，恐怕这场人群之间的"大血战"不仅难以避免,也不只单是面对面的相互撕杀吧!而是动用包括核武器、生物武器和生态武器在内的一切高、精、尖、新武器的全球性"高级"人群大血战，其结局可能是地球和人类一起同归于尽，地球也许将会变成第二个"火星"。这种启示对吗？

▲ 我国实行独生子女政策后，子女赡养责任过重的问题如何解决？
实行独生子女政策后，子女赡养责任过重问题可否采取子女赡养、社会赡养和国家赡养三分摊的办法来解决？不然独生子女政策难以坚持，我国的人口总数就难以回降。

▲ "独生子女"一代将使中国衰退吗？
有人担忧，多数独生子女是在溺爱和娇生惯养中成长起来的，肆意任性，不知羞耻，未尝人生艰辛，只讲享受，不讲奉献，自立和自理能力不强，这一代人将有使中国走向衰退的危险！复兴中华将变成泡影，这种担忧对吗？
能否酿成这种危险，关键在于家庭、学校、社会和国家对这一代如何进行教育，人是可以教育转化的。是吗？

▲ 人类可持续发展的主要基地究竟在哪里？
是地球还是球外之星？
迄今为止，我们在宇宙中还没有发现可有供人类移居的第二个地球，就是发现有，它能缓解地球人满为患的危机吗？
人类的移居梦在相当长的一个时期内是难实现的，即使能迁居，也缓解不了地球人满为患的危机。地球仍是人类可持续发展的主要基地。这种看法对吗？

▲ 当今人类的舍近求远发展战略是否需要加以适当调整？
如果说人类赖以生存的主要基地仍是地球，那末，人类是否应适当调整一下目前舍近求远的发展战略，加大对保护地球的投入，立体提高地球"母亲"的养育力。应该这样做吗？

▲ 可否将地球沙荒和月球放在同等重要的位置来加以开发利用？
目前全球荒漠化面积，与月球表面总面积（3650万平方公里）大体相当。依靠高新技术把沙漠改造利用好,让人们重返沙漠。走这条路要比长途跋涉去改造利用月球和火星划算得多、省事得多、难度小得多，而且行得通, 并有不少成功的典型可借鉴。微调人类舍近求远的战略，首先应将改造利用地球沙荒放到与探索开发月球提到同等重要的位置。改造好的沙漠可能比移居月球更舒适。这种看法对吗？

▲ 实施可持续发展难的根子在哪里？
可持续发展理念正式提出已有十多年了，现在人们的可持续发展意识也开始渐渐增强，但实施颇难，根子是不是在利已、贪婪和短视六个字上？
人类自私，利已，贪婪和短视的劣根性不仅仅表现在个体上, 而且也表现在群体上。如果不认清自己的这个致命弱点,并让其恶性膨胀, 什么控制人口、保护环境、建立"新的全球伙伴关系"、拯救地球、消除战争、和平共处和持续生存发展等等,都难以实现, 无疑最终将把自己推向毁灭的深渊。是吗？

▲ 发展是否具有两重性？
发展同其它事物一样都具有两重性。有人说可持续发展，科学发展，文明发展，理性发展，和平发展，节制发展，一切不损害人类生存根基的发展才有道理，反之，就没有道理。如果是，那末，我们又如何来规范自己的发展行为，彻底改变传统的发展观，注意区分两类不同性质的发展，坚持走正确发展的道路？
自从人类诞生地球以来，一直以发展自己为天职，将最大限度满足人民不断增长的物质文化需求作为神圣的发展目标，好像发展就是一个"无底洞"。在人类面临若干生存危机的今天，这一观念，是否需加以转变？
发展"无底洞"，应不应该有个不可逾越的底线。如果没有，前景是更加文明进步还是最终掉进这个无底的毁灭深渊？
在以人为本和建设节约型社会的今天，把"最大限度地满足"调整为"合理地、有节制地满足"，可以吗？
人类在发展壮大自己的伟大征程中，必然扩大领地，侵犯自然，改造自然，缩小动植物的生存空间，那么，人类如何与大自然和谐共处？界限在哪里？分寸如何掌握？

▲ 马克思关于发展的告诫，现在是否还适用？
马克思曾告诫说，文明如果是自发地发展，而不是自觉地发展，留给自己的则是荒漠。
这一告诫，现在是否还适用？

▲ 恩格斯关于发展的告诫，现在适用吗？
恩格斯也曾经告诫说："……我们不要过分陶醉于我们对自然界的胜利。对于这样的胜利，自然界都报复我们。第一步我们确实达到了预期的结果，但是在第二步和第三步却有了完全不同的意想不到的结果，常常又把第一个结果取消了。"
由于人类的贪婪和短视，我们的不少第一步发展"伟绩"，都被第二步和第三步意想不到的恶果取消了。是这样吗？

▲ 当今，自然灾害日益频繁和加剧，人类有没有自己的责任？
人类的不当活动破坏了自然生态平衡后，可能诱致某些自然灾害频频发生；当自然发生后又会加重自然灾害造成的损失。如全球变暖就加剧了种种自然灾害的频频发生和加剧。2004年12月26日在印尼发生的大地震和海啸，由于人为破坏了能滞阻海啸冲击波的天然屏障——珊瑚和红树林，结果"助灾为虐"，加重了灾害损失，造成死亡人数超过16万，同时因灾新增穷人200万，财产损失更是无计其数。一列正行驶在斯里兰卡沿岸的火车，顷刻间就被突如其中来的海啸推出轨道，造成车毁人亡，乘客和乘务人员约1700人全部遇难。因此，有人说，2004年是大自然"反击"、报复人类最严重的一年。
当今，自然灾害日益频繁和加剧，人类自己负有不可推卸的责任。是吗？

▲ 是否需要大力倡导科学树"三观"？
发展、富裕和消费好似一个连体怪婴，有密不可分的关联。
党中央提出的"科学发展观"，非常及时，非常正确，对指导我国健康步入可持续发展的轨道将产生巨大的推动作用，但如果能同时三位一体，三管齐下地提出树立科学发展观、科学富裕观和科学消费观，可能推动效应会更佳。因为发展、致富和消费是相互关联的一个整体。因此，需要适时向全社会大力提倡科学树"三观" 的问题，并以科学"三观"为核心逐步建立起一套完整的可持续发展的政策体系。可以吗？

▲ 在可持续发展的总要求下，个人幸福和致富如何追求？社会道德观念如何转型？
人类和中华民族实施可持续发展的最大威胁是资源短缺和枯竭，在这种形势下，如果不择手段去追求个幸福和致富，可能不利于可持续发展，因此，幸福观、致富观和道德观，应随时代的变迁而调整转型。是吗？
西方发达国家，已开始反省他们一味追求享受的生活方式：生活水平的提高，并未带来生命质量的提高，人类反而出现衰退现象。这为发展中国家如何发展自己敲响了警钟。

▲ 究竟多少才算"够"？！怎样才能"满足"？！
地球资源的有限性与人类繁衍和人们贪得无厌的无限性,是实施可持续发展的最大矛盾。而当今人类的胃口越来越大，吃了五马还要想六羊！如果永远不"够"，永不"满足"，那末，地球将为人类流尽最后一滴"血"，实现可持续发展最终将会成为泡影！是吗？
马克思告诫说，如果人的需要长期在物质享受层次上停留，就会产生恶性消费和恶性开发，从而破坏环境也摧毁人自身。
印度之父圣雄甘地也曾说:"地球能满足每个人的需要，但不能满足每个人的贪念"。
每个地球公民都应对此进行反思和回答。

▲ 建设节约型社会，中华民族是否需要开展一场"生存习惯"革命？
我们中华民族有许多传统美德值得继承发杨，但在衣、食、住、行等方面生存习惯上也有不少陈规陋习，特别在吃喝上表现更为突出，不利于节约型社会的建设，需要加革除。
当今人类已不知不觉地陷入现代工业社会卷起的贪求享受的高消费的漩涡中,穿高档,吃美食,坐宝驹,住华宅, 永无止境地无度挥霍浪费自然资源和社会财富。我们是否无需在这方面盲目羡慕和赶超？
中国目前尚未达到发达国家水平，但奢侈消费却令人刮目相看，与发达国家相比，某些方面，有过之无不及，不像西欧某些国家富了也不挥霍浪费有限的生存资源。
瑞典已是世界上的一个发达国家，2001年人均国民收入在25000美元以上，但他们从不挥霍生存资源和社会财富，整个社会节约成风，物质循环利用搞得非常好，被称之为"一切都可以利用的国家"。

▲ 人类会不会在你追我赶、你争我斗，你攻我守，你猜我疑中最终葬送自己的前程？
有人忧心忡忡地说，人类最终将会在你追我赶、你争我斗，你攻我守，你猜我疑中葬送自己的前程。是吗？

▲ 中国的干净饮用水将会在30年内出现枯竭危机吗？
前水利部长钮茂生在任时曾惊呼：如果不迅速采取节水和治污行动，在30年内，中国的干净饮用水就将枯竭。是吗？
水是中华民族的命根子。水资源短缺和污染的危机就像一条绞绳的两头，似乎这条绞绳正向中华民族的颈上缠来，我们应如何解开向我们紧紧缠来的这条绞绳呢？

▲ 当今沙漠的主要成因究竟是什么？
有人说沙漠主要是自然形成的，但又有资料称："在全世界已查明的215个沙漠成因中,有87％是由于人类破坏植被等不合理行为造成的，中国解放后新增的荒漠化面积中有95％是人为活动引起的。"这一数据，又恰好证实了"人类走过大地, 足迹留下沙漠"和"耕作自发地进行, 接踵而来的就是土地荒芜 "的警言。这两句警言。对吗？

▲有人告诫说，如果沙漠进逼的势头得不到有效控制，首都北京，300年后有被沙漠淹没的危险，这是蛊惑人心的危言耸听吗？
沙漠向首都北京进逼的前沿，距离天安门只有70多公里了，2000年笔者曾到现场参加过保卫首都的人沙之战的植树造林活动，亲眼目睹了沙漠向首都进逼的险恶情景，感到忧心忡忡。中华民族苦心营造了上千年历史的北京古都，300年后将有被沙漠淹没的危险。果真会吗？

▲ 人类文明发展史上，有何重大教训值得记取？
古埃及文明、巴比伦文明和中国的楼兰文明，为何长眠在沙漠之中？人类为何多次重演开垦出的"黑风暴"灾难？中华民族的"母亲"河为何演变成当今世界上最大一条害河？上个世纪50年代，我们发动的"大跃进"，为何不仅付出了上千万人的生命代价，而且最后还造成国民经济大倒退？……这些教训是否应该吸取？

▲ 长江是否正在步黄河的后尘？
1999年，《地球的警告》作者曾告诫"长江是在步黄河的后尘"。2004年，全国政协和中国发展研究院通过实地考察后，也发出类似的告诫。长江果真会变成第二条"黄河"吗？如果长江变成第二条"黄河"，那么，会给中华民族带来哪些灾难？

▲ 科学技术是第一生产力，那末，它的另一刃是不是第一破坏力？
科学技术如同其它事物一样都是把双刃剑，人类应如何趋利避害发展它？

▲ 造成地球气候日益变暖的真正罪魁祸首是谁？
二氧化碳等增温气体正向人类实事求是法庭提出申诉：造成地球气候日益变暖的真正罪魁祸首不是它们，而是人类自己，诉求人类实事求是法庭做出公正判决。是吗？
根据观测，因全球气候日益变暖，地球上所有大大小小冰川都在不同程度地消融减退，新的消融灾难在全球越来越来明显地显现，它的加剧，尤如向人类当头猛击一棒，果真有如此危险吗？

▲ 地球是人类独有的王国吗？
地球是人类、动物、植物和微生物的生物联合国，不是人类独有的王国。人类的生存和发展是离不开其他生物的。人类在寻求自己的生存物质,开辟新的生存基地和环境中,对其它生物采取"杀鸡取卵"或者盲目灭绝的做法,并不是聪明做法,而是愚蠢的行为。野生动植物的灭绝,对于人类来说, 并不是一个"伟大的胜利",而是自己走向毁灭深渊的一个危险信号。如果不加以保护,其它生物全部灭绝之时，也可能是人类毁灭之日！这种告诫是否危言耸听？　　

▲ 人类是否正在制造第六次生物大灭绝？
学术界有人说，人类历史上，因自然原因生物已出现过五次大灭绝，现在人类正在人为制造第六次大灭绝。是吗？

▲ 氧气耗减的潜在威胁比地球变暖更可怕吗？
地球大气中的氧气也不是取之不尽，用之不竭的资源。自有人类以来，大气中的氧气比重已由原来的 30% 下降到 21% ，耗损了近 1/3 。据国际环境保护科学家测定，从 1860 年到 1980 年的 120 年中，大气中的氧气又耗损了 5000 亿吨。约占氧气总量的万分之五。这个数字虽然不大，但令人不安的是，随着人口和能源消费量的增加，森林面积的减少，大气中的氧气含量还将进一步减少。据美国普林顿先进研究所自然科学教授弗里曼·迪森的研究表明：每燃烧 1 吨矿物燃料就要消耗 2 ．６７吨氧气。俄罗斯一生物学家估计，若按目前每年增加５％的矿物燃料计算，那么，今后１６０年内大气的氧气含量还将减少２５－３０％。据专家测算，一公顷常绿森林每年大约要生成氧气２７３．７５吨，全球每年减少森林面积在１５００－２０００公顷之间，这意味着森林的吐氧能力在削弱。
海洋污染一方面增加了氧气的消耗量（如海洋自净一公升油污，就要消耗４０万升溶解氧），另一方面又使海洋细胞植物和海藻制造氧气的能力遭受到破坏。美国科学家通过对卫星照片进行大量分析后发现，从上世纪８０年代初到９０年代中期，北大西洋浮游植物的生长速度下降了７％。在北太平洋和南极海域，浮游植物的生长速度分别降低了９％和１０％。在我们呼吸的氧气中有一半都是由海洋浮游植物群落制造的。
更可怕的是臭氧层遭破坏后，形成的巨大臭氧空洞，使大气中的氧气大量通过臭氧空洞向外层空间泄漏。2003 年，南非著名地球物理学家、关于大气平流层上部调查项目的首席研究员伊安·冯·维克博士说：臭氧层像一道天然屏障，一方面阻挡太阳紫外线中的有害辐射，使地球生物免受伤害，另一方面它也把生命所需的氧气控制在地球大气中。如果没了它，"我们就死定了"。巨大的臭氧层空洞在使有害的紫外线辐射从臭氧层空洞里涌进大气层的同时，也使大气的氧气从臭氧层空洞泄漏出去。以他为首的专家组分析卫星拍摄臭氧层空洞的照片清晰地显示出：每分钟都有几十亿个氧分子从空洞中逃逸到太空。如果不修补好臭氧层空洞，地球大气层中氧气将不断泄漏。
看来，人类如果不检点自己的不当发展行为，并加以改进，到头来将自食其果。氧气"短缺"危机比全球变暖的威胁更可怕，两者不同之处在于：全球变暖的威胁来得现实一点，人人有直觉，而氧气"短缺"的威胁人们似乎现在还未感觉到。这种看法对吗？

▲ 人类需要重返乡村吗？
从目前城市发展趋势来看, 人类世界正以空前的速度发展成为城市世界,居住在城市及其周围的人口,将在21世纪初首次超过居住在农村的人口,预计到2015年,全球人口将有2/3生活在城市。但是，现代化城市在闪耀人类文明进步"明珠"光辉的同时，它又播下了许多毁灭的"种子"。在人类发展史上曾有过这样的先例：古代不少名城早已长眠在沙漠和汪洋之中。目前地球上的某些城市也有步这个后尘的危险。现代化大城市不管人们多么想望，但生态和社会危机正向它频频袭来：环境污染、淡水奇缺、垃圾公害、"热岛效应"、"混浊岛效应"、地面沉降、城市塌陷、能源不足、交通堵塞、住房拥挤、吸毒贩毒增多、传染疾病、应变力弱、差别扩大、贫困和流浪人口增多等等问题，一直把现代化城市困扰得喘不过气来,越都市化,这方面的压力就越大。据专家预测,由于水资源奇缺,地表、地下水源都遭到污染,到下一个世纪工业发达国家的一些大城市将可能出现不宜人类居住的危险!
目前在农村人口大量涌进城市的同时，西方城市富翁又开始回流乡村。预料这一趋势，到21世纪在人口居住都市化达到顶峰之后将会更加明显。由于信息高速公路的建设和延伸, 随着城乡差别的渐渐缩小，将为人口的适度散居提供了极为方便的条件,拥有蓝天、碧水、绿树的广大乡村原野,将是城市居民"重返"的天堂。是这样吗？
目前，中国城市化的思路、战略和做法是否应作一些微调？重点转向建设广大中小乡村城镇上，无须花更多的投入去堆建大城市。
▲ 21世纪会不会出现第二个"希特勒"，把更可怕第三次世界大战强加于爱好和平的世界人民？
提出这个问题的人说，凡是企图称霸世界，梦想领导世界的统治者，都是将战争强加于人民的。这种历史事件，在人类历史上曾经出现过多次，如德国的希特勒，法国的拿破伦……这种担忧对吗？

▲人类应如何把握好自己未来生存发展的大方向？
遏贪节欲，依靠高科技发展循环经济，建设节制型的文明社会，是否是人类在地球上持续生存发展的最佳选择？
可持续发展，科学发展，文明发展，理性发展，和平发展，节制发展是否是人类在地球上求生存求发展应走的一条金光大道？
何去何从，人类应当深思猛省！中华民族应当深思猛省！匹夫呼吁全人类和中华民族能以站得高一点，看得远一些的战略眼光，对事关人类和中华民族前途命运的上述问题在心灵中做自己反思和回答。

美国大学研究生教育各专业排行榜

Top 50 Business Schools

Search business schools

Complete Guide to Business Schools

	Exclusive Rankings: All 80 ranked business schools
	School Directory: The latest info on graduates' career prospects, starting salaries, admissions, and much more!
	Advanced Search: Custom searches, personalized results

Top Law Schools

Search law schools

Complete Guide to Law Schools

	Exclusive Rankings: All 190 law schools
	School Directory: The latest info on graduates' career prospects, starting salaries, admissions, and much more!
	Advanced Search: Custom searches, personalized results

Top 50 Medical Schools-Research

Top 50 Medical Schools-Primary Care

Search medical schools

Complete Guide to Medical Schools

	Exclusive Rankings: All 69 research and 68 primary care med schools
	School Directory: The latest info on residency placement, starting salaries, admissions, and much more!
	Advanced Search: Custom searches, personalized results

Top 50 Engineering Schools

Search engineering schools

Complete Guide to Engineering Schools

	Exclusive Rankings: All 95 schools
	School Directory: The latest info on starting salaries, admissions, and much more!
	Advanced Search: Custom searches, personalized results

Top 50 Education Programs

Search education programs

Complete Guide to Education Programs

	Exclusive Rankings: All 76 programs
	School Directory: The latest info on teacher prep programs, starting salaries, admissions, and much more!
	Advanced Search: Custom searches, personalized results

	Top Programs in the Sciences
	Search programs in the sciences

	Extended rankings in all science disciplines and specialties
	All-inclusive access: 11 disciplines, exclusive rankings, interactive tools, school trend tables, and U.S. News's authoritative school directory

	Top Library & Information Studies Programs
	Search library and information studies programs

	Exclusive rankings in library and information studies and associated specialties
	All-inclusive access: 11 disciplines, exclusive rankings, interactive tools, school trend tables, and U.S. News's authoritative school directory

	Top Programs in Social Sciences & Humanities
	Search social sciences & humanities programs

	Exclusive rankings in all social sciences & humanities disciplines and specialties
	All-inclusive access: 11 disciplines, exclusive rankings, interactive tools, school trend tables, and U.S. News's authoritative school directory

	Top Health Programs
	Search health programs

	Exclusive rankings in all health disciplines and specialties
	All-inclusive access: 11 disciplines, exclusive rankings, interactive tools, school trend tables, and U.S. News's authoritative school directory

	Top Public Affairs Programs
	Search public affairs programs

	Exclusive rankings in public affairs and associated specialties
	All-inclusive access: 11 disciplines, exclusive rankings, interactive tools, school trend tables, and U.S. News's authoritative school directory

	Top Fine Arts Programs
	Search fine arts programs

	Exclusive rankings in fine arts and associated specialties
	All-inclusive access: 11 disciplines, exclusive rankings, interactive tools, school trend tables, and U.S. News's authoritative school directory

奇想录 www.qixianglu.cn

2007-01-20

腾讯的浏览器竟然比firefox速度快很多

How to build a WebFountain: An architecture for very large-scale text analytics

How to build a WebFountain: An architecture for very large-scale text analytics

Requirements

Related literature

System design

Ingestion

Data storage

Augmentation

Index

Miners

Querying

Web service gateway

Cluster management

Conclusion

Cited references and notes

有关自然语言搜索研究的一个项目：webfountain

原来新浪读书频道的连载是不完整的

当今人类需要反思的若干问题

美国大学研究生教育各专业排行榜

我的简介

先前的博文