The development of e-Science (cyberScience, Grid, etc.) is starting to become a reality with formalised data resources, services on demand, domain-specific search engines, digital repositories, etc. Increasingly STM information will be contained in compound XML documents, representing scientific communication (articles, theses, repository entries, etc.). In physical sciences such as chemistry, materials science, engineering, physics, earth sciences, these "datuments" [1] normally contain hypertext, graphics, tables, graphs and numerical data, mathematical objects and relationships. In addition they may also contain domain-specific content such as chemical formula and reactions, thermodynamic and mechanical properties, electric, magnetic and optical properties. Among the domain-specific languages, CML (Chemical Markup Language) is the oldest and broadest, and is now being actively used for publishing by the Royal Society of Chemistry (Project Prospect [2]) which gives an idea of what chemistry in datuments can look like. CML has had to develop the domain-specific objects (molecules, atoms, bonds, spectra, crystallography, etc.) and the relationships between them. However, due to the text-based nature of early XML, it has also had to design an implement domain-independent infrastructure which can support much of physical science. Originally called STMML [3] it supports data types (float, integer, complex, etc.), data structures (arrays, lists, matrices, etc.), geometrical concepts (points, planes, lines, etc.) and scientific units of measurement. In addition CML bases much of its flexibility one user-created dictionaries (ontologies) which are hyperlinked from objects in the datuments. It is now clear that the domain-independent parts of CML (and by extension some other markup languages in physical science) are loosely isomorphic with approaches in MathML and OMDOC. If a synthesis can be found, then CML can happily forget about the "non-chemistry" knowing that the mathematical and physical science community has a general way forward. In easiest-first order, the following are suggested: (1) Mathematical variables and equations in chemical documents. An obvious challenge is that the variables represent types, often physical quantities (but also chemical objects such as atomTypes). This would be one of the first areas to explore with publishers. (2) Graphs and tables. A high proportion of graphs are functions of one of more dependent variables against one or more independent variables, currently supported by >. (3) Dictionaries. The CML dictionaries and OMDOC content dictionaries seem fairly similar in approach. (4) Mathematical relationships. A large area of physical science is based on theoretically and experimentally validated relationships which have been proved over many years (e.g. Maxwell's equations in thermodynamics). Often a quantity can be most easily determined by measuring different ones and transforming them. However most transformations are currently hidden in procedural non-portable code and it would be an exciting challenge to create a self-consistent declarative model of parts of physical science. It would be very exciting to have a discovery engine which could, on demand, decide which quantities were deducible from which (with similarity to theorem proving). A major challenge for distributed mathematics and science is discovery through search engines. These currently work on "free text" and are optimised to recognise strings. In a few cases domain-specific canonicalisations can be used (e.g. our Google Inchi [4] transforms a molecular graph into a string which is recognised by search engines). However most cases require mathematical operations (arithmetic, transformations, subgraph-matching, etc.). How - and where - can these be performed? A new generation of domain-independent and domain-specific indexing and searching tools needs to be developed. Recently CML has had to evolve a grammar to support fuzzy c
展开▼