ABSTRACT:
Any Organization has an explicit or implicit code of ethics rules that its members are expected to follow. We expect merchants to be honest, police men to be honorable and librarians to be informative. There are at least three ethical systems that are commonly used to administer different types of resources. Defending against loss is best achieved by Guardian ethics, which include that use of force and deception.
Nanotechnology presents a wide range of problems and opportunities, not just diverse issues, but different kinds of issues. Many of these issues have arisen already with older technologies and institutions. Some of the issues are new, and even the old issues take on new urgency when they occur in new combinations. Nanotech will make most exiting products quite a bit more powerful and flexible, it will probably also allow the creation of new products and even new ways of manufacturing and distributing products.
The promise of Nanotech is material evidence and rapid improvement of technology at low cost and high convenience. The threat of nanotech is potential of developing and fabricating dangerous weapons, drugs and other undesirables covertly or in large quantity. To, minimize the threat while maximizing the benefit will require the co-operation of many organization of several distinct types.
1. INTRODUCTION:
In a world of information, digital technologies have made copying fast, cheap, and perfect, quite, independent of cost or complexity of the content. What if the same were to happen in the world of matter? The production cost of a ton of tetra byte RAM chips would be about the same as the production cost of steel. Design costs matter, production costs would not matter.
At the last turn of the century, the average person would have had a hard time trying to understand how cars and airplanes worked, and computers and nuclear bombs exist only in theory. By the next turn of the century, we may have submicroscopic, self-replicating robots; machine people; the end of disease; even immortality.
Hard to imagine? Not for the new breed of scientist who says that the 21st century could see all these science fiction dreams come true the is because of molecular nanotechnology, a hybrid of chemistry and engineering that would let us manufacture anything with atomic precision. In fact, scientists claim that even within the next 50 years, this new technology will change the world in ways we can barely begin to imagine today.
2. HOW NANO TECHNOLOGY WILL CHANGE THE WORLD:
(a). First Bricks Then The Building :
Before nanotechnology can become anything other than a very impressive computer simulation, nanotechnologists are inventing an assembler, a few-atoms-large nanomachine that can custom-build matter.
Engineers at Cornell and Stanford, as well as at Zyvex (the self- described "first molecular nanotechnology development company") are working to create such assemblers right now.
The first products will most likely be super strong nanoscale building materials, such as the Bucky tubes. Bucky tubes are chicken-wire-shaped tubes made from geodesic dome-shaped carbon molecules. These tubes are essentially nanometer-sized graphite fibers, and their strength is 100 to 150 times that of steel at less than one-fourth the weight. With Bucky tubes we can build super roller coasters that drop you from 14,000 feet or we could take tram rides through the Himalayas.
The key to manufacturing with assemblers on a large scale is self-replication. One nano-sized robot making wood one nano-sized piece at a time would be painfully slow. But if these assemblers could replicate themselves, we could have trillions of assemblers all manufacturing in unison. Then there would be no limit to the kinds of things we could create. "Not only will our manufacturing process be transformed, but our concept of labor. Consumer goods will become plentiful, inexpensive, smart, and durable".
(b).The Ways That Molecular Nanotechnology could Change our lives:
(b.1)Manufacturing and Industry:
Nanotechnology will render the traditional manufacturing process Obsolete. For example, we’d no longer have a steel mill Outfitted with enormous, expensive machinery, running on fossil fuels and employing hundreds of human workers; instead we'd have a nanofactory with trillions of nanobots synthesizing steel, molecule by molecule.
Bill Spence believes that all industry would disappear except software engineering and design. We’d simply design, engineer, and do a molecular model of any product we wanted, and then software could tell a nanobot how to make it.
(b.2).Use of Natural Resources:
Rather than clear-cutting forests to make paper, we'd have assemblers synthesizing paper. Rather than using oil for energy, we'd have molecule-sized solar cells mixed into road pavement a few hundred Famines would be obliterated, as food could be synthesized easily and cheaply with a microwave-sized nanobox that pulls the raw materials (mostly carbon) from the air or the soil. And by using nanobots as cleaning machines that break down pollutants, we would be able to counteract the damage we've done to the earth since the industrial revolution.
(b.3).Medicine:
Nanotechnology could also mean the end of disease as we know it. If you caught a cold or contracted AIDS, you'd just drink a teaspoon of liquid that contained an army of molecule-sized nanobots programmed to enter your body's cells and fight viruses. If a genetic disease ran in your family, you'd ingest nanobots that would burrow into your DNA and repair the defective . Even traditional plastic surgery would be eliminated, as medical nanobots could change your eye color, alter the shape of your nose, or even give you a complete sex change without surgery.
3. WHAT NEW OBJECTS WILL APPEAR BECAUSE OF NANOTECHNOLOGY? :
Perhaps the big story -- with mature nanotechnology, any object can morph into any other imaginable object... truly a concept requiring personal exposure to fully understand the significance and possibilities, but to get a grip on the idea, consider this:
The age of digital matter -- multi-purpose, programmable machines, change the software, and something completely different happens.
Fractal Robots are programmable machines that can do unlimited tasks in the physical world, the world of matter. Load the right software and the same "machines" can take out the garbage, paint your car, or construct an office building and later, wash that building's windows. In large groups, these devices exhibit what may be termed as macro (hold in your hand) sized "nanobots", possessing AND performing many of the desirable features of mature nanomachines (as described in Drexler's, Engines of Creation, Unbounding the Future, Nanosystems, etc.).This is the beginning of "Digital Matter".
These Robots look like "Rubik's Cubes" that can "slide" over each other on command, changing and moving in any overall shape desired for a particular task. These cubes communicate with each other and share power through simple internal induction coils, have batteries, a small computer and various kinds of internal magnetic and electric inductive motors (depending on size) used to move over other cubes (details here). When sufficiently miniaturized (below 0.1mm) and fabricated using photolithography methods, cubes can also be programmed to assemble other cubes of smaller or larger size. This “self-assembly" is an important feature that will drop cost dramatically. The point is – if you have enough of the cubes of small enough dimension, they can slide over each other, or "morph" into any object with just about any function, one can imagine and program for such behavior. Cubes of sufficiently miniaturized size could be programmed to behave like the "T-2" Terminator Robot in the Arnold Schwarzenegger movie, or a lawn chair... Just about any animate or inanimate object.
Fractal Shape Shifting Robots have been in prototype for the last two years and this form of "digital matter" to hit the commercial seen very soon. In the near future, if you gaze out your window and see something vaguely resembling an amoeba constructing an office building, you'll know what "IT" is.
This is not to say individual purpose objects will not be desirable... Back to cotton -- although Cubes could mimic the exact appearance of a fuzzy down comforter (a blanket), if made out of cubes, it would be heavy and not have the same thermal properties. Although through a heroic engineering effort, such a "blanket" could be made to insulate and pipe gasses like comforter and even "levitate" slightly to mimic the weight and mass, why bother when the real thing can be manufactured atom by atom, on site, at about a meter a second (depending on thermal considerations).
Also, "single purpose" components of larger machines will be built to take advantage of fantastic structural properties of
Diamondoid-Buckytube composites for such things as thin, super strong aircraft parts. Today, using the theoretical properties of such materials, we can design an efficient, quiet, super safe personal vertical takeoff air car. This vehicle of science fiction is probably science future.
4. WHICH INDUSTRIES SHOULD DISAPPEAR BECAUSE OF NANOTECHNOLOGY? :
Everything -- but software, everything will run on software, and general engineering, as it relates to this new power over matter... and the entertainment industry. Unfortunately, there will still be insurance salesmen and lawyers, although not in my solar orbiting city state. If as Drexler suggest, we can pave streets with self assembling solar cells, I would tend to avoid energy stocks. Mature nanites could mine any material from the earth, landfills or asteroids at very low cost and in great abundance.
The mineral business is about to change. Traditional manufacturing will not be able to compete with assembler technology and what happens to all those jobs and the financial markets is a big, big issue that needs to be addressed now.
We will have a lot of obsolete mental baggage and programming to throw out of our heads... Traditional pursuits of money will need to be reevaluated when a personal assembler can manufacture a fleet of Porch, that run circles around today’s models.
As Drexler so intuitively points out, the best thing to do, is to get the whole world's society educated and understanding what will and can happen with this technology. This will help people make the transition and keep mental and financial meltdowns to a minimum.
5. WHICH NEW INDUSTRIES SHOULD APPEAR BECAUSE OF NANOTECHNOLOGY? :
Future generations are laughing as they read these words…
Laughing at the utter inadequacy and closed imagination of this writing... So consider this a comically inadequate list. However, if they are laughing, I am satisfied and at peace, as this means we made it through the transition (although I fear it shall not be the last).
Mega engineering for space habitation and transport in the Solar System will have a serious future. People will be surprised at how fast space develops, because right now, a very bright core of nano-space enthusiasts have engineering plans, awaiting the arrival of the molecular assembler. People like Forrest Bishop have wonderful plans for space transport and development, capable of being implemented in surprisingly short time frames. This is artificial life, programmed to "grow" faster than natural systems
An explosion in the arts and service industries are to be expected when no fields need to be plowed for our daily bread, similar to the explosion when agriculture became mechanized and efficient and the sons and daughters of farmers migrated to cities. This explosion will be exponentially greater. Leisure time, much more leisure time, more diversions... • what professions should disappear because of nano-technology?
Ditch digger, tugboat captain – most professions where humans are now used as "smart brawn", or as "the best available computer", including jet fighter pilot, truck driver, surgeon, pyramid builder, steel worker, gold miner... not that there will not be people doing these jobs, just for fun. Charming libation venders have a good future, until the A.I. We are just on the verge for finding out how frequent and varied novel situations can be.
6. NEW ENTERTAINMENT / EXPERIENCES WHICH WILL BE POSSIBLE WITH NANOTECHNOLOGY:
Perhaps the definition of life and entertainment will become blurred, but as I have previously noted, you can have a LOT of fun
With Utility Fog and a super internet. In the near term, how about designing a "roller coaster" that self assembles (traditional construction costs are not a consideration) and made of super materials 80-100 times as strong as and much lighter than steel. That first drop can be made from 14,000 feet! The ride can last until you need the skin replaced on your face. How about a tram ride through the Himalayas?
Amateur underwater archeologist could map and recover ancient treasures from the Mediterranean in personal subs bristling with sensors. Dinosaur hunters could send down microscopic probes into the Earth searching for new fossil fields, then release nanomachines to meticulously unearth finds. Zero G sports are yet to be defined. These are simple examples written by a mind stuck in this contemporary world view. The possibilities are as numerous as moves in 3-D chess.
The Foresight Institute suggests we now have the question of not if the technology can be developed, but when. I agree. It is a function of the general concept awareness in society. The media is picking up Drexler’s ideas ever more quickly now. Presently, two American companies are known to be engineering several "magical" assembler dependent products right now, in anticipation of the arrival of the assembler. Who knows how many black government projects may have hundreds of millions in funding around the world. The miniaturize understands Drexler's ideas and what a weapons boon nanotechnology will be.
Keep in mind, nanotechnology is not the ultimate, nor the end of technology… is nexpico technology (trillionth of a meter)? If so, this technology would deal with “matter” on a scale 1000 times smaller and emanate from deep inside the quantum realm... What does this mean? Power and understanding over space-time to engineer super luminal flight (faster than light)? Perhaps. If so, this would probably represent only the tip of this quantum weirdness iceberg. Pico Technology may be developed with enhanced intelligence made available through nanotechnology.
7. PROBLEMS WITH CURRENT NANOTECHNOLOGY RESEARCH IDEAS ENERGY REQUIREMENTS:
One of the big problems not fully appreciated with current ideas in nano technology research is the energy requirements for synthesizing bulk materials and big molecules. If you wanted to build concrete for example atom by atom, then one has to seriously ask whether it is best done using ingredients used for the manufacture of concrete which is found in reasonable abundance or do we start with atoms. If we start with atoms, then every chemical bond in concrete must be synthesized bond, by bond, using chemical steps that would at best use several times that bond energy to achieve the desired effect. The result is a an energy requirement to synthesize concrete that is way beyond the energy required to make concrete from existing ingredients. For this reason, bulk materials will never be synthesized using nano technology methods. Nanotechnology contributions would be limited to making simple precursors if that is energetically feasible and low cost enzymes that speed up various chemical reactions.
(A).Cross Bonding:
In trying to synthesize very large molecules, like DNA, the problems with cross bonding and reactive intermediates bonding unfavorably with other molecules poses a huge risk to making perfect molecules. The work of enzymes overcomes most of these difficulties. However, enzymes have to be developed that co- exist with other enzymes and other chemicals. In nature, this is achieved through millions of years of evolution where the right chemicals have been found to do the right job through natural selection pressures. Beyond that, compartmentalization is used where chemicals cannot co-exist through their design. The compartmentalization also requires various molecules to transport materials through membranes separating the compartments. All these operations require a huge diversity of chemicals that have to be researched and perfected so that they can co-exist with the previous set of chemicals.
(B).Time Restrictions:
To perfect such systems require an unreasonable amount of effort on behalf of a nano technologist to search out all combinations. It requires considerable effort even now to research just one chemical in all its glorious working detail let alone combinations of chemicals in a system.
(C).Wholesale Mistakes:
Nano-technologists hope to side-step many of the issues by using something the equivalent of a robot arm to perform molecular level assembly. Certainly for mass manufacturing, this is a wholesale mistake as can be proved when energy considerations are taken into account.
(D).Energy Consumption:
For one thing a robot arm that picks up a precursor and attaches them precisely to a growing molecule is particularly energy inefficient. You have to pick up the precursor from one place and place it an another which requires HUGE amounts of energy in relation to the actual work accomplished.
(F).Biological Systems & Energy Conservation:
In biological system, the currency for energy is the energy carried by ATP (Adenosine Tri-Phosphate). Every time an action is required usually a molecule of ATP is involved and energy is absorbed from ATP which is then recycled. It’s common for biochemists to cite reactions in terms of the number of ATP molecules consumed per reaction. So some chemicals require 1 ATP to accomplish its reactions while others including very large molecules require hundreds to thousands of ATP molecules to accomplish all its tasks. To move a ribosome 3 base pairs while it’s attached to a DNA requires huge numbers of ATP molecules to be consumed. But a lot of it is recovered when the final protein it makes is broken down as it gets recycled which means that overall, the process of reading DNA and making macro molecules is fairly energy efficient.
Compare that scenario where a robot arm with dimensions approaching a fraction of a micron is used to synthesize molecules. Every time the arm swings around to pick a chemical and place it at the right place to synthesize an exotic chemical, it spends billions of ATP energy equivalents in doing mechanical work. As the robot arm requires computers and sensors to make them work, we are now counting into trillions of ATP energy equivalents make one chemical bond in the newly synthesized product. There is no getting away from this reality of the total energy cost in making new materials from scratch. Nanotechnology using this type of universal assembler is clearly nonsense and doomed to failure in all but a handful of cases where small quantities of exotic chemicals are involved.
(G).Lack Of Self Repair:
Another subject not fully appreciated about the biological system is the self repair systems built in at all levels from repairing damaged DNA code to destroying molecules to re-manufacture them for re-use. Small machines need self repair at all levels to cope with the high breakage rates found at the smaller scales. Nanotechnologists cannot even begin to address the question right now because they don't have any nano technology machines ready for this work to be carried out!
8. WHEN WILL NANOTECHNOLOGY WILL ARRIVE?
“Arrive” is broadly defined as the first “universal Assembler” that has the ability to build with atoms anything one’s software defines. A universal assembler may look like a micro oven, connected to a raw atomic feed stock, like carbon black, o2, sulfur power.
Now most of the people understand that it will take. A long, disciplined effort, and it will not be an accidental discovery. Even so, they seem to believe that shortly after getting the first nanotech manipulators, well get many of the nanotech miracles. But probably the first thing we are likely to get with nanotech will be cute publicity demo’s may not even be visible to the naked eye.
It took over a decade after serious nanotechnology research got underway, to create the first nanotech robotic arm. Then we jumped over about another decade while they create thee first self replicating nano factory.
9. CONCLUSION:
Humanity will be faced with powerful, accelerated social revolutions as a result of nanotechnology. In the near future, a team of scientists will succeed in constructing the first nano-sized robot capable of self replication. Consumer goods will become plentiful, inexpensive, smart, and durable. Medicine will take a quantum leap forward. Space travel and colonization will become safe and affordable. For these and other reasons global life styles will change radically and human behavior drastically impacted.
REFERENCE:
• NANOTECHNOLOGY: A gentle introduction to the next big idea by MARK A RATNER, DANIEL RATNER technology-2002.
• NANOTECHNOLOGY: Basic science and emerging technologies by CAROL CRANCE, MICHAEL, KAMALI KANNANGARA, GEOFF SMITH.
Sunday, April 25, 2010
information security and advantages
Contents:
Abstract
1. Elements of Networking Security: Orange Book Security Levels and Firewalls
2. Elements of Networking Security: Passwords
3. Elements of Networking Security: Encryption, Authentication, and Integrity
4. Developing a Site Security Policy
5. Violation Response
Abstract:
Internet security is the practice of protecting and preserving private resources and information on the Internet.
Computer and network security are challenging topics among executives and managers of computer corporations. Even discussing security policies may seem to create a potential liability. As a result, enterprise management teams are often not aware of the many advances and innovations in Internet and intranet security technology. Without this knowledge, corporations are not able to take full advantage of the benefits and capabilities of the network. Together, network security and a well-implemented security policy can provide a highly secure solution. Employees can then confidently use secure data transmission channels and reduce or eliminate less secure methods, such as photocopying proprietary information, sending purchase orders and other sensitive financial information by fax, and placing orders by phone.
1. Elements of Networking Security:Orange Book Security Levels and Firewalls
While this paper will provide a basic understanding of the need for a site security policy and factors to consider in creating a security policy, it will not outline one policy that will fit every company. The reason for this is simple—security is very subjective. Every business has a different threshold of well-being, different assets, a different culture, and a different technology infrastructure. Every business has different requirements for storing, sending, and communicating information in electronic form. Just as a business evolves in changing market conditions, a site security policy must adapt to meet changing technology conditions. This tutorial is based on a publicly available document, request for comment (RFC) 1244.
There are many strong tools available for securing a computer network. By themselves, the software applications and hardware products that secure a business’ computer network do not comprise a security policy, yet they are essential elements in the creation of site security. While these technologies are not the focus of this paper, a basic understanding of them will facilitate the creation of a site security policy.
Tools to protect your enterprise network have been evolving for the last two decades, roughly the same amount of time that people have been trying to break into computer networks. These tools can protect a computer network at many levels, and a well-guarded enterprise deploys many different types of security technologies. The most obvious element of security is often times the most easily overlooked: physical security—namely, controlling access to the most sensitive components in your computer network, such as a network administration station or the server room. No amount of planning or expensive equipment will keep your network secure if unauthorized personnel can have access to central administration consoles. Even if a user does not have evil intent, an untrained user may unknowingly provide unauthorized outside access or override certain protective configurations.
The next level of computer security is operating system security (OSS). The U. S. Department of Defense (DOD) established general guidelines for operating system security, and other countries around the world (as well as other federal organizations) have set their standards as well. In the past few years, certified (tested and approved) secure OSS has been introduced in commercial operating systems like UNIX® and Microsoft Windows NT. These are at the C2 level, which provides discretionary access control-file, directory read and write permission, and auditing and authentication controls.
Orange Book Security Levels
The DOD has defined seven levels of computer OSS in the Trusted Computer Standards Evaluation Criteria, otherwise known as the Orange Book. The levels are used to evaluate protection for hardware, software, and stored information. The system is additive—higher ratings include the functionality of the levels below. The definition centers around access control, authentication, auditing, and levels of trust. D1 is the lowest form of security available and states that the system is insecure. A D1 rating is never awarded because this is essentially no security at all. C1 is the lowest level of security. The system has file and directory read and write controls and authentication through user login. However, root is considered an insecure function and auditing (system logging) is not available. C2 features an auditing function to record all security-related events and provides stronger protection on key system files, such as the password file.
A B-rated system supports multilevel security, such as secret, top secret, and mandatory access control, which states that a user cannot change permissions on files or directories. B2 requires that every object and file be labeled according to its security level and that these labels change dynamically depending on what is being used. B3 extends security levels down into the system hardware; for example, terminals can only connect through trusted cable paths and specialized system hardware to ensure that there is no unauthorized access. A1 is the highest level of security validated through the Orange Book. The design must be mathematically verified; all hardware and software must have been protected during shipment to prevent tampering. A word of caution on secure operating systems must be mentioned: the features and capabilities require significant amounts of central processing unit (CPU) processing power and disk space. In low-end servers, enabling the security features may seriously affect the number of users a server can support.
Firewalls
While in theory firewalls allow only authorized communications between the internal and external networks, new ways are always being developed to compromise these systems. However, properly implemented, they are very effective at keeping out unauthorized users and stopping unwanted activities on an internal network. Firewall systems protect and facilitate your network at a number of levels. They allow e-mail and other applications, such as file transfer protocol (FTP) and remote login as desired, to take place while otherwise limiting access to the internal network. Firewall systems provide an authorization mechanism that assures that only specified users or applications can gain access through the firewall. They typically provide a logging and alerting feature, which tracks designated usage and signals at specified events. These systems offer address translation, which masks the actual name and address of any machine communicating through the firewall. For example, all messages for anyone in the technical support department would have his/her address translated to techsupp@company.com, effectively hiding the name of an actual user and network address. Firewall system providers are adding new functionality, such as encryption and virtual private network (VPN) capabilities.
Firewall systems can also be deployed within an enterprise network to compartmentalize different servers and networks, in effect controlling access within the network. For example, an enterprise may want to separate the accounting and payroll server from the rest of the network and only allow certain individuals to access the information. Unfortunately, all firewall systems have some performance degradation. As a system is busy checking or rerouting data communications packets, they do not flow through the system as efficiently as they would if the firewall system were not in place.
2. Elements of Networking Security: Passwords
Password Mechanisms
Passwords are a way to identify and authenticate users as they access the computer system. Unfortunately, there are a number of ways in which a password can be compromised. For Example, someone wanting to gain access can listen for a username password as an authorized user gains access over a public network. In addition, a potential intruder can mount an attack on the access gateway, entering an entire dictionary of words (or license plates or any other list) against a password field. Users may loan their password to a co-worker or inadvertently leave out a list of system passwords. Fortunately, there are password technologies and tools to help make your network more secure. Useful in ad hoc remote access situations, one-time password generation assumes that a password will be compromised. Before leaving the internal network, a list of passwords that will work only one time against a given username is generated. When logging into the system remotely, a password is used once and then will no longer be valid.
Password Aging and Policy Enforcement
Password aging is a feature that requires users to create new passwords every so often. Good password policy dictates that passwords must be a minimum number of characters and a mix of letters and numbers. Smart cards provide extremely secure password protection. Unique passwords, based on a challenge-response scheme, are created on a small credit-card device. The password is then entered as part of the log-on process and validated against a password server, which logs all access to the system. As might be expected, these systems can be expensive to implement.
Single sign-on overcomes what can only be the ultimate irony in system security: as a user gains more passwords, these passwords become less secure, not more, and the system opens itself up for unauthorized access. Many enterprise computer networks are designed to require users to have different passwords to access different parts of the system. As users acquire more passwords—some people have more than 50—they cannot help but write them down or create easy-to-remember passwords. A single sign-on system is essentially a centralized access control list which determines who is authorized to access different areas of the computer network and a mechanism for providing the expected password. A user need only remember a single password to sign onto the system.
Good password procedures include the following:
• Do not use your login name in any form (as is, reversed, capitalized, doubled, etc.).
• Do not use your first, middle, or last name in any form or use your spouse’s or children’s names.
• Do not use other information easily obtained about you. This includes license plate numbers, telephone numbers, social security numbers, the make of your automobile, the name of the street you live on, etc.
• Do not use a password of all digits or all the same letter.
• Do not use a word contained in English or foreign language dictionaries, spelling lists, or other lists of words.
• Do not use a password shorter than six characters.
• Do use a password with mixed-case alphabetics.
• Do use a password with non-alphabetic characters (digits or punctuation).
• Do use a password that is easy to remember, so you don’t have to write it down.
3. Elements of Networking Security: Encryption, Authentication, and Integrity
A firewall system is a hardware/software configuration that sits at perimeter between a company's network and the Internet, controlling access into and out of the network. Encryption can be understood as follows:
• the coding of data through an algorithm or transform table into apparently unintelligible garbage
• used on both data stored on a server or as data is communicated through a network
• a method of ensuring privacy of data and that only intended users may view the information
There are many forms of encryption, but only the most popular forms will be discussed in this tutorial. The digital encryption standard (DES) has been endorsed by the National Institute of Standards and Technology (NIST) since 1975 and is the most readily available encryption standard. One major drawback with DES is that it is subject to U. S. export control; programs that deploy DES technology are generally not available for export from the United States. Rivest, Shamir, and Adleman (RSA) encryption is a public-key encryption system, is patented technology in the United States, and thus is not available without a license. However, the fundamental DES algorithm was published before the patent filing, and RSA encryption may be used in Europe and Asia without a royalty. RSA encryption is growing in popularity and is considered quite secure from brute force attacks. An emerging encryption mechanism is pretty good privacy (PGP), which allows users to encrypt information stored on their system as well as to send and receive encrypted e-mail. PGP also provides tools and utilities for creating, certifying, and managing keys. PGP should not be confused with privacy enhanced mail (PEM), a protocol standard.
Encryption mechanisms rely on keys or passwords. The longer the password, the more difficult the encryption is to break. DES relies on a 56-bit key length, and some mechanisms have keys that are hundreds of bits long. There are two kinds of encryption mechanisms used—private key and public key. Private-key encryption uses the same key to encode and decode the data. Public-key encryption uses one key to encode the data and another to decode the data. The name public key comes from a unique property of this type of encryption mechanism—namely, one of the keys can be public without compromising the privacy of the message or the other key. In fact, usually a trusted recipient, perhaps a remote office network gateway, keeps a private key to decode data as it comes from the main office. VPNs employ encryption to provide secure transmissions over public networks such as the Internet.
Authentication and Integrity
Authentication is simply making sure users are who they say they are. When using resources or sending messages in a large private network, not to mention the Internet, authentication is of the utmost importance. Integrity is knowing that the data sent has not been altered along the way. Of course, a message modified in any way would be highly suspect and should be completely discounted. Message integrity is maintained with digital signatures. A digital signature is a block of data at the end of a message that attests to the authenticity of the file. If any change is made to the file, the signature will not verify. Digital signatures perform both an authentication and message integrity function. Digital signature functionality is available in PGP and when using RSA encryption. Kerberos is an add-on system that can be used with any existing network. Kerberos validates a user through its authentication system and uses DES when communicating sensitive information—such as passwords—in an open network. In addition, Kerberos sessions have a limited lifespan, requiring users to login after a predetermined length of time and disallowing would-be intruders to replay a captured session and thus gain unauthorized entry.
4. Developing a Site Security Policy
The first rule of network site security is easily stated: that which is not expressly permitted is prohibited. A security policy should deny access to all network resources and then add back access on a specific basis. Implemented in this way, a site security policy will not allow any inadvertent actions or procedures. The goal in developing an official site policy on computer security is to define the organization's expectations for proper computer and network use and to define procedures to prevent and respond to security incidents. In order to do this, specific aspects of the organization must be considered and agreed upon by the policy-making group. For example, a military base may have very different security concerns from those of a university. Even departments within the same organization will have different requirements.
It is important to consider who will make the network site security policy. Policy creation must be a joint effort by a representative group of decision-makers, technical personnel, and day-to-day users from different levels within the organization. Decision-makers must have the power to enforce the policy; technical personnel will advise on the ramifications of the policy; and day-to-day users will have a say in how usable the policy is. A site security policy that is unusable, unimplementable, or unenforceable is worthless.
Developing a security policy comprises identifying the organizational assets, identifying the threats, assessing the risk, implementing the tools and technologies available to meet the risks, and developing a usage policy. In addition, an auditing procedure must be created that reviews network and server usage on a timely basis. A response should be in place before any violation or breakdown occurs as well. Finally, the policy should be communicated to everyone who uses the computer network, whether employee or contractor, and should be reviewed on a regular basis.
Identifying the Organizational Assets
The first step in creating a site security policy is creating a list of all the things that must be protected. The list must be easily and regularly updated, as most organizations add and subtract equipment all the time. Items to be considered include the following:
• hardware—CPUs, boards, keyboards, terminals, workstations, personal computers, printers, disk drives, communication lines, terminal servers, routers
• software—source programs, object programs, utilities, diagnostic programs, operating systems, communication programs
• data—during execution, stored on-line, archived off-line, backups, audit logs, databases, in transit over communication media
• documentation—on programs, hardware, systems, and local administrative procedures
Assessing the Risk
While there is a great deal of publicity about intruders on computer networks, most surveys show that the loss from people within the organization is significantly greater. Risk analysis involves determining what must be protected, from what it must be protected, and how to protect it. Possible risks to your network include the following:
• unauthorized access
• unavailable service, corruption of data, or a slowdown due to a virus
• disclosure of sensitive information, especially that which gives someone else a particular advantage, or theft of information such as credit card information
Once the list has been assembled, a scheme for weighing the risk against the importance of the resource should be developed. This will allow the site policy makers to determine how much effort should be spent protecting the resource. Some security experts advocate the proactive use of the very tools that hackers use in order to find system weaknesses. By discovering weaknesses before the fact, protective action can be implemented to fend off certain attacks. Perhaps the most famous of these tools is security analysis tool for auditing networks (SATAN), which is publicly available on many WWW sites.
Auditing and Review
To help determine if there is a violation of a security policy, take advantage of the tools that are included in computers and networks. Most operating systems store numerous bits of information in log files. Examination of these log files on a regular basis is often the first line of defense in detecting unauthorized use of the system. Compare lists of currently logged in users and past login histories. Most users typically log in and out at roughly the same time each day. An account logged in outside the normal time for the account may be being used by an intruder.
In addition, accounting records can be used to determine usage patterns for the system; unusual accounting records may indicate unauthorized use of the system. System logging facilities, such as the UNIX "syslog" utility, should be checked for unusual error messages from system software. For example, a large number of failed login attempts in a short period of time may indicate someone trying to guess passwords. Operating system commands that list currently executing processes can be used to detect users running programs they are not authorized to use, as well as to detect unauthorized programs that have been started by an intruder. By running various monitoring commands at different times throughout the day, a company makes it harder for intruders to predict when they can be detected. While it may be exceptionally fortuitous that an administrator would catch a violator in their first act, by reviewing log files there is a very good chance for setting up procedures to identify them at a later date.
5. Violation Response
Planning responses for different violation scenarios well in advance—without the burden of an actual event—is good practice. Not only must companies define actions based on the type of violation, but it is also important to have solutions ready based on the anticipated kind of user violating the computer security policy.
Answers to the following questions should be a part of a company's site security plan:
• What outside agencies should be contacted, and who should contact them?
• Who may talk to the press?
• When do you contact law enforcement and investigative agencies?
• If a connection is made from a remote site, is the system manager authorized to contact that site?
What are our responsibilities to our neighbors and other Internet sites? Whenever a site suffers an incident that may compromise computer security, the strategies for reacting may be influenced by two opposing pressures.
If management fears that the site is sufficiently vulnerable, it may choose a protect and proceed strategy. The primary goals of this approach are to protect and preserve the site facilities and to provide normalcy for its users as quickly as possible. Attempts will be made to interfere with the intruder's processes, prevent further access, and begin immediate damage assessment and recovery. This process may involve shutting down the facilities, closing off access to the network, or other drastic measures. The drawback is that unless the intruders are identified, they may come back into the site via a different path or may attack another site.
The alternate approach, pursue and prosecute, adopts the opposite philosophy and goals. The primary goal is to allow intruders to continue their activities at the site until the site can identify the responsible persons. Law enforcement agencies and prosecutors endorse this approach. The drawback is that the agencies cannot exempt a site from possible user lawsuits if damage is done to their systems and data. Prosecution is not the only outcome possible if the intruder is identified. If the culprit is an employee or a student, the organization may choose to take disciplinary actions. Site management must carefully consider potential approaches to this issue before the problem occurs. The strategy adopted might depend upon each circumstance or there may be a global policy that mandates one approach in all circumstances. The following are checklists to help a site determine which of the two strategies to adopt.
Protect and Proceed
• if assets are not well protected
• if continued penetration could result in great financial risk
• if there is no possibility or willingness to prosecute
• if user base is unknown
• if users are unsophisticated and their work is vulnerable
• if the site is vulnerable to lawsuits from users, e.g., if their resources are undermined
Pursue and Prosecute
• if assets and systems are well protected
• if good backups are available
• if the risk to the assets is outweighed by the disruption caused by the present and potential future penetrations
• if this is a concentrated attack occurring with great frequency and intensity
• if the site has a natural attraction to intruders and consequently regularly attracts intruders
• if the site is willing to incur the financial (or other) risk to assets by allowing the perpetrator to continue
• if intruder access can be controlled
• if the monitoring tools are sufficiently well developed to make the pursuit worthwhile
• if the support staff is sufficiently clever and knowledgeable about the operating system, related utilities, and systems to make the pursuit worthwhile
• if management is willing to prosecute
• if the system administrators know what kind of evidence would lead to prosecution
• if there is established contact with knowledgeable law enforcement
• if there is a site representative versed in the relevant legal issues
• if the site is prepared for possible legal action from its own users if their data or systems become compromised during the pursuit
Capturing Lessons Learned
Once you believe that a system has been restored to a safe state, it is still possible that holes and even traps could be lurking. In the follow-up stage, the system should be monitored for items that may have been missed during the clean-up stage. It would be prudent to utilize some of the tools mentioned as a start. Remember that these tools do not replace continual system monitoring and good systems administration procedures. A security log can be most valuable during this phase of removing vulnerabilities. There are two considerations here. The first is to keep logs of the procedures that have been used to make the system secure again. This should include command procedures (e.g., shell scripts) that can be run on a periodic basis to recheck the security. Second, keep logs of important system events. These can be referenced when trying to determine the extent of the damage of a given incident.
After an incident, it is prudent to write a report describing the incident, method of discovery, correction procedure, monitoring procedure, and a summary of lessons learned. This will help develop a clear understanding of the problem. Remember that it is difficult to learn from an incident if you do not understand the source.
Abstract
1. Elements of Networking Security: Orange Book Security Levels and Firewalls
2. Elements of Networking Security: Passwords
3. Elements of Networking Security: Encryption, Authentication, and Integrity
4. Developing a Site Security Policy
5. Violation Response
Abstract:
Internet security is the practice of protecting and preserving private resources and information on the Internet.
Computer and network security are challenging topics among executives and managers of computer corporations. Even discussing security policies may seem to create a potential liability. As a result, enterprise management teams are often not aware of the many advances and innovations in Internet and intranet security technology. Without this knowledge, corporations are not able to take full advantage of the benefits and capabilities of the network. Together, network security and a well-implemented security policy can provide a highly secure solution. Employees can then confidently use secure data transmission channels and reduce or eliminate less secure methods, such as photocopying proprietary information, sending purchase orders and other sensitive financial information by fax, and placing orders by phone.
1. Elements of Networking Security:Orange Book Security Levels and Firewalls
While this paper will provide a basic understanding of the need for a site security policy and factors to consider in creating a security policy, it will not outline one policy that will fit every company. The reason for this is simple—security is very subjective. Every business has a different threshold of well-being, different assets, a different culture, and a different technology infrastructure. Every business has different requirements for storing, sending, and communicating information in electronic form. Just as a business evolves in changing market conditions, a site security policy must adapt to meet changing technology conditions. This tutorial is based on a publicly available document, request for comment (RFC) 1244.
There are many strong tools available for securing a computer network. By themselves, the software applications and hardware products that secure a business’ computer network do not comprise a security policy, yet they are essential elements in the creation of site security. While these technologies are not the focus of this paper, a basic understanding of them will facilitate the creation of a site security policy.
Tools to protect your enterprise network have been evolving for the last two decades, roughly the same amount of time that people have been trying to break into computer networks. These tools can protect a computer network at many levels, and a well-guarded enterprise deploys many different types of security technologies. The most obvious element of security is often times the most easily overlooked: physical security—namely, controlling access to the most sensitive components in your computer network, such as a network administration station or the server room. No amount of planning or expensive equipment will keep your network secure if unauthorized personnel can have access to central administration consoles. Even if a user does not have evil intent, an untrained user may unknowingly provide unauthorized outside access or override certain protective configurations.
The next level of computer security is operating system security (OSS). The U. S. Department of Defense (DOD) established general guidelines for operating system security, and other countries around the world (as well as other federal organizations) have set their standards as well. In the past few years, certified (tested and approved) secure OSS has been introduced in commercial operating systems like UNIX® and Microsoft Windows NT. These are at the C2 level, which provides discretionary access control-file, directory read and write permission, and auditing and authentication controls.
Orange Book Security Levels
The DOD has defined seven levels of computer OSS in the Trusted Computer Standards Evaluation Criteria, otherwise known as the Orange Book. The levels are used to evaluate protection for hardware, software, and stored information. The system is additive—higher ratings include the functionality of the levels below. The definition centers around access control, authentication, auditing, and levels of trust. D1 is the lowest form of security available and states that the system is insecure. A D1 rating is never awarded because this is essentially no security at all. C1 is the lowest level of security. The system has file and directory read and write controls and authentication through user login. However, root is considered an insecure function and auditing (system logging) is not available. C2 features an auditing function to record all security-related events and provides stronger protection on key system files, such as the password file.
A B-rated system supports multilevel security, such as secret, top secret, and mandatory access control, which states that a user cannot change permissions on files or directories. B2 requires that every object and file be labeled according to its security level and that these labels change dynamically depending on what is being used. B3 extends security levels down into the system hardware; for example, terminals can only connect through trusted cable paths and specialized system hardware to ensure that there is no unauthorized access. A1 is the highest level of security validated through the Orange Book. The design must be mathematically verified; all hardware and software must have been protected during shipment to prevent tampering. A word of caution on secure operating systems must be mentioned: the features and capabilities require significant amounts of central processing unit (CPU) processing power and disk space. In low-end servers, enabling the security features may seriously affect the number of users a server can support.
Firewalls
While in theory firewalls allow only authorized communications between the internal and external networks, new ways are always being developed to compromise these systems. However, properly implemented, they are very effective at keeping out unauthorized users and stopping unwanted activities on an internal network. Firewall systems protect and facilitate your network at a number of levels. They allow e-mail and other applications, such as file transfer protocol (FTP) and remote login as desired, to take place while otherwise limiting access to the internal network. Firewall systems provide an authorization mechanism that assures that only specified users or applications can gain access through the firewall. They typically provide a logging and alerting feature, which tracks designated usage and signals at specified events. These systems offer address translation, which masks the actual name and address of any machine communicating through the firewall. For example, all messages for anyone in the technical support department would have his/her address translated to techsupp@company.com, effectively hiding the name of an actual user and network address. Firewall system providers are adding new functionality, such as encryption and virtual private network (VPN) capabilities.
Firewall systems can also be deployed within an enterprise network to compartmentalize different servers and networks, in effect controlling access within the network. For example, an enterprise may want to separate the accounting and payroll server from the rest of the network and only allow certain individuals to access the information. Unfortunately, all firewall systems have some performance degradation. As a system is busy checking or rerouting data communications packets, they do not flow through the system as efficiently as they would if the firewall system were not in place.
2. Elements of Networking Security: Passwords
Password Mechanisms
Passwords are a way to identify and authenticate users as they access the computer system. Unfortunately, there are a number of ways in which a password can be compromised. For Example, someone wanting to gain access can listen for a username password as an authorized user gains access over a public network. In addition, a potential intruder can mount an attack on the access gateway, entering an entire dictionary of words (or license plates or any other list) against a password field. Users may loan their password to a co-worker or inadvertently leave out a list of system passwords. Fortunately, there are password technologies and tools to help make your network more secure. Useful in ad hoc remote access situations, one-time password generation assumes that a password will be compromised. Before leaving the internal network, a list of passwords that will work only one time against a given username is generated. When logging into the system remotely, a password is used once and then will no longer be valid.
Password Aging and Policy Enforcement
Password aging is a feature that requires users to create new passwords every so often. Good password policy dictates that passwords must be a minimum number of characters and a mix of letters and numbers. Smart cards provide extremely secure password protection. Unique passwords, based on a challenge-response scheme, are created on a small credit-card device. The password is then entered as part of the log-on process and validated against a password server, which logs all access to the system. As might be expected, these systems can be expensive to implement.
Single sign-on overcomes what can only be the ultimate irony in system security: as a user gains more passwords, these passwords become less secure, not more, and the system opens itself up for unauthorized access. Many enterprise computer networks are designed to require users to have different passwords to access different parts of the system. As users acquire more passwords—some people have more than 50—they cannot help but write them down or create easy-to-remember passwords. A single sign-on system is essentially a centralized access control list which determines who is authorized to access different areas of the computer network and a mechanism for providing the expected password. A user need only remember a single password to sign onto the system.
Good password procedures include the following:
• Do not use your login name in any form (as is, reversed, capitalized, doubled, etc.).
• Do not use your first, middle, or last name in any form or use your spouse’s or children’s names.
• Do not use other information easily obtained about you. This includes license plate numbers, telephone numbers, social security numbers, the make of your automobile, the name of the street you live on, etc.
• Do not use a password of all digits or all the same letter.
• Do not use a word contained in English or foreign language dictionaries, spelling lists, or other lists of words.
• Do not use a password shorter than six characters.
• Do use a password with mixed-case alphabetics.
• Do use a password with non-alphabetic characters (digits or punctuation).
• Do use a password that is easy to remember, so you don’t have to write it down.
3. Elements of Networking Security: Encryption, Authentication, and Integrity
A firewall system is a hardware/software configuration that sits at perimeter between a company's network and the Internet, controlling access into and out of the network. Encryption can be understood as follows:
• the coding of data through an algorithm or transform table into apparently unintelligible garbage
• used on both data stored on a server or as data is communicated through a network
• a method of ensuring privacy of data and that only intended users may view the information
There are many forms of encryption, but only the most popular forms will be discussed in this tutorial. The digital encryption standard (DES) has been endorsed by the National Institute of Standards and Technology (NIST) since 1975 and is the most readily available encryption standard. One major drawback with DES is that it is subject to U. S. export control; programs that deploy DES technology are generally not available for export from the United States. Rivest, Shamir, and Adleman (RSA) encryption is a public-key encryption system, is patented technology in the United States, and thus is not available without a license. However, the fundamental DES algorithm was published before the patent filing, and RSA encryption may be used in Europe and Asia without a royalty. RSA encryption is growing in popularity and is considered quite secure from brute force attacks. An emerging encryption mechanism is pretty good privacy (PGP), which allows users to encrypt information stored on their system as well as to send and receive encrypted e-mail. PGP also provides tools and utilities for creating, certifying, and managing keys. PGP should not be confused with privacy enhanced mail (PEM), a protocol standard.
Encryption mechanisms rely on keys or passwords. The longer the password, the more difficult the encryption is to break. DES relies on a 56-bit key length, and some mechanisms have keys that are hundreds of bits long. There are two kinds of encryption mechanisms used—private key and public key. Private-key encryption uses the same key to encode and decode the data. Public-key encryption uses one key to encode the data and another to decode the data. The name public key comes from a unique property of this type of encryption mechanism—namely, one of the keys can be public without compromising the privacy of the message or the other key. In fact, usually a trusted recipient, perhaps a remote office network gateway, keeps a private key to decode data as it comes from the main office. VPNs employ encryption to provide secure transmissions over public networks such as the Internet.
Authentication and Integrity
Authentication is simply making sure users are who they say they are. When using resources or sending messages in a large private network, not to mention the Internet, authentication is of the utmost importance. Integrity is knowing that the data sent has not been altered along the way. Of course, a message modified in any way would be highly suspect and should be completely discounted. Message integrity is maintained with digital signatures. A digital signature is a block of data at the end of a message that attests to the authenticity of the file. If any change is made to the file, the signature will not verify. Digital signatures perform both an authentication and message integrity function. Digital signature functionality is available in PGP and when using RSA encryption. Kerberos is an add-on system that can be used with any existing network. Kerberos validates a user through its authentication system and uses DES when communicating sensitive information—such as passwords—in an open network. In addition, Kerberos sessions have a limited lifespan, requiring users to login after a predetermined length of time and disallowing would-be intruders to replay a captured session and thus gain unauthorized entry.
4. Developing a Site Security Policy
The first rule of network site security is easily stated: that which is not expressly permitted is prohibited. A security policy should deny access to all network resources and then add back access on a specific basis. Implemented in this way, a site security policy will not allow any inadvertent actions or procedures. The goal in developing an official site policy on computer security is to define the organization's expectations for proper computer and network use and to define procedures to prevent and respond to security incidents. In order to do this, specific aspects of the organization must be considered and agreed upon by the policy-making group. For example, a military base may have very different security concerns from those of a university. Even departments within the same organization will have different requirements.
It is important to consider who will make the network site security policy. Policy creation must be a joint effort by a representative group of decision-makers, technical personnel, and day-to-day users from different levels within the organization. Decision-makers must have the power to enforce the policy; technical personnel will advise on the ramifications of the policy; and day-to-day users will have a say in how usable the policy is. A site security policy that is unusable, unimplementable, or unenforceable is worthless.
Developing a security policy comprises identifying the organizational assets, identifying the threats, assessing the risk, implementing the tools and technologies available to meet the risks, and developing a usage policy. In addition, an auditing procedure must be created that reviews network and server usage on a timely basis. A response should be in place before any violation or breakdown occurs as well. Finally, the policy should be communicated to everyone who uses the computer network, whether employee or contractor, and should be reviewed on a regular basis.
Identifying the Organizational Assets
The first step in creating a site security policy is creating a list of all the things that must be protected. The list must be easily and regularly updated, as most organizations add and subtract equipment all the time. Items to be considered include the following:
• hardware—CPUs, boards, keyboards, terminals, workstations, personal computers, printers, disk drives, communication lines, terminal servers, routers
• software—source programs, object programs, utilities, diagnostic programs, operating systems, communication programs
• data—during execution, stored on-line, archived off-line, backups, audit logs, databases, in transit over communication media
• documentation—on programs, hardware, systems, and local administrative procedures
Assessing the Risk
While there is a great deal of publicity about intruders on computer networks, most surveys show that the loss from people within the organization is significantly greater. Risk analysis involves determining what must be protected, from what it must be protected, and how to protect it. Possible risks to your network include the following:
• unauthorized access
• unavailable service, corruption of data, or a slowdown due to a virus
• disclosure of sensitive information, especially that which gives someone else a particular advantage, or theft of information such as credit card information
Once the list has been assembled, a scheme for weighing the risk against the importance of the resource should be developed. This will allow the site policy makers to determine how much effort should be spent protecting the resource. Some security experts advocate the proactive use of the very tools that hackers use in order to find system weaknesses. By discovering weaknesses before the fact, protective action can be implemented to fend off certain attacks. Perhaps the most famous of these tools is security analysis tool for auditing networks (SATAN), which is publicly available on many WWW sites.
Auditing and Review
To help determine if there is a violation of a security policy, take advantage of the tools that are included in computers and networks. Most operating systems store numerous bits of information in log files. Examination of these log files on a regular basis is often the first line of defense in detecting unauthorized use of the system. Compare lists of currently logged in users and past login histories. Most users typically log in and out at roughly the same time each day. An account logged in outside the normal time for the account may be being used by an intruder.
In addition, accounting records can be used to determine usage patterns for the system; unusual accounting records may indicate unauthorized use of the system. System logging facilities, such as the UNIX "syslog" utility, should be checked for unusual error messages from system software. For example, a large number of failed login attempts in a short period of time may indicate someone trying to guess passwords. Operating system commands that list currently executing processes can be used to detect users running programs they are not authorized to use, as well as to detect unauthorized programs that have been started by an intruder. By running various monitoring commands at different times throughout the day, a company makes it harder for intruders to predict when they can be detected. While it may be exceptionally fortuitous that an administrator would catch a violator in their first act, by reviewing log files there is a very good chance for setting up procedures to identify them at a later date.
5. Violation Response
Planning responses for different violation scenarios well in advance—without the burden of an actual event—is good practice. Not only must companies define actions based on the type of violation, but it is also important to have solutions ready based on the anticipated kind of user violating the computer security policy.
Answers to the following questions should be a part of a company's site security plan:
• What outside agencies should be contacted, and who should contact them?
• Who may talk to the press?
• When do you contact law enforcement and investigative agencies?
• If a connection is made from a remote site, is the system manager authorized to contact that site?
What are our responsibilities to our neighbors and other Internet sites? Whenever a site suffers an incident that may compromise computer security, the strategies for reacting may be influenced by two opposing pressures.
If management fears that the site is sufficiently vulnerable, it may choose a protect and proceed strategy. The primary goals of this approach are to protect and preserve the site facilities and to provide normalcy for its users as quickly as possible. Attempts will be made to interfere with the intruder's processes, prevent further access, and begin immediate damage assessment and recovery. This process may involve shutting down the facilities, closing off access to the network, or other drastic measures. The drawback is that unless the intruders are identified, they may come back into the site via a different path or may attack another site.
The alternate approach, pursue and prosecute, adopts the opposite philosophy and goals. The primary goal is to allow intruders to continue their activities at the site until the site can identify the responsible persons. Law enforcement agencies and prosecutors endorse this approach. The drawback is that the agencies cannot exempt a site from possible user lawsuits if damage is done to their systems and data. Prosecution is not the only outcome possible if the intruder is identified. If the culprit is an employee or a student, the organization may choose to take disciplinary actions. Site management must carefully consider potential approaches to this issue before the problem occurs. The strategy adopted might depend upon each circumstance or there may be a global policy that mandates one approach in all circumstances. The following are checklists to help a site determine which of the two strategies to adopt.
Protect and Proceed
• if assets are not well protected
• if continued penetration could result in great financial risk
• if there is no possibility or willingness to prosecute
• if user base is unknown
• if users are unsophisticated and their work is vulnerable
• if the site is vulnerable to lawsuits from users, e.g., if their resources are undermined
Pursue and Prosecute
• if assets and systems are well protected
• if good backups are available
• if the risk to the assets is outweighed by the disruption caused by the present and potential future penetrations
• if this is a concentrated attack occurring with great frequency and intensity
• if the site has a natural attraction to intruders and consequently regularly attracts intruders
• if the site is willing to incur the financial (or other) risk to assets by allowing the perpetrator to continue
• if intruder access can be controlled
• if the monitoring tools are sufficiently well developed to make the pursuit worthwhile
• if the support staff is sufficiently clever and knowledgeable about the operating system, related utilities, and systems to make the pursuit worthwhile
• if management is willing to prosecute
• if the system administrators know what kind of evidence would lead to prosecution
• if there is established contact with knowledgeable law enforcement
• if there is a site representative versed in the relevant legal issues
• if the site is prepared for possible legal action from its own users if their data or systems become compromised during the pursuit
Capturing Lessons Learned
Once you believe that a system has been restored to a safe state, it is still possible that holes and even traps could be lurking. In the follow-up stage, the system should be monitored for items that may have been missed during the clean-up stage. It would be prudent to utilize some of the tools mentioned as a start. Remember that these tools do not replace continual system monitoring and good systems administration procedures. A security log can be most valuable during this phase of removing vulnerabilities. There are two considerations here. The first is to keep logs of the procedures that have been used to make the system secure again. This should include command procedures (e.g., shell scripts) that can be run on a periodic basis to recheck the security. Second, keep logs of important system events. These can be referenced when trying to determine the extent of the damage of a given incident.
After an incident, it is prudent to write a report describing the incident, method of discovery, correction procedure, monitoring procedure, and a summary of lessons learned. This will help develop a clear understanding of the problem. Remember that it is difficult to learn from an incident if you do not understand the source.
Information security
ABSTRACT :
Information security means protecting information and information
systems from unauthorized access, use, disclosure, disruption, modification, or destruction.
Key concepts
For over twenty years information security has held that confidentiality, integrity and availability (known as the CIA Triad) are the core principles of information security.
Integrity
In information security, integrity means that data can not be created, changed, or deleted without authorization. It also means that data stored in one part of a DATABASE system is in agreement with other related data stored in another part of the database system (or another system). For example: a loss of integrity can occur when a database system is not properly shut down before maintenance is performed or the database server suddenly loses electrical power A loss of integrity occurs when an employee accidentally, or with malicious intent, deletes important data files. A loss of integrity can occur if a computer virus is released onto the computer. A loss of integrity can occur when an on-line shopper is able to change the price of the product they are purchasing.
.
Risk management
A comprehensive treatment of the topic of risk management is beyond the scope of this article. We will however, provide a useful definition of risk management, outline a commonly used process for risk management, and define some basic terminology.
Risk is the likelihood that something bad will happen that causes harm to an informational asset (or the loss of the asset). A vulnerability is a weakness that could be used to endanger or cause harm to an informational asset. A threat is anything (man made or act of nature) that has the potential to cause harm.
In broad terms the risk management process consists of:
1. Identification of assets and estimating their value. Include: people, buildings, hardware, software, data (electronic, print, other), supplies.
2. Conduct a threat assessment. Include: Acts of nature, acts of war, accidents, malicious acts originating from inside or outside the organization.
3. Conduct a vulnerability assessment, and for each vulnerability, calculate the probability that it will be exploited. Evaluate policies, procedures, standards, training, physical security, quality control, technical security.
4. Calculate the impact that each threat would have on each asset. Use qualitative analysis or quantitative analysis.
5. Identify, select and implement appropriate controls. Provide a proportional response. Consider productivity, cost effectiveness, and value of the asset.
6. Evaluate the effectiveness of the control measures. Ensure the controls provide the required cost effective protection without discernable loss of productivity.
Controls
When Management chooses to mitigate a risk, they will do so by implementing one or more of three different types of controls.
Administrative
Administrative controls (also called procedural controls) consist of approved written policies, procedures, standards and guidelines. Administrative controls form the framework for running the business and managing people. They inform people on how the business is to be run and how day to day operations are to be conducted.
Logical
Logical controls (also called technical controls) use software and data to monitor and control access to information and computing systems. For example: passwords, network and host based firewalls, network intrusion detection systems, access control lists, and data encryption are logical controls.
Physical
Physical controls monitor and control the environment of the work place and computing facilities. They also monitor and control access to and from such facilities. For example: doors, locks, heating and air conditioning, smoke and fire alarms, fire suppression systems, cameras, barricades, fencing, security guards, cable locks, etc. Separating the network and work place into functional areas are also physical controls.
Security classification for information :
An important aspect of information security and risk management is recognizing the value of information and defining appropriate procedures and protection requirements for the information. Not all information is equal and so not all information requires the same degree of protection.
Conclusion
Information security is the ongoing process of exercising due care and due diligence to protect information, and information systems, from unauthorized access, use, disclosure, destruction, modification, or disruption. The never ending process of information security involves ongoing training, assessment, protection, monitoring & detection, incident response & repair, documentation, and review
Information security means protecting information and information
systems from unauthorized access, use, disclosure, disruption, modification, or destruction.
Key concepts
For over twenty years information security has held that confidentiality, integrity and availability (known as the CIA Triad) are the core principles of information security.
Integrity
In information security, integrity means that data can not be created, changed, or deleted without authorization. It also means that data stored in one part of a DATABASE system is in agreement with other related data stored in another part of the database system (or another system). For example: a loss of integrity can occur when a database system is not properly shut down before maintenance is performed or the database server suddenly loses electrical power A loss of integrity occurs when an employee accidentally, or with malicious intent, deletes important data files. A loss of integrity can occur if a computer virus is released onto the computer. A loss of integrity can occur when an on-line shopper is able to change the price of the product they are purchasing.
.
Risk management
A comprehensive treatment of the topic of risk management is beyond the scope of this article. We will however, provide a useful definition of risk management, outline a commonly used process for risk management, and define some basic terminology.
Risk is the likelihood that something bad will happen that causes harm to an informational asset (or the loss of the asset). A vulnerability is a weakness that could be used to endanger or cause harm to an informational asset. A threat is anything (man made or act of nature) that has the potential to cause harm.
In broad terms the risk management process consists of:
1. Identification of assets and estimating their value. Include: people, buildings, hardware, software, data (electronic, print, other), supplies.
2. Conduct a threat assessment. Include: Acts of nature, acts of war, accidents, malicious acts originating from inside or outside the organization.
3. Conduct a vulnerability assessment, and for each vulnerability, calculate the probability that it will be exploited. Evaluate policies, procedures, standards, training, physical security, quality control, technical security.
4. Calculate the impact that each threat would have on each asset. Use qualitative analysis or quantitative analysis.
5. Identify, select and implement appropriate controls. Provide a proportional response. Consider productivity, cost effectiveness, and value of the asset.
6. Evaluate the effectiveness of the control measures. Ensure the controls provide the required cost effective protection without discernable loss of productivity.
Controls
When Management chooses to mitigate a risk, they will do so by implementing one or more of three different types of controls.
Administrative
Administrative controls (also called procedural controls) consist of approved written policies, procedures, standards and guidelines. Administrative controls form the framework for running the business and managing people. They inform people on how the business is to be run and how day to day operations are to be conducted.
Logical
Logical controls (also called technical controls) use software and data to monitor and control access to information and computing systems. For example: passwords, network and host based firewalls, network intrusion detection systems, access control lists, and data encryption are logical controls.
Physical
Physical controls monitor and control the environment of the work place and computing facilities. They also monitor and control access to and from such facilities. For example: doors, locks, heating and air conditioning, smoke and fire alarms, fire suppression systems, cameras, barricades, fencing, security guards, cable locks, etc. Separating the network and work place into functional areas are also physical controls.
Security classification for information :
An important aspect of information security and risk management is recognizing the value of information and defining appropriate procedures and protection requirements for the information. Not all information is equal and so not all information requires the same degree of protection.
Conclusion
Information security is the ongoing process of exercising due care and due diligence to protect information, and information systems, from unauthorized access, use, disclosure, destruction, modification, or disruption. The never ending process of information security involves ongoing training, assessment, protection, monitoring & detection, incident response & repair, documentation, and review
SOFTWARE TESTING METHODOLOGIES
ABSTRACT:
This paper describes about the different techniques of testing the software. This paper explicitly addresses the idea for testability is an important as testing itself-not just by saying that testability is a desirable goal, but by showing how to do it. Software testing is the process used to measure the quality of developed computer software. Software Testing is not just about error-finding and their rectification but also about underlining client requirements and testing that those requirements are met by the software solution/application. It is the most important functional phase in the SDLC (Software Development Life Cycle) as it exhibits all mistakes, flaws and errors in the developed software. Without rectifying theses errors, technically termed as ‘bugs,’ software development is not considered to be complete. Hence, software testing becomes an important parameter for assuring quality of the software product. We discuss here about when to start and when to stop the testing of software. How errors or Bugs are formed and rectified. How software testing is done i.e. with the help of Team Work.
INTRODUCTION:
Testing is a process used to help identify the correctness, completeness and quality of developed computer software. . With that in mind, testing can never completely establish the correctness of computer software.
There are many approaches to software testing, but effective testing of complex products is essentially a process of investigation, not merely a matter of creating and following rote procedure. One definition of testing is "the process of questioning a product in order to evaluate it", where the "questions" are things the tester tries to do with the product, and the product answers with its behavior in reaction to the probing of the tester. Although most of the intellectual processes of testing are nearly identical to that of review or inspection, the word testing is connoted to mean the dynamic analysis of the product—putting the product through its paces. The quality of the application can and normally does vary widely from system to system but some of the common quality attributes include reliability, stability, portability, maintainability and usability. Testing helps is verifying and validating if the Software is working as it is intended to be working. Things involve using Static and Dynamic methodologies to Test the application. Because of the fallibility of its human designers and its own abstract, complex nature, software development must be accompanied by quality assurance activities. It is not unusual for developers to spend 40% of the total project time on testing. For life-critical software (e.g. flight control, reactor monitoring), testing can cost 3 to 5 times as much as all other activities combined. The destructive nature of testing requires that the developer discard preconceived notions of the correctness of his/her developed software. The importance of software testing and its impact on software cannot be underestimated. Software testing is a fundamental component of software quality assurance and represents a review of specification, design and coding.
Software Testing Fundamentals:
Testing is the one step in the software process that can be seen by the developer as destructive instead of constructive. Software engineers are typically constructive people and testing requires them to overcome preconceived concepts of correctness and deal with conflicts when errors are identified.
Testing objectives include
1. Testing is a process of executing a program with the intent of finding an error.
2. A good test case is one that has a high probability of finding an as yet undiscovered error.
3. A successful test is one that uncovers an as yet undiscovered error.
Testing should systematically uncover different classes of errors in a minimum amount of time and with a minimum amount of effort. A secondary benefit of testing is that it demonstrates that the software appears to be working as stated in the specifications. The data collected through testing can also provide an indication of the software's reliability and quality. But, testing cannot show the absence of defect -- it can only show that software defects are present.
When Testing should be started?
Testing early in the life cycle reduces the errors. Test deliverables are associated with every phase of development. The goal of Software Tester is to find bugs, find them as early as possible, and make them sure they are fixed.
The number one cause of Software bugs is the Specification. There are several reasons specifications are the largest bug producer.
In many instances a Spec simply isn’t written. Other reasons may be that the spec isn’t thorough enough, it’s constantly changing, or it’s not communicated well to the entire team. Planning software is vitally important. If it’s not done correctly bugs will be created.
The next largest source of bugs is the Design, That’s where the programmers lay the plan for their Software. Compare it to an architect creating the blue print for the building, Bugs occur here for the same reason they occur in the specification. It’s rushed, changed, or not well communicated.
Coding errors may be more familiar to you if you are a programmer. Typically these can be traced to the Software complexity, poor documentation, schedule pressure or just plain dump mistakes. It’s important to note that many bugs that appear on the surface to be programming errors can really be traced to specification. The other category is the catch-all for what is left. Some bugs can blamed for false positives, conditions that were thought to be bugs but really weren’t. There may be duplicate bugs, multiple ones that resulted from the square root cause. Some bugs can be traced to Testing errors.
When should we Stop Testing?
This can be difficult to determine. Many modern software applications are so complex, and run in such as interdependent environment, that complete testing can never be done. "When to stop testing" is one of the most difficult questions to a test engineer. Common factors in deciding when to stop are:
• Deadlines (release deadlines, testing deadlines.)
• Test cases completed with certain percentages passed
• Test budget depleted
• Coverage of code/functionality/requirements reaches a specified point
• The rate at which Bugs can be found is too small
• Beta or Alpha Testing period ends
• The risk in the project is under acceptable limit.
Practically, we feel that the decision of stopping testing is based on the level of the risk acceptable to the management. As testing is a never ending process we can never assume that 100 % testing has been done, we can only minimize the risk of shipping the product to client with X testing done. The risk can be measured by Risk analysis but for small duration / low budget / low resources project, risk can be deduced by simply: -
• Measuring Test Coverage.
• Number of test cycles.
• Number of high priority bugs.
Concepts for Application Test Management:
Testing should be pro-active following the V-model. Test execution can be a manual process, Test execution can be an automated process. It is possible to plan the start date for testing. It is not possible to accurately plan the end date of testing. Ending testing is through risk assessment. A fool with a tool is still a fool Testing is not a diagnosis process. Testing is a triage process. Testing is expensive Not testing, can be more expensive.
How Software Defects Arise?
The International Software Testing Qualifications Board says that software faults occur through the following process:
A human being can make an error (mistake), which produces a defect (fault, bug) in the code, in software or a system, or in a document. If a defect in code is executed, the system will fail to do what it should do (or do something it shouldn’t), causing a failure. Defects in software, systems or documents may result in failures, but not all defects do so.
A fault can also turn into a failure when the environment is changed. Examples of these changes in environment include the software being run on a new hardware platform, alterations in source data or interacting with different software.
Inability to find all faults:
A problem with software testing is that testing all combinations of inputs and preconditions is not feasible when testing anything other than a simple product. This means that the number of defects in a software product can be very large and defects that occur infrequently are difficult to find in testing.
When Testing is Carried Out?
A common practice of software testing is that it is performed by an independent group of testers after the functionality is developed but before it is shipped to the customer. This practice often results in the testing phase being used as project buffer to compensate for project delays, thereby compromising the time devoted to testing. Another practice is to start software testing at the same moment the project starts and it is a continuous process until the project finishes.
Another common practice is for test suites to be developed during technical support escalation procedures.[citation needed] Such tests are then maintained in regression testing suites to ensure that future updates to the software don't repeat any of the known mistakes.
Measuring Software Testing:
Usually, quality is constrained to such topics as correctness, completeness, security, but can also include more technical requirements as described under the ISO standard ISO 9126, such as capability, reliability, efficiency, portability, maintainability, compatibility, and usability.
Testing is a process of technical investigation, performed on behalf of stakeholders, that is intended to reveal quality-related information about the product with respect to the context in which it is intended to operate. This includes, but is not limited to, the process of executing a program or application with the intent of finding errors. Quality is not an absolute; it is value to some person. With that in mind, testing can never completely establish the correctness of arbitrary computer software; testing furnishes a criticism or comparison that compares the state and behavior of the product against a specification. An important point is that software testing should be distinguished from the separate discipline of Software Quality Assurance (SQA), which encompasses all business process areas, not just testing. Today, software has grown in complexity and size. The software product developed by a developer is according to the System Requirement Specification. Every software product has a target audience. For example, video game software has its audience completely different from banking software. Therefore, when an organization invests large sums in making a software product, it must ensure that the software product must be acceptable to the end users or its target audience. This is where Software Testing comes into play. Software testing is not merely finding defects or bugs in the software; it is the completely dedicated discipline of evaluating the quality of the software.
There are 4 Testing Steps:
1. Select what has to be measured
Code tested for correctness with respect to:
Requirements
Architecture
Detailed Design
2. Decide how the testing is done for each level of testing
Code inspection
Black-box, white box, grey box.
Select integration testing strategy (big bang, bottom up, top down, sandwich)
3. Develop test cases
A test case is a set of test data or situations that will be used to exercise the unit (code, module, system) being tested or about the attribute being measured.
4. Create the test oracle
An oracle contains of the predicted results for a set of test cases i.e., expected output for each test.
The test oracle has to be written down before the actual testing takes place.
This is the difficult step
White box, black box, and grey box testing:
White box and black box testing are terms used to describe the point of view that a test engineer takes when designing test cases.
Black box testing treats the software as a black-box without any understanding as to how the internals behave. It aims to test the functionality according to the requirements. Thus, the tester inputs data and only sees the output from the test object. This level of testing usually requires thorough test cases to be provided to the tester who then can simply verify that for a given input, the output value (or behavior), is the same as the expected value specified in the test case.
White box testing, however, is when the tester has access to the internal data structures, code, and algorithms. For this reason, unit testing and debugging can be classified as white-box testing and it usually requires writing code, or at a minimum, stepping through it, and thus requires more knowledge of the product than the black-box tester.[19] If the software in test is an interface or API of any sort, white-box testing is almost always required. In recent years the term grey box testing has come into common usage. This involves having access to internal data structures and algorithms for purposes of designing the test cases, but testing at the user, or black-box level. Manipulating input data and formatting output do not qualify as grey-box because the input and output are clearly outside of the black-box we are calling the software under test. This is particularly important when conducting integration testing between two modules of code written by two different developers, where only the interfaces are exposed for test.
Grey box testing could be used in the context of testing a client-server environment when the tester has control over the input, inspects the value in a SQL database, and the output value, and then compares all three (the input, SQL value, and output), to determine if the data got corrupt on the database insertion or retrieval.
Software Testing Life Cycle:
Requirements Analysis: Testing should begin in the requirements phase of the software development life cycle. During the design phase, testers work with developers in determining what aspects of a design are testable and with what parameters those tests work.
Test Planning: Test Strategy, Test Plan(s), Test Bed creation.
A lot of activities will be carried out during testing, so that a plan is needed.
Test Development: Test Procedures, Test Scenarios, Test Cases, and Test Scripts to use in testing software.
Test Execution: Testers execute the software based on the plans and tests and report any errors found to the development team.
Test Reporting: Once testing is completed, testers generate metrics and make final reports on their test effort and whether or not the software tested is ready for release.
Retesting the Defects. Not all errors or defects reported must be fixed by a software development team. Some may be caused by errors in configuring the test software to match the development or production environment. Some defects can be handled by a workaround in the production environment. Others might be deferred to future releases of the software, or the deficiency might be accepted by the business user. There are yet other defects that may be rejected by the development team (of course, with due reason) if they deem it.
What should the Test Team do?
Programmer Management
Strong Change Management
Strict Configuration Control
Pro Active Scope Creep Management
Inclusion in the decision making process
What are the Test Team Deliverables?
Test Plans
Test Script Planner
Test Scripts
Test Execution Results
Defect Reports
CONCLUSION:
Software testing accounts for a large percentage of effort in the software development process, but we have only recently begun to understand the subtleties of systematic planning, execution and control. For an IT organization, developing a software system that meets the business needs of clients is always a challenge. The company needs to ensure that the software system that gets delivered to their clients is free from bugs or defects and achieves the demands as per client requirements. But this can only be ensured by following rigorous software testing and quality assurance procedures.
Software testing is a process without which the Software Development Life Cycle (SDLC) stands incomplete. It is the process that identifies the correctness, completeness and quality of the software developed during the SDLC process. Software bugs and improperly tested codes cost millions in damages and millions more in time and money to fix the defect. Organizations try to develop software applications that should act in a way that cause the least amount of surprises to the user. In short they should be bug free. New paradigms of software testing are being adopted and used in the process of software development.
Due to this, the software testing field has emerged from the shadows in the world IT space and has claimed its rightful place in the IT market. Gone are the days when software testing was considered a poor cousin of software development. In this article, we talk about software testing techniques, trends that are coming up in this arena and also new software development paradigms.
The ways in which testing can be done are broadly classified as Manual Testing and Automated Testing.
The manual testing of the software happens in several phases. Self-testing, which is done by developers themselves or by small development teams, should be restricted to build cycle itself and should be done while the software development is in the production stage. The errors of the Bugs can be Corrected or Verified easily by the Performance of the Team. Hence there should be a good understanding between the Team Members by which we get the Software Tested Successfully. It encourages team and organisational learning
Team Work focuses team efforts towards the respective goals.(i.e., intent of finding the errors)
Team Work increases motivation and accountability of individual employees by which knowledge will be shared among themselves.
Team Work encourages continuous improvement.
Team Work provides adequate feedback thus, allowing situational awareness, capability assessment, problem diagnosis, intervention and remediation.
REFERENCES:
1. Software Testing Techniques (Second Edition) by Boris Beizer.
2. www.onestoptesting.com Referred the topics related to Testing Types.
3. www.en.wikipedia.org/wiki/Software testing Referred the topics related to begin and stop the software testing process and related ones.
This paper describes about the different techniques of testing the software. This paper explicitly addresses the idea for testability is an important as testing itself-not just by saying that testability is a desirable goal, but by showing how to do it. Software testing is the process used to measure the quality of developed computer software. Software Testing is not just about error-finding and their rectification but also about underlining client requirements and testing that those requirements are met by the software solution/application. It is the most important functional phase in the SDLC (Software Development Life Cycle) as it exhibits all mistakes, flaws and errors in the developed software. Without rectifying theses errors, technically termed as ‘bugs,’ software development is not considered to be complete. Hence, software testing becomes an important parameter for assuring quality of the software product. We discuss here about when to start and when to stop the testing of software. How errors or Bugs are formed and rectified. How software testing is done i.e. with the help of Team Work.
INTRODUCTION:
Testing is a process used to help identify the correctness, completeness and quality of developed computer software. . With that in mind, testing can never completely establish the correctness of computer software.
There are many approaches to software testing, but effective testing of complex products is essentially a process of investigation, not merely a matter of creating and following rote procedure. One definition of testing is "the process of questioning a product in order to evaluate it", where the "questions" are things the tester tries to do with the product, and the product answers with its behavior in reaction to the probing of the tester. Although most of the intellectual processes of testing are nearly identical to that of review or inspection, the word testing is connoted to mean the dynamic analysis of the product—putting the product through its paces. The quality of the application can and normally does vary widely from system to system but some of the common quality attributes include reliability, stability, portability, maintainability and usability. Testing helps is verifying and validating if the Software is working as it is intended to be working. Things involve using Static and Dynamic methodologies to Test the application. Because of the fallibility of its human designers and its own abstract, complex nature, software development must be accompanied by quality assurance activities. It is not unusual for developers to spend 40% of the total project time on testing. For life-critical software (e.g. flight control, reactor monitoring), testing can cost 3 to 5 times as much as all other activities combined. The destructive nature of testing requires that the developer discard preconceived notions of the correctness of his/her developed software. The importance of software testing and its impact on software cannot be underestimated. Software testing is a fundamental component of software quality assurance and represents a review of specification, design and coding.
Software Testing Fundamentals:
Testing is the one step in the software process that can be seen by the developer as destructive instead of constructive. Software engineers are typically constructive people and testing requires them to overcome preconceived concepts of correctness and deal with conflicts when errors are identified.
Testing objectives include
1. Testing is a process of executing a program with the intent of finding an error.
2. A good test case is one that has a high probability of finding an as yet undiscovered error.
3. A successful test is one that uncovers an as yet undiscovered error.
Testing should systematically uncover different classes of errors in a minimum amount of time and with a minimum amount of effort. A secondary benefit of testing is that it demonstrates that the software appears to be working as stated in the specifications. The data collected through testing can also provide an indication of the software's reliability and quality. But, testing cannot show the absence of defect -- it can only show that software defects are present.
When Testing should be started?
Testing early in the life cycle reduces the errors. Test deliverables are associated with every phase of development. The goal of Software Tester is to find bugs, find them as early as possible, and make them sure they are fixed.
The number one cause of Software bugs is the Specification. There are several reasons specifications are the largest bug producer.
In many instances a Spec simply isn’t written. Other reasons may be that the spec isn’t thorough enough, it’s constantly changing, or it’s not communicated well to the entire team. Planning software is vitally important. If it’s not done correctly bugs will be created.
The next largest source of bugs is the Design, That’s where the programmers lay the plan for their Software. Compare it to an architect creating the blue print for the building, Bugs occur here for the same reason they occur in the specification. It’s rushed, changed, or not well communicated.
Coding errors may be more familiar to you if you are a programmer. Typically these can be traced to the Software complexity, poor documentation, schedule pressure or just plain dump mistakes. It’s important to note that many bugs that appear on the surface to be programming errors can really be traced to specification. The other category is the catch-all for what is left. Some bugs can blamed for false positives, conditions that were thought to be bugs but really weren’t. There may be duplicate bugs, multiple ones that resulted from the square root cause. Some bugs can be traced to Testing errors.
When should we Stop Testing?
This can be difficult to determine. Many modern software applications are so complex, and run in such as interdependent environment, that complete testing can never be done. "When to stop testing" is one of the most difficult questions to a test engineer. Common factors in deciding when to stop are:
• Deadlines (release deadlines, testing deadlines.)
• Test cases completed with certain percentages passed
• Test budget depleted
• Coverage of code/functionality/requirements reaches a specified point
• The rate at which Bugs can be found is too small
• Beta or Alpha Testing period ends
• The risk in the project is under acceptable limit.
Practically, we feel that the decision of stopping testing is based on the level of the risk acceptable to the management. As testing is a never ending process we can never assume that 100 % testing has been done, we can only minimize the risk of shipping the product to client with X testing done. The risk can be measured by Risk analysis but for small duration / low budget / low resources project, risk can be deduced by simply: -
• Measuring Test Coverage.
• Number of test cycles.
• Number of high priority bugs.
Concepts for Application Test Management:
Testing should be pro-active following the V-model. Test execution can be a manual process, Test execution can be an automated process. It is possible to plan the start date for testing. It is not possible to accurately plan the end date of testing. Ending testing is through risk assessment. A fool with a tool is still a fool Testing is not a diagnosis process. Testing is a triage process. Testing is expensive Not testing, can be more expensive.
How Software Defects Arise?
The International Software Testing Qualifications Board says that software faults occur through the following process:
A human being can make an error (mistake), which produces a defect (fault, bug) in the code, in software or a system, or in a document. If a defect in code is executed, the system will fail to do what it should do (or do something it shouldn’t), causing a failure. Defects in software, systems or documents may result in failures, but not all defects do so.
A fault can also turn into a failure when the environment is changed. Examples of these changes in environment include the software being run on a new hardware platform, alterations in source data or interacting with different software.
Inability to find all faults:
A problem with software testing is that testing all combinations of inputs and preconditions is not feasible when testing anything other than a simple product. This means that the number of defects in a software product can be very large and defects that occur infrequently are difficult to find in testing.
When Testing is Carried Out?
A common practice of software testing is that it is performed by an independent group of testers after the functionality is developed but before it is shipped to the customer. This practice often results in the testing phase being used as project buffer to compensate for project delays, thereby compromising the time devoted to testing. Another practice is to start software testing at the same moment the project starts and it is a continuous process until the project finishes.
Another common practice is for test suites to be developed during technical support escalation procedures.[citation needed] Such tests are then maintained in regression testing suites to ensure that future updates to the software don't repeat any of the known mistakes.
Measuring Software Testing:
Usually, quality is constrained to such topics as correctness, completeness, security, but can also include more technical requirements as described under the ISO standard ISO 9126, such as capability, reliability, efficiency, portability, maintainability, compatibility, and usability.
Testing is a process of technical investigation, performed on behalf of stakeholders, that is intended to reveal quality-related information about the product with respect to the context in which it is intended to operate. This includes, but is not limited to, the process of executing a program or application with the intent of finding errors. Quality is not an absolute; it is value to some person. With that in mind, testing can never completely establish the correctness of arbitrary computer software; testing furnishes a criticism or comparison that compares the state and behavior of the product against a specification. An important point is that software testing should be distinguished from the separate discipline of Software Quality Assurance (SQA), which encompasses all business process areas, not just testing. Today, software has grown in complexity and size. The software product developed by a developer is according to the System Requirement Specification. Every software product has a target audience. For example, video game software has its audience completely different from banking software. Therefore, when an organization invests large sums in making a software product, it must ensure that the software product must be acceptable to the end users or its target audience. This is where Software Testing comes into play. Software testing is not merely finding defects or bugs in the software; it is the completely dedicated discipline of evaluating the quality of the software.
There are 4 Testing Steps:
1. Select what has to be measured
Code tested for correctness with respect to:
Requirements
Architecture
Detailed Design
2. Decide how the testing is done for each level of testing
Code inspection
Black-box, white box, grey box.
Select integration testing strategy (big bang, bottom up, top down, sandwich)
3. Develop test cases
A test case is a set of test data or situations that will be used to exercise the unit (code, module, system) being tested or about the attribute being measured.
4. Create the test oracle
An oracle contains of the predicted results for a set of test cases i.e., expected output for each test.
The test oracle has to be written down before the actual testing takes place.
This is the difficult step
White box, black box, and grey box testing:
White box and black box testing are terms used to describe the point of view that a test engineer takes when designing test cases.
Black box testing treats the software as a black-box without any understanding as to how the internals behave. It aims to test the functionality according to the requirements. Thus, the tester inputs data and only sees the output from the test object. This level of testing usually requires thorough test cases to be provided to the tester who then can simply verify that for a given input, the output value (or behavior), is the same as the expected value specified in the test case.
White box testing, however, is when the tester has access to the internal data structures, code, and algorithms. For this reason, unit testing and debugging can be classified as white-box testing and it usually requires writing code, or at a minimum, stepping through it, and thus requires more knowledge of the product than the black-box tester.[19] If the software in test is an interface or API of any sort, white-box testing is almost always required. In recent years the term grey box testing has come into common usage. This involves having access to internal data structures and algorithms for purposes of designing the test cases, but testing at the user, or black-box level. Manipulating input data and formatting output do not qualify as grey-box because the input and output are clearly outside of the black-box we are calling the software under test. This is particularly important when conducting integration testing between two modules of code written by two different developers, where only the interfaces are exposed for test.
Grey box testing could be used in the context of testing a client-server environment when the tester has control over the input, inspects the value in a SQL database, and the output value, and then compares all three (the input, SQL value, and output), to determine if the data got corrupt on the database insertion or retrieval.
Software Testing Life Cycle:
Requirements Analysis: Testing should begin in the requirements phase of the software development life cycle. During the design phase, testers work with developers in determining what aspects of a design are testable and with what parameters those tests work.
Test Planning: Test Strategy, Test Plan(s), Test Bed creation.
A lot of activities will be carried out during testing, so that a plan is needed.
Test Development: Test Procedures, Test Scenarios, Test Cases, and Test Scripts to use in testing software.
Test Execution: Testers execute the software based on the plans and tests and report any errors found to the development team.
Test Reporting: Once testing is completed, testers generate metrics and make final reports on their test effort and whether or not the software tested is ready for release.
Retesting the Defects. Not all errors or defects reported must be fixed by a software development team. Some may be caused by errors in configuring the test software to match the development or production environment. Some defects can be handled by a workaround in the production environment. Others might be deferred to future releases of the software, or the deficiency might be accepted by the business user. There are yet other defects that may be rejected by the development team (of course, with due reason) if they deem it.
What should the Test Team do?
Programmer Management
Strong Change Management
Strict Configuration Control
Pro Active Scope Creep Management
Inclusion in the decision making process
What are the Test Team Deliverables?
Test Plans
Test Script Planner
Test Scripts
Test Execution Results
Defect Reports
CONCLUSION:
Software testing accounts for a large percentage of effort in the software development process, but we have only recently begun to understand the subtleties of systematic planning, execution and control. For an IT organization, developing a software system that meets the business needs of clients is always a challenge. The company needs to ensure that the software system that gets delivered to their clients is free from bugs or defects and achieves the demands as per client requirements. But this can only be ensured by following rigorous software testing and quality assurance procedures.
Software testing is a process without which the Software Development Life Cycle (SDLC) stands incomplete. It is the process that identifies the correctness, completeness and quality of the software developed during the SDLC process. Software bugs and improperly tested codes cost millions in damages and millions more in time and money to fix the defect. Organizations try to develop software applications that should act in a way that cause the least amount of surprises to the user. In short they should be bug free. New paradigms of software testing are being adopted and used in the process of software development.
Due to this, the software testing field has emerged from the shadows in the world IT space and has claimed its rightful place in the IT market. Gone are the days when software testing was considered a poor cousin of software development. In this article, we talk about software testing techniques, trends that are coming up in this arena and also new software development paradigms.
The ways in which testing can be done are broadly classified as Manual Testing and Automated Testing.
The manual testing of the software happens in several phases. Self-testing, which is done by developers themselves or by small development teams, should be restricted to build cycle itself and should be done while the software development is in the production stage. The errors of the Bugs can be Corrected or Verified easily by the Performance of the Team. Hence there should be a good understanding between the Team Members by which we get the Software Tested Successfully. It encourages team and organisational learning
Team Work focuses team efforts towards the respective goals.(i.e., intent of finding the errors)
Team Work increases motivation and accountability of individual employees by which knowledge will be shared among themselves.
Team Work encourages continuous improvement.
Team Work provides adequate feedback thus, allowing situational awareness, capability assessment, problem diagnosis, intervention and remediation.
REFERENCES:
1. Software Testing Techniques (Second Edition) by Boris Beizer.
2. www.onestoptesting.com Referred the topics related to Testing Types.
3. www.en.wikipedia.org/wiki/Software testing Referred the topics related to begin and stop the software testing process and related ones.
Simple Statistical Algorithm for Biological Sequence Compression
Abstract :
This paper introduces a novel algorithm for biological sequence compression that
makes use of both statistical properties and repetition within sequences. A panel of
experts is maintained to estimate the probability distribution of the next symbol in
the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of
the biological sequence. Each symbol is then encoded by arithmetic coding. Most compression algorithms fall into one of two categories, namely substitutional
compression and statistical compression. Those in the former class replace a long
repeated subsequence by a pointer to an earlier instance of the subsequence or to
an entry in a dictionary Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.
1. Introduction
Modelling DNA and protein sequences is an important step in understanding biology.
Deoxyribonucleic acid (DNA) contains genetic instructions for an organism.
A DNA sequence is composed of nucleotides of four types: adenine (abbreviated A),
cytosine (C), guanine (G) and thymine (T). In its double-helix form, two complementary
strands are joined by hydrogen bonds joining A with T and C with G. The
reverse complement of a DNA sequence is also considered when comparing DNA
sequences. Certain regions in a DNA sequence are translated to proteins, which control
the development of organisms. The alphabet of protein sequences consists of 20
amino acids, each of which is determined by a triplet of nucleotides called a codon.
The amount of DNA sequenced from organisms is increasing rapidly. Compression
of biological sequences is useful, not primarily for managing the genome database,
but for modelling and learning about sequences. Work by Stern et al. [21] recognizes
the importance of mutual compressibility for discovering patterns of interest from
genomes. Chen et al. [6] and Powell et al. [19] show that compressibility is a good
measurement of relatedness between sequences and can be effectively used in sequence
alignment and evolutionary tree construction.
Since DNA is the “instruction of life”, it is expected that DNA sequences are not
random and should be compressible. Some DNA sequences are highly repetitive. It is
estimated that 55% of the human genome is repeat DNA. A repeat subsequence is a
copy of a previous subsequence in the genome, either forward or reverse complement.
Most DNA repeats are not exact as nucleotides can be changed, inserted or deleted.
As an example, the ALU family are repeats of length about 300 bases, and any one
is only about 87% similar to a consensus sequence.
Interestingly, most general purpose text compression algorithms fail to compress
DNA to below the naive 2 bits per symbol. That is because DNA regularities are
different from those in text and are rarely modelled by those compressors. A number
of special purpose compression algorithms for DNA have been developed recently.
Most of these search for repeat subsequences and encode them by reference to a
previous instance. As a DNA subsequence could be (approximately) repeated many
times, using information from many of those repeat positions is expected to give
better compression ratios.
In this paper, we present the expert model (XM) and an algorithm for biological
sequence compression. Our compressor encodes each symbol by estimating the
probability based on information obtained from previous symbols. If the symbol is
part of a repeat, the information from one or more previous occurrences is used.
Once the symbol’s probability distribution is determined, it is encoded by a primary
compression algorithm such as arithmetic coding.
This paper is organized as follows. Section 2 reviews current research on biological
compression. Our expert model is described in section 3 and experimental results are
presented in section 4. Finally, section 5 concludes our work.
2. Background
Most compression algorithms fall into one of two categories, namely substitutional
compression and statistical compression. Those in the former class replace a long
repeated subsequence by a pointer to an earlier instance of the subsequence or to
an entry in a dictionary. Examples of this category are the popular Lempel-Ziv
compression algorithms [25, 26] and their variants. As DNA sequences are known to
be highly repetitive, a substitutional scheme is a natural approach to take. Indeed,
most DNA compressors to date are in this category.
On the other hand, a statistical compression encoder such as prediction by partial
match (PPM) [8] predicts the probability distribution of each symbol. Statistical
compression algorithms depend on assumptions about how the sequence is generated
to calculate the distribution. These assumptions are said to be the model of the
sequence. If the model gives a high probability to the actual value of the next symbol,
good compression is obtained. A model that produces good compression makes good
predictions and is a good description of the data.
The earliest special purpose DNA compression algorithm found in the literature is
BioCompress developed by Grumbach and Tahi [11]. BioCompress detects an exact
repeat in DNA using an automaton, and uses Fibonacci coding to encode the length
and position of its previous location. If a subsequence is not a repeat, it is encoded
by the naive 2 bits per symbol technique. The improved version, BioCompress-2
[12] uses a Markov model of order 2 to encode non-repeat regions. The Cfact DNA
compressor developed by Rivals et al. [20] also searches for the longest exact repeats
but is a two-pass algorithm. It builds the suffix tree of the sequence in the first
pass, and does the actual encoding in the second pass. Regions not repeated are also
encoded by 2 bits per symbol. The Off-line approach by Apostolico and Lonardi
[3] iteratively selects repeated substrings for which encoding would gain maximum
compression.
A similar substitution approach is used in Recompress by Chen et al. [6] except
that approximate repeats are exploited. An inexact repeat subsequence is encoded by
a pair of integers, as for BioCompress-2, and a list of edit operations for mutations,
insertions and deletions. Since almost all repeats in DNA are approximate, Recompress
obtains better compression ratios than BioCompress-2 and Cfact. The same
compression technique is used in the DNACompress algorithm by Chen et al. [7],
which finds significant inexact repeats in one pass and encodes these in another pass.
Most other compression algorithms employ similar techniques to Recompress to
encode approximate repeats. They differ only in the encoding of non-repeat regions
and in detecting repeats. The CTW+LZ algorithm developed by Matsumoto et al.
[16] encodes significantly long repeats by the substitution method, and encodes short
repeats and non repeat areas by context tree weighting [23]. At the cost of time
complexity, DNAPack Behzadi and Fessant [4] employs a dynamic programming
approach to find repeats. Non-repeat regions are encoded by the best choice from an
order 2 Markov model, context tree weighting, and naive 2 bits per symbol methods.
Several DNA compression algorithms combine substitution and statistical styles.
An inexact repeat is encoded using (i) a pointer to a previous occurrence and (ii) the
probabilities of symbols being copied, changed, inserted or deleted. In the MNL
algorithm by Tabus et al. [22] and its improvement, GeMNL by Korodi and Tabus
[14], the DNA sequence is split into fixed size blocks. To encode a block, the algorithm
searches the history for a regressor, which is a subsequence having the minimum
Hamming distance from the current block, and represents it by a pointer to the
match as well as a bit mask for the differences between the block and the regressor.
The bit mask is encoded using a probability distribution estimated by the normalized
maximum likelihood of similarity between the regressor and the block.
Probably the only two pure statistical DNA compressors published so far are CDNA
by Loewenstern and Yianilos [15] and ARM by Allison et al. [2]. In the former algorithm,
the probability distribution of each symbol is obtained by approximate partial
matches from history. Each approximate match is with a previous subsequence having
a small Hamming distance to the context preceding the symbol to be encoded. Predictions
are combined using a set of weights, which are learnt adaptively. The latter
ARM algorithm forms the probability of a subsequence by summing the probabilities
over all explanations of how the subsequence is generated. Both these approaches
yield significantly better compression ratios than those in the substitutional class and
can also produce information content sequences. CDNA has many parameters which
do not have biological interpretations. Both are very computationally intensive.
The expert model presented in this paper is a statistical algorithm. The encoder
maintains a panel of experts and combines them for prediction but a much simpler
and computationally cheaper mechanism is used than in those above. The framework
allows any kind of expert to be used, though we report here only experts obtained from
statistics and repetitivenes of sequences. Weights for expert combination are based
on expert performance. Our compressor is found to be superior to any compression
algorithms to date and its speed is practical. The algorithm is capable of biological
knowledge discovery based on per element information content sequences [10]. This
is a purpose of our compressibility research.
3. Algorithm description
As a statistical method, our XM algorithm compresses each symbol by forming
the probability distribution for the symbol and then using a primary compression
scheme to code it. The probability distribution at a position is based on symbols
seen previously. Correspondingly, the decoder, also having seen all previous decoded
symbols, is able to compute the identical probability distribution and can recover the
symbol at the position.
In order to form the probability distribution of a symbol, the algorithm maintains a
set of experts, whose predictions of the symbol are combined into a single probability
distribution. An expert is any entity that can provide a probability distribution at a
position. Expert opinions about a symbol are blended to give a combined prediction
for the symbol.
The statistics of symbols may change over the sequence. One expert may perform
well on some region, but could give bad advice on others. A symbol is likely to
have similar statistical properties to the context surrounding, particularly the context
preceding the symbol. The reliability of an expert is evaluated from its recent
predictions. A reliable expert has high weight for combination while an unreliable
one has little influence on the final prediction or may be ignored.
3.1. Type of experts
An expert can be anything that provides a reasonably good probability distribution
for a position in the sequence. A simple expert can be a Markov model (Markov
expert). An order-k Markov expert gives the probability of a symbol in a position
given k preceding symbols. Initially, the Markov expert does not have any prior
knowledge of the sequence and thus gives a uniform distribution to a symbol. The
probability distribution adapts as the encoding proceeds. Essentially, the Markov
expert provides the background statistical distribution of symbols over the sequence.
Here we use an order-2 Markov expert for DNA, and order-1 for protein.
Different areas of a DNA sequence may have differing functions and thus may have
different symbol distributions. Another type of expert is the context Markov expert,
whose probability distribution is not based on the entire history of the sequence but
on a limited preceding context. In other words, the context Markov expert bases its
prediction on the local statistics. The context Markov expert currently used by XM
is order-1 with a context of 512 previous symbols.
The compressibility of biological sequences comes mainly from repeated subsequences.
Therefore, it is important to include experts that make use of this feature.
XM employs a copy expert that considers the next symbol to be part of a copied
region from a particular offset. A copy expert with offset f suggests that the symbol
at position i is likely to be the same as the symbol at position i − f.
A copy expert does not blindly give a high probability to its suggested symbol. It
uses an adaptive code [5], over some recent history, for correct/incorrect predictions.
The copy expert gives a probability to its predicted symbol of:
p =
r + 1
w + 2
(1)
where w is the window size over which the expert reviews its performance and r is
the number of correct predictions the expert has made. The remaining probability,
1 − p, is distributed evenly to the other symbols in the alphabet.
For complementary reverse repeats, a similar reverse expert is used. This works
exactly the same as the copy expert, except that it suggests the complementary
symbol to the one from the earlier instance and it proceeds in the reverse direction.
3.2. Proposing experts
At position i of the sequence, there are O(i) possible copy and reverse experts.
This is too many to combine efficiently and anyway most would be ignored. To be
efficient, the algorithm must use at most a small number of copy and reverse experts
at any one time. We currently employ a simple hashing technique to propose likely
experts. Every position is stored in a hash table with the hash key composed of h
symbols preceding the position. If there is an opening for a new expert at any point,
the hash table is consulted.
3.3. Combining expert predictions
The core part of our XM algorithm is the evaluation and combination of expert
predictions. Suppose a panel of experts E is available to the encoder. Expert _k gives
the prediction P(xn+1_k, x1..n) of symbol xn+1 based on its observations of preceding
n symbols. A sensible way to combine experts’ predictions is based on Bayesian
averaging:
P(xn+1x1..n) =Xk2E
P(xn+1_k, x1..n)w_k,n
=Xk2E
P(xn+1_k, x1..n)P(_kx1..n)
(2)
In other words, the weight w_k,n of expert _k for encoding xn+1 is the posterior
probability P(_kx1..n) of _k after encoding n symbols. w_k,n can be estimated by
Bayes’s theorem:
w_k,n = P(_kx1..n)
= Qn
i=1 P(xi_k, x1..i−1)P(_k)
Qn
i=1 P(xix1..i−1)
(3)
If we assume that every expert has the same prior probability P(_k) then normalizing
equation 3 by a common factor M we have:
w_k,n =
1
M
n Yi=1
P(xi_k, x1..i−1) (4)
The normalization factor M, in fact does not matter as equation 2 could be again
normalized to have PP(xn+1x1..n) = 1. Take the negative log of equation 4 and
ignore the constant term:
−log2(w_k,n) _ −
n Xi=1
log2P(xi_k, x1..i−1) (5)
Since −log2P(xi_k, x1..i−1) is the cost of encoding symbol xi by expert _k, the right
hand side of equation 5 is the length of encoding of subsequence x1..n by expert _k.
As we want to evaluate experts on a recent history of size w, only the message length
of encoding symbols xn−w+1..n is used to determine weights of experts. We find that,
the algorithm works best when negative log 2 of the expert weight varies as three
times the average code length over a window of size w = 20:
−log2(w_k,n) _ −
3
w
n X i=n−w+1
log2P(xi_k, x1..i−1)
= 3AveMsgLen(xn−w+1..n_k)
(6)
or
w_k,n / 2−3AveMsgLen(xn−w+1..n_k) (7)
Suppose there are three hypotheses about how a symbol is generated: by the
distribution of the species genome; by the distribution of the current subsequence;
or by repeating from an earlier subsequence. We therefore entertain three experts
for these hypotheses: (i) a Markov expert for the species genome distribution, (ii) a
context Markov expert for the local distribution, and (iii) a repeat expert, which is
the combination of any available copy and reverse experts, for the third hypothesis.
The experts’ predictions are blended as in equations 2 and 7.
If a symbol is part of a significant repeat, the copy or reverse expert of that repeat
must predict significantly better than a general prediction such as that from the
Markov expert. We therefore define a listen threshold, T, to determine the reliability
of a copy or reverse expert. A copy or reverse expert is considered reliable if its
average code word length is smaller than Cmk −T bits where Cmk is the average code
word of the Markov expert. T is a parameter of the algorithm.
The algorithm can be used as an entropy estimator or a compressor for biological
sequences. The information content of every single symbol is estimated by the negative
log of its probability. To compress the sequence, we use arithmetic coding [24]
to code each symbol based on the probability distribution combined from experts.
4. Experimental results
We implemented the encoder and decoder of XM in Java and ran experiments on
a workstation with Pentium IV 2.4Ghz CPU and 1GB of RAM, using the Sun Java
run-time environment 1.5. The compression results are calculated from the size of real
encoded files. Note that the figures for actual compression and information content
are similar up to four decimal places. The subtle difference between the information
content computation and the actual compression is due to rounding in arithmetic
coding and padding the last byte of the encoded files.
For comparison, we applied our algorithm on a standard dataset of DNA sequences
that has been used in most other DNA compression publications. The dataset
contains 11 sequences including two chloroplast genomes (CHMPXX and CHNTXX),
five human genes (HUMDYSTROP, HUMGHCSA, HUMHBB, HUMHDABCD
and HUMHPRTB), two mitochondria genomes (MPOMTCG and MTPACG)
and genomes of two viruses (HEHCMVCG and VACCG). For DNA compression, we
use hash key of length 11 and listen threshold of 0.5 bits.

Table 1. Comparison of DNA compression.
Table 1 compares the compression results, in bits per symbol (bps), of XM to that
of other DNA compressors on the dataset. Due to space limitations, we present here
the most efficient algorithms, including BioCompress-2 (BioC) [12], Recompress
(GenC) [6], DNACompress (DNAC) [7], DNAPack (DNAP) [4], CDNA [15] and
GeMNL [14]. Comparison with other DNA compressors can be found on the website:
ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/ The results of
CDNA are reported for only 9 sequences in precision of two decimal places. The
GeMNL results are also reported without the sequence HUMHBB and in two decimal
place precision but we are able to obtain higher precision by downloading the encoded
files from the author’s website. We include the average compression results of each
algorithm in the last row.
XM outperforms all other algorithms in most sequences from the standard dataset.
The average compression ratio is also significantly better. For CDNA and GeMNL,
due to missing compression results of several sequences, we are unable to compute
the same average. Instead, we compare the average of the only available results.
The average compression ratio of nine sequences reported for CDNA is 1.6911 bps,
while XM’s average performance on the same set is 1.6815 bps. On the ten sequences
excluding HUMHBB, GeMNL averages 1.6980 bps, compared to XM’s 1.6883 bps.
Total time for XM to encode these 11 sequences is about 8 seconds. Decoding time
is similar since both encoder and decoder do essentially the same computation.

Figure 1. Information content of the HUMHBB sequence.
As a statistical compressor, the expert model is able to produce the information
content sequence from DNA or protein. This is important when we want to analyze
areas of interest [21, 9, 10]. For example, figure 1 shows a graph of information content
along the HUMHBB sequence. The data in the graph is smoothed with a window
size of 300 for viewing purposes. One can notice spikes in the graph corresponding
to areas of repeats in the sequence.
The alphabet for proteins consists of 20 symbols and thus the base line of protein
entropy is log220 = 4.322 bps. Similar to DNA, most general purpose compressors
fail to compress to less than that base line. Nevill-Manning and Witten [18] designed
CP, a protein-oriented compression algorithm based on PPM. However, compression
ratios obtained by CP are only marginally better than the base line entropy. Several
other attempts such as ProtComp [13], LZ-CTW [16] and BW [1] show that protein
sequence are indeed compressible with better compression ratios.
We experimented with compressing protein using XM on a protein corpus gathered
by [18] which consists of proteomes of four species: Haemophilus Influenzae (HI),
Saccharomyces Cerevisiae (SC), Methanococcus Jannaschii (MJ) and Homo Sapiens
(HS). As an amino acid is coded by three nucleotides, we use a shorter hash key for protein, of length 6. The listen threshold is raised to 1.0 bit as the upper bound
entropy of protein is 4.322 bps instead of 2.0 bps in DNA.

Table 2. Comparison of protein compression.
Table 2 shows the compression ratios of CP, ProtComp, LZ-CWT and XM of the four protein sequences.
Note that an incorrect protein corpus that was more compressible was made available
at some point resulting in a significantly lower compression ratios being reported in
ProtComp [13] and BW [1]. We obtained the compression results of ProtComp on
the correct protein corpus from the author’s website but were unable to do so for
BW as the authors have moved to new projects [17]. We found that our algorithm is
able to compress proteins better than CP and LZ-CWT and marginally better than
ProtComp for all sequences in the corpus.
5. Conclusion
We have presented the expert model, XM, which is simple and based on biological
principles. The associated compression algorithm is efficient and effective for both
DNA and protein sequence compression. The algorithm utilizes approximate repeats
and statistical properties of the biological sequence for compression. As a statistical
compression method, XM is able to compute the information content of every symbol
in a sequence which is useful in knowledge discovery [21, 9, 10]. Our algorithm
is shown to outperform all published DNA and protein compressors to date while
maintaining a practical running time.
References
[1] D. Adjeroh and F. Nan. On compressibility of protein sequences. DCC, pages 422–434, 2006.
[2] L. Allison, T. Edgoose, and T. I. Dix. Compression of strings with approximate repeats. ISMB,
pages 8–16, 1998.
[3] A. Apostolico and S. Lonardi. Compression of biological sequences by greedy off-line textual
substitution. DCC, pages 143–152, 2000.
[4] B. Behzadi and F. L. Fessant. DNA compression challenge revisited: A dynamic programming [5] D. M. Boulton and C. S. Wallace. The information content of a multistate distribution. Theoretical
Biology, 23(2):269–278, 1969.
[6] X. Chen, S. Kwong, and M. Li. A compression algorithm for DNA sequences and its applications
in genome comparison. RECOMB, page 107, 2000.
[7] X. Chen, M. Li, B. Ma, and T. John. DNACompress: Fast and effective DNA sequence
compression. Bioinformatics, 18(2):1696–1698, Dec 2002.
[8] J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string
matching. IEEE Trans. Comm., COM-32(4):396–402, April 1984.
[9] T. I. Dix, D. R. Powell, L. Allison, S. Jaeger, J. Bernal, and L. Stern. Exploring long DNA
sequences by information content. Probabilistic Modeling and Machine Learning in Structural
and Systems Biology, Workshop Proc, pages 97–102, 2006.
[10] T. I. Dix, D. R. Powell, L. Allison, S. Jaeger, J. Bernal, and L. Stern. Comparative analysis
of long DNA sequences by per element information content using different contexts. BMC
Bioinformatics, to appear, 2007
[11] S. Grumbach and F. Tahi. Compression of DNA sequences. DCC, pages 340–350, 1993.
[12] S. Grumbach and F. Tahi. A new challenge for compression algorithms: Genetic sequences.
Inf. Process. Manage., 30(6):875–866, 1994.
[13] A. Hategan and I. Tabus. Protein is compressible. NORSIG, pages 192–195, 2004.
[14] G. Korodi and I. Tabus. An efficient normalized maximum likelihood algorithm for DNA
sequence compression. ACM Trans. Inf. Syst., 23(1):3–34, 2005.
This paper introduces a novel algorithm for biological sequence compression that
makes use of both statistical properties and repetition within sequences. A panel of
experts is maintained to estimate the probability distribution of the next symbol in
the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of
the biological sequence. Each symbol is then encoded by arithmetic coding. Most compression algorithms fall into one of two categories, namely substitutional
compression and statistical compression. Those in the former class replace a long
repeated subsequence by a pointer to an earlier instance of the subsequence or to
an entry in a dictionary Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time.
1. Introduction
Modelling DNA and protein sequences is an important step in understanding biology.
Deoxyribonucleic acid (DNA) contains genetic instructions for an organism.
A DNA sequence is composed of nucleotides of four types: adenine (abbreviated A),
cytosine (C), guanine (G) and thymine (T). In its double-helix form, two complementary
strands are joined by hydrogen bonds joining A with T and C with G. The
reverse complement of a DNA sequence is also considered when comparing DNA
sequences. Certain regions in a DNA sequence are translated to proteins, which control
the development of organisms. The alphabet of protein sequences consists of 20
amino acids, each of which is determined by a triplet of nucleotides called a codon.
The amount of DNA sequenced from organisms is increasing rapidly. Compression
of biological sequences is useful, not primarily for managing the genome database,
but for modelling and learning about sequences. Work by Stern et al. [21] recognizes
the importance of mutual compressibility for discovering patterns of interest from
genomes. Chen et al. [6] and Powell et al. [19] show that compressibility is a good
measurement of relatedness between sequences and can be effectively used in sequence
alignment and evolutionary tree construction.
Since DNA is the “instruction of life”, it is expected that DNA sequences are not
random and should be compressible. Some DNA sequences are highly repetitive. It is
estimated that 55% of the human genome is repeat DNA. A repeat subsequence is a
copy of a previous subsequence in the genome, either forward or reverse complement.
Most DNA repeats are not exact as nucleotides can be changed, inserted or deleted.
As an example, the ALU family are repeats of length about 300 bases, and any one
is only about 87% similar to a consensus sequence.
Interestingly, most general purpose text compression algorithms fail to compress
DNA to below the naive 2 bits per symbol. That is because DNA regularities are
different from those in text and are rarely modelled by those compressors. A number
of special purpose compression algorithms for DNA have been developed recently.
Most of these search for repeat subsequences and encode them by reference to a
previous instance. As a DNA subsequence could be (approximately) repeated many
times, using information from many of those repeat positions is expected to give
better compression ratios.
In this paper, we present the expert model (XM) and an algorithm for biological
sequence compression. Our compressor encodes each symbol by estimating the
probability based on information obtained from previous symbols. If the symbol is
part of a repeat, the information from one or more previous occurrences is used.
Once the symbol’s probability distribution is determined, it is encoded by a primary
compression algorithm such as arithmetic coding.
This paper is organized as follows. Section 2 reviews current research on biological
compression. Our expert model is described in section 3 and experimental results are
presented in section 4. Finally, section 5 concludes our work.
2. Background
Most compression algorithms fall into one of two categories, namely substitutional
compression and statistical compression. Those in the former class replace a long
repeated subsequence by a pointer to an earlier instance of the subsequence or to
an entry in a dictionary. Examples of this category are the popular Lempel-Ziv
compression algorithms [25, 26] and their variants. As DNA sequences are known to
be highly repetitive, a substitutional scheme is a natural approach to take. Indeed,
most DNA compressors to date are in this category.
On the other hand, a statistical compression encoder such as prediction by partial
match (PPM) [8] predicts the probability distribution of each symbol. Statistical
compression algorithms depend on assumptions about how the sequence is generated
to calculate the distribution. These assumptions are said to be the model of the
sequence. If the model gives a high probability to the actual value of the next symbol,
good compression is obtained. A model that produces good compression makes good
predictions and is a good description of the data.
The earliest special purpose DNA compression algorithm found in the literature is
BioCompress developed by Grumbach and Tahi [11]. BioCompress detects an exact
repeat in DNA using an automaton, and uses Fibonacci coding to encode the length
and position of its previous location. If a subsequence is not a repeat, it is encoded
by the naive 2 bits per symbol technique. The improved version, BioCompress-2
[12] uses a Markov model of order 2 to encode non-repeat regions. The Cfact DNA
compressor developed by Rivals et al. [20] also searches for the longest exact repeats
but is a two-pass algorithm. It builds the suffix tree of the sequence in the first
pass, and does the actual encoding in the second pass. Regions not repeated are also
encoded by 2 bits per symbol. The Off-line approach by Apostolico and Lonardi
[3] iteratively selects repeated substrings for which encoding would gain maximum
compression.
A similar substitution approach is used in Recompress by Chen et al. [6] except
that approximate repeats are exploited. An inexact repeat subsequence is encoded by
a pair of integers, as for BioCompress-2, and a list of edit operations for mutations,
insertions and deletions. Since almost all repeats in DNA are approximate, Recompress
obtains better compression ratios than BioCompress-2 and Cfact. The same
compression technique is used in the DNACompress algorithm by Chen et al. [7],
which finds significant inexact repeats in one pass and encodes these in another pass.
Most other compression algorithms employ similar techniques to Recompress to
encode approximate repeats. They differ only in the encoding of non-repeat regions
and in detecting repeats. The CTW+LZ algorithm developed by Matsumoto et al.
[16] encodes significantly long repeats by the substitution method, and encodes short
repeats and non repeat areas by context tree weighting [23]. At the cost of time
complexity, DNAPack Behzadi and Fessant [4] employs a dynamic programming
approach to find repeats. Non-repeat regions are encoded by the best choice from an
order 2 Markov model, context tree weighting, and naive 2 bits per symbol methods.
Several DNA compression algorithms combine substitution and statistical styles.
An inexact repeat is encoded using (i) a pointer to a previous occurrence and (ii) the
probabilities of symbols being copied, changed, inserted or deleted. In the MNL
algorithm by Tabus et al. [22] and its improvement, GeMNL by Korodi and Tabus
[14], the DNA sequence is split into fixed size blocks. To encode a block, the algorithm
searches the history for a regressor, which is a subsequence having the minimum
Hamming distance from the current block, and represents it by a pointer to the
match as well as a bit mask for the differences between the block and the regressor.
The bit mask is encoded using a probability distribution estimated by the normalized
maximum likelihood of similarity between the regressor and the block.
Probably the only two pure statistical DNA compressors published so far are CDNA
by Loewenstern and Yianilos [15] and ARM by Allison et al. [2]. In the former algorithm,
the probability distribution of each symbol is obtained by approximate partial
matches from history. Each approximate match is with a previous subsequence having
a small Hamming distance to the context preceding the symbol to be encoded. Predictions
are combined using a set of weights, which are learnt adaptively. The latter
ARM algorithm forms the probability of a subsequence by summing the probabilities
over all explanations of how the subsequence is generated. Both these approaches
yield significantly better compression ratios than those in the substitutional class and
can also produce information content sequences. CDNA has many parameters which
do not have biological interpretations. Both are very computationally intensive.
The expert model presented in this paper is a statistical algorithm. The encoder
maintains a panel of experts and combines them for prediction but a much simpler
and computationally cheaper mechanism is used than in those above. The framework
allows any kind of expert to be used, though we report here only experts obtained from
statistics and repetitivenes of sequences. Weights for expert combination are based
on expert performance. Our compressor is found to be superior to any compression
algorithms to date and its speed is practical. The algorithm is capable of biological
knowledge discovery based on per element information content sequences [10]. This
is a purpose of our compressibility research.
3. Algorithm description
As a statistical method, our XM algorithm compresses each symbol by forming
the probability distribution for the symbol and then using a primary compression
scheme to code it. The probability distribution at a position is based on symbols
seen previously. Correspondingly, the decoder, also having seen all previous decoded
symbols, is able to compute the identical probability distribution and can recover the
symbol at the position.
In order to form the probability distribution of a symbol, the algorithm maintains a
set of experts, whose predictions of the symbol are combined into a single probability
distribution. An expert is any entity that can provide a probability distribution at a
position. Expert opinions about a symbol are blended to give a combined prediction
for the symbol.
The statistics of symbols may change over the sequence. One expert may perform
well on some region, but could give bad advice on others. A symbol is likely to
have similar statistical properties to the context surrounding, particularly the context
preceding the symbol. The reliability of an expert is evaluated from its recent
predictions. A reliable expert has high weight for combination while an unreliable
one has little influence on the final prediction or may be ignored.
3.1. Type of experts
An expert can be anything that provides a reasonably good probability distribution
for a position in the sequence. A simple expert can be a Markov model (Markov
expert). An order-k Markov expert gives the probability of a symbol in a position
given k preceding symbols. Initially, the Markov expert does not have any prior
knowledge of the sequence and thus gives a uniform distribution to a symbol. The
probability distribution adapts as the encoding proceeds. Essentially, the Markov
expert provides the background statistical distribution of symbols over the sequence.
Here we use an order-2 Markov expert for DNA, and order-1 for protein.
Different areas of a DNA sequence may have differing functions and thus may have
different symbol distributions. Another type of expert is the context Markov expert,
whose probability distribution is not based on the entire history of the sequence but
on a limited preceding context. In other words, the context Markov expert bases its
prediction on the local statistics. The context Markov expert currently used by XM
is order-1 with a context of 512 previous symbols.
The compressibility of biological sequences comes mainly from repeated subsequences.
Therefore, it is important to include experts that make use of this feature.
XM employs a copy expert that considers the next symbol to be part of a copied
region from a particular offset. A copy expert with offset f suggests that the symbol
at position i is likely to be the same as the symbol at position i − f.
A copy expert does not blindly give a high probability to its suggested symbol. It
uses an adaptive code [5], over some recent history, for correct/incorrect predictions.
The copy expert gives a probability to its predicted symbol of:
p =
r + 1
w + 2
(1)
where w is the window size over which the expert reviews its performance and r is
the number of correct predictions the expert has made. The remaining probability,
1 − p, is distributed evenly to the other symbols in the alphabet.
For complementary reverse repeats, a similar reverse expert is used. This works
exactly the same as the copy expert, except that it suggests the complementary
symbol to the one from the earlier instance and it proceeds in the reverse direction.
3.2. Proposing experts
At position i of the sequence, there are O(i) possible copy and reverse experts.
This is too many to combine efficiently and anyway most would be ignored. To be
efficient, the algorithm must use at most a small number of copy and reverse experts
at any one time. We currently employ a simple hashing technique to propose likely
experts. Every position is stored in a hash table with the hash key composed of h
symbols preceding the position. If there is an opening for a new expert at any point,
the hash table is consulted.
3.3. Combining expert predictions
The core part of our XM algorithm is the evaluation and combination of expert
predictions. Suppose a panel of experts E is available to the encoder. Expert _k gives
the prediction P(xn+1_k, x1..n) of symbol xn+1 based on its observations of preceding
n symbols. A sensible way to combine experts’ predictions is based on Bayesian
averaging:
P(xn+1x1..n) =Xk2E
P(xn+1_k, x1..n)w_k,n
=Xk2E
P(xn+1_k, x1..n)P(_kx1..n)
(2)
In other words, the weight w_k,n of expert _k for encoding xn+1 is the posterior
probability P(_kx1..n) of _k after encoding n symbols. w_k,n can be estimated by
Bayes’s theorem:
w_k,n = P(_kx1..n)
= Qn
i=1 P(xi_k, x1..i−1)P(_k)
Qn
i=1 P(xix1..i−1)
(3)
If we assume that every expert has the same prior probability P(_k) then normalizing
equation 3 by a common factor M we have:
w_k,n =
1
M
n Yi=1
P(xi_k, x1..i−1) (4)
The normalization factor M, in fact does not matter as equation 2 could be again
normalized to have PP(xn+1x1..n) = 1. Take the negative log of equation 4 and
ignore the constant term:
−log2(w_k,n) _ −
n Xi=1
log2P(xi_k, x1..i−1) (5)
Since −log2P(xi_k, x1..i−1) is the cost of encoding symbol xi by expert _k, the right
hand side of equation 5 is the length of encoding of subsequence x1..n by expert _k.
As we want to evaluate experts on a recent history of size w, only the message length
of encoding symbols xn−w+1..n is used to determine weights of experts. We find that,
the algorithm works best when negative log 2 of the expert weight varies as three
times the average code length over a window of size w = 20:
−log2(w_k,n) _ −
3
w
n X i=n−w+1
log2P(xi_k, x1..i−1)
= 3AveMsgLen(xn−w+1..n_k)
(6)
or
w_k,n / 2−3AveMsgLen(xn−w+1..n_k) (7)
Suppose there are three hypotheses about how a symbol is generated: by the
distribution of the species genome; by the distribution of the current subsequence;
or by repeating from an earlier subsequence. We therefore entertain three experts
for these hypotheses: (i) a Markov expert for the species genome distribution, (ii) a
context Markov expert for the local distribution, and (iii) a repeat expert, which is
the combination of any available copy and reverse experts, for the third hypothesis.
The experts’ predictions are blended as in equations 2 and 7.
If a symbol is part of a significant repeat, the copy or reverse expert of that repeat
must predict significantly better than a general prediction such as that from the
Markov expert. We therefore define a listen threshold, T, to determine the reliability
of a copy or reverse expert. A copy or reverse expert is considered reliable if its
average code word length is smaller than Cmk −T bits where Cmk is the average code
word of the Markov expert. T is a parameter of the algorithm.
The algorithm can be used as an entropy estimator or a compressor for biological
sequences. The information content of every single symbol is estimated by the negative
log of its probability. To compress the sequence, we use arithmetic coding [24]
to code each symbol based on the probability distribution combined from experts.
4. Experimental results
We implemented the encoder and decoder of XM in Java and ran experiments on
a workstation with Pentium IV 2.4Ghz CPU and 1GB of RAM, using the Sun Java
run-time environment 1.5. The compression results are calculated from the size of real
encoded files. Note that the figures for actual compression and information content
are similar up to four decimal places. The subtle difference between the information
content computation and the actual compression is due to rounding in arithmetic
coding and padding the last byte of the encoded files.
For comparison, we applied our algorithm on a standard dataset of DNA sequences
that has been used in most other DNA compression publications. The dataset
contains 11 sequences including two chloroplast genomes (CHMPXX and CHNTXX),
five human genes (HUMDYSTROP, HUMGHCSA, HUMHBB, HUMHDABCD
and HUMHPRTB), two mitochondria genomes (MPOMTCG and MTPACG)
and genomes of two viruses (HEHCMVCG and VACCG). For DNA compression, we
use hash key of length 11 and listen threshold of 0.5 bits.

Table 1. Comparison of DNA compression.
Table 1 compares the compression results, in bits per symbol (bps), of XM to that
of other DNA compressors on the dataset. Due to space limitations, we present here
the most efficient algorithms, including BioCompress-2 (BioC) [12], Recompress
(GenC) [6], DNACompress (DNAC) [7], DNAPack (DNAP) [4], CDNA [15] and
GeMNL [14]. Comparison with other DNA compressors can be found on the website:
ftp://ftp.infotech.monash.edu.au/software/DNAcompress-XM/ The results of
CDNA are reported for only 9 sequences in precision of two decimal places. The
GeMNL results are also reported without the sequence HUMHBB and in two decimal
place precision but we are able to obtain higher precision by downloading the encoded
files from the author’s website. We include the average compression results of each
algorithm in the last row.
XM outperforms all other algorithms in most sequences from the standard dataset.
The average compression ratio is also significantly better. For CDNA and GeMNL,
due to missing compression results of several sequences, we are unable to compute
the same average. Instead, we compare the average of the only available results.
The average compression ratio of nine sequences reported for CDNA is 1.6911 bps,
while XM’s average performance on the same set is 1.6815 bps. On the ten sequences
excluding HUMHBB, GeMNL averages 1.6980 bps, compared to XM’s 1.6883 bps.
Total time for XM to encode these 11 sequences is about 8 seconds. Decoding time
is similar since both encoder and decoder do essentially the same computation.

Figure 1. Information content of the HUMHBB sequence.
As a statistical compressor, the expert model is able to produce the information
content sequence from DNA or protein. This is important when we want to analyze
areas of interest [21, 9, 10]. For example, figure 1 shows a graph of information content
along the HUMHBB sequence. The data in the graph is smoothed with a window
size of 300 for viewing purposes. One can notice spikes in the graph corresponding
to areas of repeats in the sequence.
The alphabet for proteins consists of 20 symbols and thus the base line of protein
entropy is log220 = 4.322 bps. Similar to DNA, most general purpose compressors
fail to compress to less than that base line. Nevill-Manning and Witten [18] designed
CP, a protein-oriented compression algorithm based on PPM. However, compression
ratios obtained by CP are only marginally better than the base line entropy. Several
other attempts such as ProtComp [13], LZ-CTW [16] and BW [1] show that protein
sequence are indeed compressible with better compression ratios.
We experimented with compressing protein using XM on a protein corpus gathered
by [18] which consists of proteomes of four species: Haemophilus Influenzae (HI),
Saccharomyces Cerevisiae (SC), Methanococcus Jannaschii (MJ) and Homo Sapiens
(HS). As an amino acid is coded by three nucleotides, we use a shorter hash key for protein, of length 6. The listen threshold is raised to 1.0 bit as the upper bound
entropy of protein is 4.322 bps instead of 2.0 bps in DNA.

Table 2. Comparison of protein compression.
Table 2 shows the compression ratios of CP, ProtComp, LZ-CWT and XM of the four protein sequences.
Note that an incorrect protein corpus that was more compressible was made available
at some point resulting in a significantly lower compression ratios being reported in
ProtComp [13] and BW [1]. We obtained the compression results of ProtComp on
the correct protein corpus from the author’s website but were unable to do so for
BW as the authors have moved to new projects [17]. We found that our algorithm is
able to compress proteins better than CP and LZ-CWT and marginally better than
ProtComp for all sequences in the corpus.
5. Conclusion
We have presented the expert model, XM, which is simple and based on biological
principles. The associated compression algorithm is efficient and effective for both
DNA and protein sequence compression. The algorithm utilizes approximate repeats
and statistical properties of the biological sequence for compression. As a statistical
compression method, XM is able to compute the information content of every symbol
in a sequence which is useful in knowledge discovery [21, 9, 10]. Our algorithm
is shown to outperform all published DNA and protein compressors to date while
maintaining a practical running time.
References
[1] D. Adjeroh and F. Nan. On compressibility of protein sequences. DCC, pages 422–434, 2006.
[2] L. Allison, T. Edgoose, and T. I. Dix. Compression of strings with approximate repeats. ISMB,
pages 8–16, 1998.
[3] A. Apostolico and S. Lonardi. Compression of biological sequences by greedy off-line textual
substitution. DCC, pages 143–152, 2000.
[4] B. Behzadi and F. L. Fessant. DNA compression challenge revisited: A dynamic programming [5] D. M. Boulton and C. S. Wallace. The information content of a multistate distribution. Theoretical
Biology, 23(2):269–278, 1969.
[6] X. Chen, S. Kwong, and M. Li. A compression algorithm for DNA sequences and its applications
in genome comparison. RECOMB, page 107, 2000.
[7] X. Chen, M. Li, B. Ma, and T. John. DNACompress: Fast and effective DNA sequence
compression. Bioinformatics, 18(2):1696–1698, Dec 2002.
[8] J. G. Cleary and I. H. Witten. Data compression using adaptive coding and partial string
matching. IEEE Trans. Comm., COM-32(4):396–402, April 1984.
[9] T. I. Dix, D. R. Powell, L. Allison, S. Jaeger, J. Bernal, and L. Stern. Exploring long DNA
sequences by information content. Probabilistic Modeling and Machine Learning in Structural
and Systems Biology, Workshop Proc, pages 97–102, 2006.
[10] T. I. Dix, D. R. Powell, L. Allison, S. Jaeger, J. Bernal, and L. Stern. Comparative analysis
of long DNA sequences by per element information content using different contexts. BMC
Bioinformatics, to appear, 2007
[11] S. Grumbach and F. Tahi. Compression of DNA sequences. DCC, pages 340–350, 1993.
[12] S. Grumbach and F. Tahi. A new challenge for compression algorithms: Genetic sequences.
Inf. Process. Manage., 30(6):875–866, 1994.
[13] A. Hategan and I. Tabus. Protein is compressible. NORSIG, pages 192–195, 2004.
[14] G. Korodi and I. Tabus. An efficient normalized maximum likelihood algorithm for DNA
sequence compression. ACM Trans. Inf. Syst., 23(1):3–34, 2005.
Subscribe to:
Posts (Atom)