short texts and articles by justin

Scalability of Webcrawlers designed in SML (Standard Meta Language)

By Jeff Chiang and Justin Tung

Storage of Data

To start to consider scalable webcrawlers involving arbitrarily large data, the abstract data types to be looked at should be ones that involve low worst case run times in functions such as lookup and insert. Looking at the data structures learned so far, red-black balanced trees provide the optimal big-O running times for lookup and insert which are both O(log n). The advantage O(log n) running times for these key functions becomes apparent when n reaches billions as is the case when an indexer is storing and wading through all the data on webpages. New webpages are easily inserted into the existing scheme and old ones can be found easily and updated. A possible implementation might be to count the length of the URL and store data based on that number. Another way could be to sort data into subtrees based on certain other ways of dividing the information on the web (i.e. file extensions of URLs, general topics). can be done via a record which stores fields that contain specific information about page title, language, format (from URLs and potential file links) as well as information about the content on the page. This content like the index of webpages should be stored in the same format to provide optimal lookup times for words, phrases, etc.

Content and page title are probably the most important elements in a search in the web and should be displayed along with the URL in a results page. However, advanced searches going beyond simple boolean expressions requires more complex analysis of data in the fields. Current search engines support phrase finding, negative queries, language specific queries, and similar pages or pages that link to a certain page. Additional support includes searching certain directories (images, newsgroups, audio/mp3s, downloads, etc.), dated page queries, and even reviewing most recent queries (probably to save resources by bringing up information already searched for previously).

How to implement these 'advanced searches'?

Using the data structure setup discussed above, advanced searches really are based on good algorithms that go through the data stored in an index and pulls out the important information. Phrase finding wil l require the ability of the search engine to parse text into groups of words and compare them, once a page meets this phrase requirement, the URL and its corresponding information should be sent back. Negative queries work similar to 'positive' queries and it should be a negation of the implementation used for regular queries. Similar pages can be searched for by keywords in page titles, text, and other data stored on the page. Finding links to a certain page should be easy using the tree setup since a find on all URLs that have a common link can be done and the URLs returned. Subtopics and more specific areas of the web can be implemented by dividing data during the insert into the web index as mentioned above with subtrees. Then searches will be limited to those trees when needed, thus decreasing running times by limiting what topics/data to search for. The implementation of storing recent or most popular queries can be done by storing a set of URLs in a database (balanced tree) either for reference or for near future use. This storing of returns on queries helps cut down the amount of queries (in which many could be the same) to a search engine.

Analysis of Running Times and Scalability considerations to Web Search Algorithms

Since all the webpages are stored as balanced trees and within those trees, the nodes store content data in balanced trees, search for a specific page (i.e. URL) could be done in O(log n) time. The good thing about balanced trees within balanced trees is that any other type of search that might access fields in the nodes will take at most O(log n) time. Since data on the internet is so large, worst-case running times do provide a good estimation as smaller orders such as constants (accessing data) do not really affect running time. As a result of processing terabytes of data, frequency of crawls, and the general expectations of speed on the internet, algorithms with the smallest possible big-O are the most desirable. Fast algorithms will be able to service the large amount of people using the search engine everyday in good time.

Caching Web Pages

Although the caching of every examined page may seem like an impossible task, it is indeed what Google does. Google employs a compression technique on each document that trades off speed for file size. Its 3:1 compression ratio allows moderate space-saving while still retaining adequate retrieval speed. Each document is stored with a docID, length, and URL. In addition to this central repository, a document index, lexicon, hit list, and hit list indexes are stored as well. The document index stores in each entry the current document status, a pointer into the repository, a document checksum, and various statistics. The hit list corresponds to a list of occurrences of a particular word in a particular document including position, font, and capitalization information. All of this information is then used to create the hit list indexes, which record the docID of a document that contains the words in the hit list. Even the queries themselves are cached, resulting in faster search times should the same query be repeated. The result of these data structures blended together is a fast and accurate return of the pages requested by users. Frequency of Web Crawls and Changing Web pages In order to keep track of what pages have or have not changed, a record of the time of last access of any page needs to be maintained. Any further attempt to access the same page in less than the specified period of time is rejected and "local copy up to date" condition is signaled. If a page has been accessed previously, the HTTP HEAD access method is used to determine the last modified date of the current remote version. If this is unchanged from the last modified date of the current local copy then no further network traffic ensues and "local copy up to date" condition is signaled. If the remote version has changed then an HTTP GET access gets the new copy.

A search engine needs to maintain information about the last 10 accesses or attempted accesses to a resource. The stored information includes the date of access, time needed for transfer, amount of data transferred and the HTTP status code [right name]. In the internal database the status code is extended with non-standard to indicate various types of communication failure. If successive access attempts fail, the page is assumed to be no longer available. It is marked as such and may still be included in the database but is presented to the searching user with a warning. Eventually the page will be eliminated entirely from the database.

back to document index

Water Conservation Presentation

By Justin Tung

Educate for the Earth 2000-2001: Presentation Sheet

Information taken from: http://www.epa.gov/ow/kids.html as well as various other net sources. Target audience is elementary school children and above.

Presentation

What is water?

Water comes in 3 forms ice, water (the type that we usually see in oceans and lakes), and vapor (form of water in the air) - all have a clear color.

Where does it come from?

Water comes from sources on the Earth and is located in many places:

Some of these are: oceans, rivers, lakes, atmosphere, underground wells, and glaciers in places like the Antarctica.

Why is it important?

Our bodies are around 55-60 % water and the earth is 70% water.
We need water to live and stay healthy since a large percent of us is water.
Like us plants and animals also need water to survive. These include all living things both on land and underwater and of course water is obviously important to sea life like coral reefs, fish and sea plants.

What is water conservation?

It means to save and recycle our water so we use as least water possible.
"Water recycling is a critical element for managing our water resources. Through water conservation and water recycling, we can meet environmental needs and still have sustainable development and a viable economy."
-Felicia Marcus, Regional Administrator Water Division Region IX

Water recycling is reusing treated wastewater for beneficial purposes such as farming, business, and home processes as well as refilling a ground water supplies (water recharge). A common type of recycled water is water that has been reclaimed from city wastewater, or sewage.

Through the natural water cycle, the earth has recycled and reused water for millions of years. Water recycling, though, generally refers to projects that use technology to speed up these natural processes.

There are numerous water recycling projects to increase the quality of water that is recycled because the usual quality now is non-drinkable, but still useful for farming and industries.

By providing an additional source of water, water recycling can help us find ways to decrease water taken from sensitive ecosystems. Other benefits include decreasing wastewater discharges and reducing and preventing pollution. Recycled water can also be used to create or enhance wetlands and habitats.

In some cases, the reasons for water recycling comes not from a water supply need, but from a need to eliminate or decrease wastewater discharge to the ocean, an estuary, or a stream.

While water recycling is a sustainable approach and can be cost-effective in the long term, the treatment of wastewater for reuse and the installation of distribution systems can be initially expensive compared to such water supply alternatives as imported water or ground water.

As water demands and environmental needs grow, water recycling will play a greater role in our overall water supply. By working together to overcome problems, water recycling, along with water conservation, can help us to conserve and manage our vital water resources to last into the future.

Activities

What do you do already at home that conserves water?

Watershed - Adopt Your Watershed - http://www.epa.gov/adopt/

Encourages the saving and looking over of the nation's water resources. Through this effort, Environmental Protection Agency challenges people in the community to join them and others who are working to protect and restore our valuable rivers, streams, wetlands, lakes, ground water, and estuaries.

What you can do at home

At home you can significantly reduce the amount of wastewater from home systems and sewage treatment plants by conserving water - less water use means less waste
You can try using low-flow taps, shower heads, reduced-flow toilet flushing equipment, and water saving appliances such as dish and clothes washers
Repair leaking water sources in your house
Avoid letting taps run unnecessarily like when people brush their teeth they leave the tap on
Wash your car only when necessary; use a bucket to save water or go to a carwash that uses water efficiently and disposes of runoff properly - runoff is a source of waste and pollution
Do not over-water your lawn or garden. Over-watering may increase leaching of fertilizers to ground water
When your lawn or garden needs watering, use slow-watering techniques such as trickle irrigation or soaker hoses. (Such devices reduce runoff and are 20- percent more effective than sprinklers.)

Cleaning water process (figure 1):

Coagulation - Coagulation removes dirt and other small things caught in the water. Chemicals are added to water to form tiny sticky balls that stick to the dirt and sink to the bottom of the water.
Sedimentation - The heavy balls settle to the bottom and the clear water moves to filtration.
Filtration - The water passes through filters, some made of layers of sand, gravel, and charcoal that help remove even smaller pieces of dirt and other things.
Disinfection - A small amount of cleaning liquid is used to kill any bacteria or small living things that may be in the water.
Storage - Water is stored in a large pool in order for dis-infection to take place. The water then flows through pipes to homes and businesses in the community.

What's wrong with this picture (figure 2):

Stream erosion - the sides of the stream should be maintained - the sides are weak because people have stripped the sides of plant life which holds the soil together
Oil dumping - leads to pollution of soil and ground water, also affects local wildlife
Car leaking
Over fertilization - can kill plants, harm the soil and create runoff pollution where fertilizers enter the water system
Water waste - we want to conserve water and use a little as possible to keep our environment healthy
Litter

back to document index

How to Succeed in Technical/General Liberal Arts Courses

by Justin Tung

Introduction

I decided to write the text when I reflected on my most successful courses grade-wise in university. I also found that these courses were the ones I gained a large understanding and appreciation of the material and retained much memory of what was taught in the course. I found I studied for them all in a similar format and there was a general study method I used unconsciously. I should also add that these particular courses were taught by excellent professors and teaching assistants and often had good supportive reading material.

The following text assumes a kind of lecture-recitation (section) class structure which is sometimes not the case if different types of classes (like labs, discussion sessions, etc.) are available. Also, some of the points (maybe all of them) sound quite common sensical or obvious to most students, but the difficult part is to attain the DISCIPLINE required to carry everything out to the end (i.e. go to all your classes, even the boring ones). If you attain this skill, you will be guaranteed a sense of accomplishment and learning and the rewards may be unlimited. In hindsight, this so-called discipline is much easier to obtain when:

You like the subject/course content
You enjoy the professor, TAs, teachers, etc.
The presentation and learning styles suit you
You chose the course because you like it

1. Lecture

Read assignments for lecture no more than a week later so that the material is still fresh in your head. This method aids understanding of the material. The closer to lecture you read the material, the better your understanding and learning.
Some classes even promote reading material to be covered in lecture before lecture. Be sure to at least attempt this since the course organizer sometimes presumes knowledge during the lecture which may be important for understanding.

2. Section

Go over homework before a section and be prepared to ask questions if any. Do not forget to pay close attention to the questions of others, often these questions reflect common weaknesses in understanding of material.
If you do not have the time to go over (or even take a peek) at the homework or it is not necessary to know it, make sure to take comprehensive notes.
Make sure you have at least a slight idea of the material covered in sections and in some cases section should be a review and expanding on known material.

3. Essays/Projects/Problem Sets (Homeworks)

If you have questions about homework go to office hours or ask the teacher.
For projects and problem sets, you should look over the entire document the task at reviewing homework before it is done does not have to be a lengthy process and may only take a couple minutes.
One of the problems with homeworks is knowing when to talk to a teacher. This timing should be determined by you. It difficult to analyze when the timing may be too early or too late, but being a little earlier always helps.

For essays, there are numerous methods on the internet about how to go about writing a good essay so I will not go into that except to say the general chronological order is: think about the problem/question, research, write a Thesis/Outline and organize arguments/data, set down which points go into which arguments, then write!.

4. Exams

Review course material to be covered and LEARN it. I find it helpful to rewrite notes, key ideas, etc. or to write up "exam notes" which are easily to review in the few hours/minutes before the exam.
Do questions/mock essays/practice exams

back to document index

Conducting searches/research and data mining on the web

How to find information and resources

This section was originally part of my links area until I realized that it was important enough for its own section. The sites below helped me to find most my links and learn how to make and design webpages. Also, using the Internet has saved me countless hours in searching for and learning information. There is no real path to follow but I have listed the links below in order of importance to me.

One thing to keep in mind though when looking at information on the internet is do not trust or believe everything you read. A large amount of common internet (and urban legend) spread myths are available at www.snopes.com, which at least a couple you will have heard of, depending on your age and experiences. In the print world, most serious information sources go through screening, editors, publishers, and finally critics, while online information can be published by anybody, although if you get the information from a reliable website (e.g. a government one) you can rest a little more assured.

My process to find information online:

1) Usually I proceed to Google first and find the information right away. However, some information is found easier using other search engines (e.g. specialized ones) and Yahoo for example is best for finding financial and map information fast.

2) Of course finding the resources and search engines is half the effort, the other half is searching smart and knowing when to move on to other pages in the search, follow links on a page, or try another approach. Ideally, you should choose a search that narrows down the number of pages you have to look at to a manageable amount.

This process might involve adding more keywords, specifying advanced search conditions, looking at different resources, etc. At other times, simple is better (i.e. typing in virginia.com to find information about Virginia, USA) or just doing very general searches to find information quickly.

3) If you are having a hard time finding information because it is relatively specific try the meta search engines or try looking up sites in Google that pertain to your general subject area and continue the search within those sites.

4) In case the above steps fail, try accessing the resources below.

Google: One of the best search engines with a fast and efficient searching algorithm.
About.com: Expert guides in many subjects About.com like Yahoo contains information on many things except that the sites are managed by experts in their field who give their input about a subject and provide a set of links from which to explore. This site has easily moved up my priority list in visiting when searching for topical information.
LYCOS: My second choice for a keyword search engine, sometimes it finds sites Google does not and I guess as a result uses a different page index and/or searching algorithm.
YAHOO!: Excellent array of resources, contains just about everything (current events, finance, mail, maps, etc) and has many services, it's different from some sites in that information on Yahoo is sorted using library science methods.
usenet: Newsgroups have been a source of interaction between varied internet users and sometimes good short or long-term sources about information where a personalized response or interaction is required. You might consider looking into the 1000s of newsgroups for the topic that interests you or ask people how to obtain information.

Other Resources

Guide to Research using the Internet | Encyclopedia Britannica | Searchopolis [Educational Searches] | Raging Search (AltaVista) | Merriam-Webster Dictionary

ARTEMIS | resources/docstexts.html by justin tung generated using Apache Software Foundation's Xalan-J version 2.7.2