April 20th, 2004


The Deep Web and Orkut (http://www.orkut.com)

Deep One City, Copyright © 2003 Flying Lab Software
(see their Delta Green artwork and their new MMORPG, Pirates of the Burning Sea!)

I've been hearing a lot of buzz about the Deep Web, aka the "invisible web", or the large part of the publicly accessible pages on the World Wide Web that is not indexed by search engines.

Recently Dr. Kevin Chen-Chuan Chang, a UIUC-DCS faculty member, received a 2004 Faculty Fellowship from NCSA for his research on the Deep Web.

Collapse )
Collapse )

What do you all think of mining the deep web?
In particular, what do you think are potential applications (other than, say, targeted marketing and your stereotypical Big Brother scenario) of mining social networks such as weblogs and self-selected meta-communities?

On a tangentially related note, here's
[Error: Irreparable invalid markup ('<a [...] http://www.orkut.com/profile.aspx?uid>') in entry. Owner must fix manually. Raw contents below.]

<img src="http://www.kddresearch.org/TEUNC/Banazir/Pics/Other/deep-one-city.jpg" width=600>
<font size="-2"><i>Deep One City</i>, Copyright © 2003 <a href="http://www.flyinglab.com">Flying Lab Software</a>
(see their <a href="http://www.flyinglab.com/deltagreen/exposure.htm">Delta Green</a> artwork and their new <a href="http://en.wikipedia.org/wiki/MMORPG">MMORPG</a>, <a href="http://www.burningsea.com"><i>Pirates of the Burning Sea</i></a>!)</font>

I've been hearing a lot of buzz about the <a href="http://en.wikipedia.org/wiki/Deep_web">Deep Web</a>, <i>aka</i> the "invisible web", or the large part of the publicly accessible pages on the World Wide Web that is not indexed by search engines.

Recently <a href="http://www-faculty.cs.uiuc.edu/~kcchang/">Dr. Kevin Chen-Chuan Chang</a>, a <a href="http://www.uiuc.edu">UIUC</a>-<a href="http://www.cs.uiuc.edu">DCS</a> faculty member, received a <a href="http://www.ncsa.uiuc.edu/Divisions/DirOffice/CampusRelations/FFP/FifthSet.html">2004 Faculty Fellowship</a> from <a href="http://www.ncsa.uiuc.edu">NCSA</a> for his research on the Deep Web.

<lj-cut text="NCSA Access blurb">
The <a href="http://access.ncsa.uiuc.edu">NCSA Access Online</a> press release explains:

<blockquote><i>When the World Wide Web was young, it was relatively easy for people to find information, because everything was posted on the surface. The Web's content was shallow, and users just had to wade in to find what they needed.

Today's Web, however, is deep. Content is often locked in databases that must be queried before they will provide the answers users are looking for. A typical Web search will miss the vast majority of this information, because current Web crawlers cannot effectively query databases.

Kevin Chen-Chuan Chang, a professor in the University of Illinois Computer Science Department, aims to enable access to database information on the Web by creating a metaquery system that will be able to extract information from diverse online databases. He has been pursuing this goal in collaboration with NCSA through the Faculty Fellows Program and presented an overview of his research, "Exploring and Integrating the Deep Web: Building a Database of Databases," at a recent brown bag presentation.

The goal of Chang's research is to reveal the information locked in the deep Web by developing a metaquery system that will allow users to find and query an array of databases. He acknowledges that this is a "significant information integration problem," but says that his research so far has found promising similarities among the myriad query interfaces that are the gatekeepers of database content.</i></blockquote>

( <a href="http://access.ncsa.uiuc.edu/Stories/DeepWeb/">Source: NCSA Access Online</a> )
<lj-cut text="Chang's abstract">
<i>Exploring and Integrating the Deep Web: Building a Database of Databases</i>
Kevin Chen-Chuan Chang

<blockquote><i>This research aims at enabling access to structured information sources on the Internet. Over the past few years, the Web has deepened dramatically - A significant and increasing amount of information is hidden on the "deep" Web, behind the query interfaces of searchable databases. Because current crawlers cannot effectively query databases, such data is mostly invisible to traditional search engines, and thus remain largely hidden from users.

As the context of this proposal, we propose to build a metquery system, to help users in finding and querying these databases uniformly with rich expressive queries. Our goal is two fold: First, to make the deep Web systematically accessible: the MetaExplorer will help users find online databases that are useful for their queries. Second, to make the deep Web uniformly usable: the MetaIntegrator will help users interact with online databases to ask queries. To open up the deep Web, this MetaQuerier faces new challenges: First, it must deal with large scale, since sources are proliferating rapidly online. Second, it must be dynamic and ad-hoc: each query will dynamically select different ad-hoc sources.</i></blockquote>

( <a href="http://www.ncsa.uiuc.edu/Divisions/DirOffice/CampusRelations/FFP/FifthSet.html#Project1">Source: NCSA Archives</a> )

What do you all think of mining the deep web?
In particular, what do you think are potential applications (other than, say, targeted marketing and your stereotypical Big Brother scenario) of mining social networks such as weblogs and self-selected meta-communities?


On a tangentially related note, here's <a href="<a href="http://www.orkut.com/Profile.aspx?uid=338354856863666719">my account</a> on <a href="http://www.orkut.com">Orkut</a>, a Friendster-like social network operated by Google.

Here are <a href="http://www.orkut.com/ProfileC.aspx?uid=338354856863666719">my Orkut communities</a>. They are listed in the order in which I added them, in case you were wondering.
<lj-cut text="Warning: quijillions of interests">
<li><a href="http://www.orkut.com/Community.aspx?cmm=2665"><b>The Johns Hopkins University (96 members)</b></a>: "The Johns Hopkins University was the first research university in the United States. Founded in 1876, it was a whole new educational enterprise. Its aim was not only to advance students' knowledge, but also to advance human knowledge generally, through discovery and scholarship. The university's emphasis on both learning and research—and on how each complements the other—revolutionized U.S. higher education."
The Truth Shall Set You Free - JHU Motto</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=30"><b>University of Illinois (695 members)</b></a>: For graduates, faculty, students, and friends of the University of Illinois (Go Illini!)</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=5640"><b>Kansas State University (39 members)</b></a>: A community for Kansas State University alumni & current students. Go Wildcats!</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=410"><b>Firefly (596 members)</b></a>: FireFly combines WildWest and Sci-Fi. Fox cancelled it.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=250"><b>Fantasy / Sci-fi Book Club (1486 members)</b></a>: Hi All! This community is for everyone who loves reading fantasy and sci-fi books. Go ahead, admit you love them... :)</li>
Discussions, book swapping, reviews, praise of the inherent cheesiness... you name it!</li>
**Please read**
Before beginning a new topic, please view all the topics to make sure no one has thought of it before... :)
<li><a href="http://www.orkut.com/Community.aspx?cmm=8168"><b>Genome - seq. analysis tools (65 members)</b></a>: General discussion about DNA/Prot sequence analysis tools used for similarity searches and profiling like wu/n/xBLAST.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=9047"><b>Genome Annotation (44 members)</b></a>: Discussion on genome annotation</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=961"><b>Tolkien (1419 members)</b></a>: For fans of J.R.R. Tolkien's assorted works, or derivatives thereof.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=1931"><b>Star Wars (916 members)</b></a>: Fans of the Star Wars movie series.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=1223"><b>eclipse (598 members)</b></a>: For users of eclipse, the premier open-source IDE, with Java as primary programming language.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=4023"><b>A.I. Programming (606 members)</b></a>: Discuss algorithms, techniques and code for artificial intelligence projects.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=24292"><b>Evolutionary Computing (37 members)</b></a>: Genetic Algorithms, Genetic Programming, Artificial Intelligence & Neural Networks, Evolving Hardware.</li>
Discuss them all here! I believe that evolutionary computing is an exceptionally powerful technique, and can be used to solve problems that are too complex to solve algorithmically.</li>
Share your ideas, your programs, your code or your hardware!
<li><a href="http://www.orkut.com/Community.aspx?cmm=589"><b>Java (2513 members)</b></a>
For all those interested in the Java programming language, development tools, programming techniques, the entire Java world, etc.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=15276"><b>Bioinformatics (216 members)</b></a>: Biological questions can be explored through wetlab experimental work - the traditional arena of biologists - or through modeling and simulation in virtual environments, also known as drylab research or computational biology. The later is more generally the domain of mathematicians and algorithm researchers. Of course wetlab research is used to develop better models to describe our understanding of biology, while drylab research needs to validate its' results through wetlab experimentation. Thus wet~ and drylab biology is closely related.
Bioinformatics is about improving the methods and technologies for the management and manipulation of data used by people trying to answer biological questions.
This community is about bioinformatics, rather than computational biology.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=20151"><b>Artificial Intelligence (255 members)</b></a>: Community for Artificial Intelligence and how it can be used and developed.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=1486"><b>Visualization (333 members)</b></a>: For everyone interested in data visualization, UML diagrams, Euler and Venn diagrams, etc.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=3929"><b>Roleplaying Games (702 members)</b></a>: Do you role play? If so, come right in. Whether it is online RPG, pen-and-paper, table-top RP or live action ( LARP ) - whatever your flavor, this is a community made for those who enjoy this hobby as a way to spend their time.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=18266"><b>Ecology (75 members)</b></a>: A place for the discussion of ecology. Please note that this group concentrates on ecology the science, not environmentalism or conservation.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=1642"><b>Nethack (480 members)</b></a>: A place to discuss Nethack and other Roguelike games. Thrill to the stories of ascensions, or stupid deaths! Marvel at its elegant and simple design!</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=11462"><b>TheOneRing.net (14 members)</b></a>: Community for all those who know each other through TheOneRing.net (TORn); the net's biggest Tolkien website - forged by and for fans of JRRT.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=848"><b>Neuroscience (963 members)</b></a>: A community for anyone who is interested in understanding how the brain works. (e.g., Cognitive Science, Philosophy of Mind).</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=141"><b>Linux (7461 members)</b></a>: For those who use and enjoy Linux.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=300"><b>LiveJournal (2173 members)</b></a>:
<i>LiveJournal! LiveJournal!
Making fun of your friends behind their back
LiveJournal! LiveJournal!
Chronicle your gerbil's heart attack</i>
(The LiveJournal song: http://www.livejournal.com/users/news/50307.html)</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=6006"><b>Star Control II (113 members)</b></a>: The best video game of all time, Star Control II.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=680"><b>Vim (1943 members)</b></a>: Vim is Vi IMproved. Accept no substitute.
"a programmer's text editor"
Main author of Vim: Bram Moolenaar
<li><a href="http://www.orkut.com/Community.aspx?cmm=13122"><b>Technical Writing (65 members)</b></a>: Technical writers who want to network with others, discuss differences, help out interns and other newbies, share funny stories...all welcome. If you need to brag or flame, find any of the other tech writer sites or newsgroups.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=17040"><b>Grid Computing (81 members)</b></a>: Grid Computing...and access to the Grid Technologies</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=31967"><b>Computer Graphics & Animation (62 members)</b></a>: Community to talk about computer graphics/design and animation. Movies, short movies, personal movies, techniques, modeling and rendering programs or anything else related.
The image used for the logo was created by Kristijan Petrovic.
<li><a href="http://www.orkut.com/Community.aspx?cmm=34123"><b>Pattern Recognition Image Ana. (24 members)</b></a>: Pattern Recognition (PR) & Image Analysis (IA) is a community dedicated to these and related topics. PR can simple be defined as Computer-based recognition of forms, or shapes within an image, or a pattern in general as a sound (sound recognition). Related topics within computer-science can be signal processing, mathematical morphology, artificial intelligence, classification, image segmentation, etc.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=1545"><b>Machine Learning (349 members)</b></a>: For anyone interested in machine learning and probabilistic modelling. Bayesians especially welcome.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=290"><b>Bloggers (2656 members)</b></a>: Blogging weblogs about weblogs about weblogs about blogging about blogging blogging.
<li><a href="http://www.orkut.com/Community.aspx?cmm=9043"><b>Comparative Genomics (47 members)</b></a>: Discussion on comparative genomics</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=2859"><b>Angband (88 members)</b></a>: Players and developers of the the angband family of roguelike games.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=512"><b>Artificial Thinking (1065 members)</b></a>: This is about intelligence and computers. How can computers be intelligent, how can they learn to be intelligent. What is intelligence. What is consciousness.
Philosophy (Hume, Turing and Dennett) and science (statistical learning and alike) are welcome.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=1997"><b>Trillian (302 members)</b></a>: For fans of the Trillian IM client for Windows.
Trillian supports AOL Instant Messenger (AIM), Yahoo Instant Messenger (YIM), MSN Messenger, Jabber, IRC, ICQ.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=2536"><b>UnrealIRCd (22 members)</b></a>: UnrealIRCd people and fans - you know who you are ;)</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=5942"><b>IRC (320 members)</b></a>: Community for users of IRC (Internet Relay Chat). Preferably mIRC users, although not mandatory!</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=1478"><b>Dilbert (914 members)</b></a>: Fans of Dilbert, the world's most productive engineer, unite</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=7895"><b>Marion Zimmer Bradley (41 members)</b></a>: For fans of the author of Mists of Avalon, the Darkover series, and a lot of other cool stuff</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=23723"><b>Orson Scott Card (43 members)</b></a>: For fans of the great SF writer Orson Scott Card and admirers of books like Ender's Game, Speaker for the Dead, The Homecoming Saga and many others.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=22317"><b>Algorithms (223 members)</b></a>: People interested in algorithms-- methods to solve problems-- and the optimization thereof.
If you think P \stackrel{?}{=} NP is an interesting problem, this community is for you.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=9278"><b>Robotic algorithms (140 members)</b></a>: Motion planning, map-making, control, architectures, multi-robot systems, and everything else: the study of robotic algorithms.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=19075"><b>LaTeX (330 members)</b></a>: \begin{comment} Just LaTeX \end{comment}</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=30174"><b>Super longevity/health (38 members)</b></a>: This is a community for people who want to extend the natural biological life span of humans through the intervention of science using techniques such as Stem cell research, telomere research, gene therapy, and bio-engineering. It is also a community where you can talk about any kind of science technology, and health and fitness techniques that may help to make super longevity a potential reality in our lifetime.
This is an off shoot of the Longevity Meme founded by Reason, an excellent web site dedicated to longevity research, grassroots political action campaigns, and essays and articles regarding the latest that science has to offer in super longevity research. http://www.longevitymeme.com/</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=15155"><b>UIUC Computer Science Dept (123 members)</b></a>: The Department of Computer Science at the University of Illinois at Urbana-Champaign is recognized throughout the world as a leader in both education and research. The department and its graduates have long been at the forefront of modern computing beginning with the creation of ILLIAC in 1952, and continuing with the creation of Mosaic, the first graphic web browser, through the most recent Internet and electronic commerce era. (Harris and Zych not allowed!)</b></a>
<li><a href="http://www.orkut.com/Community.aspx?cmm=4461"><b>Anne McCaffrey (51 members)</b></a>: For fans of Anne McCaffrey, Pern, Freedom, and her other worlds!
I invite you all to join #andor scifi/fantasy book chat on irc.xelium.net</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=7445"><b>Anti Spam (310 members)</b></a>: Discussion of novel anti spam techniques.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=18700"><b>Sml (30 members)</b></a>: Here is a community for the developers, admirers, advocates, and devotees of SML/NJ, SML, and other related ML languages.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=9509"><b>Human Computer Interaction (673 members)</b></a>: Where psychology, design, and technology come together to partay!
We employ the practice of boozability (usability) engineering to improve our quality of life =)
Go HCI!</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=163"><b>Functional Programming (410 members)</b></a>: For all those who are interested in the only truly pure way to program.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=2201"><b>Nanotechnology (507 members)</b></a>: Community of people interested in the science, technology, and business of nanotechnology, MEMS, and self-assembling devices in general.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=2173"><b>Foresight (160 members)</b></a>: Molecular nanotechnology and everything else nano. Researchers, entrepreneurs, engineers, students, and newbies are welcome here. Other buzzwords for the search engine: nanotech, nanoscience, nanomedicine, MEMS.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=11157"><b>Disruptive Technologies (298 members)</b></a>: This community is intended for the discussion of new and interesting disruptive technologies... such as nanotechnology, RSS, Blogs, RFIDs, LCD, OLEDs, Open Source Software, bittorrent, biotech, IT outsourcing, etc.</li>
<li><a href="http://www.orkut.com/Community.aspx?cmm=591"><b>C++ (1538 members)</b></a>: For all those interested in the C++ programming language, development tools, programming techniques, history, rationale (God knows C++ needs it), etc.</li>

<b>Edit</b>, 04:00 CDT Thu 22 Apr 2004: I added a bunch of communities (from Anne McCaffrey on down) and updated the member counts. This will be the last update in this entry.
<b>Edit</b>, 15:45 CDT Sat 24 Apr 2004: OK, I lied. Can't resist growing those JPEGs, as <lj user=masteralida> puts it. ;-)