Distinguished Lecturer Series
Kansas State University
Computing and Information Sciences Department
Title: PADS: Processing Arbitrary Data Sources
Speaker: Kathleen Fisher (AT&T Research)
Date: Tuesday, March 7, 2006
Time: 11:00 AM
Location: K-State Union 207
Vast amounts of useful data are stored and processed in ad hoc formats. Traditional databases and XML systems provide rich infrastructure for processing well-behaved data, but are of little help when dealing with ad hoc data. Examples that we face at AT&T include call detail data, web server logs, netflows capturing internet traffic, log files characterizing IP backbone resource utilization, and wire formats for legacy telecommunication billing systems. Such data may simply require processing before it can be loaded into a data management system, or it may be too large or too transient to make such loading cost effective. Typically, data consumers have no control over the format of ad hoc data. Therefore, they must invest significant effort in understanding such a data source and writing a custom parser, a process that is both tedious and error-prone. Often, the hard-won understanding of the data ends up embedded in parsing code, making both sharing the understanding and maintaining the parser difficult. Typically, such parsers are incomplete, failing to specify how to handle situations where the data does not conform to the expected format.
PADS is a declarative data description language that allows data analysts to describe both the physical layout of ad hoc data sources and semantic properties of that data. From such descriptions, the PADS compiler generates libraries and tools for manipulating the data, including parsing routines, statistical profiling tools, translation programs to produce well-behaved formats such as XML or those required for loading relational databases, and tools for running XQueries over raw PADS data sources. The descriptions are concise enough to serve as ``living'' documentation while flexible enough to describe most of the ASCII, binary, and Cobol formats that we have seen in practice. The generated parsing library provides for robust, application-specific error handling.
This is joint work with Mary Fernandez, Bob Gruber, Yitzhak Mandelbaum, and David Walker.
This was actually a very good research summary and generated quite a bit of food for thought. George Strawn from the NSF, another special invited lecturer who spoke later in the afternoon (during a Graduate Council meeting I had to attend), came to this talk, as did our first faculty candidate of this week. I ought to have made one of Fisher's two talks mandatory class participation for this month in my database class. I have some notes from the talk that I will post to the class mailing list later.
Edit, Sat 11 Mar 2006 - Here's the CIS Seminar List if you aren't familiar with it. I learned later from Greg Monaco in Psychology that a lot of people who are potentially interested in computational models of cognitive processes are not on our seminar list because:
- a) there haven't been that many speakers in the areas of human-computer interaction (HCI), natural language processing (NLP), cognitive science, computational neuroscience, brain theory, and cognitive psychology
- b) most of the "language processing" talks in our department are hardcore programming languages talks (the one last Friday on Example-Based Machine Translation by Violetta Cavalli-Sforza was a rare exception), and are harder for cogsci folks to grok
- c) most Psych people are unaware of our seminar website or our mailing list (possibly what our department actually does as research, too, but that's another story)
From now on, I'll be reposting links to Distinguished Lecturer talks from the seminar site here, just in case.