Monday, February 01, 2010

Pinocchio's a Real Boy Now

After many years in school, I can finally call myself a "real" linguist: I've been published!

Here's the abstract

Labels:

Friday, January 15, 2010

LingMA update, 1-15-2010

Per my plan for January, today is the day I'd planned to have several EFR segmentation files ready for use in testing the project software. Here's what I've gathered:

Raw data files (sentence alignment files, XCES format)

XSLT

Segfiles

Tools

Reference

The segmentation files aren't quite in their final form yet, due to some limitations with XSLT transformations. I'll be posting the gory details of the differences, and writing a JavaScript parser to bridge them, coming later today on the Code section of my site.

Labels:

Thursday, January 07, 2010

LingMA update, 1-7-2010

So, as I mentioned a few weeks ago, I reset my deadline for the first cut of my project software for December 31. Recall that this software would:

  1. parse input segmentation file, determine lexical items
  2. invoke subject field detection module
  3. parse input termbases for lexical item matches

Item 1 is a simple parsing task. Easy enough to implement, but only doable once I have input files to parse in the proper format. This is a task I neglected to itemize in my initial project plan, which is unfortunate because it will take me some time. I plan to work on that over the next week, so let's say January 15 as a target date by which I'll have at least 2 or 3 input segmentation files.

Item 2 is now unnecessary, as my committee chair suggested I avoid the need to integrate a subject field detector by having the user specify subject field at processing time. One open question in my mind is how we'll represent the subject field: is it simply free-text, which is much less useful, or is it from a specific ontological system? If the latter, what ontology should we use? This seemingly simple subtask suddenly looks complex again.

Item 3 is the simplest subtask in this first incarnation of the software - all we'll need is a good XML/XPath parser to do the lookup. However, as with the segfiles, writing the parser isn't enough if there's no input data to parse. So after producing some sample segfiles with identified target terms, I'll need to hand-create a TBX file with enough entries to be useful. Setting January 22nd for this deadline is likely over-optimistic — even January 29th is probably pushing it, but that's the date I'm going to set.

I just got an email from the BYU Linguistics dept secretary, saying that in order to graduate this semester I'd have to have my final thesis / project writeup delivered to my committee by the first week of February. At this stage in the process, I think it's pretty obvious that I won't be ready to defend by then. Soooooo... Now shooting for Summer term.

Labels:

Friday, December 18, 2009

LingMA status report for 12-15-2009

It's been too long since my last school-related update, particularly since I've passed several of my self-imposed deadlines. So here goes:

3. Evaluate available methods of determining subject field of a text

Unfortunately, I haven't really found any good, open-source solutions to this problem, so after conferring with my committee chair, we've changed course slightly: now we'll specify a subject field for the the text prior to processing it.

4. Based on results of milestones 1, 2, and 3, determine overall program architecture
- which programming language will it be written in?

I'm still not 100% sure about this, but am leaning towards either Visual Studio Express (hint: the Express version is free), or Python.

- which SMT package will we use?

Moses is pretty much the only game in town for what we're doing.

- which subject field detection method will we use?

See above; we'll ask the user instead.

- how will we invoke these three modules (subject detection, termbase lookup, and SMT)?
-- due date: Friday, November 20

Yeah... "oops" on that deadline. I'll be invoking Moses from the command-line, rather than trying to integrate its source code into my own project. The exact mechanics of how I do this depend on which programming language I choose.

5. Find source(s) for initial input termbases
-- due date: Friday, November 27

Per my committee chair, BYU has a license to a piece of software called SynchroTerm (good thing, too: a license is $1,800!), to which I may be able to get access. It identifies and extracts source/target translation pairs from bitext input, which I can then plug into a skeleton TBX file. But, since I don't currently have access to SynchroTerm, I'll start with some hand-created entries. The input bitexts will come from a selection of movie subtitle files in the OPUS/OpenSubs corpus, probably English-French since that's a common-enough European language that OpenSubs has a lot of available content, and my committee members all have at least a passing understanding of the language (I don't, but hopefully Google Translate will help me to gist things well enough to proceed).

6. Project software 0.1 alpha
- parse input segmentation file, determine lexical items
- invoke subject field detection module
- parse input termbases for lexical item matches
-- due date: Friday, December 11

Oops, this has slipped a bit. With health, work, and family constraints on my time recently, I haven't kept up quite as well as I would have liked. My new deadline for this will be December 31.

(Assuming, of course, that my surgery on the 28th goes smoothly, and the drugs don't make me completely loopy. So that's my excuse if things slip again! ;))

Labels:

Thursday, November 12, 2009

School progress: domain detection

So this is a few days late, since task 3 on my Linguistics MA project plan was due November 6th:

3. Evaluate available methods of determining subject field of a text

... but I don't actually have much to report. After several nights of searching, I haven't come up with anything substantial in the way of open-source code for subject field detection of a text. The academic papers I have read that touch on the subject (heh) all seem to gloss over the details of any domain detection step, or simply refer to proprietary solutions to the problem with no discussion of how those solutions are implemented. When I raised this issue with my graduate adviser, he suggested I "punt" -- that is, my software will just have to ask the user to specify the domain at run-time, or run without a specific domain in mind. Anyway, I have bigger fish to fry; the domain inference was never intended to be the core of my project, rather a means to the end of coupling statistical machine translation with terminology management/lookup. So now I move on to step 4:

4. Based on results of milestones 1, 2, and 3, determine overall program architecture

Also, in unrelated news: paralleling similar moves on my Javascript and MUGEN blogs, I've moved my site navigation from the sidebar into the header. I'm also working on merging the three templates used by these blogs into a single template that can accept custom CSS for each blog to help them have a unified appearance, while allowing each to retain its unique look and feel. I've also added more links in the sidebar to my various social media profiles, if that interests you at all. I suspect for most of you it won't, but anyway, they're there.

Labels: ,

Sunday, November 01, 2009

Accountability: MA Progress for 11-1-2009

Last time, when I posted my fall schedule for my grad school work, it was for a very particular purpose. In fact, it's the same reason I track my workouts on DailyBurn: holding myself accountable.

Accountability is a weird thing, especially on the Web. Several years ago, Sweetheart and I were involved with a pretty cool network marketing team, and this is where I first encountered the idea of accountability to someone else. Not as a job, mind you, and not as any kind of a prerequisite to anything, but it was in my friend's best interest that I succeed, because ultimately my success furthered his own. Now, I'm not going to debate the relative merits of this kind of a marketing system with anyone -- it's probably worth noting that we're no longer involved with that company, but most of the people we worked with are still our Facebook friends -- but this concept of holding myself accountable still has merit in getting me to do what I know I should be doing anyway. The difference, of course, if that a lot of you probably don't care all that much (well, except you, Mom). The point is, if there's a public record of my performance (or lack thereof), it makes me that much more likely to want to perform well.

With that prelude, here's what I've done so far:
  1. Evaluate available open-source SMT packages

  2. I've really only found one good, open-source option for Statistical machine translation: Moses. Everything else I've seen is either too new/experimental, or is Rule-based MT (like Apertium). Moses, on the other hand, is mature, in active use, and well documented: all strong points in its favor. Here's how it met my other evaluation criteria:

    • what programming language / API?

    • It's written in C++, but may well be usable strictly from the command-line without requiring any code changes to its core source code

    • what are the license terms?

    • Moses is licensed under the LGPL.

    • does it accept bitext inputs in XLIFF format?

    • No, but this may not be a problem (see task 2).


  3. Evaluate file format(s) of OpenSubs corpus

  4. The OpenSubs project encodes its parallel texts in a different XML dialect, called XCES. However, they also provide a Perl script to convert from this format into the input format needed by Moses. So as nice as it may be to use the XLIFF format, doing so may introduce unnecessary delays into the project because I would have to write a pair of filters: OpenSubs-to-XLIFF and XLIFF-to-Moses.


So, that's where things stand as of tonight. This week, I'll be looking into open-source methods of text categorization and subject field determination: that is, trying to figure out programmatically if an input script is from a movie about rockets and rayguns, or Victorian romance.

Labels: ,

Wednesday, October 21, 2009

Ling MA tasks for the fall

Many of my readers may know I've been working on a Linguistics MA for a few years now. I finished my coursework last year, and only have my project/thesis left to be done, and I've been kind of drifting a bit. Part of the reason for that has been a busy work schedule and a long commute since our move, not to mention all of the work that goes along with home ownership. But a significant portion of it also has to do with a lack of a clear through-line from here to completion. Last week, when I made this observation about myself, I resolved to do something about it. This post summarizes the results of that effort, which I have forwarded on to my graduate committee chair so I have someone to be accountable to.

  1. Evaluate available open-source SMT packages
    • what programming language / API?
    • what are the license terms?
    • does it accept bitext inputs in XLIFF format?
    • due date: Friday, October 23

  2. Evaluate file format(s) of OpenSubs corpus
    • if not XLIFF, develop a conversion / mapping / filter
    • due date: Friday, October 30

  3. Evaluate available methods of determining subject field of a text
    • what programming language / API?
    • what are the license terms?
    • due date: Friday, November 6

  4. Based on results of milestones 1, 2, and 3, determine overall program architecture
    • which programming language will it be written in?
    • which SMT package will we use?
    • which subject field detection method will we use?
    • how will we invoke these three modules (subject detection, termbase lookup, and SMT)?
    • due date: Friday, November 20

  5. Find source(s) for initial input termbases
    • due date: Friday, November 27

  6. Project software 0.1 alpha
    • parse input segmentation file, determine lexical items
    • invoke subject field detection module
    • parse input termbases for lexical item matches
    • due date: Friday, December 11

  7. Project software 0.2 alpha
    • invoke SMT module for lexical items missing from input termbases
    • due date: Friday, January 1

  8. Alpha evaluation
    • Evaluate software gaps w/ AKM
    • due date: Friday, January 15

  9. Project software 1.0 beta
    • fill in any functional gaps identified during milestone 8
    • fix any bugs identified during milestone 8
    • present the system to the rest of the committee (Drs. Lonsdale & Bush)
    • due date: TBD

Schedule for my evenings (~9pm-12am):

  • MWF: schoolwork
  • Sun: personal reading / family time
  • Tue: personal code projects / coding blog
  • Thu: church service / new comic book night
  • Sat: family time / MUGEN work

Now, to see how long I can stick to it. I intend to post at least some of my progress here in this blog, albeit in a rough form, because it'll force me to get back in the habit of writing, and will function as the rough draft for my eventual project writeup. (don't worry, though, I'll try not to make it too dry; I can always make it more academic later)

Labels:

Monday, August 17, 2009

My Day at the Library of Congress (part 2)

Back in June, I wrote part 1 of my story about our visit to the Library of Congress and said that part 2 was coming the next day... Then I promptly dropped the subject. I'm thinking I need better follow-through on my writing here, eh?

So when I left off last time, my Sweetheart and the kids and I were on our way into D.C. We took the metro from Huntington station, which is always fun for the kids and mildly stressful for the grownups (but much less stressful than actually driving into / parking in the District). One quick subterranean trip later, we surfaced a few blocks south of the Library.

Following all the out-of-towners into the main visitor's lobby, we uttered the requisite oohs and aahs at its impressive architecture (which is quite impressive). After pausing a few moments to take it all in (and take a few candid photos of Sweetheart and the kids), I asked a patron where to go for, you know, the books, telling him I was trying to find a particular book for my journal submission. He directed me to a sub-basement tunnel leading to one of the other buildings, where I could sign up for my "library card". So I left Sweetheart with the kids to look at the Gutenberg bible, Bob Hope memorabilia, Gershwin's piano, and the rest of the public exhibits, and took the elevator down to the catacombs (okay, not really, but it did kind of feel that way).

When I got to the right office, I filled out my request for access and waited my turn. Not much of a wait, actually (it was, reportedly, slower than usual: bonus!). They took my picture and gave me my official "library reader" card, then I sat down with another gentleman whose job it is to help new readers find what they're looking for, in exchange for their self-worth as a human being and grad student. When I gave him the call number and title, he kind of snorted and launched into a tirade about the uselessness of my chosen field, and Liberal Arts disciplines in general. A few times, I wanted to interject some rebuttal or counter to his points, but I bit my tongue for the sake of the mission: closing time was looming within a few hours and I might end up needing all the time I could get (I also wondered if you could get kicked out for arguing with the grumpy old librarian, and didn't want to risk it). At the time I idly wondered if goading young researchers was some form of "hazing," and it wouldn't surprise me if it is.

Anyway, after I refused to take the bait when Mister Sunshine belittled my research subject, he told me that my book - if I still really wanted to get it, and not abandon such a useless endeavor in the face of his soul-crushing logic - was in the main reading room. Cool! That's the place you always see pictures of, and one Ben Gates got the President's secret book from in National Treasure 2! Keen! So back through the catacombs I marched, and up to the research floor.

I checked my bag at security: no cameras, smartphones, or other potentially clandestine devices are permitted in the main Reading Room, although they've recently made an exception for laptops in one area of the floor. If I'd known this ahead of time, I would have brought my laptop; instead, my arsenal of research tools was stripped down to a small pad of paper and a pencil. Worse, the woman at the circulation desk told me the wait for my request could be upwards of an hour! Without email, Google Reader, and Patiences solitaire, my usual time-killing companions! Horrors! I noted with some interest (and a little concern) how dependent I've become on this little miracle of hand-held computing (it's with a liberal helping of irony that I'm writing this tonight from said smartphone).

So now I was in the Reading Room... and yes, it is as mind-blowingly big as it seems in its photos. And, like most of the government buildings in Washington, it was loaded to the gills with symbolic art, almost absurdly, overwhelmingly so.

With nothing else to do but sit and quietly wait, I began to write in my small notepad:


6/27/2009
So... Here I am in the Library of Congress's main Reading Room! It's quite an impressive place, just as amazing as it looked in "National Treasure" when Nic Cage tried to steal the President's secret book. I'm sitting directly under the huge rotunda, looking up at art celebrating our civilization's heritage: philosophy from Greece, religion from Judea, emancipation from France, administration from Rome, etc. Awe-inspiring is too small a word.


About that time, my book showed up -- in less than the predicted hour, so they were 2 for 2 on efficiency for me that day. Within 5 minutes, I'd found the quote I'm using in my paper, and had made a note of its page number and a few other relevant details. Here's the funniest part of the whole effort: I knew the answer all along: years ago, I learned from Douglas Adams the "Ultimate Answer" to the "Ultimate Question," and that was indeed the answer I'd spent this entire day seeking (and all this time blogging about it!):

The missing page number was 42.

Labels: , ,

Sunday, June 28, 2009

My Day at the Library of Congress (part 1)

Library of Congress main reading room

So yesterday afternoon I had the opportunity to go and do research for my ongoing Master's degree at the Library of Congress! This was one of those things that I never knew I wanted to do until the chance came, then I realized how unique the opportunity was, and got pretty excited about it.



This all came about due to a paper I submitted to the journal of Oceanic Linguistics a while ago as part of my Linguistics MA coursework. To my surprise, it was accepted for publication! Yay me! The editor, John Lynch, sent me back the comments from his two reviewers that I needed to account for in my final revision, and I promptly got busy with work and school and family, and forgot all about it. Fast forward 14 months, to now: John emailed me, asking in the most diplomatic terms possible if he was ever, actually, going to get my final revision so her could publish it. D'oh!



So, properly chastened, I spent my nights last week working on the necessary revisions, and submitted something to him last Thursday. All was well with the world, and I could get some sleep! Yay again!



Except for one minor quibble, John was happy with this revision. But the quibble was kind of a big one: I had included a direct quote from one of my sources, a grammar text for the language I'm working with (Hiligaynon), but I neglected to include the page number. With a direct quote, that's crucial. With that one change, John informed me, he'd be able to format the article for publication. One little problem: I'm not in Provo anymore, and can't drag my little old self over to the BYU library to look the quote up!



So my options were to rewrite the paragraph without the quote (which would be doable, but it would hurt my argument to not have that author's direct input anymore), or to find another copy of the book closer at hand. When I turned to Google's academic search engine, Google Scholar, I found that there was a copy close by -- in the Library of Congress! I immediately made plans for my expedition into "the District."



To be continued tomorrow...

Labels: ,

Monday, June 26, 2006

Moving Right Along...

I just realized that I haven't blogged about this yet, so figured it was time to do so. We've decided to move back to Utah (again). As of August 10th or thereabouts, we'll be residents of the small-but-growing town of Lehi, Utah.

The reason for the move? Well, there are several. The big two are that Mary & I have both decided to go back to school... she to massage therapy college, and me back to BYU for graduate work in computational linguistics.

So, yay us!

Labels: ,

Friday, January 06, 2006

Happy New Year!

Well, the holidays are now behind us. We had a nice one, quiet & (mostly) restful... but Mary & I now understand that line in the old Christmas carol: "Mom and Dad can hardly wait for school to start again!" For my kids, at least, a regular routine makes all the difference.

The big thing right now is that I'm applying for Grad School at BYU. Deadline is January 15th, so we're coming down to the wire now. I've finished the writing portions, just waiting on my GRE test results and recommendation letters. Hopefully they'll come in time. =P

Labels: ,

Thursday, September 26, 2002

Another school rant

Wow... 2 blogs in 2 days... the world must be coming to an end!

But seriously.

I wish I could just blink and make school over with already... I stayed up until 5am this morning trying to get some work done, and spent all day studying for and taking a Phonetics test that killed my wrist on my writing hand... took over 2 hours to do the whole thing, solid writing... most of it wasn't terribly difficult after all my cramming, just a restating of what was said in class, but BOY it was verbose!

Anyway, that's my life in a nutshell. Lots of school, lots of work, not much sleep. Que sera sera, I suppose.

But I still wish I could just go to sleep and wake up when it's all over.

Labels: ,

Monday, September 23, 2002

The Home Stretch?

Been a while. I've started my last (knock on wood) semester at BYU; only 9 credits, but all 3 are those 3-credit classes where the professors actually believe that you've cleared 40+ hrs a week in your calendar specifically for them! :-P

I'm also trying to work full-time while doing this. Last week, I barely managed to pull off 20. And that was working Friday night until 5am!

Anyway, besides extreme sleep deprevation, all is well here in OO-tah. Mary's just going into her third trimester with baby Matthew Braden, and he's one __active__ little dude... kicks her all the time now, so much so that Sarah & I can see & feel it too. She's getting tired more easily than before, but that's a pretty typical symptom of 3rd trimester, so I'm not worried about chronic fatigue or anything yet.

That's about it. More news later when I think of it.

Labels: , ,

Friday, July 19, 2002

A toothless rant

Hm. No blog in 2 weeks. Sorry for those of you waiting with baited breath to read my ramblings.

I am SO fried. Getting like 4 hours of sleep every night, if I'm lucky. Some days are more productive than others, but right now, Friday afternoon, all I want to do is go lie on the beach in La Jolla, CA. Man, I miss San Diego... wish we could afford to go back... wish we could afford a LOT of things that we can't right now...

like sleep. :(

Labels: ,