Pinocchio's a Real Boy Now
After many years in school, I can finally call myself a "real" linguist: I've been published!
Labels: school
Buyog's web hive
After many years in school, I can finally call myself a "real" linguist: I've been published!
Labels: school
Per my plan for January, today is the day I'd planned to have several EFR segmentation files ready for use in testing the project software. Here's what I've gathered:
The segmentation files aren't quite in their final form yet, due to some limitations with XSLT transformations. I'll be posting the gory details of the differences, and writing a JavaScript parser to bridge them, coming later today on the Code section of my site.
Labels: school
So, as I mentioned a few weeks ago, I reset my deadline for the first cut of my project software for December 31. Recall that this software would:
- parse input segmentation file, determine lexical items
- invoke subject field detection module
- parse input termbases for lexical item matches
Item 1 is a simple parsing task. Easy enough to implement, but only doable once I have input files to parse in the proper format. This is a task I neglected to itemize in my initial project plan, which is unfortunate because it will take me some time. I plan to work on that over the next week, so let's say January 15 as a target date by which I'll have at least 2 or 3 input segmentation files.
Item 2 is now unnecessary, as my committee chair suggested I avoid the need to integrate a subject field detector by having the user specify subject field at processing time. One open question in my mind is how we'll represent the subject field: is it simply free-text, which is much less useful, or is it from a specific ontological system? If the latter, what ontology should we use? This seemingly simple subtask suddenly looks complex again.
Item 3 is the simplest subtask in this first incarnation of the software - all we'll need is a good XML/XPath parser to do the lookup. However, as with the segfiles, writing the parser isn't enough if there's no input data to parse. So after producing some sample segfiles with identified target terms, I'll need to hand-create a TBX file with enough entries to be useful. Setting January 22nd for this deadline is likely over-optimistic — even January 29th is probably pushing it, but that's the date I'm going to set.
I just got an email from the BYU Linguistics dept secretary, saying that in order to graduate this semester I'd have to have my final thesis / project writeup delivered to my committee by the first week of February. At this stage in the process, I think it's pretty obvious that I won't be ready to defend by then. Soooooo... Now shooting for Summer term.
Labels: school
It's been too long since my last school-related update, particularly since I've passed several of my self-imposed deadlines. So here goes:
3. Evaluate available methods of determining subject field of a text
Unfortunately, I haven't really found any good, open-source solutions to this problem, so after conferring with my committee chair, we've changed course slightly: now we'll specify a subject field for the the text prior to processing it.
4. Based on results of milestones 1, 2, and 3, determine overall program architecture
- which programming language will it be written in?
I'm still not 100% sure about this, but am leaning towards either Visual Studio Express (hint: the Express version is free), or Python.
- which SMT package will we use?
Moses is pretty much the only game in town for what we're doing.
- which subject field detection method will we use?
See above; we'll ask the user instead.
- how will we invoke these three modules (subject detection, termbase lookup, and SMT)?
-- due date: Friday, November 20
Yeah... "oops" on that deadline. I'll be invoking Moses from the command-line, rather than trying to integrate its source code into my own project. The exact mechanics of how I do this depend on which programming language I choose.
5. Find source(s) for initial input termbases
-- due date: Friday, November 27
Per my committee chair, BYU has a license to a piece of software called SynchroTerm (good thing, too: a license is $1,800!), to which I may be able to get access. It identifies and extracts source/target translation pairs from bitext input, which I can then plug into a skeleton TBX file. But, since I don't currently have access to SynchroTerm, I'll start with some hand-created entries. The input bitexts will come from a selection of movie subtitle files in the OPUS/OpenSubs corpus, probably English-French since that's a common-enough European language that OpenSubs has a lot of available content, and my committee members all have at least a passing understanding of the language (I don't, but hopefully Google Translate will help me to gist things well enough to proceed).
6. Project software 0.1 alpha
- parse input segmentation file, determine lexical items
- invoke subject field detection module
- parse input termbases for lexical item matches
-- due date: Friday, December 11
Oops, this has slipped a bit. With health, work, and family constraints on my time recently, I haven't kept up quite as well as I would have liked. My new deadline for this will be December 31.
(Assuming, of course, that my surgery on the 28th goes smoothly, and the drugs don't make me completely loopy. So that's my excuse if things slip again! ;))
Labels: school
So this is a few days late, since task 3 on my Linguistics MA project plan was due November 6th:
3. Evaluate available methods of determining subject field of a text
... but I don't actually have much to report. After several nights of searching, I haven't come up with anything substantial in the way of open-source code for subject field detection of a text. The academic papers I have read that touch on the subject (heh) all seem to gloss over the details of any domain detection step, or simply refer to proprietary solutions to the problem with no discussion of how those solutions are implemented. When I raised this issue with my graduate adviser, he suggested I "punt" -- that is, my software will just have to ask the user to specify the domain at run-time, or run without a specific domain in mind. Anyway, I have bigger fish to fry; the domain inference was never intended to be the core of my project, rather a means to the end of coupling statistical machine translation with terminology management/lookup. So now I move on to step 4:
4. Based on results of milestones 1, 2, and 3, determine overall program architecture
Also, in unrelated news: paralleling similar moves on my Javascript and MUGEN blogs, I've moved my site navigation from the sidebar into the header. I'm also working on merging the three templates used by these blogs into a single template that can accept custom CSS for each blog to help them have a unified appearance, while allowing each to retain its unique look and feel. I've also added more links in the sidebar to my various social media profiles, if that interests you at all. I suspect for most of you it won't, but anyway, they're there.
Labels: school, site updates
Last time, when I posted my fall schedule for my grad school work, it was for a very particular purpose. In fact, it's the same reason I track my workouts on DailyBurn: holding myself accountable.
Labels: personal development, school
Many of my readers may know I've been working on a Linguistics MA for a few years now. I finished my coursework last year, and only have my project/thesis left to be done, and I've been kind of drifting a bit. Part of the reason for that has been a busy work schedule and a long commute since our move, not to mention all of the work that goes along with home ownership. But a significant portion of it also has to do with a lack of a clear through-line from here to completion. Last week, when I made this observation about myself, I resolved to do something about it. This post summarizes the results of that effort, which I have forwarded on to my graduate committee chair so I have someone to be accountable to.
Schedule for my evenings (~9pm-12am):
Now, to see how long I can stick to it. I intend to post at least some of my progress here in this blog, albeit in a rough form, because it'll force me to get back in the habit of writing, and will function as the rough draft for my eventual project writeup. (don't worry, though, I'll try not to make it too dry; I can always make it more academic later)
Labels: school
Back in June, I wrote part 1 of my story about our visit to the Library of Congress and said that part 2 was coming the next day... Then I promptly dropped the subject. I'm thinking I need better follow-through on my writing here, eh?
6/27/2009
So... Here I am in the Library of Congress's main Reading Room! It's quite an impressive place, just as amazing as it looked in "National Treasure" when Nic Cage tried to steal the President's secret book. I'm sitting directly under the huge rotunda, looking up at art celebrating our civilization's heritage: philosophy from Greece, religion from Judea, emancipation from France, administration from Rome, etc. Awe-inspiring is too small a word.

So yesterday afternoon I had the opportunity to go and do research for my ongoing Master's degree at the Library of Congress! This was one of those things that I never knew I wanted to do until the chance came, then I realized how unique the opportunity was, and got pretty excited about it.
This all came about due to a paper I submitted to the journal of Oceanic Linguistics a while ago as part of my Linguistics MA coursework. To my surprise, it was accepted for publication! Yay me! The editor, John Lynch, sent me back the comments from his two reviewers that I needed to account for in my final revision, and I promptly got busy with work and school and family, and forgot all about it. Fast forward 14 months, to now: John emailed me, asking in the most diplomatic terms possible if he was ever, actually, going to get my final revision so her could publish it. D'oh!
So, properly chastened, I spent my nights last week working on the necessary revisions, and submitted something to him last Thursday. All was well with the world, and I could get some sleep! Yay again!
Except for one minor quibble, John was happy with this revision. But the quibble was kind of a big one: I had included a direct quote from one of my sources, a grammar text for the language I'm working with (Hiligaynon), but I neglected to include the page number. With a direct quote, that's crucial. With that one change, John informed me, he'd be able to format the article for publication. One little problem: I'm not in Provo anymore, and can't drag my little old self over to the BYU library to look the quote up!
So my options were to rewrite the paragraph without the quote (which would be doable, but it would hurt my argument to not have that author's direct input anymore), or to find another copy of the book closer at hand. When I turned to Google's academic search engine, Google Scholar, I found that there was a copy close by -- in the Library of Congress! I immediately made plans for my expedition into "the District."
To be continued tomorrow...
Labels: personal development, school
I just realized that I haven't blogged about this yet, so figured it was time to do so. We've decided to move back to Utah (again). As of August 10th or thereabouts, we'll be residents of the small-but-growing town of Lehi, Utah.
Labels: happenings, school
Well, the holidays are now behind us. We had a nice one, quiet & (mostly) restful... but Mary & I now understand that line in the old Christmas carol: "Mom and Dad can hardly wait for school to start again!" For my kids, at least, a regular routine makes all the difference.
Wow... 2 blogs in 2 days... the world must be coming to an end!
Been a while. I've started my last (knock on wood) semester at BYU; only 9 credits, but all 3 are those 3-credit classes where the professors actually believe that you've cleared 40+ hrs a week in your calendar specifically for them! :-P
Hm. No blog in 2 weeks. Sorry for those of you waiting with baited breath to read my ramblings.