Recently, Martin has transferred recordings from 108 DATs to hard disk. Unfortunately, he didn't preserve the ids on the DATs, but instead segmented the audio according to tracks according to silence. This is not the same, since ids are sometimes one and sometimes two tracks on these dats. Thing is, the recordings were originally stored on 78s. The ids consistently describe these records, i.e. for each record has one id, no matter how many sides it has.
Although this segmentation was not intended, it provides an interesting case study, since segmentation according to silence is relatively easy to automatize. However, the result is worthless if one cannot associate the result with the original chunks. In other words: if I cannot associate tracks with ids, the audio chunks are worthless.
So far so good. The problem is that when I start have the db which stores
- tape number (tape_no)
- id
- identifier
- ...
while the new files come only with two types of information:
- tape number (tape_no) and the
- track(track_no).
The only common field here is tape_no. As explained above, track_no is usually different from id. Unfortunately, I cannot link the two tables because a lack of common fields.
So, I try to come up with an logarithm to do exactly this. Possible? Theoretically yes, in practice it is pretty fishy.
Since a 78 can have one side or two, there is no easy rule how I come from tracks to ids. Instead, I look them up in our database. If I find two titles it usually means that the record has two sides and that means I need two tracks for one id. This is fairly reliable. THere a few cases where there are no titles. And another set of examples, where somebody put in translations, so that there are 4 titles. I didn't come across two titles where the 78 has actually only one side. But maybe I should check this again.
So, what I decided to do is what I call a "reverse loopup". This seems not to be an existing term in this context judging from google. I don't know. Perhaps it is not a good term either. You can judge yourself. What I mean is that with a normal query (lookup), I would start at either end, i.e.
- either from the db side with tape and id, loop over both and access every item; or the other way round from the
- file side with tape_no and track_no.
Both don't work here, so I change order in the middle of the query. I start in one direction and then change direction. That's at least how I think of it. Not quite sure, if this description is accurate. Rather not. More likely I reconstruct ids which I actually do not have from the fact that I know that an id consists of either one or two consecutive tracks.
More concretely: I loop through the tapes. For each tape, I create ids. I assume that there is a id=1. For this id, I look up in the db record with that tape and that id. In that record i check wether this record has one or two sides. Accordingly, I associate this id with one or two tracks and continue with the next id and the next track. Just a loop with two variables.
So far for the theory in which this works fine, but unfortuneately my db data is not so reliable. Once in a while, something is missing and I don't find a a tape_no and id pair. What then? I cannot be sure if that record has one or two sides. And that means that all the id-track associations afterwards may or may not be correct anymore. Now, at this moment, I could guess, but this is not good enough. I need to rely to the information I produce, so I have to abort the whole tape and continue with the next one. (Right now as I am still writing on the script, half the tapes have something missing...:-(
Now, before I do this. I try to eliminate a few missing gaps in the data manually.
A few comments:
- Doing this, I notice problems in my xml data. When $id = $tape_no the info is missing. For the first 13 tapes or so. I didn't see this error so far. I guess the error is in lvl3 (fix) conversion or somewhere else on the way. I correct it in my xml and then it should not be relevant anymore.
- I will attach the perl once it's done, but let me just put the sections relevant for libxml in here since I will look at it again the next time I will use libxml. This time I tried to use xpc, although it wasn't necessary.
...
use XML::LibXML;
...
$xml = '/home/mengel/EMEM-78test.lvl3.220708.mpx';
...
my $parser = XML::LibXML->new();
my $tree = $parser->parse_file($xml); #tree is doc
my $root = $tree->getDocumentElement;
...
my $xpc = XML::LibXML::XPathContext->new;
$xpc->registerNs( 'mpx' => 'http://www.mpx.org/mpx' );
...
my $xpath =
"//mpx:sammlungsobjekt[mpx:andereNr[\@art = 'ID' and . = '$id'] and mpx:andereNr[\@art ='DAT-Nr.' and . = '$tape_no' ]]";
my @nodes = $xpc->findnodes( $xpath, $root );
...
if ( $#nodes eq 0 ) {
my $identNr = $nodes[0]->find('mpx:identNr');
my @titles = $nodes[0]->findnodes('mpx:titel[@art]');
}
I tried XPathContext here although not necessary. Not exactly sure why. Have an idea, but this is not the topic here anyways. So far so good. I guess I continue in a comment when I find it.
| Attachment | Size |
|---|---|
| 78er_file_map.txt | 187.23 KB |
| 78er_missing.txt | 15.51 KB |
| 78er_warnings.txt | 24.65 KB |
| 78er_file_cache.txt | 205.14 KB |
| 78er.missing_tracks.txt | 1.72 KB |
| Rename78.pl.txt | 40.74 KB |

inconsistent file names
Since I found the gaps in the file names (see above), I thought I better check my file name parser again and made it look for duplicates. This pointed me to another 400 or so errors. Two alternatives. Improve my file parser. Read excel file. I guess I prefer the former.
Gaps in the files
Now I ran into the next problem. Some of the files on the hard disk seem to be missing. At least there are gaps in the numbering of them.
I don't remember very well what Martin told me. I think he said he did delete some files since they were 'wrong takes". This would mean that these gaps would be there intentionally and I should ignore these gaps.
But: It is unlikely that I can rely on this info, so I should check them manually at some point!
So, as practice I write a routine that identifies these gaps. (I still have to debug it, there seems to an error in it now, but I cannot fix it without the files. So I am stuck here for the moment.) When I find one of these gaps, I create a dummy entry in the filecache which says "Controlled file gap".
When I loop over the the filecache in the main routine, I do $track_no++ so I jump over this track.
I should not do it like this. This is too automatical. I should create a manual list with all verified gaps. This would force me to look at each and every gap.
PS: I added a list of the gaps as "missing tracks.txt" as attachment