Avoiding duplicates on import

I’m importing data from a CSV file, one line per record. The import method tidies up the data before creating each new record. If I find a duplicate, I ask if it’s OK to add it (they’re rare, but some duplicates are permissible).

So I have an ItemExists method, which is

C_OBJECT ($sel)
$sel:= ds .items. query (“date = :1 and amount = :2 and balance = :3 and desc = :4”\
[items]date; [items]amount; [items]balance; [items]desc)
$0 :=$sel.length#0

It turns out that the import spends about 99% of its time in this method.

Before I spend time re-inventing the wheel, is there a generally-approved and fast method of finding duplicates? I could add a hash of some sort to each record, I suppose: is there a better way?

Jeremy

set query destination ?

You could concat the string version of each field separated by some character like “•”. It could go either into an indexed field, or a sorted array to be searched.
If 4D’s internal searching stops after the first mismatch, the first field of the built string would be the one with the most unique values, like if every description was different, it would go first. Not sure how all that would benchmark…

Thanks, but I tried that, Bertrand. It was even slower.

Jeremy

Hi Jeremy,

Before you start the import, load a collection with the relevant attributes:

$baseCollection:=ds.items.all().toCollection(“date”;“amount”;“balance”;“desc”)

then you can look for a match

$Found:=$baseCollection.query(…).length
if($found=0
//do whatever you need, but add the line
$baseCollection.push(new object(“date”;…
end if

Lahav

You could ask the ORDA dataClass.query() to return queryPlan and queryPath.

Maybe you can use a composite index that includes multiple fields.

That said, if you perform a query during export, the search is probably not using indexes, since the index needs to be updated each time after you add a record.

Thank you all for the comments. I’ve now reduced the time taken for import by more than 98%. Excellent.

Jeremy

Would you mind posting the final solution?

1 Like

Guy,

I used a combination of Lahav’s suggestion (thanks, Lahav) and something I should have done in the first place, careful consideration of the data to be imported which, on reflection, was all fairly recent. I now do this:

  1. a quick preliminary scan through the data, noting the earliest date of any item to be imported;

  2. build a collection of all items already in the table dated on or after that date: $alreadyThere := ds.items.query(“date >= :1”; $earliestDate).toCollection()

  3. check for duplicates only in that collection, using $alreadyThere.query(…).length, remembering to add each new item to it as well as to the table.

It’s a good reminder to me of the old adage (I paraphrase) that you can optimise the hell out of your code but you’ll achieve an awful lot more, and probably a lot more quickly, if you check that the algorithm is optimal. Lesson learned, at least until next time.

Jeremy

2 Likes