Skip to content

[BUG] deeplake 4.5.10 raises Dtype is unknown error for int * JSON, but not for JSON * JSON #3149

@zhuang-keju

Description

@zhuang-keju

Severity

P2 - Not urgent, nice to have

Current Behavior

Deeplake raises Dtype is unknown error for filter = """(-21877) >= (-1093 * f6['e3'])""".

Steps to Reproduce

import deeplake as deeplake_lib
from deeplake import types as dl_types
import numpy as np


# create dataset
path = "mem://test_collection"
try: deeplake_lib.delete(path)
except: pass
ds = deeplake_lib.create(path)

# create schema
ds.add_column("f0", dl_types.Int64())
ds.add_column("f1", dl_types.Array("float32", 1))
ds.add_column("f2", dl_types.Float32())
ds.add_column("f3", dl_types.Float32())
ds.add_column("f4", dl_types.Dict())
ds.add_column("f5", dl_types.Embedding(48, dtype='float32'))
ds.add_column("f6", dl_types.Dict())
ds.add_column("f7", dl_types.Text())
ds.add_column("f8", dl_types.Embedding(46, dtype='float32'))

# define data
data_list = [...]

# insert data
pkey_name = 'f0'
pk_to_idx = {}
if pkey_name is not None and len(ds) > 0:
    col = ds[pkey_name]
    for i, v in enumerate(col):
        pk_to_idx[v.item()] = i
for data in data_list:
    pk = data.get(pkey_name) if pkey_name is not None else None
    idx = pk_to_idx.get(pk) if pk is not None else None
    if idx is None:
        row = {k: [v] for k, v in data.items()}
        ds.append(row)
        if pk is not None:
            pk_to_idx[pk] = len(ds) - 1
    else:
        for field, v in data.items():
            if field in ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8']:
                ds[field][idx] = v
ds.commit()

# create scalar index
ds["f0"].create_index(dl_types.Inverted)
ds["f1"].create_index(dl_types.Inverted)

# search
target_vector = [0.61611, -0.62616, 0.0, -0.80932, 0.80428, 0.0, 0.73469, 0.34276, 0.17649, 0.0, 0.93962, 0.00258, 0.0, 0.64142, 0.0, 0.37028, 0.41766, 0.0, 0.0, -0.42989, -0.77392, -0.12478, -0.70494, -0.95591, 0.0, -0.33446, 0.0, -0.3029, -0.75286, 0.94385, 0.0543, 0.10103, 0.30253, 0.21931, -0.25426, -0.30362, 0.14265, -0.60801, 0.34153, 0.16569, -0.24234, -0.23225, 0.48922, -0.39635, 0.24076, -0.63206, -0.14048, -0.45444]
anns_field = 'f5'
filter = """(-21877) >= (f4['e3'] * f6['e3'])"""
limit = 2309
q_correct = f"SELECT *, COSINE_SIMILARITY(f5, ARRAY{repr(target_vector)}) AS distance_score WHERE {filter} ORDER BY COSINE_SIMILARITY(f5, ARRAY{repr(target_vector)}) DESC LIMIT 2309"
view = ds.query(q_correct)
print("first query executed successfully")

filter = """(-21877) >= (-1093 * f6['e3'])"""
q_error = f"SELECT *, COSINE_SIMILARITY(f5, ARRAY{repr(target_vector)}) AS distance_score WHERE {filter} ORDER BY COSINE_SIMILARITY(f5, ARRAY{repr(target_vector)}) DESC LIMIT 2309"
view = ds.query(q_error) # this raises the error
print("second query executed successfully") # should not be printed if error triggered.

here is the full script
deeplake_bug_trigger.py.txt

Expected/Desired Behavior

Deeplake raises Dtype is unknown error for filter = """(-21877) >= (-1093 * f6['e3'])""", but not for filter = """(-21877) >= (f4['e3'] * f6['e3'])""". f4 is also a JSON field, and some f4 doesn't have the e3 field. This means f4[e3] is more difficult to handle than -1093. If deeplake can handle the second filter of JSON * JSON, it should also be able to handle int * JSON.

Python Version

3.10

OS

22.04

IDE

No response

Packages

No response

Additional Context

No response

Possible Solution

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR (Thank you!)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions