Preparing the MS MARCO dataset for RAG

a look at the MS MARCO dataset and how to prepare it for RAG
RAG
LLM
Metrics
Evaluation
Published

January 25, 2025

Preparing the data

This data is the MS MARCO dataset, which is a large dataset that can be used to train, test, and evaluate RAG models. I’m going to be exploring the data in polars and make it simpler to work with.

The MS MARCO (Microsoft MAchine Reading COmprehension) dataset is a large-scale collection designed to advance machine reading comprehension and question-answering research. It comprises real anonymized user queries from Bing’s search logs, each paired with human-generated answers and relevant passages extracted from web documents. This dataset is widely used for training and evaluating models in natural language understanding tasks.

Columns in the Dataset according to the documentation: - query: The user’s search query in text format. - query_id: A unique integer identifier for each query. - query_type: The classification of the query, indicating its nature or intent. - passages: A sequence containing passages related to the query. Each passage includes: - passage_text: The text content of the passage. - is_selected: A binary indicator (0 or 1) denoting whether the passage was selected as relevant to the query. - answers: Contains the human-generated answers to the query. - wellFormedAnswers: Provides well-structured answers, if available, offering a more polished response to the query.

First I’m going to load the data into a polars dataframe.

from datasets import load_dataset
import polars as pl
import os
from dotenv import load_dotenv

load_dotenv()


OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

if "OPENAI_API_KEY" in os.environ:
    print("OPENAI_API_KEY is present")
else:
    print("OPENAI_API_KEY is not present")
OPENAI_API_KEY is present

next step is to download the data set using the huggingface datasets library.

ds = load_dataset("microsoft/ms_marco", "v2.1")

it contains 3 datasets, train, validation, and test.

ds
DatasetDict({
    validation: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 101093
    })
    train: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 808731
    })
    test: Dataset({
        features: ['answers', 'passages', 'query', 'query_id', 'query_type', 'wellFormedAnswers'],
        num_rows: 101092
    })
})

For the moment, I’m going to be working with a subset of the train dataset for speed and simplicity.

ds_sample = ds['train'].select(range(1000))

Convert the dataset to a polars DataFrame taking only the train as its smaller to work with

df_ms_marco = pl.DataFrame({
    'query': ds_sample['query'],
    'passages': ds_sample['passages'], 
    'answers': ds_sample['answers'],
    'query_id': ds_sample['query_id'],
    'query_type': ds_sample['query_type'],
    'wellFormedAnswers': ds_sample['wellFormedAnswers']
})

we have 1000 rows and 6 columns which matches my expectations.

df_ms_marco.shape
(1000, 6)

The schema shows the columns and their data types. Interestingly the answers and wellFormedAnswers are lists so there might be multiple answers.

df_ms_marco.schema
Schema([('query', String),
        ('passages',
         Struct({'is_selected': List(Int64), 'passage_text': List(String), 'url': List(String)})),
        ('answers', List(String)),
        ('query_id', Int64),
        ('query_type', String),
        ('wellFormedAnswers', List(String))])

Obervations from the selected columns/rows: - the querys dont look well formatted, there are some special characters that not always question marks after the query so reflect real user queries which matches real world data. - the passages column is a list of dictionaries, each dictionary contains the passage_text, is_selected, and some links. This needs further analysis to understand the data. - the answers column is a list of strings so could be more than one answer to a query. might be wise to remove entries with more than one answer to make the data easier to work with. - the wellFormedAnswers column is a list of strings so could be more than one answer to a query. Looks like most are blank, ill take a futher looks at them to see if they are useful.

df_ms_marco[:10]
shape: (10, 6)
query passages answers query_id query_type wellFormedAnswers
str struct[3] list[str] i64 str list[str]
")what was the immediate impact of the success of the manhattan project?" {[1, 0, … 0],["The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.", "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.", … "One of the main reasons Hanford was selected as a site for the Manhattan Project's B Reactor was its proximity to the Columbia River, the largest river flowing into the Pacific Ocean from the North American coast."],["http://www.pitt.edu/~sdb14/atombomb.html", "http://www.osti.gov/accomplishments/manhattan_story.html", … "https://www.atomicheritage.org/history/environmental-consequences"]} ["The immediate impact of the success of the manhattan project was the only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."] 1185869 "DESCRIPTION" []
"_________ justice is designed to repair the harm to victim, the community and the offender caused by… {[0, 0, … 0],["group discussions, community boards or panels with a third party, or victim and offender dialogues, and requires a skilled facilitator who also has sufficient understanding of sexual assault, domestic violence, and dating violence, as well as trauma and safety issues.", "punishment designed to repair the damage done to the victim and community by an offender's criminal act. Ex: community service, Big Brother program indeterminate sentence", … "The circle includes a wide range of participants including not only the offender and the victim but also friends and families, community members, and justice system representatives. The primary distinction between conferencing and circles is that circles do not focus exclusively on the offense and do not limit their solutions to repairing the harm between the victim and the offender."],["https://www.justice.gov/ovw/file/926101/download", "https://quizlet.com/1128245/criminal-justice-exam-1-flash-cards/", … "https://www.sciencedirect.com/science/article/pii/B978145572599100014X"]} ["Restorative justice that fosters dialogue between victim and offender has shown the highest rates of victim satisfaction and offender accountability."] 1185868 "DESCRIPTION" []
"why did stalin want control of eastern europe" {[0, 0, … 0],["Western betrayal. The concept of Western betrayal refers to the view that the United Kingdom and France failed to meet their legal, diplomatic, military and moral obligations with respect to the Czech and Polish nations during the prelude to and aftermath of the Second World War.", "The Tuvan People's Republic, was proclaimed independent in 1921 and was a satellite state of Soviet Union until its annexation in 1944 by the Soviet Union. Another early Soviet satellite state in Asia was the short-lived Far East Republic in Siberia. Post-World War II", … "Not just Eastern Europe, Stalin wanted to control the world by creating an empire based on Communism. That's how they clashed with US who intended to control the world by building an empire based on Capitalism. That's what the Cold War is about."],["https://en.wikipedia.org/wiki/Western_betrayal", "https://en.wikipedia.org/wiki/Satellite_state", … "https://answers.yahoo.com/question/index?qid=20090928204102AAGBpWx"]} ["The reasons why Stalin wanted to control Eastern Europe are Russia has historically no secure border and they wanted to set up satellite countries."] 1185854 "DESCRIPTION" []
"why do nails get rusty" {[0, 0, … 0],["what to Do If I Stepped on Rusty Nail: Prevent Getting Tetanus. What to do if you step on a nail and are afraid to get Tetanus? Tetanus vaccines are available to help the body fight off the bacteria that causes this infection. Patients as young as two months are able to receive a tetanus shot.", "Just asked! See more. 1  What is the minimum coefficient of friction ... Answer. 13 minutes ago. 2  How do you find the derivate of inverse sin ... Answer. 14 minutes ago. 3  You found the perfect pair of shorts for $24.99 ... Answer. 20 minutes ago. 4  What are the asymptote(s) and hole(s), if any, of ... Answer. 35 minutes ago.", … "Quick Answer. Nails rust in water because water allows the iron to react with any oxygen present, which forms iron oxide, known as rust. In order to cause rust quickly, there must be some impurities in the water, particularly salts, since these speed up the transfer of electrons from iron to oxygen. Continue Reading"],["http://www.healthcare-online.org/Stepped-On-Rusty-Nail.html", "https://socratic.org/questions/how-to-explain-the-law-of-conservation-of-mass-using-a-nail-rusting-in-air", … "https://www.reference.com/beauty-fashion/nails-rust-water-a757692adb7c0eb4"]} ["Nails rust in water because water allows the iron to react with any oxygen present, which forms iron oxide. Nails rust due to presence of some impurities in the water, particularly salts, which speeds up the transfer of electrons from iron to oxygen."] 1185755 "DESCRIPTION" []
"depona ab" {[0, 0, … 0],["A preview of what LinkedIn members have to say about Göran: Göran är en mycket duktig och noggrann säljare och ledare. Under görans ledning och stöd byggde vi en effektiv och stark sälj/marknads organisation för outsourcing av dokumenthanteringstjänster i stor skala. Jag jobbar gärna med Göran närsomhelst och rekommenderar gärna Göran som er nästa sälj/marknadschef.", "Depona Oy YTJ: Y-tunnus: 21527338 YTJ: Toimiala: Kirjastojen ja arkistojen toiminta (TOL: 91010) YTJ: Toimialakuvaus: (Päivitetty: 01.04.2008)", … "Connecting decision makers to a dynamic network of information, people and ideas, Bloomberg quickly and accurately delivers business and financial information, news and insight around the world."],["https://www.linkedin.com/in/goranaxelsson", "https://www.kauppalehti.fi/yritykset/yritys/depona+oy/21527338", … "https://www.bloomberg.com/profiles/companies/5175594Z:SS-depona-ab"]} ["Depona Ab is a library in Vilhelmina, Sweden."] 1184773 "DESCRIPTION" []
"is the atlanta airport the busiest in the world" {[0, 0, … 0],["While Chicago O’Hare in 2014 was briefly the world’s busiest in flight counts, Atlanta has had the most flights for the past two years. Hartsfield-Jackson had a 1.8 percent increase in flights in 2016, while Chicago O’Hare had a 0.9 percent decline.", "More than 104 million travelers passed through Atlanta's airport last year, making the Delta Air Lines hub the world's busiest for passenger traffic, according to Airports Council International. The airport handles around 2,500 arrivals and departures, according to the airport's figures.", … "Atlanta Hartsfield-Jackson International Airport, the busiest airport in the world, was hit with extended rain and high winds from Irma on Monday while many of Florida's airports remained closed as officials assess damage and try clean up from the powerful storm."],["http://www.ajc.com/travel/hartsfield-jackson-retains-title-world-busiest-airport/1QbkFE3E6DBckQMPhMtROJ/", "https://www.cnbc.com/2017/12/17/atlanta-airport-the-worlds-busiest-reports-power-outage.html", … "http://money.cnn.com/2017/09/11/news/companies/atlanta-atl-airport-irma-disruption/index.html"]} ["No Answer Present."] 1174762 "LOCATION" []
"nyu tuition cost" {[0, 1, … 0],["the cost of attending new york university is comparable to that of other selective private institutions new york university charges tuition and registration fees on a per unit basis for 2015 2016 the tuition rate is expected to be $ 1616 per unit plus registration and service feesthe estimated total tuition for the ms program is $ 51654 the board of trustees of new york university reserves the right to alter tuition and feesfinal tuition and fees for the 2015 16 academic year will be made official by april of this yearhe board of trustees of new york university reserves the right to alter tuition and fees final tuition and fees for the 2015 16 academic year will be made official by april of this year", "tuition for new york university is $ 43746 for the 2014 2015 academic year this is 73 % more expensive than the national average private non profit four year college tuition of $ 25240he net out of pocket total cost you end up paying or financing though student loans is known as the net price the reported new york university net price for in state students $ 34268 for the 2013 2014 academic year this net price includes housing and meal expenses", … "the net out of pocket total cost you end up paying or financing though student loans is known as the net price the reported new york university net price for in state students $ 34268 for the 2013 2014 academic year this net price includes housing and meal expenseshe net out of pocket total cost you end up paying or financing though student loans is known as the net price the reported new york university net price for in state students $ 34268 for the 2013 2014 academic year this net price includes housing and meal expenses"],["http://cusp.nyu.edu/tuition-and-fees/", "http://www.collegecalc.org/colleges/new-york/new-york-university/", … "http://www.collegecalc.org/colleges/new-york/new-york-university/"]} ["$43,746 for the 2014-2015 academic year."] 467556 "NUMERIC" []
"at what age do kids start to hold memories" {[0, 0, … 0],["In an effort to better understand how children form memories, the researchers asked 140 kids between the ages of 4 and 13 to describe their earliest memories and then asked them to do the same thing two years later.", "Conversely, a third of the children who were age 10 to 13 during the first interview described the same earliest memory during the second interview. More than half of the memories they recalled were the same at both interviews. The researchers are now studying why children remember certain events and not others.", … "Recent research on the development of memory has indicated that declarative, or explicit memory, may exist in infants who are even younger than two years old. For example, newborns who are less than 3 days old demonstrate a preference for their mother’s own voice."],["http://www.webmd.com/parenting/news/20110511/when-do-kids-form-their-first-memories", "http://www.webmd.com/parenting/news/20110511/when-do-kids-form-their-first-memories", … "https://en.wikipedia.org/wiki/Memory_development"]} ["Before the age of 2–4 years."] 28213 "NUMERIC" []
"average teeth brushing time" {[0, 0, … 0],["Your dentist or oral surgeon may use one of three types of anesthesia, depending on the expected complexity of the wisdom tooth extraction and your comfort level. Local anesthesia. Your dentist or oral surgeon administers local anesthesia with one or more injections near the site of each extraction.", "Your dentist or oral surgeon may use one of three types of anesthesia, depending on the expected complexity of the wisdom tooth extraction and your comfort level. Options include: Local anesthesia.", … "In fact, according to the Delta Dental survey, people who brush at least twice a day are 22 percent more likely to describe their oral health as good or better compared with those who brush less frequently. Unfortunately, 23 percent of Americans have gone two or more days without brushing their teeth in the past year."],["http://www.mayoclinic.org/tests-procedures/wisdom-tooth-extraction/basics/what-you-can-expect/PRC-20020652", "http://www.mayoclinic.org/tests-procedures/wisdom-tooth-extraction/basics/what-you-can-expect/PRC-20020652", … "https://www.deltadental.com/Public/NewsMedia/NewsReleaseDentalSurveyFindsShortcomings_201409.jsp"]} ["Americans brush for just under the two minutes on average."] 44588 "NUMERIC" []
"is funner a word?" {[0, 0, … 0],["Taken from Wiktionary: Funnest is a regular superlative of the adjective fun. However, the use of fun as an adjective is itself still often seen as informal or casual and to be avoided in formal writing, and this would apply equally to the superlative form.", "adjective, funnier, funniest. 1. providing fun; causing amusement or laughter; amusing; comical: a funny remark; a funny person. 2. attempting to amuse; facetious:", … "something that provides mirth or amusement: A picnic would be fun. 2. enjoyment or playfulness: She's full of fun. verb (used with or without object), funned, funning."],["https://english.stackexchange.com/questions/4066/is-funnest-a-word", "http://www.dictionary.com/browse/funnier", … "http://www.dictionary.com/browse/funner"]} ["Yes, funner is a word."] 410717 "DESCRIPTION" []

Lets expand passages column for a select list.. like I mentioned earlier, the query column is not well formatted, I think it will work for our purpose but I would like to get a better understanding of it.

for x in df_ms_marco["query"].to_list()[:20]:
    print(x)
)what was the immediate impact of the success of the manhattan project?
_________ justice is designed to repair the harm to victim, the community and the offender caused by the offender criminal act. question 19 options:
why did stalin want control of eastern europe
why do nails get rusty
depona ab
is the atlanta airport the busiest in the world
nyu tuition cost
at what age do kids start to hold memories
average teeth brushing time
is funner a word?
what direction does phloem flow
what is ce certified
artin chicken mcdonalds calories
is panglao island safe
what is calomel powder used for?
sum of squares of even numbers formula
bancfirst customer service number
what is happening in montreal in august?
what county is dewitt michigan in?
how to create msn email address

Lets do the same for the answers column… Some entries dont have answers, they should be removed as they wont be useful for retrieval as they dont have any answers. Some of the answers are non neglish too. I will remove these entries as well.

for x in df_ms_marco["answers"].to_list()[:20]:
    print(x)
['The immediate impact of the success of the manhattan project was the only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated.']
['Restorative justice that fosters dialogue between victim and offender has shown the highest rates of victim satisfaction and offender accountability.']
['The reasons why Stalin wanted to control Eastern Europe are Russia has historically no secure border and they wanted to set up satellite countries.']
['Nails rust in water because water allows the iron to react with any oxygen present, which forms iron oxide. Nails rust due to presence of some impurities in the water, particularly salts, which speeds up the transfer of electrons from iron to oxygen.']
['Depona Ab is a library in Vilhelmina, Sweden.']
['No Answer Present.']
['$43,746 for the 2014-2015 academic year.']
['Before the age of 2–4 years.']
['Americans brush for just under the two minutes on average.']
['Yes, funner is a word.']
['No Answer Present.']
['An abbreviation of Conformite Conformité, europeenne Européenne Meaning. european conformity.']
['No Answer Present.']
['No Answer Present.']
['No Answer Present.']
['No Answer Present.']
['The customer support phone number of BancFirst is +1(405) 948-7930, 1-855-889-9216, 1-888-200-9149.']
['No Answer Present.']
['DeWitt is located in Clinton County, Michigan, United States.']
['No Answer Present.']

We’re now looking at statistical data so we’ll be better loading the whole dataset, this takes 7 minutes to laod the dataset while the subset takes seconds. Polars is faster than pandas due to its lazy evaluation and parallel processing capabilities and im using a GPU compute instance so it should use the GPU cores.

# df_ms_marco_full = pl.DataFrame({
#     'query': ds['train']['query'],
#     'passages': ds['train']['passages'], 
#     'answers': ds['train']['answers'],
#     'query_id': ds['train']['query_id'],
#     'query_type': ds['train']['query_type'],
#     'wellFormedAnswers': ds['train']['wellFormedAnswers']
# })

Just for testing, i will save to parquet to see if it saves quicker than reading from huggingface datasets….and its 7 times faster.

#df_ms_marco_full.write_parquet("ms_marco_full.parquet")

and also much quicker to read. I will use this method to load the data in the future to save time

df_ms_marco_full = pl.read_parquet("ms_marco_full.parquet")

Lets count the answers to see what values are …. answers can have 1 or more, maybe filter out anything greater than 1 to make the data easier to work with.

df_ms_marco_full["answers"].list.len().value_counts()
shape: (6, 2)
answers count
u32 u32
3 22
4 9
5 1
1 794932
6 1
2 13766

Lots of differnt types of data to explore, I asked claude to help me understand the types and here is what it said:

  • DESCRIPTION (most common) - General descriptive queries asking for explanations, processes, or information
  • NUMERIC - Queries looking for numerical answers (dates, measurements, quantities)
  • ENTITY - Queries about specific things, organizations, or concepts
  • LOCATION - Queries about places, geography, or spatial information
  • PERSON (least common) - Queries about specific people

For example from the data:

  • PERSON query: “who wrote romeo and juliet?”
  • LOCATION query: “what is the capital of france?”
  • NUMERIC query: “how tall is mount everest?”
  • DESCRIPTION query: “why do nails get rusty?”
  • ENTITY query: “what is ce certified?”
df_ms_marco_full["query_type"].value_counts()
shape: (5, 2)
query_type count
str u32
"NUMERIC" 212856
"PERSON" 45727
"DESCRIPTION" 427514
"ENTITY" 71741
"LOCATION" 50893

The wellformed answers are answers been re-written by subject experst and should be better quality. Lets check out how many of the dataset has wellformed answers.

for row in df_ms_marco_full["wellFormedAnswers"].list.len().value_counts().sort("wellFormedAnswers").rows():
    print(row)
(0, 655006)
(1, 149282)
(2, 3761)
(3, 139)
(4, 64)
(5, 33)
(6, 27)
(7, 28)
(8, 19)
(9, 25)
(10, 16)
(11, 21)
(12, 23)
(13, 25)
(14, 14)
(15, 15)
(16, 14)
(17, 14)
(18, 17)
(19, 14)
(20, 13)
(21, 12)
(22, 12)
(23, 13)
(24, 10)
(25, 10)
(26, 9)
(27, 4)
(28, 10)
(29, 12)
(30, 9)
(31, 9)
(32, 3)
(33, 4)
(34, 7)
(35, 5)
(36, 6)
(37, 2)
(38, 2)
(39, 4)
(40, 3)
(41, 2)
(42, 2)
(43, 1)
(44, 1)
(45, 1)
(47, 4)
(48, 1)
(49, 2)
(53, 1)

It looks like there are lots of queries with wellformed answers, lets bring up a few and explore and see if they are better quality.

pl.Config.set_fmt_str_lengths(100)  # Sets maximum string length for display

df_ms_marco_full.filter(pl.col("wellFormedAnswers").list.len() > 0)
shape: (153_725, 6)
query passages answers query_id query_type wellFormedAnswers
str struct[3] list[str] i64 str list[str]
"what county is columbus city in" {[0, 0, … 0],["WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial property. Columbus boasts large tracts of public land, including Carlos Avery Wildlife Management Area and Lamprey Pass.", "The ratio of number of residents in Columbus to the number of sex offenders is 488 to 1. The number of registered sex offenders compared to the number of residents in this city is near the state average. Nearest city with pop. 50,000+: Bloomington, IN (33.3 miles , pop. 69,291).", … "Acceptable ID for children: State ID, Birth Certificate, or Health Insurance Card. Effective June 27, 2016, the Franklin County Sheriff's Office will be implementing changes to ensure the safety of inmates, staff, and visitors. Printed materials (magazines, books, pamphlets, leaflets, or catalogues) MUST fit all the below criteria:"],["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]} ["Bartholomew"] 604568 "LOCATION" ["Columbus is a city in Bartholomew County."]
"what is a omurice omelet" {[0, 0, … 0],["In one common rendition, the rice is fried with ketchup, and extra ketchup is squeezed on top as a garnish. In another popular one, seen in the Kyoto video, the chef uses demi-glace (a rich, veal stock–based sauce) to both fry the rice and top the omelette. Japanese mayo is often also squeezed on top.", "오무라이스. 1  Omurice is a contemporary Asian dish consisting of an omelette made with fried rice. Its name derives from the combination of the English words omelette and rice. 2  A relatively simple dish, it typically calls for rice fried with ketchup, chicken and onions wrapped in a thin sheet of fried egg.", … "A cut-open omurice with ketchup. Omurice or omu-rice (オムライス, Omu-raisu) is an example of yōshoku (a Western-influenced style of Japanese cuisine) consisting of an omelette made with fried rice and usually topped with ketchup. With omu and raisu being contractions of the words omelette and rice, the name is an example of wasei-eigo."],["http://www.seriouseats.com/2016/08/how-to-make-omurice-japanese-omelette-fried-rice.html", "http://trifood.com/omurice.asp", … "https://en.wikipedia.org/wiki/Omurice"]} ["A contemporary Asian dish consisting of an omelette made with fried rice."] 693333 "DESCRIPTION" ["An omurice omelet is a contemporary Asian dish consisting of an omelette made with fried rice."]
"how much does it cost to change a jeep alternator" {[1, 0, … 0],["The average cost for a Jeep Compass Alternator Replacement is between $377 and $577. Labor costs are estimated between $123 and $156 while parts are priced between $254 and $421. Estimate does not include taxes and fees. Get a Personalized Estimate for your Jeep Compass.", "Well if a alt from autozone costs $150 any shop(and I mean any) will mark that up to $200 or or more(standard is 150% of cost) and then you got labor added on top of that and most labor charges are $80-$120 per hour depending on location.Then you got the diag fees to test the vehicle to find out what is wrong.", … "Definitely depends on the vehicle and where you live, but if you’re looking to have an auto repair shop replace the alternator, the national average on RepairPal is between $315 to $559, including quality parts and labor costs. (Alternator Replacement Cost - RepairPal Estimate)"],["https://repairpal.com/estimator/jeep/compass/alternator-replacement-cost", "http://www.jeepforum.com/forum/f292/cost-alternator-1115135/", … "https://www.quora.com/How-much-is-the-average-cost-to-replace-an-alternator"]} ["A Jeep Compass Alternator Replacement is between $377 and $577."] 315061 "NUMERIC" ["The cost for a Jeep Compass Alternator Replacement is between $377 and $577."]
"what does folic acid do" {[0, 0, … 0],["Folate and folic acid are forms of a water-soluble B vitamin. Folate occurs naturally in food, and folic acid is the synthetic form of this vitamin. Since 1998, folic acid has been added to cold cereals, flour, breads, pasta, bakery items, cookies, and crackers, as required by federal law.", "Folic acid is a type of B vitamin. It is the man-made (synthetic) form of folate that is found in supplements and added to fortified foods. Folate is a generic term for both naturally occurring folate found in foods and folic acid.", … "Women who are pregnant or might become pregnant take folic acid to prevent miscarriage and “neural tube defects,” birth defects such as spina bifida that occur when the fetus’s spine and back do not close during development. Some people use folic acid to prevent colon cancer or cervical cancer."],["http://www.webmd.com/vitamins-supplements/ingredientmono-1017-FOLIC%20ACID.aspx?activeIngredientId=1017&activeIngredientName=FOLIC%20ACID", "https://www.nlm.nih.gov/medlineplus/ency/article/002408.htm", … "http://www.webmd.com/vitamins-supplements/ingredientmono-1017-FOLIC%20ACID.aspx?activeIngredientId=1017&activeIngredientName=FOLIC%20ACID"]} ["Crucial for proper brain function and plays an important role in mental and emotional health."] 637779 "DESCRIPTION" ["Folic acid is crucial for proper brain function and plays an important role in mental and emotional health."]
"what county is seminole fl in" {[0, 1, … 0],["1 To change an address on a registration, tag, title or boat/vessel/trailer, see a Motor Vehicle Service Center above or visit www.gorenew.com.  To change an address on a driver license, see a Driver License Service Center above or visit www.gorenew.com.", "The water tower in Seminole at 113th Street and Park Boulevard. Seminole is a city in Pinellas County, Florida, United States. The population was 17,233 at the 2010 census, up from 10,890 in 2000. St. Petersburg College has a campus in the city.", … "Report FRAUD, WASTE or ABUSE. Anonymous reporting to the Clerk of Court and Comptroller is available at AlertLine, or by phone at 866-889-8808. Due to Federal regulations, the Clerk's office will no longer process passport applications effective March 31, 2010. Please contact Passport Services at 1 (877) 487-2778."],["https://www.flhsmv.gov/locations/seminole/", "https://en.wikipedia.org/wiki/Seminole,_Florida", … "http://seminoleclerk.org/"]} ["Pinellas County"] 612569 "LOCATION" ["Seminole, Florida is in Pinellas County."]
"what do mink eat" {[0, 0, … 0],["Mink tracks are nearly round with a width of 1¼ - 1¾ inches for the front feet and 2 â¼ inches long for the hind feet. The stride length of their tracks is 12-26 inches apart and both hind and forefeet prints almost touch. Skulls from mink have 34 teeth.", "They have oily guard hairs that waterproof and protect their coats. Predators such as foxes, bobcats, great horned owls and alligators are not a serious threat to mink populations. Mink may hiss, snarl, screech and/or excrete an odorous fluid from their scent glands when threatened.", … "What do they look like? Mink fur is usually dark brown with white patches on the chin, chest, and throat areas. The fur is soft and thick, with oily outer hairs that waterproof the animal's coat. The body is long and slender with short legs and a pointy, flat face. The toes are partially webbed, showing the mink's semi-aquatic nature. Body length is usually around 2 feet or 610 mm with up to half of this length being the tail."],["http://www.georgiawildlife.com/MinkFacts", "http://www.georgiawildlife.com/MinkFacts", … "http://www.biokids.umich.edu/critters/Neovison_vison/"]} ["Rodents, fish, crustaceans, amphibians, and birds."] 623976 "ENTITY" ["Mink eats rodents, fish, crustaceans, amphibians and birds."]
"weight evenflo maestro" {[0, 0, … 0],["Rated 4 out of 5 by Tonibnj3 from Great Second Seat I bought this car seat for my husband's suv. He loves the look of it inside. It is very light and easy to carry. My 2 yr old fits in it nicely and really likes his new seat.", "With sporty looks and award-winning safety features, the Maestro booster seat is a favorite among children, parents & safety advocates alike. Weight: 22 - 50 lbs.", … "Maestro Booster Car Seat. Details. The Maestro harnessed boosters are designed to be used with the five point harness for children 22-50 lbs. Once your child has reached 40 lbs. and at least 4 years of age, it is usually safe to transition into the booster version of this seat with your vehicle's seat belt."],["https://www.kohls.com/product/prd-2057313/evenflo-maestro-booster-car-seat.jsp", "http://www.evenflo.com/car-seats/maestro/us_maestro.html", … "http://www.evenflo.com/car-seats/maestro/us_maestro.html"]} ["11 pounds", "22-50 pounds"] 547249 "NUMERIC" ["The weight of Evenflo Maestro is 11 pounds."]
"what is a cortisone injection used for" {[0, 1, … 0],["What is cortisone (Cortone)? Cortisone is a steroid that prevents the release of substances in the body that cause inflammation. Cortisone is used to treat many different conditions such as allergic disorders, skin conditions, ulcerative colitis, arthritis, lupus, psoriasis, or breathing disorders. Cortisone may also be used for purposes not listed in this medication guide.", "If you have arthritis, you may have considered a cortisone shot as part of your treatment plan. These shots are not pain relievers. Cortisone is a type of steroid, a drug that lowers inflammation, which is something that can lead to less pain. Cortisone injections can be used to treat inflammation of small areas of the body, such as inflammation of a specific joint or tendon. They can also treat inflammation that is widespread throughout the body, such as with allergic reactions, asthma, and rheumatoid arthritis, which affects many joints.", … "The injections usually comprise a corticosteroid medication and a local anesthetic. Often, you can receive a cortisone shot at your doctor's office. Because of potential side effects, the number of shots you can receive in one year generally is limited. July 02, 2016."],["http://www.rxlist.com/cortone-drug/patient-images-side-effects.htm", "http://www.webmd.com/arthritis/what-are-cortisone-shots", … "http://www.mayoclinic.org/tests-procedures/cortisone-shots/home/ovc-20206814"]} ["Used to treat inflammation of small areas of the body, such as inflammation of a specific joint or tendon."] 679433 "DESCRIPTION" ["A cortisone injection is used to treat inflammation of small areas of the body, such as inflammation of a specific joint or tendon."]
"nasa credit union routing number" {[0, 0, … 0],["Complete a Direct Dispute Form if you have information appearing on your credit report that lists NASA Federal Credit Union as the creditor and you would like to dispute it, or if information about your NASA FCU account has been reported to ChexSystems and you would like to dispute it.", "ROUTING NUMBER - 255077833 - NASA FEDERAL CREDIT UNION. The code indicating the ABA number to be used to route or send ACH items to the RFI. 0 = Institution is a Federal Reserve Bank. 1 = Send items to customer routing number. 2 = Send items to customer using new routing number field.", … "Routing Number is a nine-digit numeric code printed on the bottom of checks that is used to facilitate the electronic routing of funds (ACH transfer) from one bank account to another. It’s also referred to as RTN, Routing Transit Number or Bank Routing Number."],["https://www.nasafcu.com/contact/", "http://bank-code.net/routing-numbers/255077833-nasa-federal-credit-union", … "http://bankroutinginfo.com/FedwireRoutingNumbers/MD-BOWIE-NASA_FEDERAL_CREDIT_UNION"]} ["255077833"] 462353 "NUMERIC" ["255077833 is NASA's credit union routing number."]
"name some organisms that might live in a marine biome" {[1, 0, … 0],["ANIMALS. ANIMALS. The marine biome covers three fourths of the earth and there are hundreds of different fish to cover each square foot of it. The main kinds of animals in the sea are whales, dolphins, sharks, and seals which are some of the most popular kinds of sea mammals.ther kinds of animals and fish that live in the sea are the walrus, star fish, eel, crabs, jelly fish, and fresh and salt water fish, but this is just a small list compared to the many different kinds of fish that inhabit the marine biome. In general these fish have about five different adaptations.", "The major biomes include: Rainforest, Tundra, Taiga, Deciduous Temperate forest, Desert, Chaparral, Grassland, Freshwater, and Marine. The plant and animal species that live in each biome have special adaptations that help them survive the conditions of that biome.Many species are endemic to certain biomes, meaning they only exist in that biome and nowhere else. Some species are able to adapt to a wider variety of climatic and geographic influences.lants and microorganisms grow and reproduce during the short summers when the soil thaws for a brief time. The types of plants that can survive here include shrubs, sedges, mosses, lichens, grasses, and some flowering or herbaceous plants. There are only 48 known species of land mammals that live in the tundra biome.", … "LOCATION: The marine biome is the biggest biome in the world! It covers about 70% of the earth. It includes five main oceans: the Pacific, Atlantic, Indian, Arctic, and Southern, as well as many smaller Gulfs and Bays. Marine regions are usually very salty!arine plants live in the euphotic zone of the ocean because they need energy from the sun for photosynthesis. ANIMALS: The Earth's oceans are home to most of the planet’s biodiversity. Here we can find mollusks, fish, whales, crustaceans, bacteria, fungi, sea anemones and many other animals."],["http://ths.sps.lane.edu/biomes/marine3/marine3.html", "http://gen.uga.edu/documents/biodiversity/activities/Biomes%20who%20what%20lives.pdf", … "http://kids.nceas.ucsb.edu/biomes/marine.html"]} ["Walrus, star fish, eel, crabs, jelly fish, and fresh and salt water fish"] 461916 "ENTITY" ["Walrus, star fish, eel, crabs, jelly fish, and fresh and salt water fish are the organisms that might live in a marine biome."]

I can confirm the quality is much better than the answers. I’m going to filter the data so im only using on rows with wellformed answers as they are better quality and will be more useful for retrieval. I will also remove any rows with more than one answer as they are harder to work with.

filtering the data

df_filtered = df_ms_marco_full.filter(pl.col("wellFormedAnswers").list.len() >= 1)
print("Total rows before filter:", f"{df_ms_marco_full.shape[0]:,}")
print("Total rows after filter:", f"{df_filtered.shape[0]:,}")
print("Total rows removed:", f"{df_ms_marco_full.shape[0] - df_filtered.shape[0]:,}")
Total rows before filter: 808,731
Total rows after filter: 153,725
Total rows removed: 655,006
df_filtered = df_filtered.filter(pl.col("answers").list.len() == 1)
print("Total rows after filter:", f"{df_filtered.shape[0]:,}")
Total rows after filter: 148,959
df_filtered
shape: (148_959, 6)
query passages answers query_id query_type wellFormedAnswers
str struct[3] list[str] i64 str list[str]
"what county is columbus city in" {[0, 0, … 0],["WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial property. Columbus boasts large tracts of public land, including Carlos Avery Wildlife Management Area and Lamprey Pass.", "The ratio of number of residents in Columbus to the number of sex offenders is 488 to 1. The number of registered sex offenders compared to the number of residents in this city is near the state average. Nearest city with pop. 50,000+: Bloomington, IN (33.3 miles , pop. 69,291).", … "Acceptable ID for children: State ID, Birth Certificate, or Health Insurance Card. Effective June 27, 2016, the Franklin County Sheriff's Office will be implementing changes to ensure the safety of inmates, staff, and visitors. Printed materials (magazines, books, pamphlets, leaflets, or catalogues) MUST fit all the below criteria:"],["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]} ["Bartholomew"] 604568 "LOCATION" ["Columbus is a city in Bartholomew County."]
"what is a omurice omelet" {[0, 0, … 0],["In one common rendition, the rice is fried with ketchup, and extra ketchup is squeezed on top as a garnish. In another popular one, seen in the Kyoto video, the chef uses demi-glace (a rich, veal stock–based sauce) to both fry the rice and top the omelette. Japanese mayo is often also squeezed on top.", "오무라이스. 1  Omurice is a contemporary Asian dish consisting of an omelette made with fried rice. Its name derives from the combination of the English words omelette and rice. 2  A relatively simple dish, it typically calls for rice fried with ketchup, chicken and onions wrapped in a thin sheet of fried egg.", … "A cut-open omurice with ketchup. Omurice or omu-rice (オムライス, Omu-raisu) is an example of yōshoku (a Western-influenced style of Japanese cuisine) consisting of an omelette made with fried rice and usually topped with ketchup. With omu and raisu being contractions of the words omelette and rice, the name is an example of wasei-eigo."],["http://www.seriouseats.com/2016/08/how-to-make-omurice-japanese-omelette-fried-rice.html", "http://trifood.com/omurice.asp", … "https://en.wikipedia.org/wiki/Omurice"]} ["A contemporary Asian dish consisting of an omelette made with fried rice."] 693333 "DESCRIPTION" ["An omurice omelet is a contemporary Asian dish consisting of an omelette made with fried rice."]
"how much does it cost to change a jeep alternator" {[1, 0, … 0],["The average cost for a Jeep Compass Alternator Replacement is between $377 and $577. Labor costs are estimated between $123 and $156 while parts are priced between $254 and $421. Estimate does not include taxes and fees. Get a Personalized Estimate for your Jeep Compass.", "Well if a alt from autozone costs $150 any shop(and I mean any) will mark that up to $200 or or more(standard is 150% of cost) and then you got labor added on top of that and most labor charges are $80-$120 per hour depending on location.Then you got the diag fees to test the vehicle to find out what is wrong.", … "Definitely depends on the vehicle and where you live, but if you’re looking to have an auto repair shop replace the alternator, the national average on RepairPal is between $315 to $559, including quality parts and labor costs. (Alternator Replacement Cost - RepairPal Estimate)"],["https://repairpal.com/estimator/jeep/compass/alternator-replacement-cost", "http://www.jeepforum.com/forum/f292/cost-alternator-1115135/", … "https://www.quora.com/How-much-is-the-average-cost-to-replace-an-alternator"]} ["A Jeep Compass Alternator Replacement is between $377 and $577."] 315061 "NUMERIC" ["The cost for a Jeep Compass Alternator Replacement is between $377 and $577."]
"what does folic acid do" {[0, 0, … 0],["Folate and folic acid are forms of a water-soluble B vitamin. Folate occurs naturally in food, and folic acid is the synthetic form of this vitamin. Since 1998, folic acid has been added to cold cereals, flour, breads, pasta, bakery items, cookies, and crackers, as required by federal law.", "Folic acid is a type of B vitamin. It is the man-made (synthetic) form of folate that is found in supplements and added to fortified foods. Folate is a generic term for both naturally occurring folate found in foods and folic acid.", … "Women who are pregnant or might become pregnant take folic acid to prevent miscarriage and “neural tube defects,” birth defects such as spina bifida that occur when the fetus’s spine and back do not close during development. Some people use folic acid to prevent colon cancer or cervical cancer."],["http://www.webmd.com/vitamins-supplements/ingredientmono-1017-FOLIC%20ACID.aspx?activeIngredientId=1017&activeIngredientName=FOLIC%20ACID", "https://www.nlm.nih.gov/medlineplus/ency/article/002408.htm", … "http://www.webmd.com/vitamins-supplements/ingredientmono-1017-FOLIC%20ACID.aspx?activeIngredientId=1017&activeIngredientName=FOLIC%20ACID"]} ["Crucial for proper brain function and plays an important role in mental and emotional health."] 637779 "DESCRIPTION" ["Folic acid is crucial for proper brain function and plays an important role in mental and emotional health."]
"what county is seminole fl in" {[0, 1, … 0],["1 To change an address on a registration, tag, title or boat/vessel/trailer, see a Motor Vehicle Service Center above or visit www.gorenew.com.  To change an address on a driver license, see a Driver License Service Center above or visit www.gorenew.com.", "The water tower in Seminole at 113th Street and Park Boulevard. Seminole is a city in Pinellas County, Florida, United States. The population was 17,233 at the 2010 census, up from 10,890 in 2000. St. Petersburg College has a campus in the city.", … "Report FRAUD, WASTE or ABUSE. Anonymous reporting to the Clerk of Court and Comptroller is available at AlertLine, or by phone at 866-889-8808. Due to Federal regulations, the Clerk's office will no longer process passport applications effective March 31, 2010. Please contact Passport Services at 1 (877) 487-2778."],["https://www.flhsmv.gov/locations/seminole/", "https://en.wikipedia.org/wiki/Seminole,_Florida", … "http://seminoleclerk.org/"]} ["Pinellas County"] 612569 "LOCATION" ["Seminole, Florida is in Pinellas County."]
"what does the acronym captcha stand for" {[0, 1, … 0],["Freebase(0.00 / 0 votes)Rate this definition: CAPTCHA. A CAPTCHA is a type of challenge-response test used in computing to determine whether or not the user is human. The term was coined in 2000 by Luis von Ahn, Manuel Blum, Nicholas J. Hopper of Carnegie Mellon University, and John Langford of IBM.", "Answer Wiki. CAPTCHA is an acronym which stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart.Most common form of CAPTCHA is an image of distorted letters. CAPTCHA Technology is based on Turing Test. Turing Test is used to test whether a Machine can think like humans.", … "reCAPTCHA is a user-dialogue system originally developed by Luis von Ahn, Ben Maurer, Colin McMillen, David Abraham and Manuel Blum at Carnegie Mellon University's main Pittsburgh campus, and acquired by Google in September 2009. Like the CAPTCHA interface, reCAPTCHA asks users to enter words seen in distorted text images onscreen."],["http://www.definitions.net/definition/CAPTCHA", "https://www.quora.com/How-does-CAPTCHA-work", … "http://www.abbreviations.com/recaptcha"]} ["Completely Automated Public Turing Test to Tell Computers and Humans Apart."] 649330 "DESCRIPTION" ["The acronym CAPTCHA stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart."]
"what do mink eat" {[0, 0, … 0],["Mink tracks are nearly round with a width of 1¼ - 1¾ inches for the front feet and 2 â¼ inches long for the hind feet. The stride length of their tracks is 12-26 inches apart and both hind and forefeet prints almost touch. Skulls from mink have 34 teeth.", "They have oily guard hairs that waterproof and protect their coats. Predators such as foxes, bobcats, great horned owls and alligators are not a serious threat to mink populations. Mink may hiss, snarl, screech and/or excrete an odorous fluid from their scent glands when threatened.", … "What do they look like? Mink fur is usually dark brown with white patches on the chin, chest, and throat areas. The fur is soft and thick, with oily outer hairs that waterproof the animal's coat. The body is long and slender with short legs and a pointy, flat face. The toes are partially webbed, showing the mink's semi-aquatic nature. Body length is usually around 2 feet or 610 mm with up to half of this length being the tail."],["http://www.georgiawildlife.com/MinkFacts", "http://www.georgiawildlife.com/MinkFacts", … "http://www.biokids.umich.edu/critters/Neovison_vison/"]} ["Rodents, fish, crustaceans, amphibians, and birds."] 623976 "ENTITY" ["Mink eats rodents, fish, crustaceans, amphibians and birds."]
"what is a cortisone injection used for" {[0, 1, … 0],["What is cortisone (Cortone)? Cortisone is a steroid that prevents the release of substances in the body that cause inflammation. Cortisone is used to treat many different conditions such as allergic disorders, skin conditions, ulcerative colitis, arthritis, lupus, psoriasis, or breathing disorders. Cortisone may also be used for purposes not listed in this medication guide.", "If you have arthritis, you may have considered a cortisone shot as part of your treatment plan. These shots are not pain relievers. Cortisone is a type of steroid, a drug that lowers inflammation, which is something that can lead to less pain. Cortisone injections can be used to treat inflammation of small areas of the body, such as inflammation of a specific joint or tendon. They can also treat inflammation that is widespread throughout the body, such as with allergic reactions, asthma, and rheumatoid arthritis, which affects many joints.", … "The injections usually comprise a corticosteroid medication and a local anesthetic. Often, you can receive a cortisone shot at your doctor's office. Because of potential side effects, the number of shots you can receive in one year generally is limited. July 02, 2016."],["http://www.rxlist.com/cortone-drug/patient-images-side-effects.htm", "http://www.webmd.com/arthritis/what-are-cortisone-shots", … "http://www.mayoclinic.org/tests-procedures/cortisone-shots/home/ovc-20206814"]} ["Used to treat inflammation of small areas of the body, such as inflammation of a specific joint or tendon."] 679433 "DESCRIPTION" ["A cortisone injection is used to treat inflammation of small areas of the body, such as inflammation of a specific joint or tendon."]
"nasa credit union routing number" {[0, 0, … 0],["Complete a Direct Dispute Form if you have information appearing on your credit report that lists NASA Federal Credit Union as the creditor and you would like to dispute it, or if information about your NASA FCU account has been reported to ChexSystems and you would like to dispute it.", "ROUTING NUMBER - 255077833 - NASA FEDERAL CREDIT UNION. The code indicating the ABA number to be used to route or send ACH items to the RFI. 0 = Institution is a Federal Reserve Bank. 1 = Send items to customer routing number. 2 = Send items to customer using new routing number field.", … "Routing Number is a nine-digit numeric code printed on the bottom of checks that is used to facilitate the electronic routing of funds (ACH transfer) from one bank account to another. It’s also referred to as RTN, Routing Transit Number or Bank Routing Number."],["https://www.nasafcu.com/contact/", "http://bank-code.net/routing-numbers/255077833-nasa-federal-credit-union", … "http://bankroutinginfo.com/FedwireRoutingNumbers/MD-BOWIE-NASA_FEDERAL_CREDIT_UNION"]} ["255077833"] 462353 "NUMERIC" ["255077833 is NASA's credit union routing number."]
"name some organisms that might live in a marine biome" {[1, 0, … 0],["ANIMALS. ANIMALS. The marine biome covers three fourths of the earth and there are hundreds of different fish to cover each square foot of it. The main kinds of animals in the sea are whales, dolphins, sharks, and seals which are some of the most popular kinds of sea mammals.ther kinds of animals and fish that live in the sea are the walrus, star fish, eel, crabs, jelly fish, and fresh and salt water fish, but this is just a small list compared to the many different kinds of fish that inhabit the marine biome. In general these fish have about five different adaptations.", "The major biomes include: Rainforest, Tundra, Taiga, Deciduous Temperate forest, Desert, Chaparral, Grassland, Freshwater, and Marine. The plant and animal species that live in each biome have special adaptations that help them survive the conditions of that biome.Many species are endemic to certain biomes, meaning they only exist in that biome and nowhere else. Some species are able to adapt to a wider variety of climatic and geographic influences.lants and microorganisms grow and reproduce during the short summers when the soil thaws for a brief time. The types of plants that can survive here include shrubs, sedges, mosses, lichens, grasses, and some flowering or herbaceous plants. There are only 48 known species of land mammals that live in the tundra biome.", … "LOCATION: The marine biome is the biggest biome in the world! It covers about 70% of the earth. It includes five main oceans: the Pacific, Atlantic, Indian, Arctic, and Southern, as well as many smaller Gulfs and Bays. Marine regions are usually very salty!arine plants live in the euphotic zone of the ocean because they need energy from the sun for photosynthesis. ANIMALS: The Earth's oceans are home to most of the planet’s biodiversity. Here we can find mollusks, fish, whales, crustaceans, bacteria, fungi, sea anemones and many other animals."],["http://ths.sps.lane.edu/biomes/marine3/marine3.html", "http://gen.uga.edu/documents/biodiversity/activities/Biomes%20who%20what%20lives.pdf", … "http://kids.nceas.ucsb.edu/biomes/marine.html"]} ["Walrus, star fish, eel, crabs, jelly fish, and fresh and salt water fish"] 461916 "ENTITY" ["Walrus, star fish, eel, crabs, jelly fish, and fresh and salt water fish are the organisms that might live in a marine biome."]

ideas for next time

https://vishalbakshi.github.io/blog/posts/2024-12-24-PLAID-ColBERTv2-scoring-pipeline/#ragatouille-results

this is probably the quickest https://github.com/lancedb/vectordb-recipes/blob/main/examples/Youtube-Search-QA-Bot/main.ipynb

working with lancedb

Actions to perform:

  • Generate Embeddings with OpenAI
  • Store Embeddings in LanceDB
  • Perform Query and Retrieval
  • Evaluate Initial Results
  • Document Observations and Insights

import relevant libraries

import polars as pl
import lancedb

Connect to Lance.db database or create a new one

db = lancedb.connect("lance.db")

I tried to filter import the data directly from the polars dataset but it wouldnt automatically call openaAI and update the vector columns so I created the dictionary to create a subset of the dataset to test it.

df_test_data = pl.DataFrame({
    "query": ["what is the capital of France?", "how tall is the Eiffel tower?"],
    "passages": ["Paris is the capital of France...", "The Eiffel tower is 324m tall..."],
    "answers": ["Paris", "324 meters"],
    "query_id": ["q1", "q2"],
    "query_type": ["factoid", "factoid"],
    "wellFormedAnswers": ["Paris", "324"]
})
df_test_data
shape: (2, 6)
query passages answers query_id query_type wellFormedAnswers
str str str str str str
"what is the capital of France?" "Paris is the capital of France..." "Paris" "q1" "factoid" "Paris"
"how tall is the Eiffel tower?" "The Eiffel tower is 324m tall..." "324 meters" "q2" "factoid" "324"

Define your LanceModel, its based on pydantic LanceModel with with te columns ive added and the instructions to create a vector based on openai ada 002.

from typing import Optional
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

func = get_registry().get("openai").create(name="text-embedding-ada-002")

class MsMarco(LanceModel):
    query: str = func.SourceField()
    vector: Vector(func.ndims()) = func.VectorField()
    
    passages: str  # or List[str] if passages are multiple
    answers: List[str]  # since answers can be multiple
    query_id: str  # or int depending on your needs
    query_type: str
    wellFormedAnswers: List[str]  # since these appear to be lists in your data

Create a table based on the class MsMarco above. This will create the folder if it doesnt exsit.

table = db.create_table("msmarko", schema=MsMarco, mode="overwrite")

Convert polars to dictionary of list

table.add(df_test_data.to_dicts())
df_test_data.to_dicts()
[{'query': 'what is the capital of France?',
  'passages': 'Paris is the capital of France...',
  'answers': 'Paris',
  'query_id': 'q1',
  'query_type': 'factoid',
  'wellFormedAnswers': 'Paris'},
 {'query': 'how tall is the Eiffel tower?',
  'passages': 'The Eiffel tower is 324m tall...',
  'answers': '324 meters',
  'query_id': 'q2',
  'query_type': 'factoid',
  'wellFormedAnswers': '324'}]

before we run the query in polars, lets see whats in the in the lance db table

table.to_pandas()
query vector passages answers query_id query_type wellFormedAnswers
0 what is the capital of France? [0.01977344, -0.011157776, -0.01185227, -0.025... Paris is the capital of France... [P, a, r, i, s] q1 factoid [P, a, r, i, s]
1 how tall is the Eiffel tower? [0.012117346, -0.015629334, -0.0050044176, -0.... The Eiffel tower is 324m tall... [3, 2, 4, , m, e, t, e, r, s] q2 factoid [3, 2, 4]

Test the query by using the Lancedb search function to see if it returns the correct results. Its not a exact match but it is the closest match of the vectors matching the effiel tower, where as the enty in the database doesnt mention the effiel tower. This is a good sign that the retrival is working.

user_query = "What city is the Eiffel Tower located in?"
res = table.search(user_query).limit(100).to_pydantic(MsMarco)
print("Best match:", res[0].query)
print("Passage:", res[0].passages)
Best match: how tall is the Eiffel tower?
Passage: The Eiffel tower is 324m tall...

im also happy with this result too.

user_query = "what is the best city in the world?"
res = table.search(user_query).limit(1).to_pydantic(MsMarco)
print("Best match:", res[0].query)
print("Passage:", res[0].passages)
Best match: what is the capital of France?
Passage: Paris is the capital of France...

import the data from the polars dataset

Create a new table in the database with a portion of the data.

from typing import Optional
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry

func = get_registry().get("openai").create(name="text-embedding-ada-002")
    
class Passage(LanceModel):
    is_selected: List[int]
    passage_text: List[str]
    url: List[str]

class MsMarco(LanceModel):
    query: str = Field(alias="query")
    vector: Vector(func.ndims()) = func.VectorField()
    passages: Passage
    answers: List[str]
    query_id: int
    query_type: str
    wellFormedAnswers: List[str]
table = db.create_table("msmarko_top500", schema=MsMarco, mode="overwrite")

save just the top 500 values to the database to test the retrieval

df_filtered_top500 = df_filtered.head(500)

lets take a quick look to remind what it looks like

df_filtered_top500.head(2)
shape: (2, 6)
query passages answers query_id query_type wellFormedAnswers
str struct[3] list[str] i64 str list[str]
"what county is columbus city in" {[0, 0, … 0],["WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial property. Columbus boasts large tracts of public land, including Carlos Avery Wildlife Management Area and Lamprey Pass.", "The ratio of number of residents in Columbus to the number of sex offenders is 488 to 1. The number of registered sex offenders compared to the number of residents in this city is near the state average. Nearest city with pop. 50,000+: Bloomington, IN (33.3 miles , pop. 69,291).", … "Acceptable ID for children: State ID, Birth Certificate, or Health Insurance Card. Effective June 27, 2016, the Franklin County Sheriff's Office will be implementing changes to ensure the safety of inmates, staff, and visitors. Printed materials (magazines, books, pamphlets, leaflets, or catalogues) MUST fit all the below criteria:"],["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]} ["Bartholomew"] 604568 "LOCATION" ["Columbus is a city in Bartholomew County."]
"what is a omurice omelet" {[0, 0, … 0],["In one common rendition, the rice is fried with ketchup, and extra ketchup is squeezed on top as a garnish. In another popular one, seen in the Kyoto video, the chef uses demi-glace (a rich, veal stock–based sauce) to both fry the rice and top the omelette. Japanese mayo is often also squeezed on top.", "오무라이스. 1  Omurice is a contemporary Asian dish consisting of an omelette made with fried rice. Its name derives from the combination of the English words omelette and rice. 2  A relatively simple dish, it typically calls for rice fried with ketchup, chicken and onions wrapped in a thin sheet of fried egg.", … "A cut-open omurice with ketchup. Omurice or omu-rice (オムライス, Omu-raisu) is an example of yōshoku (a Western-influenced style of Japanese cuisine) consisting of an omelette made with fried rice and usually topped with ketchup. With omu and raisu being contractions of the words omelette and rice, the name is an example of wasei-eigo."],["http://www.seriouseats.com/2016/08/how-to-make-omurice-japanese-omelette-fried-rice.html", "http://trifood.com/omurice.asp", … "https://en.wikipedia.org/wiki/Omurice"]} ["A contemporary Asian dish consisting of an omelette made with fried rice."] 693333 "DESCRIPTION" ["An omurice omelet is a contemporary Asian dish consisting of an omelette made with fried rice."]

Running the command table.add(df_filtered_top500.to_dict()) i get ValueError: Cannot add a single dictionary to a table. Use a list.

Picking out the first few rows and just the required column query and wellformed answers to see what it looks like in dict form.

df_filtered_top500['query', 'wellFormedAnswers'].head(2).to_dicts()
[{'query': 'what county is columbus city in',
  'wellFormedAnswers': ['Columbus is a city in Bartholomew County.']},
 {'query': 'what is a omurice omelet',
  'wellFormedAnswers': ['An omurice omelet is a contemporary Asian dish consisting of an omelette made with fried rice.']}]

wellFormedAnswers is a list of 1, i need to unpack to get this into lancedb.

df_filtered_top500 = df_filtered_top500.with_columns(
    pl.col('wellFormedAnswers').list.get(0).alias('wellFormedAnswers')
)
df_filtered_top500.head(1)
shape: (1, 6)
query passages answers query_id query_type wellFormedAnswers
str struct[3] list[str] i64 str str
"what county is columbus city in" {[0, 0, … 0],["WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial property. Columbus boasts large tracts of public land, including Carlos Avery Wildlife Management Area and Lamprey Pass.", "The ratio of number of residents in Columbus to the number of sex offenders is 488 to 1. The number of registered sex offenders compared to the number of residents in this city is near the state average. Nearest city with pop. 50,000+: Bloomington, IN (33.3 miles , pop. 69,291).", … "Acceptable ID for children: State ID, Birth Certificate, or Health Insurance Card. Effective June 27, 2016, the Franklin County Sheriff's Office will be implementing changes to ensure the safety of inmates, staff, and visitors. Printed materials (magazines, books, pamphlets, leaflets, or catalogues) MUST fit all the below criteria:"],["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]} ["Bartholomew"] 604568 "LOCATION" "Columbus is a city in Bartholomew County."

picking out a passage cell, the first cell, cell zero, shows the details of the struct, the passage_text, is_selected, and the url. I will need to create these into seperate tables to store in lancedb database.

table.add(df_filtered_top500['query', 'passages','answers','query_id', 'query_type', 'wellFormedAnswers'].head(1).to_dicts())
table.to_pandas()
query vector passages answers query_id query_type wellFormedAnswers
0 what county is columbus city in None {'is_selected': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]... [Bartholomew] 604568 LOCATION [C, o, l, u, m, b, u, s, , i, s, , a, , c, ...
df_filtered_top500["passages"][0]
{'is_selected': [0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
 'passage_text': ['WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial property. Columbus boasts large tracts of public land, including Carlos Avery Wildlife Management Area and Lamprey Pass.',
  'The ratio of number of residents in Columbus to the number of sex offenders is 488 to 1. The number of registered sex offenders compared to the number of residents in this city is near the state average. Nearest city with pop. 50,000+: Bloomington, IN (33.3 miles , pop. 69,291).',
  'Phone Number: Columbus-Muscogee, the first consolidated city-county in Georgia, began development in 1826, building on ceded Creek Indian territory. Muscogee is the name of a branch of the Creek Nation. Columbus, of course, is named for Christopher Columbus.',
  "Sponsored Topics. Columbus ( /kəlʌmbəs/) is a city in and the county seat of Bartholomew County, Indiana, United States. The population was 44,061 at the 2010 census, and the current mayor is Fred Armstrong. Located approximately 40 miles (64 km) south of Indianapolis, on the east fork of the White River, it is the state's 20th largest city.",
  'Columbus, Ohio. Columbus (/kəˈlʌmbəs/; kə-LUM-bəs) is the capital and largest city of the U.S. state of Ohio. It is the 15th-largest city in the United States, with a population of 850,106 as of 2015 estimates. This makes Columbus the fourth-most populous state capital in the United States, and the third-largest city in the Midwestern United States.',
  'Phone Number: Columbus-Muscogee, the first consolidated city-county in Georgia, began development in 1826, building on ceded Creek Indian territory. Muscogee is the name of a branch of the Creek Nation. Columbus, of course, is named for Christopher Columbus.',
  'Latest news from Columbus, IN collected exclusively by city-data.com from local newspapers, TV, and radio stations. Ancestries: American (30.5%), German (13.7%), English (7.7%), Irish (5.3%), European (2.4%), Scottish (1.2%).',
  'Columbus, Indiana. 1  Columbus: covered Bridge at Mill Race Park. 2  Columbus: A statue in cloumbus. 3  Columbus. Columbus: Bartholomew County Courthouse.  Columbus: Tipton Lakes - A wonderful planned 1  community! Columbus: Barthalomew county memorial for veterans.  Columbus: A sculpter called summer storm in 1  columbus. Columbus: Downtown Columbus.',
  'The City owns and operates a volunteer fire department through a joint powers agreement with the City of Forest Lake. Police protection is provided through a contract with the Anoka County Sheriff’s Department. Columbus is located within the Forest Lake Area School District (ISD #831).',
  "Acceptable ID for children: State ID, Birth Certificate, or Health Insurance Card. Effective June 27, 2016, the Franklin County Sheriff's Office will be implementing changes to ensure the safety of inmates, staff, and visitors. Printed materials (magazines, books, pamphlets, leaflets, or catalogues) MUST fit all the below criteria:"],
 'url': ['http://www.ci.columbus.mn.us/',
  'http://www.city-data.com/city/Columbus-Indiana.html',
  'https://georgia.gov/cities-counties/columbus-muscogee-county',
  'https://www.mapquest.com/us/in/columbus-282032817',
  'https://en.wikipedia.org/wiki/Columbus,_Ohio',
  'https://georgia.gov/cities-counties/columbus',
  'http://www.city-data.com/city/Columbus-Indiana.html',
  'http://www.city-data.com/city/Columbus-Indiana.html',
  'http://www.ci.columbus.mn.us/',
  'https://sheriff.franklincountyohio.gov/services/inmate-information.cfm']}
df_filtered_top500 = df_filtered_top500.with_columns(
    pl.col("wellFormedAnswers").list.first().alias("wellFormedAnswers"),
    pl.col("answers").list.first().alias("answers"),
    pl.col("passages").list.get(0).struct.field("passage_text").alias("passage_text")
)
StructFieldNotFoundError: passage_text

Resolved plan until failure:

    ---> FAILED HERE RESOLVING 'with_columns' <---
DF ["query", "passages", "answers", "query_id"]; PROJECT */6 COLUMNS; SELECTION: None
df_filtered_top500.with_columns([
    pl.col("passages")                          # <-- this is a Struct column
      .struct.field("passage_text")             # <-- get its "passage_text" field
      .list.get(0)                              # <-- take the 0th element of that list
      .alias("first_passage_text")
]).head(1)
shape: (1, 7)
query passages answers query_id query_type wellFormedAnswers first_passage_text
str struct[3] list[str] i64 str list[str] str
"what county is columbus city in" {[0, 0, … 0],["WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial property. Columbus boasts large tracts of public land, including Carlos Avery Wildlife Management Area and Lamprey Pass.", "The ratio of number of residents in Columbus to the number of sex offenders is 488 to 1. The number of registered sex offenders compared to the number of residents in this city is near the state average. Nearest city with pop. 50,000+: Bloomington, IN (33.3 miles , pop. 69,291).", … "Acceptable ID for children: State ID, Birth Certificate, or Health Insurance Card. Effective June 27, 2016, the Franklin County Sheriff's Office will be implementing changes to ensure the safety of inmates, staff, and visitors. Printed materials (magazines, books, pamphlets, leaflets, or catalogues) MUST fit all the below criteria:"],["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]} ["Bartholomew"] 604568 "LOCATION" ["Columbus is a city in Bartholomew County."] "WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial proper…
df_filtered_top500.with_columns([
    pl.col("passages")                          # <-- this is a Struct column
      .struct.field("url")             # <-- get its "passage_text" field
      .alias("url")
]).head(1)
shape: (1, 8)
query passages answers query_id query_type wellFormedAnswers first_passage_text url
str struct[3] list[str] i64 str list[str] list[str] list[str]
"what county is columbus city in" {[0, 0, … 0],["WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial property. Columbus boasts large tracts of public land, including Carlos Avery Wildlife Management Area and Lamprey Pass.", "The ratio of number of residents in Columbus to the number of sex offenders is 488 to 1. The number of registered sex offenders compared to the number of residents in this city is near the state average. Nearest city with pop. 50,000+: Bloomington, IN (33.3 miles , pop. 69,291).", … "Acceptable ID for children: State ID, Birth Certificate, or Health Insurance Card. Effective June 27, 2016, the Franklin County Sheriff's Office will be implementing changes to ensure the safety of inmates, staff, and visitors. Printed materials (magazines, books, pamphlets, leaflets, or catalogues) MUST fit all the below criteria:"],["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]} ["Bartholomew"] 604568 "LOCATION" ["Columbus is a city in Bartholomew County."] ["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"] ["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]
df_filtered_top500.with_columns([
    pl.col("passages")                          # <-- this is a Struct column
      .struct.field("url")             # <-- get its "passage_text" field
      .alias("url")
]).head(1)
shape: (1, 8)
query passages answers query_id query_type wellFormedAnswers first_passage_text url
str struct[3] list[str] i64 str list[str] list[str] list[str]
"what county is columbus city in" {[0, 0, … 0],["WELCOME TO COLUMBUS! The City of Columbus includes a mix of residential, rural and commercial property. Columbus boasts large tracts of public land, including Carlos Avery Wildlife Management Area and Lamprey Pass.", "The ratio of number of residents in Columbus to the number of sex offenders is 488 to 1. The number of registered sex offenders compared to the number of residents in this city is near the state average. Nearest city with pop. 50,000+: Bloomington, IN (33.3 miles , pop. 69,291).", … "Acceptable ID for children: State ID, Birth Certificate, or Health Insurance Card. Effective June 27, 2016, the Franklin County Sheriff's Office will be implementing changes to ensure the safety of inmates, staff, and visitors. Printed materials (magazines, books, pamphlets, leaflets, or catalogues) MUST fit all the below criteria:"],["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]} ["Bartholomew"] 604568 "LOCATION" ["Columbus is a city in Bartholomew County."] ["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"] ["http://www.ci.columbus.mn.us/", "http://www.city-data.com/city/Columbus-Indiana.html", … "https://sheriff.franklincountyohio.gov/services/inmate-information.cfm"]
table.add(df_filtered_top500.to_dict())
ValueError: Cannot add a single dictionary to a table. Use a list.
ValueError: Cannot add a single dictionary to a table. Use a list.
table.to_pandas()
user_query = "what is the best city in the world?"
res = table.search(user_query).limit(1).to_pydantic(MsMarco)
print("Best match:", res[0].query)
print("Passage:", res[0].passages)
df_test_data
shape: (2, 6)
query passages answers query_id query_type wellFormedAnswers
str str str str str str
"what is the capital of France?" "Paris is the capital of France..." "Paris" "q1" "factoid" "Paris"
"how tall is the Eiffel tower?" "The Eiffel tower is 324m tall..." "324 meters" "q2" "factoid" "324"
import lancedb

# 1. Connect to (or create) your LanceDB database
db = lancedb.connect("./my_lancedb_dir")

# 2. Prepare some data that includes Python lists
data = [
    {"id": 1, "numbers": [10, 20, 30]},
    {"id": 2, "numbers": [5, 15, 25]},
]

# 3. Create a new table with your list columns
table = db.create_table("my_table_with_lists", data=data)

# 4. Query your table
results = table.to_pandas()  # Just read it back into a pandas DataFrame
print(results)
   id       numbers
0   1  [10, 20, 30]
1   2   [5, 15, 25]